YouTube Crawler (ytcrawler) is a customerized tool for retrieving meta data for public accessiable YouTube videos. The meta data can be imported into ElasticSearch/Kibana for future analysis.
Analysis of the dataset appears in:
Feng Li, Jae Chung and Mark Claypool. “Three-year Trends in YouTube Video Content and Encoding”, In Proceedings the 18th International Conference on Signal Processing and Multimedia Applications (SIGMAP), Virtual Conference, July 6-8, 2021. Online at: http://www.cs.wpi.edu/~claypool/papers/youtube-crawler-21/
This documentation describes: 1) the Prerequisties for the software, 2) how to install ytcrawler and load ytcrawler data set (as snapshot) onto an ElasticSearch cluster, and 3) the database description.
Because our snapshot was taken from a live server running with ElasticSearch 7.11, we recommend to install/upgrade ElasticSearch to 7.x, ElasticSearch Installation Guide.
The installation of Kibana is optional, but we highly recommend to install Kibana 7.x, Kibana Installation Guide.
Before starting, please ensure there are at least 40GB free disk space on the ElasticSearch node, and correct privileges to create and delete indices of ElasticSearch.
directory created in Step 2
or the following command. Note, it may requires root/sudo privileges.
curl -XPUT -H "Content-Type: application/json;charset=UTF-8" 'http://localhost:9200/_snapshot/esbackup' -d '{
"type": "fs",
"settings": {
"location": "/elasticseacrhData/es-backup",
"compress": true
}
}'
(optional) using the following command to verify the repository has been created successfully.
sample output
{
"yt_crawler_backup" : {
"type" : "fs",
"settings" : {
"compress" : "true",
"location" : "/extra_spaces/data/es_backup"
}
}
}
The sample output of the above command.
{
"snapshots" : [
{
"snapshot" : "yt_crawler_snapshot_sigmap21-yruri4pmshu0lg5b1ckcua",
"uuid" : "m2Lpm_P_QX--lz3SKd-w9A",
"version_id" : 7110199,
"version" : "7.11.1",
"indices" : [
"ytcrawler-2015",
"ytcrawler-2016",
"ytcrawler-2019",
"ytcrawler-2011",
"ytcrawler-2021",
"ytcrawler-2007",
"ytcrawler-2008",
"ytcrawler-2012",
"ytcrawler-2018",
"ytcrawler-2017",
"ytcrawler-2006",
"ytcrawler-2010",
"ytcrawler-2020",
"ytcrawler-2005",
"ytcrawler-2014",
"ytcrawler-2013",
"ytcrawler-2009"
],
"data_streams" : [ ],
"include_global_state" : false,
"metadata" : {
"policy" : "yt_crawler_snapshot_sigmap21"
},
"state" : "SUCCESS",
"start_time" : "2021-04-26T00:25:48.657Z",
"start_time_in_millis" : 1619396748657,
"end_time" : "2021-04-26T00:50:40.339Z",
"end_time_in_millis" : 1619398240339,
"duration_in_millis" : 1491682,
"failures" : [ ],
"shards" : {
"total" : 17,
"failed" : 0,
"successful" : 17
},
"feature_states" : [ ]
}
]
}
It could be done through CLI with curl, or through Kibana GUI Snapshot and Restore
The CLI command using curl
The restore process will take a while, so grab a cup of coffee, and then enjoy!
The whole dataset is orginized based on its uploaded date. For example, Baby Shark Dance is uploaded in 2017, and the crawling records are all indexed at the ‘’ytcrawler-2017’’. YouTube Crawler adds a new document when it retrieves meta data of Baby Shark Dance. For example, the following queury returns 409 documnets.
GET ytcrawler-2016/_search?size=0
{
"fields": ["id", "upload_date", "visit_date", "display_id", "title"],
"query": {
"term": {
"display_id.keyword" :"XqZsoesa55w"
}
}
}
Sample output
{
"took" : 70,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 409,
"relation" : "eq"
},
"max_score" : null,
"hits" : [ ]
}
The following query returns the selected fields of one matched document of Baby Shark Dance.
GET ytcrawler-2016/_search?size=1
{
"_source": false,
"fields": ["id", "upload_date", "visit_date", "display_id", "title"],
"query": {
"term": {
"display_id.keyword" :"XqZsoesa55w"
}
}
}
Selected fields from one sample document for video Baby Shark Dance, which
visit_date: “2020-12-21T22:27:56.000Z”,
upload_date: “2016-06-17T00:00:00.000Z”,
title: “Baby Shark Dance | Most Viewed Video on YouTube | PINKFONG Songs for Children”
id, and display_id: “XqZsoesa55w”
**_id**: “XqZsoesa55w_20201221_222756”
Note, **_id** is system id for this document, which is generated as a human readable string “display_id_visitdate” by our YouTube Crawler.
"hits" : [
{
"_index" : "ytcrawler-2016",
"_type" : "_doc",
"_id" : "XqZsoesa55w_20201221_222756",
"_score" : 3.2870708,
"fields" : {
"display_id" : [
"XqZsoesa55w"
],
"id" : [
"XqZsoesa55w"
],
"title" : [
"Baby Shark Dance | Most Viewed Video on YouTube | PINKFONG Songs for Children"
],
"visit_date" : [
"2020-12-21T22:27:56.000Z"
],
"upload_date" : [
"2016-06-17T00:00:00.000Z"
]
}
}
]
Please direct any questions or comments to: