YouTube Crawler Elasticsearch DataStore Snapshot

YouTube Crawler (ytcrawler) is a customerized tool for retrieving meta data for public accessiable YouTube videos. The meta data can be imported into ElasticSearch/Kibana for future analysis.

Analysis of the dataset appears in:

Feng Li, Jae Chung and Mark Claypool. “Three-year Trends in YouTube Video Content and Encoding”, In Proceedings the 18th International Conference on Signal Processing and Multimedia Applications (SIGMAP), Virtual Conference, July 6-8, 2021. Online at: http://www.cs.wpi.edu/~claypool/papers/youtube-crawler-21/

This documentation describes: 1) the Prerequisties for the software, 2) how to install ytcrawler and load ytcrawler data set (as snapshot) onto an ElasticSearch cluster, and 3) the database description.

1. Prerequisites

Because our snapshot was taken from a live server running with ElasticSearch 7.11, we recommend to install/upgrade ElasticSearch to 7.x, ElasticSearch Installation Guide.

The installation of Kibana is optional, but we highly recommend to install Kibana 7.x, Kibana Installation Guide.

Before starting, please ensure there are at least 40GB free disk space on the ElasticSearch node, and correct privileges to create and delete indices of ElasticSearch.

2. Installation

Step 1. Download the tar ball of the snapshot ytcrawler_snapshot_sigmap21.tgz

wget http://perform.wpi.edu/downloads/youtube-crawler-21/ytcrawler_snapshot_sigmap21.tgz

Step 2. Create a directory to hold ElasticSearch snapshot backup, for example.

mkdir /extra_spaces/data/es_backup 

Step 3. Untar the downloaded snapshot tarball. ytcrawler_snapshot_sigmap21.tgz into the

directory created in Step 2

cd /extra_spaces/data/es_backup 
tar -zxvf ~/Downloads/ytcrawler_snapshot_sigmap21.tgz 

Step 4. Give write ownership of the snapshot directory to ElasticSearch.

chown -R elasticsearch:elasticsearch /extra_spaces/data/es_backup 

Step 5. Add path.repo in the ElasticSearch configuration file elasticsearch.yml by editing /etc/elasticsearch/elasticsearch.yml with vim

or the following command. Note, it may requires root/sudo privileges.

cat >> /etc/elasticsearch/elasticsearch.yml << EOF

path.repo: ["/extra_spaces/data/es_backup"]

EOF

Step 6. Restart the ElasticSearch service and create the repository that we created in Step 5.

sudo service elasticsearch restart 

Step 7. Register the directory as a repository with curl or through dev tool in Kibana.

curl -XPUT -H "Content-Type: application/json;charset=UTF-8" 'http://localhost:9200/_snapshot/esbackup' -d '{
  "type": "fs",
  "settings": {
     "location": "/elasticseacrhData/es-backup",
     "compress": true
  }
}'

(optional) using the following command to verify the repository has been created successfully.

curl -XGET 'http://localhost:9200/_snapshot/_all?pretty'

sample output

{
  "yt_crawler_backup" : {
    "type" : "fs",
    "settings" : {
      "compress" : "true",
      "location" : "/extra_spaces/data/es_backup"
    }
  }
}

Step 8. Verify the ytcrawler sigmap21 snapshot has been loaded. The snapshot should be automatically loaded into repository.

curl -XGET "http://localhost:9200/_snapshot/yt_crawler_backup/_all?pretty" 

The sample output of the above command.

{
  "snapshots" : [
    {
      "snapshot" : "yt_crawler_snapshot_sigmap21-yruri4pmshu0lg5b1ckcua",
      "uuid" : "m2Lpm_P_QX--lz3SKd-w9A",
      "version_id" : 7110199,
      "version" : "7.11.1",
      "indices" : [
        "ytcrawler-2015",
        "ytcrawler-2016",
        "ytcrawler-2019",
        "ytcrawler-2011",
        "ytcrawler-2021",
        "ytcrawler-2007",
        "ytcrawler-2008",
        "ytcrawler-2012",
        "ytcrawler-2018",
        "ytcrawler-2017",
        "ytcrawler-2006",
        "ytcrawler-2010",
        "ytcrawler-2020",
        "ytcrawler-2005",
        "ytcrawler-2014",
        "ytcrawler-2013",
        "ytcrawler-2009"
      ],
      "data_streams" : [ ],
      "include_global_state" : false,
      "metadata" : {
        "policy" : "yt_crawler_snapshot_sigmap21"
      },
      "state" : "SUCCESS",
      "start_time" : "2021-04-26T00:25:48.657Z",
      "start_time_in_millis" : 1619396748657,
      "end_time" : "2021-04-26T00:50:40.339Z",
      "end_time_in_millis" : 1619398240339,
      "duration_in_millis" : 1491682,
      "failures" : [ ],
      "shards" : {
        "total" : 17,
        "failed" : 0,
        "successful" : 17
      },
      "feature_states" : [ ]
    }
  ]
}

Step 9. Restore the snapshot from the repository.

It could be done through CLI with curl, or through Kibana GUI Snapshot and Restore

The CLI command using curl

curl  -XPOST "http://localhost:9200/_snapshot/yt_crawler_backup/yt_crawler_snapshot_sigmap21-yruri4pmshu0lg5b1ckcua/_restore?wait_for_completion=true"

The restore process will take a while, so grab a cup of coffee, and then enjoy!

3. Database Description

The whole dataset is orginized based on its uploaded date. For example, Baby Shark Dance is uploaded in 2017, and the crawling records are all indexed at the ‘’ytcrawler-2017’’. YouTube Crawler adds a new document when it retrieves meta data of Baby Shark Dance. For example, the following queury returns 409 documnets.

GET ytcrawler-2016/_search?size=0
{ 
  "fields": ["id", "upload_date", "visit_date", "display_id", "title"],
  "query": { 
    "term": { 
      "display_id.keyword" :"XqZsoesa55w"
    }
  }
}

Sample output

{
  "took" : 70,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 409,
      "relation" : "eq"
    },
    "max_score" : null,
    "hits" : [ ]
  }

The following query returns the selected fields of one matched document of Baby Shark Dance.

GET ytcrawler-2016/_search?size=1
{ 
  "_source": false, 
  "fields": ["id", "upload_date", "visit_date", "display_id", "title"],
  "query": { 
    "term": { 
      "display_id.keyword" :"XqZsoesa55w"
    }
  }
}

Selected fields from one sample document for video Baby Shark Dance, which

visit_date: “2020-12-21T22:27:56.000Z”,

upload_date: “2016-06-17T00:00:00.000Z”,

title: “Baby Shark Dance | Most Viewed Video on YouTube | PINKFONG Songs for Children”

id, and display_id: “XqZsoesa55w”

**_id**: “XqZsoesa55w_20201221_222756”

Note, **_id** is system id for this document, which is generated as a human readable string “display_id_visitdate” by our YouTube Crawler.

    "hits" : [
      {
        "_index" : "ytcrawler-2016",
        "_type" : "_doc",
        "_id" : "XqZsoesa55w_20201221_222756",
        "_score" : 3.2870708,
        "fields" : {
          "display_id" : [
            "XqZsoesa55w"
          ],
          "id" : [
            "XqZsoesa55w"
          ],
          "title" : [
            "Baby Shark Dance | Most Viewed Video on YouTube | PINKFONG Songs for Children"
          ],
          "visit_date" : [
            "2020-12-21T22:27:56.000Z"
          ],
          "upload_date" : [
            "2016-06-17T00:00:00.000Z"
          ]
        }
      }
   ]

Please direct any questions or comments to:

Feng LI, Jae Won Chung, Mark Claypool