CommonCrawl – Extract 4.6 billion Web Document

CommonCrawl – Extract 4.6 billion Web Document

Introduction

CommonCrawl produces and maintains a repository of web crawl data that is openly accessible to everyone. As on September 2013, the repository covered 4.6 billion pages that also included valuable metadata. The crawl data is stored on Amazon’s S3 public Data Set , allowing it to be downloaded as well as directly accessed for map-reduce processing in EC2. This makes wholesale extraction, transformation, and analysis of web data cheap and easy. Small startups can now access high quality crawl data that was previously only available to large search engine corporations.

Data Statistics as of September 2013 (since 2010):

Total # of Web Documents: 4.6 billion
Total Uncompressed Content Size: 120 TB+
# of Domains: 70 million
# of PDFs: 97.2 million
# of Word Docs: 8.2 million
# of Excel Docs: 1.9 million

What’s Inside

Please note that this blog covers data format that is applicable to only corpus on or before 2012 repository. CommonCrawl has changed their data formats since 5 months ago and the details are located at http://commoncrawl.org/new-crawl-data-available/

Let’s understand the format of the data and different types of data available in the corpus on or before 2012.  The crawl data set includes three different types of files: ARC raw content, Text Only, and Metadata.

  • ARC Files – Raw Content:
    • ARC files contain the full HTTP response and payload for all pages crawled.
    • ARC files are a series of concatenated GZIP documents and the ARC files reside in the segment folders in the 2012 crawl repository with the file names as below
  • Text Only:
    • These files take content returned as HTML or RSS and parse out just the text content – making it easier to perform text-based analysis
    • Text Only files are saves as Hadoop SequenceFiles using GZIP compression. The key and value are both Text with key being the URL, and the value contain the actual text content.
    • For HTML pages, the text content includes the page title, page metadata, all text content from the HTML body. They are located in the segment directories with a file name of “textData-nnnnn” as follows:
  • Metadata:
    • The Metadata files contain status information, the HTTP response code, and file names and offsets of ARC files where the raw content can be found.
    • Users can scan the metadata files to pick up extracted links rather than extracting the links themselves
    • Metadata files are saves as Hadoop SequenceFiles using GZIP compression. The key being the URL and the value is a JSON structure of fields and subfields.
    • Similar to TextOnly, the Metadata files are also located in the segment directories, with a file name of “metadata-nnnnn” as follows:

Use Case

This use case shows how to download different CommonCrawl data formats from Amazon S3 using S3cmd tool, and extract the data using Hadoop Sequence Decoder.

What we want to do:

  • Ensure Python is set up
  • Install & Configure S3cmd tools
  • Download & Extract ARC Files
  • Create a Hadoop SequenceFile Reader Program
  • Download & Extract TextOnly Files
  • Download & Extract Metadata Files

Solution

Ensure Python is set up:

  • There are so many blogs and articles explaining about how to install Python, and the links are referenced below. For this use case, we use Ubuntu and Python 2.7.5

http://askubuntu.com/questions/101591/how-do-i-install-python-2-7-2-on-ubuntu

http://heliumhq.com/docs/installing_python_2.7.5_on_ubuntu

Install & Configure S3cmd tools:

  • Install S3cmd:
  •  Configure S3Cmd:
  • Verify S3Cmd: 

Download & Extract ARC Files:

  • Download from Amazon S3:

CommonCrawl repository is stored in Amazon S3 publicly available data set. In order to download this, we need to add header “x-amz-request-payer:requester”

  • Uncompress & View the ARC file:

Note: Each ARC file is of 100 MB size and so don’t use vi editor to view the content as it will take long time. Try some linux commands similar to grep or tail to easily view the data.

Create a Hadoop SequenceFile Reader Program:

TextOnly and Metadata files are stored as Hadoop SequenceFile instead of raw textual content. We need to write a small Decoder program that takes the Sequence File as the input and outputs the Key and Value in the readable text format. The key is the URL and the value is the actual content.

  • Create a Decoder:
  •  Download Hadoop Native Libraries:
    • Hadoop will sometimes throw exceptions as follows when it can’t find the native libraries for 64-bit.
    • The easiest way to get the native libraries is download from one of the Nutch sources by following the below link, and get all the files

http://svn.apache.org/viewvc/nutch/branches/branch-1.1/lib/native/Linux-amd64-64/

nutch1

    • Create a directory “linux-amd64-64” and store the above files
  • Create a Decoder Script:
    • This script will come in handy to extract the content from TextOnly and Metadata files. Please note that we do need hadoop-core.jar and commons-logging-1.1.1.jar files as dependencies.

     Note: Make the script executable

Download & Extract TextOnly Files:

  • Download from Amazon S3:
  • View the TextOnly file:
  • Note: The decoder script will store the key and value pair in the out.log file. The snippet of the output of the TextOnly file is shown below.

Download & Extract Metadata Files:

  • Download from Amazon S3:
  • View the Metadata file:
  • Note: The decoder script will store the key and value (JSON format) pair in the out.log file. The snippet of the output of the Metadata file is shown below.

Conclusion

  • CommonCrawl is a non-profit foundation dedicated to providing an open repository of web crawl data that can be accessed and analyzed by everyone.
  • CommonCrawl comes with new formats from 2013 dataset and uses WARC (Web Archive) formats instead of ARC and Hadoop Sequence Files.
  • Tools like S3cmd and Hadoop SequenceFile readers come in handy to download and extract different types of data.

References

5876 Views 3 Views Today