Boilerpipe – Web Content Extraction without Boilerplates

Boilerpipe – Web Content Extraction without Boilerplates

Introduction

BoilerPipe provides algorithms, to detect and remove the surplus “clutter” around the main textual content of a web page. It was written to parse web pages, with the aim of scraping the principal content, be it a news article, a blog post or any information. BoilerPipe is a Java library written by Christian Kohlschütter. It was released under the Apache License 2.0.

It is the best tool that intelligently removes unwanted html tags and even irrelevant text from the web page. It extracts the contents very fast in milliseconds, with minimum requirement of inputs. It does not require global or site-level information and is usually quite accurate.

Benefits:

  • Much  smarter  than  the  regular  expression.
  • Provides several extraction methods.
  • Returns  text  in  a  variety  of  formats.
  • Helps to avoid manual process of finding content pattern from the source site.
  • Helps to remove boilerplates like headers, footers, menus and advertisements.

The boiler pipe library provides different extracting strategy with aim of scraping the principal content. The list of boiler pipe strategies is given below.

  • ArticleExtractor: A full-text extractor which is specialized on extracting articles. It is having higher accuracy than DefaultExtractor.
  • DefaultExtractor: A full-text extractor, but not as good as ArticleExtractor.
  • LargestContentExtractor: Like DefaultExtractor, it keeps the largest content block similar to DefaultExtractor.
  • KeepEverythingExtractor: Gets everything. We can use this for extracting the title and description.

The output of the extraction can be of Html, Text or Json. Given below are the lists of output formats.

  • Html(Default) : To output the whole HTML Document.
  • htmlFragment : To output only those HTML fragments that are regarded as main content.
  • Text : To output the extracted main content as plain text.
  • Json : To output the extracted main content as plain json.
  • Debug : To output debug information to understand how boilerpipe internally represents a document.

Use Case

Let’s look into two use cases, to start with lets see a use case starting with a basic concept.

  • To extract main content from the article website.
  • To scrape the specific sections (title,description,body etc,.)

What we need to do:

  • Create a java program and pass the article website URL, from which we need to scrape the content.
  • Create a Java program to get the specific part (i.e. title) from the article.

Solution

Before solving our use cases, let’s get some pre-requisites satisfied. Fortunately, Boilerpipe needs only a couple of jar files to start with.

Pre-requisites:

  • JDK 1.6 +
  • Use the jars given below to execute the use case samples.
    boilerpipe-1.1.0.jar
    nekohtml-1.9.17.jar
    xerces-2_6_2.jar
  • Verify:
    • Download the jar files from the location given below

To extract main content from the article website:

Create a Java program:

Code:

 Verify:

  • Run the above java program and it will generate an output file in the given output directory (“c:/output/main-content.html”).
  • Open the output file and check the extracted main content in this file.

To scrape the specific sections (title, description, etc.,)

Create a Java program:

Code:

 Verify:

  • Run the above java program and it will show the title of the article site.

Conclusion

Boilerpipe is an excellent library for extraction of a block of text, with associated titles etc. Regardless of whether the page is structured or unstructured, boiler pipe seems to be particularly good at extracting text.

Whereas for image extraction, it seems a tad lacking. It may be because of the fact that the image functionality is still in its early days. It is presumed to be less effective compared to the rest of the library, as it is not in the final release.

References

8556 Views 10 Views Today