Results

Migrate from commercial scraping product to open source for HomeBuilders

Save thousands of dollars on licensing cost and improve the performance and accuracy of scraped data

BUSINESS GOALS:

  • Avoid licensing cost of the commercial scraping product by migrating into open source scraping technologies.
  • Ability to run the scraping in parallel to ingest the data in hours than days.
  • Ability to use the same open source framework for complex websites that are based on Maps and search based forms.
  • Improve the accuracy of the data being scraped by comparing with historical data which was not possible with the commercial product.
  • Ability to feed in the input parameters from the internal database and selectively scrape the content unlike the commercial product that scrapes all the content.

CHALLENGES:

  • The team that built the solution around the commercial product few years ago were not available.
  • The scraping process was not active for over an year and all agents that were in the system have become obsolete.
  • The agents in the old system were written to scrape for only 4 types of HomeBuilder categories but the new requirement was to consider 30 of them.
  • The data model that stores the final ingested data was out of date and not in align with the new requirements.

THE SOLUTION:

  • Treselle engineering team along with QA team who are good at open source scraping frameworks chose Selenium, Scrapy, JSoup, and other related libraries for scraping process.
  • Reversed engineered the scraping agents and created multiple requirement documents to validate them with the business users.
  • Designed, architect, and implemented customized web scraping platform by adhering to appropriate EULA policies of each website.
  • Created new data model and ported over the old data into new model using Talend by filling with sensible default values for missing data.
  • Deployed the scraping platform using publish-subscribe model on AWS platform to run the scraping process in parallel.
  • Implemented multiple data provider adapters to store the ingested data in multiple stages such as raw, transient, and refined in different storage systems such as S3 and MySQL RDS.
  • Implemented 40 normalization and transformation rules for data cleansing and wrangling purpose.
  • Created multiple test cases to validate the data that’s being ingested and transformed to make sure proper business rules such as filtering, grouping similar categories, splitting single category into multiple categories, data type conversions, etc.
  • Developed a separate UI using Zeppelin for business users and Talend with JasperReports for data quality checks.
  • Deleted all the agents written in the commercial product and discontinued the product usage to save thousands of dollars on licensing cost.
  • Provided the ability to create new website scraping with medium complexity in an hour that used to take 2 days to write an agent.
  • Enabled the business to scrape complex websites that are based on GeoMap, distance search, zip search, infinity scrolling page, and others.