Results

Entity Resolution of Healthcare Practitioners for Life Science

Automate data wrangling capabilities and integrate with existing data pipeline process

BUSINESS GOALS:

  • Analyze and compare different technology options to enable data science team to move away from Excel based data munging capabilities to an automated data processing pipeline.
  • Capture data science team’s 80 different excel transformations, calculations, and other rules and make them programmatically executable.
  • Benchmark different technology options based on speed, data accuracy, flexibility.
  • Enhance the current error-prone, tedious, and slow process of entity identification, disambiguation, and linkage of entities that come from more than 4000 data sources.
  • Data science team should be able to use the chosen technology without lot of engineering involvement.
  • Expose the data wrangling capabilities as an API so that the current data pipeline process could invoke it.

CHALLENGES:

  • Choose appropriate data source samples from 4000 that covers different data munging and wrangling needs.
  • Create elaborate pros and cons of different chosen technology options such as R with RServe, GATE NLP with ANNIE, JAPE Plus, Hadoop ecosystem with Pig & UDF, Talend data integration, OpenRefine, and other commercial wrangling tools.
  • Understand and extract the existing excel macros and identify options to create equivalent user defined functions with minimal effort.
  • Choose the technology option and implement 40 excel macros on 100 data sources in less than 3 weeks.
  • Create necessary test suite to validate the chosen PoC output matches with existing process.

THE SOLUTION:

  • Treselle systems’ big data engineering team quickly understood the client’s business goal and suggested the following technologies and integration points:
    1. R programming to do data manipulation, munging, cleansing, transformation, and others. Then integrate with existing system via RServe by reusing pre-developed Rscripts in real-time.
    2. GATE (General Architecture for Text Engineering) and utilize ANNIE, JAPE Plus, and over dozen other plugins to perform data manipulation, munging, cleansing, transformation, and others. Then integrate with existing system via Java Plugin Framework in real-time.
    3. Hadoop ecosystem with Pig and other User Defined Functions to do the data transformations in batch-mode.
    4. OpenRefine (formerly Google Refine), which is great at working with messy data out of the box.
    5. Finally, by using bigger products like Talend, Pentaho Data Integration, Paxata, Trifacta, etc…
  • Client’s head of engineering decided to do PoC using OpenRefine as others were little overkill at that point and liked the out of the box capabilities of working with messy data and feels similar to excel.
  • Created JQuery scripts for custom menus and applied new macros using complex GREL that does 3 to 4 data manipulation and transformation steps in one click.
  • Remove unnecessary menu options available in OpenRefine and add new menu options that perform multiple GREL macros in one click.
  • Tracked the web calls between OpenRefine UI and Server using Gattling sniffer and identified the API calls going over the wire. This was helpful to invoke the API calls programmatically from external system to OpenRefine.
  • Created custom Spring based application that performs these API calls to OpenRefine so that previously created macros can be executed programmatically from an external application. This proved that OpenRefine can be integrated with external application or system.
  • Created integration points within OpenRefine so that it can get information from external system for enriching the data using GREL and API calls.
  • Performed many benchmarks to understand the performance and memory behavior of OpenRefine by automating few stress test scenarios with different dataset sizes and applied simple to complex macros on them.