Healthcare Clinical Trial Text Analytics

Provide actionable insights from public healthcare feeds, proprietary ontology and lexicon data dictionary using Natural Language Processing


  • Build an advanced text analytics application that will give insight into the data being collected and stored globally for publicly traded pharmaceuticals, medical device manufacturers, and other specialty pharma companies.
  • Pre-compute the context for the next set of questions a decision-maker is likely to ask; then automate the slow, mundane and costly processes of aggregating, normalizing, monitoring, and discovering anomalies – enabling department heads to find deep patterns and insights.
  • Build a proprietary ontology based on companies, products, markets, and ingredients.
  • Trigger important events that are of interest from the feed URL content mapped with the ontology.
  • Build a large proprietary lexicon data dictionary to score the triggered events.
  • Enable internal SME and Curators to enhance the Ontology and create training models.
  • Create graph based relationships among Ontology metadata, scraped documents and signals to quickly identify why a particular signal surfaced.
  • Tag the triggered events automatically based on curator provided business rules that is helpful in searching the events.


  • Extract the actual feed content from the feed URL found in RSS feeds required multiple page navigation.
  • Prepare Ontology on pharmaceutical, medical device, and specialty pharma required multiple hands on session with subject matter experts and was time consuming.
  • Infer important lexicons in healthcare industry and associated appropriate positive, neutral, and negative scores to build custom lexicon data dictionary.
  • Derivation of deep insights from the text content.
  • Extract needed content from the web pages by reducing noise from banner, ads, headers, footers, and others that impact the actual content scoring.
  • Tag the content while aggregating them at run time.


  • Treselle’s Engineering team implemented complete data processing pipeline based on Apache Nutch, Scrapy, Selenium, Jsoup, ActiveMQ, Apache Camel, GATE, UIMA, TIKA, R, Neo4j, and ElasticSearch.
  • Designed a complex Ontology system based on OWL semantic language using triple store RDF format that was used for entity identification & resolution .
  • Apache Nutch with Jsoup and boilerplate was used to scrape websites with multiple depth levels, RSS feeds, and XML content. Custom built adapters were used to reduce the content noise and improve the quality.
  • Selenium was used to scrape interactive websites that involve jQuery, JavaScript popups, login authentication, and search box entries.
  • ActiveMQ was used to initiate batch and on-demand crawling of different sources by producing messages either from the cronjobs or user actions.
  • Apache Camel was used for mediation and routing logic throughout the data pipeline.
  • GATE (General Architecture for Text Engineering) & R was extensively utilized as well as many natural language processing algorithms were designed to match the content with the ontologies.
  • GATE was also used to compute the document scoring based on the age of the document, lexicons (matched that has different scoring values), and other custom rules to score the document.
  • Clusters of Elastic Search were designed with tagging, boosting, and ranking to provide highly intuitive searching capabilities that include topic modelling, POS tagging, Lemmatization, fuzzy searching, and/or searching, lexicon dictionary, valuation range searches, tag search, and others.