Clinical Trial Marketplace Data Science Application

An advanced data analytics application to provide insights into the data being collected and stored globally for Pharmaceutical clinical trials


  • An advanced data analytics application based on microservice architecture to provide near immediate insights into the data being collected and stored globally for Pharmaceutical clinical trials from initiation to completion. The data will be collected from 6 sources:
    1. Three CDISC compliant structured data sources known as: SDTM, CDASH, and ADaM
    2. Unstructured data from Handwritten clinical notes
    3. Clinical trial data from
  • The platform must use entity identification, resolution and disambiguation techniques to properly associate the following:
    1. Names of people, symptoms, drugs, diagnoses, concepts or ideas so that the same idea can be found no matter how it is expressed (high blood pressure vs. hypertension), relationships such as cause and affect or side effects, exercise, categories so that similar documents are grouped together, sentiment or opinion, and location
  • Create a set of services such as bots that download clinical trial data from The bot should simulate the user experience downloading web data and adding to the data store for analysis with other datasets.
  • Index the scraped content and apply necessary analysers to perform interesting searches and facet queries.
  • A Restful API with the following capabilities:
    1. API Lifecycle Management
    2. API Governance
    3. Developer enablement
    4. API Management solution and cost
    5. Solution Pricing Manageability
    6. Reliability
    7. Scalability


  • The proposed architecture should be based on open source and on non-hadoop stack as the type of the data has more variety than velocity and volume.
  • The NLP component should be based on Python due to the customers’ data science skill sets.
  • None of the components in the architecture should rely on any particular cloud services due to data confidentiality.


  • All the components are based on open source and doesn’t rely on any cloud services.
  • Purpose of the specific components used in the architecture:
    1. Selenium & Nutch for scraping
    2. Logstash to process logs
    3. Apache Camel for mediation and routing for both data processing and as API Gateway
    4. Talend for ingestion mechanism and to perform mapping of input and output fields
    5. NLTK & Scikit for Natural Language processing needs with respect to entity identification and resolution
    6. Storm for processing the messages from Kafka which are sent from the respective ingestion layer
    7. MySQL to store user preferences and metadata
    8. MongoDB to store raw and transient data and also responsible for data lineage
    9. ElasticSearch for end user search, log aggregation, and usage metrics
    10. LDAP & JOSSO was used for security and single sign-on purpose