Treselle systems services
Treselle systems services
Treselle systems services
Treselle systems services

Big Data


Over the past few years, businesses across all industries have been collecting massive amounts of data – be it machine, human, sensor, or smart device generated. In addition to these, businesses also have structured data in their enterprise legacy systems as well as in their traditional data stores. When you add social, web, and partner data on top, there is a rich trove of information that businesses can use to come up with meaningful and valuable insights – whether it is predictive analytics, targeted marketing, churn analytics, fraud detection, or others. Lately, newer technologies have come to the fore to help perform the analysis cost-effectively.

We strongly believe that businesses that consider Big Data as a strategic corporate asset and methodically exploit it will thrive. The data is there and the technology and tools are available. The questions are: what to do, how to do, and where to start?

We can help. For more than 4 years, one of our teams has been building a SaaS based Big Data application for one of our clients in the financial services industry in San Francisco, US. We have the experience, skills, and resources to turn your business data into business insights. Contact us.


  • Big Data Strategy & Roadmap

  • Data Lake Architecture

  • Data Science

  • Big Data Infrastructure Management

  • Big Data Visualization

  • Big Data Engineering

  • Internet of Things (IoT)

Businesses are increasingly realizing the business benefits of big data but not sure how and where to start. A unique big data strategy tailor made to specific business needs that links organization’s business strategy and support business is very crucial. Treselle’s big data strategy helps the organization through the process of breaking down its business strategy and initiatives into potential big data uses cases and the supporting data and analytic requirements. The high-level phases of big data strategy are as follows:

  • Initiate Engagement: Understand the business mission, vision, pain points and drivers for big data adoption
  • Identify Big Data use cases & requirements: Identify the potential Big Data use cases based on interviews with stakeholder and define requirements around Big Data Governance and Security
  • Assess current data capabilities: Assess existing data capabilities, tools and technologies and identify the gaps to propose appropriate strategy that co-exist with current data capabilities and investment
  • Define Target state: Assess gaps between Current and Target state and define Target state based on organization drivers and use cases defined
  • Develop Big Data Roadmap: Develop Big Data roadmap in phases starting from analytical foundation to descriptive, diagnostic and predictive analytical capabilities
  • Define Big Data Solution: Define solution approach and propose appropriate architecture based on business use cases and effective use of existing technology investment
  • Execute PoC & refine next steps: Execute Big Data proof of concept based on 1 or 2 use cases in agile mode, solicit feedback, document learnings, and define next steps in the roadmap
To execute a successful big data strategy, it requires a tailor made big data lake architecture that should make it possible to store all the data, ability to ask business questions, identify patterns and insights from the data, and uncover new variables and metrics that are better predictors of business performance. Architecting a Data Lake should consider business needs and data strategy so that appropriate tools & technologies can be put in place rather than forcing it to be built on Hadoop ecosystem.

Treselle’s deep expertise on Data Lake architecture helps clients to understand their business needs to propose a scalable Data Lake architecture by protecting client’s existing investments. A typical Data Lake architecture should contain the following layers at a very high level:

  • Ingestion Layer: This layer is responsible for ingesting the data from variety of sources that includes structured, unstructured, and semi-structured
  • Data Storage Layer: This layer is responsible for storing raw, transient, and refined data where the ingested data goes thru multiple transformations
  • Processing Layer: This layer is responsible for processing the data such as cleansing, filtering, normalizing, correlating with other data, and perform necessary aggregations and finding insights
  • Consumption Layer: This layer allows data consumers to consume the data that is available in the Data Lake
  • Management Layer: This layer is responsible for operational and data management tasks that is common across all layers that include Data Governance & Security, Metdata, and Information Lifecycle Management
Data science is the art and science of acquiring knowledge through data. Data science is all about how we take data, use it to acquire knowledge, and then use that knowledge to make decisions, predict the future, understand the past/present, and create new products & opportunities. Data science is about using data in order to gain new insights that you would otherwise have missed. With the sheer volume of data that’s been collected in various forms and from different sources that often comes in very unorganized format, it is impossible for a human or an excel tool to parse and find insights.


Treselle’s expertise in data science & machine learning algorithms across multiple languages such as R, Python and Spark ML allows us to provide specific solutions to address the unique challenges that businesses face. Treselle helped clients in performing some of the machine learning analysis as listed below:

  • Regression Analysis: These kind of analysis helps to understand how the typical value of the dependent variable changes when any one or more of the independent variables is varied, while the other independent variables are held fixed. This includes algorithms such as simple linear regression, multiple linear regression, polynomial regression, support vector regression, decision tree regression, random forest regression, and evaluating different regression model for better performance
  • Clustering Analysis: These kinds of analysis involve grouping set of objects that are more similar to each other than those in other groups (clusters). Cluster analysis can be achieved by various algorithms such as K-means clustering, Hierarchial clustering, Fuzzy K-means clustering, Model-based clustering, and Topic modeling using LDA
  • Classification Analysis: Classification is a process of using specific information (input) to choose a single selection (output) from a short list of predetermined potential responses. Classification algorithms are at the heart of what is called predictive analytics. The goal of predictive analytics is to build automated systems that can make decisions to replicate human judgment. This includes algorithms such as Logistic Regression, K-NN, Fisher’s linear discriminant analysis, Support Vector Machine, Naive Bayes, Decision Tree, and Random Forest Classification
  • Recommendation Systems: These systems produce recommendations using collaborative filtering, content-based filtering and a combination of both called the hybrid approach. Collaborative filtering methods are based on collecting and analyzing a large amount of information on users’ behaviors, activities or preferences and predicting what users will like based on their similarity to other users. Content-based filtering methods are based on a description of the item and a profile of the user’s preference. Hybrid approaches can be implemented by making content-based and collaborative-based predictions separately and then combining them or by adding content-based capabilities to a collaborative-based approach or by unifying the approaches into one model
  • Association Rule Learning: The use of predictive analytics as a mining tool also seeks to discover hidden relationships among items in your data. These hidden relationships are called mining association rules. Some well-known algorithms are Apriori, Eclat and FP-Growth that only does half the job, since they are algorithms for mining frequent itemsets. Another step needs to be done after to generate rules from frequent itemsets found in a database
  • Graph Analysis: While graph analysis is most commonly used to identify clusters of friends, uncover group influencers or advocates, and make friend recommendations on social media networks, graph analysis has other use cases such as graph-based search, master data management, identity & access management, entity identification & linkage, identify hidden patterns and insights, explore causes and effects, etc
  • Trend Analysis: One of the most basic yet very powerful exploratory analytics, trend analysis can quickly uncover trends and events that tend to happen together or happen at some period of interval. It is a fundamental visualization technique to spot patterns, trends, relationships, and outliers across a time series of data which yeilds mathematical models for each of the trend lines that can be flagged for further investigation.
  • Geo Spatial Analysis: This includes techniques for analyzing geographical activities and conditions using a business entity’s topological, geometric, or geographic properties. Anything that is capable of associating latitude and longitude will enable interesting use cases such as whitespace analysis, market and sales penetration, geogrpahical reach, competition analysis, saturation analysis, etc. Geographical analysis can be combined with trend analysis and external sources such as BLS and Census to identify changes in market patterns across the organization’s key markets
  • Text Analytics & Natural Language Processing: There is a huge potential in unstructured data as it is accounted for 80% of enterprise data. Text analytics & NLP is a powerful technique to mine text data and glean insights from the wealth of internal customer, product, social, and operational data. Typical text mining techniques and algorithms include text categorization, text clustering, concept/entity extraction, taxonomies production, sentiment analysis, document summarization, and entity relation modeling
  • Entity identification, linkage & disambiguation: This is one of the most challenging tasks while normalizing the entities that comes from multiple data silos to construct a 360 view as different systems refer the same thing in different formats such as a customer or member. It becomes even more challenging when this information is hidden in unstructured format such as text. It needs complex machine learning techniques to identify, disambiguate and link the entities during normalization process
A unique big data strategy and architecture needs a custom Big Data infrastructure to be successful. A well planned infrastructure should be in place based on volume, velocity and variety of the data needs. Treselle has helped clients in setting up infrastructure using well known distributions such as Hortonworks, Cloudera, Amazon AWS, and non-hadoop environments.

  • Hortonworks: HDP (Hortonworks Data Platform) is entirely based on open source projects managed by Ambari that provides an entire Hadoop ecosystem and provide controls to choose necessary components in the ecosystem based on the business needs. It is possible to set up a pure batch or streaming or hybrid mode that enables business to set up required Data Lake architecture
  • Cloudera: Another well known Hadoop distribution built on slightly different stack unlike HDP and enables businesses to set up required Data Lake architecture. It has proprietary products such as Cloudera Manager Enterprise for infrastructure management that had advanced functionalities compared to Ambari and Cloudera Navigator for Data Governance capabilities
  • Amazon AWS: A Data Lake architecture can be entirely built on cloud by using Amazon AWS managed services such as API Gateway, Lambda, Elastic Map Reduce (EMR), Spark, Kinesis, DynamoDB, S3, ElasticSearch, etc. Those businesses who are already using AWS services and don’t require lot of control over the infrastructure can choose AWS
  • Custom Infrastructure: Not all Big Data needs require Hadoop ecosystem and many business use cases can be achieved using custom infrastructure such as Standalone Spark clusters that satisfies most of the big data needs such as batch, streaming, machine learning, graph process and programming interfaces with Python and R. SparkSQL is capable of connecting to variety of data stores that are in RDBMS, NoSQL, S3, etc. Apache Drill is another interesting technology that can infer schema on the fly and enables to perform joins across multiple data stores such as HDFS, S3, RDBMS, NoSQL, etc
  • Infrastructure management & migration: Treselle can help businesses to manage their existing Big Data infrastructure as well as migrate to the most suitable infrastructure based on their business needs
Visualizing large data is challenging. Along with increase in volume, there are new kinds of datasets which are becoming more and more mainstream. The need to analyze user comments, sentiments, customer calls and various unstructured data has resulted in the use of new kinds of visualizations. The use of graph databases, streaming data, geo-map, and machine learning models are some of the examples of how things are changing because of velocity and variety. Treselle has helped clients with big data visualization needs ranging from custom built solutions and open source notebooks to well know commercial visualization products that are targeted to different user groups such as Business Analyst, Data Science team, Curators, etc…

  • Commercial Products: Tableau & Qlik Sense are some of the most popular visualization vendors that can integrate with Hadoop ecosystem using Hive or Spark connectors and databases via JDBC/ODBC capable drivers. Apache Drill can be used to perform cross join queries across multiple SQL & NoSQL databases and connect with Tableau using JDBC driver to create interesting visualizations at scale while data is located in the original sources
  • Open Source Visualizations: The visualization needs for Data Science teams and Curators are quite different as they are mostly responsible for performing exploratory analysis on large scale of unrefined data to find insights and patterns. These visualizations are not based on pre-materialized views and had to run on clusters when an analysis is performed. Visualization tools such as Zeppelin, Jupyter, Shiny with R are some of the preferred tools to perform these tasks
  • Custom Web based Visualizations: Businesses are building interesting web based SaaS products with stunning visualizations on huge volume and variety of data using web stack such as Nodejs, AngularJS, Java/Python webservice API, Highcharts, D3.js, Google charts, plotly, and others
Big Data Engineering is a specialised field that helps businesses to ingest, store, transform, perform analytics, run machine learning models, identify patterns and insights, manage security and governance and prepare the data for visualization purpose. Treselle has helped clients in building different big data engineering solutions based on the proposed architecture with wide variety of tools and technologies. Treselle can help businesses either in complete Big Data Engineering implementation or any part/layer of the stack as below:

  • Ingestion – batch or streaming
  • Data Storage – SQL, NOSQL, Cloud storage
  • Data Processing – wrangling, cleansing, normalizing, filtering, transformations, ETL
  • Analytics – aggregation, correlation, statistical computation, machine learning models
  • Operations – manage & monitor clusters, data governance, data security, information lifecycle management, Metadata – business, technical, and operation

Treselle has experience in implementing variety of Big Data implementation from non-distributed to distributed computing system and commercial products.

  • Non-distributed Computing Systems: Businesses that have lot of variety but not volume or velocity doesn’t need complicated distributed computing systems such as Hadoop or Spark. There are different tools and technologies that can be used to create loosely coupled and highly cohesive data processing pipeline based on open source technologies such as Camel, Talend, Nutch, Java/Python/R/Scala, ActiveMQ/RabbitMQ, OpenRefine, etc
  • Distributed Computing Systems: Hadoop, Spark, and AWS cloud based ecosytem fall in this category that defines specific tools and technologies from ingestion to visualization that can be used to process batch, streaming, and hybrid mode by combining both systems
  • Commercial Products: Datameer, Trifacta and Talend Big Data Enterprise are some of the well known commercial products that can be utilized together to solve common business problems without lot of engineering and data science effort
IoT doesn’t have to be always sensors producing data from connected devices but anything that sends streaming data and this could be something similar to a weather station producing min & max temperature, click events on AWS IoT button, large volume of log events generated by machines such as computers and other devices, stock ticker price, etc. Treselle worked with clients on different proof of concepts depending on the business needs and use cases and processed near real-time data using AWS IoT infrastructure and Spark based streaming infrastructure.

  • Spark Streaming: Spark Streaming is one of the powerful extensions provided by Spark for consuming and processing the events produced by various data sources in near real-time. Spark Streaming extended the Spark core architecture and produced a new architecture based on microbatching, where live/streaming data is received and collected from various data sources and further divided into a series of deterministic microbatches. Data can be ingested from Kafka, Flume, Kinesis that can be processed by Spark streaming to perform necessary aggregations at microbatch level
  • AWS IoT: AWS has created a reference model and a platform that enables businesses to connect devices to their managed services, store and process the data, and enable applications to interact with devices. The AWS IoT mainly contains Rules engine, registry, device shadow, process the data via Kinesis platform by utilising Kinesis Firehose, Kinesis Stream, and Kinesis Analytics, and store the data in variety of location such as S3, Aurora, DynamoDB, and send event notification to downstream data pipeline using Lambda expressions. The entire data pipeline can be governed by VPC and IAM authorization and authentication mechanism for security purpose


Financial Client Embraces Signal Relationships with Graph Database
The client’s curators and user base had difficulties in understanding why the system raised a particular signal from vast amounts of data and its correlation or relationship with other signals. They also preferred multiple depth traversal to find insights into the signals. We implemented and migrated client’s proprietary disconnected data that was maintained in the relational database to graph database using Neo4j. This enabled traversable structure to comprehensively evaluate relationships in a highly scalable realtime fashion. We also implemented powerful visualization wrapper on top of it with D3.js to visualize the complex unforeseen relationships among signals.

Business benefits delivered:

  • Increased the signal identification from 700,000 to 7.5 million nodes and relationships.
  • Improved the signal relationship traversal by more than 500% with depth level 4 and above.
  • Reduced the latent writes across de-normalized stores by 100%.
  • Implemented the ability to create signal nodes and relationships on the streaming data.
  • Reduced curator’s time from hours to seconds to understand and validate why the system raised a signal.
Global Client Performs Statistical Analysis on Large Energy, Oil, & Gas Datasets
Our global client’s analysts were spending between 4 and 5 weeks to analyze the raw data to find some insights and extract value. It has become a tedious task for the analysts to perform different statistical analysis and data manipulation on these datasets. Further, they were unable to cope with the amount of data coming in from multiple partners. The analysts were using Microsoft Excel to perform these statistical analysis on the samples instead of the entire data and thus were loosing lot of insights. Treselle’s engineering team implemented a solution on the web with R, Shiny, and D3.js that enabled the analysts to switch to different datasets on the fly with ease and perform data manipulation to achieve desired quality and shape of data for analysis. Our solution enabled our client to perform exploratory data analysis that includes numerical summary analysis, graphical analysis for insights, and distribution analysis. It also helped to understand correlation between variables.

Business benefits delivered:

  • Reduced the time taken to find insights from 5 weeks to days.
  • Increased processing of the partner dataset processing by more than 200%.
  • Reduced data clutter by 100% by applying various filters and removing irrelevant fields.
  • Included the ability to perform data manipulation and analysis on the web as well as share with other analysts and data partners.
  • Increased confidence in the data analysis by 100% as our client performed the analysis on the entire data instead of just a sample of data.
Global Client Enhances Data Aggregation Capabilities
A global financial services company chose us to help develop a Big Data SaaS application. We enhanced and improved data aggregation capabilities with multiple custom adapters and plugins. We designed the application to ingest data from various sources. These sources have structured, unstructured, and semi-structured data types. We processed HTML, PDF, CSV, JSON, XBRL, JDBC, RSS, Email, Twitter and other formats with deep content pattern algorithms to extract the needed information.

Business benefits delivered:

  • Increased complex ingestion of unstructured data source feeds from 200+ to 600+.
  • Introduced the ability to monitor and react to content feed layout changes at the source from 100 to 1200 domains.
  • Enhanced the range of structured partner data feeds from a mere 400 to 150 million records.
  • Reduced the turnaround time to load partner data feeds from few weeks to just 2 days.
  • Increased the frequency of content feed extraction for large sources from weekly to daily.
  • Added the ability to ingest sources automatically that requires user interaction such as filling a form, and keyword searches.
Client Finds Deep Insights with Natural Language Processing
We implemented and enhanced text-engineering capabilities for a client. This involved entity recognition, information extraction, stemming, tokens/tokenizers, part of speech tagger, sentence detection, numeric range valuations, and other text analyses to find deep insights in the content extracted and co-relate them with their proprietary web ontology and data model.

Business benefits delivered:

  • Increased the entity identification from 700,000 to 7.5 million nodes and relationships.
  • Enhanced the ability to score and rank the content over several thousand significant lexicon words.
  • Introduced the ability to map text-based numerals to numeric for performing numeric range valuations that tremendously improved our client’s platform offerings.
  • Increased the content quality by 50% and reduced the noise by 99%.
Global Healthcare Big Data Client Munges Data with OpenRefine
Client who processes over 4000 datasources had challenges with Entity Identification, Disambiguation, and Linkage as same information about healthcare practitioners came in different formats. Client’s data science team used Excel to perform different data manipulations and transformations by exporting ingested data from MongoDB, applying different pre-built macros, and then loading it back into the system in a batch mode. This was too painful, tedious, slow, and error-prone when it comes to large datasets. Client wanted a system that can do these data munging capabilities and can naturally integrate with their data management platform so that entities are identified and linked properly within their massive 40 million nodes and relationships managed in Neo4j. In a call with the client, Treselle Systems’ big data engineering team quickly understood the problem and suggested the following technologies and integration points:

  • Perform R programming to do data manipulation, munging, cleansing, transformation, and others. Then integrate with existing system via RServe by reusing pre-developed Rscripts in real-time.
  • Use GATE (General Architecture for Text Engineering) and utilize ANNIE, JAPE Plus, and over dozen other plugins to perform data manipulation, munging, cleansing, transformation, and others. Then integrate with existing system via Java Plugin Framework in real-time.
  • Use Hadoop ecosystem with Pig and other User Defined Functions to do the data transformations in batch mode.
  • Use OpenRefine (formerly Google Refine), which is great at working with messy data out of the box.
  • Finally, use bigger products such as Talend and Pentaho data integration.
Client requested us to do a PoC (Proof of Concept) using OpenRefine, as others were little overkill at this point. Client’s head of engineering very much liked our implementation and the out-of the box capabilities to work with messy data.

Business benefits delivered:

  • Reduced Data Scientist’s time from days to hours to perform various data munging capabilities with out of the box and custom built GREL Macros.
  • Ability to transition to any number of steps backward and forward was a matchless feature compared to Excel sheets.
  • Reduced the time for saving and reusing of GREL macros on different datasets from several hours to minutes.
  • Ability to enrich the data with client’s internal APIs reduced 15 manual steps in excel to 1 automated step.
  • Custom integration of GREL macros with data processing layer saved several thousands of dollars from errors due to manual edits and incorrect macros in Excel, manual upload/download of data into MongoDB, custom integration of new macros or formulas in Excel, and others.
  • This pilot was tested on 10 datasets in the client’s data pipeline from start to end within 2 weeks compared to 2 months using Excel and other manual processes.


  • MongoDB Shard Setup – Part II
    Apr, 2017

    This is the second part about MongoDB Shard Setup. In this blog, we will discuss about the basic concepts of implementing Shard with Replica-set in Windows instance. If you would like to use Linux, please change directories specification and binary usage (mongod.exe into mongod) as per Linux. read more

  • MongoDB Shard Setup – Part II
    Apr, 2017

    This is the second part about MongoDB Shard Setup. In this blog, we will discuss about the basic concepts of implementing Shard with Replica-set in Windows instance. If you would like to use Linux, please change directories specification and binary usage (mongod.exe into mongod) as per Linux. read more

  • MongoDB Shard – Part I
    Apr, 2017

    Sharding is a method for distributing data across multiple machines. It supports deployments with very large data sets, high throughput operations, and horizontal scaling. It shards data at the collection level and distributes the collection data across the shards in the cluster. For aggregation operations running on multiple shards, if the operations do not require running on the database’s primary shard, these operations can route the results to any shard so as to merge the results. Thus, avoiding overloading of the primary shard for that database. It divides the data across multiple servers and reduces the amount of data stored by each server. A shard cluster can have shard/non-shard collections without causes. read more


How can we help you accomplish your goals?