Treselle systems services
Treselle systems services
Treselle systems services
Treselle systems services

Big Data


Businesses are increasingly realizing the business benefits of big data but not sure how and where to start. A unique big data strategy tailor made to specific business needs that links organization’s business strategy and support business is very crucial. Treselle’s big data strategy helps the organization through the process of breaking down its business strategy and initiatives into potential big data uses cases and the supporting data and analytic requirements. The high-level phases of big data strategy are as follows:

  • Initiate Engagement: Understand the business mission, vision, pain points and drivers for big data adoption
  • Identify Big Data use cases & requirements: Identify the potential Big Data use cases based on interviews with stakeholder and define requirements around Big Data Governance and Security
  • Assess current data capabilities: Assess existing data capabilities, tools and technologies and identify the gaps to propose appropriate strategy that co-exist with current data capabilities and investment
  • Define Target state: Assess gaps between Current and Target state and define Target state based on organization drivers and use cases defined
  • Develop Big Data Roadmap: Develop Big Data roadmap in phases starting from analytical foundation to descriptive, diagnostic and predictive analytical capabilities
  • Define Big Data Solution: Define solution approach and propose appropriate architecture based on business use cases and effective use of existing technology investment
  • Execute PoC & refine next steps: Execute Big Data proof of concept based on 1 or 2 use cases in agile mode, solicit feedback, document learnings, and define next steps in the roadmap

To execute a successful big data strategy, it requires a tailor made big data lake architecture that should make it possible to store all the data, ability to ask business questions, identify patterns and insights from the data, and uncover new variables and metrics that are better predictors of business performance. Architecting a Data Lake should consider business needs and data strategy so that appropriate tools & technologies can be put in place rather than forcing it to be built on Hadoop ecosystem.


Treselle’s deep expertise on Data Lake architecture helps clients to understand their business needs to propose a scalable Data Lake architecture by protecting client’s existing investments. A typical Data Lake architecture should contain the following layers at a very high level:

  • Ingestion Layer: This layer is responsible for ingesting the data from variety of sources that includes structured, unstructured, and semi-structured
  • Data Storage Layer: This layer is responsible for storing raw, transient, and refined data where the ingested data goes thru multiple transformations
  • Processing Layer: This layer is responsible for processing the data such as cleansing, filtering, normalizing, correlating with other data, and perform necessary aggregations and finding insights
  • Consumption Layer: This layer allows data consumers to consume the data that is available in the Data Lake
  • Management Layer: This layer is responsible for operational and data management tasks that is common across all layers that include Data Governance & Security, Metdata, and Information Lifecycle Management

Data science is the art and science of acquiring knowledge through data. Data science is all about how we take data, use it to acquire knowledge, and then use that knowledge to make decisions, predict the future, understand the past/present, and create new products & opportunities. Data science is about using data in order to gain new insights that you would otherwise have missed. With the sheer volume of data that’s been collected in various forms and from different sources that often comes in very unorganized format, it is impossible for a human or an excel tool to parse and find insights.


Treselle’s expertise in data science & machine learning algorithms across multiple languages such as R, Python and Spark ML allows us to provide specific solutions to address the unique challenges that businesses face. Treselle helped clients in performing some of the machine learning analysis as listed below:

    • Regression Analysis: These kind of analysis helps to understand how the typical value of the dependent variable changes when any one or more of the independent variables is varied, while the other independent variables are held fixed. This includes algorithms such as simple linear regression, multiple linear regression, polynomial regression, support vector regression, decision tree regression, random forest regression, and evaluating different regression model for better performance
    • Clustering Analysis: These kinds of analysis involve grouping set of objects that are more similar to each other than those in other groups (clusters). Cluster analysis can be achieved by various algorithms such as K-means clustering, Hierarchial clustering, Fuzzy K-means clustering, Model-based clustering, and Topic modeling using LDA
    • Classification Analysis: Classification is a process of using specific information (input) to choose a single selection (output) from a short list of predetermined potential responses. Classification algorithms are at the heart of what is called predictive analytics. The goal of predictive analytics is to build automated systems that can make decisions to replicate human judgment. This includes algorithms such as Logistic Regression, K-NN, Fisher’s linear discriminant analysis, Support Vector Machine, Naive Bayes, Decision Tree, and Random Forest Classification
    • Recommendation Systems: These systems produce recommendations using collaborative filtering, content-based filtering and a combination of both called the hybrid approach. Collaborative filtering methods are based on collecting and analyzing a large amount of information on users’ behaviors, activities or preferences and predicting what users will like based on their similarity to other users. Content-based filtering methods are based on a description of the item and a profile of the user’s preference. Hybrid approaches can be implemented by making content-based and collaborative-based predictions separately and then combining them or by adding content-based capabilities to a collaborative-based approach or by unifying the approaches into one model
    • Association Rule Learning: The use of predictive analytics as a mining tool also seeks to discover hidden relationships among items in your data. These hidden relationships are called mining association rules. Some well-known algorithms are Apriori, Eclat and FP-Growth that only does half the job, since they are algorithms for mining frequent itemsets. Another step needs to be done after to generate rules from frequent itemsets found in a database
    • Graph Analysis: While graph analysis is most commonly used to identify clusters of friends, uncover group influencers or advocates, and make friend recommendations on social media networks, graph analysis has other use cases such as graph-based search, master data management, identity & access management, entity identification & linkage, identify hidden patterns and insights, explore causes and effects, etc
    • Trend Analysis: One of the most basic yet very powerful exploratory analytics, trend analysis can quickly uncover trends and events that tend to happen together or happen at some period of interval. It is a fundamental visualization technique to spot patterns, trends, relationships, and outliers across a time series of data which yeilds mathematical models for each of the trend lines that can be flagged for further investigation.
    • Geo Spatial Analysis: This includes techniques for analyzing geographical activities and conditions using a business entity’s topological, geometric, or geographic properties. Anything that is capable of associating latitude and longitude will enable interesting use cases such as whitespace analysis, market and sales penetration, geogrpahical reach, competition analysis, saturation analysis, etc. Geographical analysis can be combined with trend analysis and external sources such as BLS and Census to identify changes in market patterns across the organization’s key markets
    • Text Analytics & Natural Language Processing: There is a huge potential in unstructured data as it is accounted for 80% of enterprise data. Text analytics & NLP is a powerful technique to mine text data and glean insights from the wealth of internal customer, product, social, and operational data. Typical text mining techniques and algorithms include text categorization, text clustering, concept/entity extraction, taxonomies production, sentiment analysis, document summarization, and entity relation modeling
  • Entity identification, linkage & disambiguation: This is one of the most challenging tasks while normalizing the entities that comes from multiple data silos to construct a 360 view as different systems refer the same thing in different formats such as a customer or member. It becomes even more challenging when this information is hidden in unstructured format such as text. It needs complex machine learning techniques to identify, disambiguate and link the entities during normalization process

A unique big data strategy and architecture needs a custom Big Data infrastructure to be successful. A well planned infrastructure should be in place based on volume, velocity and variety of the data needs. Treselle has helped clients in setting up infrastructure using well known distributions such as Hortonworks, Cloudera, Amazon AWS, and non-hadoop environments. 

    • Hortonworks: HDP (Hortonworks Data Platform) is entirely based on open source projects managed by Ambari that provides an entire Hadoop ecosystem and provide controls to choose necessary components in the ecosystem based on the business needs. It is possible to set up a pure batch or streaming or hybrid mode that enables business to set up required Data Lake architecture
    • Cloudera: Another well known Hadoop distribution built on slightly different stack unlike HDP and enables businesses to set up required Data Lake architecture. It has proprietary products such as Cloudera Manager Enterprise for infrastructure management that had advanced functionalities compared to Ambari and Cloudera Navigator for Data Governance capabilities
    • Amazon AWS: A Data Lake architecture can be entirely built on cloud by using Amazon AWS managed services such as API Gateway, Lambda, Elastic Map Reduce (EMR), Spark, Kinesis, DynamoDB, S3, ElasticSearch, etc. Those businesses who are already using AWS services and don’t require lot of control over the infrastructure can choose AWS
    • Custom Infrastructure: Not all Big Data needs require Hadoop ecosystem and many business use cases can be achieved using custom infrastructure such as Standalone Spark clusters that satisfies most of the big data needs such as batch, streaming, machine learning, graph process and programming interfaces with Python and R. SparkSQL is capable of connecting to variety of data stores that are in RDBMS, NoSQL, S3, etc. Apache Drill is another interesting technology that can infer schema on the fly and enables to perform joins across multiple data stores such as HDFS, S3, RDBMS, NoSQL, etc
  • Infrastructure management & migration: Treselle can help businesses to manage their existing Big Data infrastructure as well as migrate to the most suitable infrastructure based on their business needs

Visualizing large data is challenging. Along with increase in volume, there are new kinds of datasets which are becoming more and more mainstream. The need to analyze user comments, sentiments, customer calls and various unstructured data has resulted in the use of new kinds of visualizations. The use of graph databases, streaming data, geo-map, and machine learning models are some of the examples of how things are changing because of velocity and variety.


Treselle has helped clients with big data visualization needs ranging from custom built solutions and open source notebooks to well know commercial visualization products that are targeted to different user groups such as Business Analyst, Data Science team, Curators, and others 

    • Commercial Products: Tableau & Qlik Sense are some of the most popular visualization vendors that can integrate with Hadoop ecosystem using Hive or Spark connectors and databases via JDBC/ODBC capable drivers. Apache Drill can be used to perform cross join queries across multiple SQL & NoSQL databases and connect with Tableau using JDBC driver to create interesting visualizations at scale while data is located in the original sources
    • Open Source Visualizations: The visualization needs for Data Science teams and Curators are quite different as they are mostly responsible for performing exploratory analysis on large scale of unrefined data to find insights and patterns. These visualizations are not based on pre-materialized views and had to run on clusters when an analysis is performed. Visualization tools such as Zeppelin, Jupyter, Shiny with R are some of the preferred tools to perform these tasks
  • Custom Web based Visualizations: Businesses are building interesting web based SaaS products with stunning visualizations on huge volume and variety of data using web stack such as Nodejs, AngularJS, Java/Python webservice API, Highcharts, D3.js, Google charts, plotly, and others

Big Data Engineering is a specialized field that helps businesses to ingest, store, transform, perform analytics, run machine learning models, identify patterns and insights, manage security and governance and prepare the data for visualization purpose. Treselle has helped clients in building different big data engineering solutions based on the proposed architecture with a wide variety of tools and technologies. Treselle can help businesses either in complete Big Data Engineering implementation or any part/layer of the stack as below: 

  1. Ingestion – batch or streaming
  2. Data Storage – SQL, NoSQL, Cloud storage
  3. Data Processing – wrangling, cleansing, normalizing, filtering, transformations, ETL
  4. Analytics – aggregation, correlation, statistical computation, machine learning models
  5. Operations – manage & monitor clusters, data governance, data security, information lifecycle management, Metadata – business, technical, and operation

Treselle has experience in implementing variety of Big Data implementation from non-distributed to distributed computing system and commercial products.

    • Non-distributed Computing Systems: Businesses that have a lot of varieties but not volume or velocity doesn’t need complicated distributed computing systems such as Hadoop or Spark. There are different tools and technologies that can be used to create loosely coupled and highly cohesive data processing pipeline based on open source technologies such as Camel, Talend, Nutch, Java/Python/R/Scala, ActiveMQ/RabbitMQ, OpenRefine, etc
    • Distributed Computing Systems: Hadoop, Spark, and AWS cloud based ecosystem fall in this category that defines specific tools and technologies from ingestion to visualization that can be used to process batch, streaming, and hybrid mode by combining both systems
  • Commercial Products: Datameer, Dataiku and Talend Big Data Enterprise are some of the well known commercial products that can be utilized together to solve common business problems without lot of engineering and data science effort

IoT doesn’t have to be always sensors producing data from connected devices but anything that sends streaming data and this could be something similar to a weather station producing min & max temperature, click events on AWS IoT button, large volume of log events generated by machines such as computers and other devices, stock ticker price, etc. Treselle worked with clients on different proof of concepts depending on the business needs and use cases and processed near real-time data using AWS IoT infrastructure and Spark based streaming infrastructure. 

    • Spark Streaming: Spark Streaming is one of the powerful extensions provided by Spark for consuming and processing the events produced by various data sources in near real-time. Spark Streaming extended the Spark core architecture and produced a new architecture based on microbatching, where live/streaming data is received and collected from various data sources and further divided into a series of deterministic microbatches. Data can be ingested from Kafka, Flume, Kinesis that can be processed by Spark streaming to perform necessary aggregations at microbatch level
  • AWS IoT: AWS has created a reference model and a platform that enables businesses to connect devices to their managed services, store and process the data, and enable applications to interact with devices. The AWS IoT mainly contains Rules engine, registry, device shadow, process the data via Kinesis platform by utilising Kinesis Firehose, Kinesis Stream, and Kinesis Analytics, and store the data in variety of location such as S3, Aurora, DynamoDB, and send event notification to downstream data pipeline using Lambda expressions. The entire data pipeline can be governed by VPC and IAM authorization and authentication mechanism for security purpose
  • Hadoop ecosystem: HDFS, Hive, Pig, YARN
  • Spark ecosystem: Spark core, SparkSQL, Spark Streaming, SparkML
  • Data Stores: RDBMS, MongoDB, Cassandra, Neo4j, OrientDB, ElasticSearch, Solr
  • Visualization: Zeppelin, Shiny + R, Jupyter, Tableau, Qlik
  • Ingestion: Sqoop, Flume, Kafka, Talend, StreamSets, NiFi
  • Governance & Workflows: Ambari, Atlas, Ranger, Falcon, Oozie, Airflow
  • SQL on Hadoop: Drill, SparkSQL, Presto, Dremio
  • Distribution: Hortonworks, Cloudera
  • AWS: EMR, IoT, Athena, Glue, Quicksight, Redshift, Lambda, DynamoDB, AWS ML, DataStage, and other AWS storage, message, and computing services



  • Custom Partitioning and Analysis using Kafka SQL Windowing
    Jan, 2018

    Custom partition technique is used to produce a particular type of message in the defined partition and to make the produced message to be consumed by a particular consumer. Apache Kafka uses round-robin fashion to produce messages to multiple partitions. read more

  • Crime Analysis Using H2O Autoencoders – Part 2
    Dec, 2017

    This blog is the second part of a two-part series of Crime Analysis using H2O Autoencoders. H2O Autoencoders model is deployed into a real-time production environment by converting it into POJO objects using H2O functions Crime Analysis. read more

  • Streaming Analytics using Kafka SQL
    Dec, 2017

    Kafka SQL, a streaming SQL engine for Apache Kafka by Confluent, is used for real-time data integration, data monitoring, and data anomaly detection. KSQL is used to read, write, and process Citi Bike trip data in real-time. read more


How can we help you accomplish your goals?