This is the second part about MongoDB Shard Setup. In this blog, we will discuss about the basic concepts of implementing Shard with Replica-set in Windows instance. If you would like to use Linux, please change directories specification and binary usage (mongod.exe into mongod) as per Linux. read more
Over the past few years, businesses across all industries have been collecting massive amounts of data – be it machine, human, sensor, or smart device generated. In addition to these, businesses also have structured data in their enterprise legacy systems as well as in their traditional data stores. When you add social, web, and partner data on top, there is a rich trove of information that businesses can use to come up with meaningful and valuable insights – whether it is predictive analytics, targeted marketing, churn analytics, fraud detection, or others. Lately, newer technologies have come to the fore to help perform the analysis cost-effectively.
We strongly believe that businesses that consider Big Data as a strategic corporate asset and methodically exploit it will thrive. The data is there and the technology and tools are available. The questions are: what to do, how to do, and where to start?
We can help. For more than 4 years, one of our teams has been building a SaaS based Big Data application for one of our clients in the financial services industry in San Francisco, US. We have the experience, skills, and resources to turn your business data into business insights. Contact us.
- Initiate Engagement: Understand the business mission, vision, pain points and drivers for big data adoption
- Identify Big Data use cases & requirements: Identify the potential Big Data use cases based on interviews with stakeholder and define requirements around Big Data Governance and Security
- Assess current data capabilities: Assess existing data capabilities, tools and technologies and identify the gaps to propose appropriate strategy that co-exist with current data capabilities and investment
- Define Target state: Assess gaps between Current and Target state and define Target state based on organization drivers and use cases defined
- Develop Big Data Roadmap: Develop Big Data roadmap in phases starting from analytical foundation to descriptive, diagnostic and predictive analytical capabilities
- Define Big Data Solution: Define solution approach and propose appropriate architecture based on business use cases and effective use of existing technology investment
- Execute PoC & refine next steps: Execute Big Data proof of concept based on 1 or 2 use cases in agile mode, solicit feedback, document learnings, and define next steps in the roadmap
Treselle’s deep expertise on Data Lake architecture helps clients to understand their business needs to propose a scalable Data Lake architecture by protecting client’s existing investments. A typical Data Lake architecture should contain the following layers at a very high level:
- Ingestion Layer: This layer is responsible for ingesting the data from variety of sources that includes structured, unstructured, and semi-structured
- Data Storage Layer: This layer is responsible for storing raw, transient, and refined data where the ingested data goes thru multiple transformations
- Processing Layer: This layer is responsible for processing the data such as cleansing, filtering, normalizing, correlating with other data, and perform necessary aggregations and finding insights
- Consumption Layer: This layer allows data consumers to consume the data that is available in the Data Lake
- Management Layer: This layer is responsible for operational and data management tasks that is common across all layers that include Data Governance & Security, Metdata, and Information Lifecycle Management
Treselle’s expertise in data science & machine learning algorithms across multiple languages such as R, Python and Spark ML allows us to provide specific solutions to address the unique challenges that businesses face. Treselle helped clients in performing some of the machine learning analysis as listed below:
- Regression Analysis: These kind of analysis helps to understand how the typical value of the dependent variable changes when any one or more of the independent variables is varied, while the other independent variables are held fixed. This includes algorithms such as simple linear regression, multiple linear regression, polynomial regression, support vector regression, decision tree regression, random forest regression, and evaluating different regression model for better performance
- Clustering Analysis: These kinds of analysis involve grouping set of objects that are more similar to each other than those in other groups (clusters). Cluster analysis can be achieved by various algorithms such as K-means clustering, Hierarchial clustering, Fuzzy K-means clustering, Model-based clustering, and Topic modeling using LDA
- Classification Analysis: Classification is a process of using specific information (input) to choose a single selection (output) from a short list of predetermined potential responses. Classification algorithms are at the heart of what is called predictive analytics. The goal of predictive analytics is to build automated systems that can make decisions to replicate human judgment. This includes algorithms such as Logistic Regression, K-NN, Fisher’s linear discriminant analysis, Support Vector Machine, Naive Bayes, Decision Tree, and Random Forest Classification
- Recommendation Systems: These systems produce recommendations using collaborative filtering, content-based filtering and a combination of both called the hybrid approach. Collaborative filtering methods are based on collecting and analyzing a large amount of information on users’ behaviors, activities or preferences and predicting what users will like based on their similarity to other users. Content-based filtering methods are based on a description of the item and a profile of the user’s preference. Hybrid approaches can be implemented by making content-based and collaborative-based predictions separately and then combining them or by adding content-based capabilities to a collaborative-based approach or by unifying the approaches into one model
- Association Rule Learning: The use of predictive analytics as a mining tool also seeks to discover hidden relationships among items in your data. These hidden relationships are called mining association rules. Some well-known algorithms are Apriori, Eclat and FP-Growth that only does half the job, since they are algorithms for mining frequent itemsets. Another step needs to be done after to generate rules from frequent itemsets found in a database
- Graph Analysis: While graph analysis is most commonly used to identify clusters of friends, uncover group influencers or advocates, and make friend recommendations on social media networks, graph analysis has other use cases such as graph-based search, master data management, identity & access management, entity identification & linkage, identify hidden patterns and insights, explore causes and effects, etc
- Trend Analysis: One of the most basic yet very powerful exploratory analytics, trend analysis can quickly uncover trends and events that tend to happen together or happen at some period of interval. It is a fundamental visualization technique to spot patterns, trends, relationships, and outliers across a time series of data which yeilds mathematical models for each of the trend lines that can be flagged for further investigation.
- Geo Spatial Analysis: This includes techniques for analyzing geographical activities and conditions using a business entity’s topological, geometric, or geographic properties. Anything that is capable of associating latitude and longitude will enable interesting use cases such as whitespace analysis, market and sales penetration, geogrpahical reach, competition analysis, saturation analysis, etc. Geographical analysis can be combined with trend analysis and external sources such as BLS and Census to identify changes in market patterns across the organization’s key markets
- Text Analytics & Natural Language Processing: There is a huge potential in unstructured data as it is accounted for 80% of enterprise data. Text analytics & NLP is a powerful technique to mine text data and glean insights from the wealth of internal customer, product, social, and operational data. Typical text mining techniques and algorithms include text categorization, text clustering, concept/entity extraction, taxonomies production, sentiment analysis, document summarization, and entity relation modeling
- Entity identification, linkage & disambiguation: This is one of the most challenging tasks while normalizing the entities that comes from multiple data silos to construct a 360 view as different systems refer the same thing in different formats such as a customer or member. It becomes even more challenging when this information is hidden in unstructured format such as text. It needs complex machine learning techniques to identify, disambiguate and link the entities during normalization process
- Hortonworks: HDP (Hortonworks Data Platform) is entirely based on open source projects managed by Ambari that provides an entire Hadoop ecosystem and provide controls to choose necessary components in the ecosystem based on the business needs. It is possible to set up a pure batch or streaming or hybrid mode that enables business to set up required Data Lake architecture
- Cloudera: Another well known Hadoop distribution built on slightly different stack unlike HDP and enables businesses to set up required Data Lake architecture. It has proprietary products such as Cloudera Manager Enterprise for infrastructure management that had advanced functionalities compared to Ambari and Cloudera Navigator for Data Governance capabilities
- Amazon AWS: A Data Lake architecture can be entirely built on cloud by using Amazon AWS managed services such as API Gateway, Lambda, Elastic Map Reduce (EMR), Spark, Kinesis, DynamoDB, S3, ElasticSearch, etc. Those businesses who are already using AWS services and don’t require lot of control over the infrastructure can choose AWS
- Custom Infrastructure: Not all Big Data needs require Hadoop ecosystem and many business use cases can be achieved using custom infrastructure such as Standalone Spark clusters that satisfies most of the big data needs such as batch, streaming, machine learning, graph process and programming interfaces with Python and R. SparkSQL is capable of connecting to variety of data stores that are in RDBMS, NoSQL, S3, etc. Apache Drill is another interesting technology that can infer schema on the fly and enables to perform joins across multiple data stores such as HDFS, S3, RDBMS, NoSQL, etc
- Infrastructure management & migration: Treselle can help businesses to manage their existing Big Data infrastructure as well as migrate to the most suitable infrastructure based on their business needs
- Commercial Products: Tableau & Qlik Sense are some of the most popular visualization vendors that can integrate with Hadoop ecosystem using Hive or Spark connectors and databases via JDBC/ODBC capable drivers. Apache Drill can be used to perform cross join queries across multiple SQL & NoSQL databases and connect with Tableau using JDBC driver to create interesting visualizations at scale while data is located in the original sources
- Open Source Visualizations: The visualization needs for Data Science teams and Curators are quite different as they are mostly responsible for performing exploratory analysis on large scale of unrefined data to find insights and patterns. These visualizations are not based on pre-materialized views and had to run on clusters when an analysis is performed. Visualization tools such as Zeppelin, Jupyter, Shiny with R are some of the preferred tools to perform these tasks
- Custom Web based Visualizations: Businesses are building interesting web based SaaS products with stunning visualizations on huge volume and variety of data using web stack such as Nodejs, AngularJS, Java/Python webservice API, Highcharts, D3.js, Google charts, plotly, and others
- Ingestion – batch or streaming
- Data Storage – SQL, NOSQL, Cloud storage
- Data Processing – wrangling, cleansing, normalizing, filtering, transformations, ETL
- Analytics – aggregation, correlation, statistical computation, machine learning models
- Operations – manage & monitor clusters, data governance, data security, information lifecycle management, Metadata – business, technical, and operation
Treselle has experience in implementing variety of Big Data implementation from non-distributed to distributed computing system and commercial products.
- Non-distributed Computing Systems: Businesses that have lot of variety but not volume or velocity doesn’t need complicated distributed computing systems such as Hadoop or Spark. There are different tools and technologies that can be used to create loosely coupled and highly cohesive data processing pipeline based on open source technologies such as Camel, Talend, Nutch, Java/Python/R/Scala, ActiveMQ/RabbitMQ, OpenRefine, etc
- Distributed Computing Systems: Hadoop, Spark, and AWS cloud based ecosytem fall in this category that defines specific tools and technologies from ingestion to visualization that can be used to process batch, streaming, and hybrid mode by combining both systems
- Commercial Products: Datameer, Trifacta and Talend Big Data Enterprise are some of the well known commercial products that can be utilized together to solve common business problems without lot of engineering and data science effort
- Spark Streaming: Spark Streaming is one of the powerful extensions provided by Spark for consuming and processing the events produced by various data sources in near real-time. Spark Streaming extended the Spark core architecture and produced a new architecture based on microbatching, where live/streaming data is received and collected from various data sources and further divided into a series of deterministic microbatches. Data can be ingested from Kafka, Flume, Kinesis that can be processed by Spark streaming to perform necessary aggregations at microbatch level
- AWS IoT: AWS has created a reference model and a platform that enables businesses to connect devices to their managed services, store and process the data, and enable applications to interact with devices. The AWS IoT mainly contains Rules engine, registry, device shadow, process the data via Kinesis platform by utilising Kinesis Firehose, Kinesis Stream, and Kinesis Analytics, and store the data in variety of location such as S3, Aurora, DynamoDB, and send event notification to downstream data pipeline using Lambda expressions. The entire data pipeline can be governed by VPC and IAM authorization and authentication mechanism for security purpose
Business benefits delivered:
- Increased the signal identification from 700,000 to 7.5 million nodes and relationships.
- Improved the signal relationship traversal by more than 500% with depth level 4 and above.
- Reduced the latent writes across de-normalized stores by 100%.
- Implemented the ability to create signal nodes and relationships on the streaming data.
- Reduced curator’s time from hours to seconds to understand and validate why the system raised a signal.
Business benefits delivered:
- Reduced the time taken to find insights from 5 weeks to days.
- Increased processing of the partner dataset processing by more than 200%.
- Reduced data clutter by 100% by applying various filters and removing irrelevant fields.
- Included the ability to perform data manipulation and analysis on the web as well as share with other analysts and data partners.
- Increased confidence in the data analysis by 100% as our client performed the analysis on the entire data instead of just a sample of data.
Business benefits delivered:
- Increased complex ingestion of unstructured data source feeds from 200+ to 600+.
- Introduced the ability to monitor and react to content feed layout changes at the source from 100 to 1200 domains.
- Enhanced the range of structured partner data feeds from a mere 400 to 150 million records.
- Reduced the turnaround time to load partner data feeds from few weeks to just 2 days.
- Increased the frequency of content feed extraction for large sources from weekly to daily.
- Added the ability to ingest sources automatically that requires user interaction such as filling a form, and keyword searches.
Business benefits delivered:
- Increased the entity identification from 700,000 to 7.5 million nodes and relationships.
- Enhanced the ability to score and rank the content over several thousand significant lexicon words.
- Introduced the ability to map text-based numerals to numeric for performing numeric range valuations that tremendously improved our client’s platform offerings.
- Increased the content quality by 50% and reduced the noise by 99%.
- Perform R programming to do data manipulation, munging, cleansing, transformation, and others. Then integrate with existing system via RServe by reusing pre-developed Rscripts in real-time.
- Use GATE (General Architecture for Text Engineering) and utilize ANNIE, JAPE Plus, and over dozen other plugins to perform data manipulation, munging, cleansing, transformation, and others. Then integrate with existing system via Java Plugin Framework in real-time.
- Use Hadoop ecosystem with Pig and other User Defined Functions to do the data transformations in batch mode.
- Use OpenRefine (formerly Google Refine), which is great at working with messy data out of the box.
- Finally, use bigger products such as Talend and Pentaho data integration.
Business benefits delivered:
- Reduced Data Scientist’s time from days to hours to perform various data munging capabilities with out of the box and custom built GREL Macros.
- Ability to transition to any number of steps backward and forward was a matchless feature compared to Excel sheets.
- Reduced the time for saving and reusing of GREL macros on different datasets from several hours to minutes.
- Ability to enrich the data with client’s internal APIs reduced 15 manual steps in excel to 1 automated step.
- Custom integration of GREL macros with data processing layer saved several thousands of dollars from errors due to manual edits and incorrect macros in Excel, manual upload/download of data into MongoDB, custom integration of new macros or formulas in Excel, and others.
- This pilot was tested on 10 datasets in the client’s data pipeline from start to end within 2 weeks compared to 2 months using Excel and other manual processes.