Apache Spark Performance Tuning – Straggler Tasks

This is the last article of a four-part series about Apache Spark on YARN. Apache Spark carefully distinguishes “transformation” operation into two types such as “narrow” and “wide”. This distinction is important due to strong implications on evaluating transformations and improving their performance. Spark depends heavily on key/value pair paradigm on defining and parallelizing operations, especially wide transformations requiring data to be redistributed between machines.

read more

Apache Spark Performance Tuning – Straggler Tasks

Apache Spark Performance Tuning – Degree of Parallelism

This is the third article of a four-part series about Apache Spark on YARN. Apache Spark allows developers to run multiple tasks in parallel across machines in a cluster or across multiple cores on a desktop. A partition, aka split, is a logical chunk of a distributed data set. Apache Spark builds Directed Acyclic Graph (DAG) with jobs, stages, and tasks for the submitted application. The number of tasks will be determined based on number of partitions.

read more

Apache Spark Performance Tuning – Degree of Parallelism

Apache Spark on YARN – Resource Planning

This is the second article of a four-part series about Apache Spark on YARN. As Apache Spark is an in-memory distributed data processing engine, the application performance is heavily dependent on resources such as executors, cores, and memory allocated. The resources for the application depends on the application characteristics such as storage and computation.

Few performance bottlenecks were identified in the SFO Fire department call service dataset use case with YARN cluster manager. One of the bottlenecks was about improper usage of resources in YARN cluster and execution of the application based on default Spark configuration.

read more

Apache Spark on YARN – Resource Planning

Apache Spark on YARN – Performance and Bottlenecks

This is the first article of a four-part series about Apache Spark on YARN. Apache Spark 2.x version ships with second-generation Tungsten engine. This engine is built upon ideas from modern compilers to emit optimized code at runtime that collapses the entire query into a single function by using “whole-stage code generation” technique. Let us discuss about high-level and low-level Spark API performances. SFO Fire department call service dataset and YARN cluster manager are chosen to test as well as tune the application performance.

read more

Apache Spark on YARN – Performance and Bottlenecks

Sales Data Analysis using Dataiku DSS

Dataiku Data Science Studio (DSS), a complete data science software platform, is used to explore, prototype, build, and deliver data products. It significantly reduces the time taken by data scientists, data analysts, and data engineers to perform data loading, data cleaning, data preparation, data integration, and data transformation when building powerful predictive applications.

read more

Sales Data Analysis using Dataiku DSS

Importing and Analyzing Data in Datameer

Datameer, an end-to-end big data analytics platform, is built on Apache Hadoop to perform integration, analysis, and visualization of massive volumes of both structured and unstructured data. It can be rapidly integrated with any data sources such as new and existing data sources to deliver an easy-to-use, cost-effective, and sophisticated solution for big data analytics.

read more

Importing and Analyzing Data in Datameer

Kylo Setup for Data Lake Management

Kylo is a feature-rich data lake platform built on Apache Hadoop and Apache Spark. It provides data lake solution enabling self-service data ingest, data preparation, and data discovery. It integrates best practices around metadata capture, security, and data quality. It contains many special purposed routines for data lake operations leveraging Apache Spark and Apache Hive.

read more

Kylo Setup for Data Lake Management

Call Detail Record Analysis –  K-means Clustering with R

Call Detail Record (CDR) is the information captured by the telecom companies during Call, SMS, and Internet activity of a customer. This information provides greater insights about the customer’s needs when used with customer demographics. Most of the telecom companies use CDR information for fraud detection by clustering the user profiles, reducing customer churn by usage activity, and targeting the profitable customers by using RFM analysis.

read more

Call Detail Record Analysis – K-means Clustering with R

Apache Falcon Data Pipeline with Apache Atlas Lineage

In this blog article, Apache Falcon is used to centrally define data pipelines. Few definitions are used to auto-generate workflows in Apache Oozie.
As Apache Falcon dataflows are sank with Apache Atlas through Kafka topics, Atlas can manage Falcon metadata. Atlas provides Falcon feed lineage and provides the details of the tables & its source tables.

read more

Apache Falcon Data Pipeline with Apache Atlas Lineage

MongoDB Shard – Part I

Sharding is a method for distributing data across multiple machines. It supports deployments with very large data sets, high throughput operations, and horizontal scaling. It shards data at the collection level and distributes the collection data across the shards in the cluster. For aggregation operations running on multiple shards, if the operations do not require running on the database’s primary shard, these operations can route the results to any shard so as to merge the results. Thus, avoiding overloading of the primary shard for that database. It divides the data across multiple servers and reduces the amount of data stored by each server. A shard cluster can have shard/non-shard collections without causes.

read more

MongoDB Shard – Part I

Dynamic Jasper Reports Automated in Talend

Dynamic Jasper is a great tool for designing and creating simple or complex dynamic reports. Talend is not only used as the most common tool for data transformation. It is also used for dynamic Jasper report generation using tJasperInput component.

Automation with context parameters is the most important value add to it. It helps to resolve many challenges involved in dynamic report creation such as on the fly changes like column name, report header, date, and so on. It helped developers in saving the report generation time.

read more

Dynamic Jasper Reports Automated in Talend

Hive Streaming with Kafka and Storm with Atlas

With the release of Hive 0.13.1 and HCatalog, a new Streaming API was released to support continuous data ingestion into Hive tables. This API is intended to support streaming clients like Flume or Storm to better store data in Hive, which traditionally had batch oriented storage.

In our use case, we are going to use Kafka with Storm to load streaming data into bucketed Hive table. Multiple Kafka topics produce the data to Storm that ingests the data into transactional Hive table. Data committed in a transaction is immediately available to Hive queries from other Hive clients. Apache Atlas will track the lineage of Hive transactional table, Storm (Bolt, Spout), and Kafka topic, which will help us to understand how data is ingested into the Hive table.

read more

Hive Streaming with Kafka and Storm with Atlas

Microservices – Rules of Engagement

Rules of Engagement is a set of principles and best practices that are valuable to follow to ensure an efficient microservices environment. Treselle Systems captured these rules while going thru the book “Microservices From Day One” by Cloves Carneiro Jr & Tim Schmelmer. Guiding principles like these help the development team spend less time thinking about high-level architecture, and more time writing business-related code and providing value to stakeholders. Check out how one of our microservice architecture implementation is inline with these Rules of Engagement

read more

Microservices – Rules of Engagement

Loan Application Analytics with CUFX

Credit unions maintain applications for different loan products in multiple sources and spend lot of engineering and reporting time to answer the business questions related to loan application. It becomes more challenging to have a unified view of all loan applications and perform further marketing and predictive analytics. CUFX (Credit Union for Financial Exchange) provides a standard schema model so that all business units can use the same nomenclature.

read more

Loan Application Analytics with CUFX

Text Normalization with Spark – Part 1

Numerous methods such as text mining, Natural Language Processing (NLP), information retrieval, and so on, exist for analyzing unstructured data. Due to the rapid growth of unstructured data in all kinds of businesses, scalable solutions have become the need of the hour. Apache Spark is equipped with out of the box algorithms for text analytics, and it also supports custom development of algorithms that are not available by default. In this blog post, our main goal is to perform basic text normalization using simple regular expression technique with Apache Spark and then decipher Spark stages, jobs, and DAG’s in the next blog post.

read more

Text Normalization with Spark – Part 1

Customer Churn – Logistic Regression with R

In the customer management lifecycle, customer churn refers to a decision made by the customer about ending the business relationship. It is also referred as loss of clients or customers. Customer loyalty and customer churn always add up to 100%. If a firm has a 60% of loyalty rate, then their loss or churn rate of customers is 40%. As per 80/20 customer profitability rule, 20% of customers are generating 80% of revenue. So, it is very important to predict the users likely to churn from business relationship and the factors affecting the customer decisions.

read more

Customer Churn – Logistic Regression with R

Big Data Pipeline Architectures

Prior to jumping on the big data adventure, it is important to ensure that all essential architecture components required to analyze all aspects of the big data set are in place. Understanding the high level view of this reference architecture provides a good background of Big Data and how it complements existing analytics, BI, databases and systems. Treselle has solved interesting Big Data problems that required different types of architectures most apt for that particular business use case.

read more

Big Data Pipeline Architectures

The Five Layers of Microservice Testing

Microservice architectures enforce clearer, more pronounced internal boundaries between the components of an entire system than monolithic applications typically do. This can be used to the advantage of testing strategies applied to microservices; more options for places and methods to test are available. Because the entirety of the application is composed of services with clear boundaries, it becomes much more evident which layers can be tested in isolation.

read more

The Five Layers of Microservice Testing

Data Matching – Entity Identification, Resolution & Linkage

Data matching is the task of identifying, matching, and merging records that correspond to the same entities from several source systems. The entities under consideration most commonly refer to people, places, publications or citations, consumer products, or businesses. Besides data matching, the names most prominently used are record or data linkage, entity resolution, object identification, or field matching.

read more

Data Matching – Entity Identification, Resolution & Linkage