MongoDB Shard – Part I

Sharding is a method for distributing data across multiple machines. It supports deployments with very large data sets, high throughput operations, and horizontal scaling. It shards data at the collection level and distributes the collection data across the shards in the cluster. For aggregation operations running on multiple shards, if the operations do not require running on the database’s primary shard, these operations can route the results to any shard so as to merge the results. Thus, avoiding overloading of the primary shard for that database. It divides the data across multiple servers and reduces the amount of data stored by each server. A shard cluster can have shard/non-shard collections without causes.

read more

MongoDB Shard – Part I

Dynamic Jasper Reports Automated in Talend

Dynamic Jasper is a great tool for designing and creating simple or complex dynamic reports. Talend is not only used as the most common tool for data transformation. It is also used for dynamic Jasper report generation using tJasperInput component.

Automation with context parameters is the most important value add to it. It helps to resolve many challenges involved in dynamic report creation such as on the fly changes like column name, report header, date, and so on. It helped developers in saving the report generation time.

read more

Dynamic Jasper Reports Automated in Talend

Hive Streaming with Kafka and Storm with Atlas

With the release of Hive 0.13.1 and HCatalog, a new Streaming API was released to support continuous data ingestion into Hive tables. This API is intended to support streaming clients like Flume or Storm to better store data in Hive, which traditionally had batch oriented storage.

In our use case, we are going to use Kafka with Storm to load streaming data into bucketed Hive table. Multiple Kafka topics produce the data to Storm that ingests the data into transactional Hive table. Data committed in a transaction is immediately available to Hive queries from other Hive clients. Apache Atlas will track the lineage of Hive transactional table, Storm (Bolt, Spout), and Kafka topic, which will help us to understand how data is ingested into the Hive table.

read more

Hive Streaming with Kafka and Storm with Atlas

Microservices – Rules of Engagement

Rules of Engagement is a set of principles and best practices that are valuable to follow to ensure an efficient microservices environment. Treselle Systems captured these rules while going thru the book “Microservices From Day One” by Cloves Carneiro Jr & Tim Schmelmer. Guiding principles like these help the development team spend less time thinking about high-level architecture, and more time writing business-related code and providing value to stakeholders. Check out how one of our microservice architecture implementation is inline with these Rules of Engagement

read more

Microservices – Rules of Engagement

Loan Application Analytics with CUFX

Credit unions maintain applications for different loan products in multiple sources and spend lot of engineering and reporting time to answer the business questions related to loan application. It becomes more challenging to have a unified view of all loan applications and perform further marketing and predictive analytics. CUFX (Credit Union for Financial Exchange) provides a standard schema model so that all business units can use the same nomenclature.

read more

Loan Application Analytics with CUFX

Text Normalization with Spark – Part 1

Numerous methods such as text mining, Natural Language Processing (NLP), information retrieval, and so on, exist for analyzing unstructured data. Due to the rapid growth of unstructured data in all kinds of businesses, scalable solutions have become the need of the hour. Apache Spark is equipped with out of the box algorithms for text analytics, and it also supports custom development of algorithms that are not available by default. In this blog post, our main goal is to perform basic text normalization using simple regular expression technique with Apache Spark and then decipher Spark stages, jobs, and DAG’s in the next blog post.

read more

Text Normalization with Spark – Part 1

Customer Churn – Logistic Regression with R

In the customer management lifecycle, customer churn refers to a decision made by the customer about ending the business relationship. It is also referred as loss of clients or customers. Customer loyalty and customer churn always add up to 100%. If a firm has a 60% of loyalty rate, then their loss or churn rate of customers is 40%. As per 80/20 customer profitability rule, 20% of customers are generating 80% of revenue. So, it is very important to predict the users likely to churn from business relationship and the factors affecting the customer decisions.

read more

Customer Churn – Logistic Regression with R

Big Data Pipeline Architectures

Prior to jumping on the big data adventure, it is important to ensure that all essential architecture components required to analyze all aspects of the big data set are in place. Understanding the high level view of this reference architecture provides a good background of Big Data and how it complements existing analytics, BI, databases and systems. Treselle has solved interesting Big Data problems that required different types of architectures most apt for that particular business use case.

read more

Big Data Pipeline Architectures

The Five Layers of Microservice Testing

Microservice architectures enforce clearer, more pronounced internal boundaries between the components of an entire system than monolithic applications typically do. This can be used to the advantage of testing strategies applied to microservices; more options for places and methods to test are available. Because the entirety of the application is composed of services with clear boundaries, it becomes much more evident which layers can be tested in isolation.

read more

The Five Layers of Microservice Testing

Data Matching – Entity Identification, Resolution & Linkage

Data matching is the task of identifying, matching, and merging records that correspond to the same entities from several source systems. The entities under consideration most commonly refer to people, places, publications or citations, consumer products, or businesses. Besides data matching, the names most prominently used are record or data linkage, entity resolution, object identification, or field matching.

read more

Data Matching – Entity Identification, Resolution & Linkage