Apache Falcon Data Pipeline with Apache Atlas Lineage

In this blog article, Apache Falcon is used to centrally define data pipelines. Few definitions are used to auto-generate workflows in Apache Oozie.
As Apache Falcon dataflows are sank with Apache Atlas through Kafka topics, Atlas can manage Falcon metadata. Atlas provides Falcon feed lineage and provides the details of the tables & its source tables.

read more

Apache Falcon Data Pipeline with Apache Atlas Lineage

MongoDB Shard – Part I

Sharding is a method for distributing data across multiple machines. It supports deployments with very large data sets, high throughput operations, and horizontal scaling. It shards data at the collection level and distributes the collection data across the shards in the cluster. For aggregation operations running on multiple shards, if the operations do not require running on the database’s primary shard, these operations can route the results to any shard so as to merge the results. Thus, avoiding overloading of the primary shard for that database. It divides the data across multiple servers and reduces the amount of data stored by each server. A shard cluster can have shard/non-shard collections without causes.

read more

MongoDB Shard – Part I

Hive Streaming with Kafka and Storm with Atlas

With the release of Hive 0.13.1 and HCatalog, a new Streaming API was released to support continuous data ingestion into Hive tables. This API is intended to support streaming clients like Flume or Storm to better store data in Hive, which traditionally had batch oriented storage.

In our use case, we are going to use Kafka with Storm to load streaming data into bucketed Hive table. Multiple Kafka topics produce the data to Storm that ingests the data into transactional Hive table. Data committed in a transaction is immediately available to Hive queries from other Hive clients. Apache Atlas will track the lineage of Hive transactional table, Storm (Bolt, Spout), and Kafka topic, which will help us to understand how data is ingested into the Hive table.

read more

Hive Streaming with Kafka and Storm with Atlas

Loan Application Analytics with CUFX

Credit unions maintain applications for different loan products in multiple sources and spend lot of engineering and reporting time to answer the business questions related to loan application. It becomes more challenging to have a unified view of all loan applications and perform further marketing and predictive analytics. CUFX (Credit Union for Financial Exchange) provides a standard schema model so that all business units can use the same nomenclature.

read more

Loan Application Analytics with CUFX

Apache Falcon Data Pipeline with Apache Atlas Lineage

In this blog article, Apache Falcon is used to centrally define data pipelines. Few definitions are used to auto-generate workflows in Apache Oozie.
As Apache Falcon dataflows are sank with Apache Atlas through Kafka topics, Atlas can manage Falcon metadata. Atlas provides Falcon feed lineage and provides the details of the tables & its source tables.

read more

Apache Falcon Data Pipeline with Apache Atlas Lineage

MongoDB Shard – Part I

Sharding is a method for distributing data across multiple machines. It supports deployments with very large data sets, high throughput operations, and horizontal scaling. It shards data at the collection level and distributes the collection data across the shards in the cluster. For aggregation operations running on multiple shards, if the operations do not require running on the database’s primary shard, these operations can route the results to any shard so as to merge the results. Thus, avoiding overloading of the primary shard for that database. It divides the data across multiple servers and reduces the amount of data stored by each server. A shard cluster can have shard/non-shard collections without causes.

read more

MongoDB Shard – Part I

MongoDB Shard – Part I

MongoDB Shard – Part I

Sharding is a method for distributing data across multiple machines. It supports deployments with very large data sets, high throughput operations, and horizontal scaling. It shards data at the collection level and distributes the collection data across the shards in the cluster. For aggregation operations running on multiple shards, if the operations do not require running on the database’s primary shard, these operations can route the results to any shard so as to merge the results. Thus, avoiding overloading of the primary shard for that database. It divides the data across multiple servers and reduces the amount of data stored by each server. A shard cluster can have shard/non-shard collections without causes.

Hive Streaming with Kafka and Storm with Atlas

With the release of Hive 0.13.1 and HCatalog, a new Streaming API was released to support continuous data ingestion into Hive tables. This API is intended to support streaming clients like Flume or Storm to better store data in Hive, which traditionally had batch oriented storage.

In our use case, we are going to use Kafka with Storm to load streaming data into bucketed Hive table. Multiple Kafka topics produce the data to Storm that ingests the data into transactional Hive table. Data committed in a transaction is immediately available to Hive queries from other Hive clients. Apache Atlas will track the lineage of Hive transactional table, Storm (Bolt, Spout), and Kafka topic, which will help us to understand how data is ingested into the Hive table.

read more

Hive Streaming with Kafka and Storm with Atlas

Hive Streaming with Kafka and Storm with Atlas

Hive Streaming with Kafka and Storm with Atlas

With the release of Hive 0.13.1 and HCatalog, a new Streaming API was released to support continuous data ingestion into Hive tables. This API is intended to support streaming clients like Flume or Storm to better store data in Hive, which traditionally had batch oriented storage.

In our use case, we are going to use Kafka with Storm to load streaming data into bucketed Hive table. Multiple Kafka topics produce the data to Storm that ingests the data into transactional Hive table. Data committed in a transaction is immediately available to Hive queries from other Hive clients. Apache Atlas will track the lineage of Hive transactional table, Storm (Bolt, Spout), and Kafka topic, which will help us to understand how data is ingested into the Hive table.

Loan Application Analytics with CUFX

Credit unions maintain applications for different loan products in multiple sources and spend lot of engineering and reporting time to answer the business questions related to loan application. It becomes more challenging to have a unified view of all loan applications and perform further marketing and predictive analytics. CUFX (Credit Union for Financial Exchange) provides a standard schema model so that all business units can use the same nomenclature.

read more

Loan Application Analytics with CUFX

Loan Application Analytics with CUFX

Loan Application Analytics with CUFX

Credit unions maintain applications for different loan products in multiple sources and spend lot of engineering and reporting time to answer the business questions related to loan application. It becomes more challenging to have a unified view of all loan applications and perform further marketing and predictive analytics. CUFX (Credit Union for Financial Exchange) provides a standard schema model so that all business units can use the same nomenclature.