Protractor with Cucumber

Protractor, an end-to-end testing framework, supports Jasmine and is specifically built for AngularJS application. It is highly flexible with different Behavior-Driven Development (BDD) frameworks such as Cucumber. Cucumber, a BDD framework, is used for web applications for performing acceptance tests. It provides a higher-level view of testing process of the suite.

read more

Protractor with Cucumber

Apache Spark Performance Tuning – Straggler Tasks

This is the last article of a four-part series about Apache Spark on YARN. Apache Spark carefully distinguishes “transformation” operation into two types such as “narrow” and “wide”. This distinction is important due to strong implications on evaluating transformations and improving their performance. Spark depends heavily on key/value pair paradigm on defining and parallelizing operations, especially wide transformations requiring data to be redistributed between machines.

read more

Apache Spark Performance Tuning – Straggler Tasks

Apache Spark Performance Tuning – Degree of Parallelism

This is the third article of a four-part series about Apache Spark on YARN. Apache Spark allows developers to run multiple tasks in parallel across machines in a cluster or across multiple cores on a desktop. A partition, aka split, is a logical chunk of a distributed data set. Apache Spark builds Directed Acyclic Graph (DAG) with jobs, stages, and tasks for the submitted application. The number of tasks will be determined based on number of partitions.

read more

Apache Spark Performance Tuning – Degree of Parallelism

Apache Spark on YARN – Resource Planning

This is the second article of a four-part series about Apache Spark on YARN. As Apache Spark is an in-memory distributed data processing engine, the application performance is heavily dependent on resources such as executors, cores, and memory allocated. The resources for the application depends on the application characteristics such as storage and computation.

Few performance bottlenecks were identified in the SFO Fire department call service dataset use case with YARN cluster manager. One of the bottlenecks was about improper usage of resources in YARN cluster and execution of the application based on default Spark configuration.

read more

Apache Spark on YARN – Resource Planning

Apache Spark on YARN – Performance and Bottlenecks

This is the first article of a four-part series about Apache Spark on YARN. Apache Spark 2.x version ships with second-generation Tungsten engine. This engine is built upon ideas from modern compilers to emit optimized code at runtime that collapses the entire query into a single function by using “whole-stage code generation” technique. Let us discuss about high-level and low-level Spark API performances. SFO Fire department call service dataset and YARN cluster manager are chosen to test as well as tune the application performance.

read more

Apache Spark on YARN – Performance and Bottlenecks

Sales Data Analysis using Dataiku DSS

Dataiku Data Science Studio (DSS), a complete data science software platform, is used to explore, prototype, build, and deliver data products. It significantly reduces the time taken by data scientists, data analysts, and data engineers to perform data loading, data cleaning, data preparation, data integration, and data transformation when building powerful predictive applications.

read more

Sales Data Analysis using Dataiku DSS

Importing and Analyzing Data in Datameer

Datameer, an end-to-end big data analytics platform, is built on Apache Hadoop to perform integration, analysis, and visualization of massive volumes of both structured and unstructured data. It can be rapidly integrated with any data sources such as new and existing data sources to deliver an easy-to-use, cost-effective, and sophisticated solution for big data analytics.

read more

Importing and Analyzing Data in Datameer

Protractor with Cucumber

Protractor, an end-to-end testing framework, supports Jasmine and is specifically built for AngularJS application. It is highly flexible with different Behavior-Driven Development (BDD) frameworks such as Cucumber. Cucumber, a BDD framework, is used for web applications for performing acceptance tests. It provides a higher-level view of testing process of the suite.

read more

Protractor with Cucumber

Apache Spark Performance Tuning – Straggler Tasks

This is the last article of a four-part series about Apache Spark on YARN. Apache Spark carefully distinguishes “transformation” operation into two types such as “narrow” and “wide”. This distinction is important due to strong implications on evaluating transformations and improving their performance. Spark depends heavily on key/value pair paradigm on defining and parallelizing operations, especially wide transformations requiring data to be redistributed between machines.

read more

Apache Spark Performance Tuning – Straggler Tasks

Apache Spark Performance Tuning – Straggler Tasks

Apache Spark Performance Tuning – Straggler Tasks

This is the last article of a four-part series about Apache Spark on YARN. Apache Spark carefully distinguishes “transformation” operation into two types such as “narrow” and “wide”. This distinction is important due to strong implications on evaluating transformations and improving their performance. Spark depends heavily on key/value pair paradigm on defining and parallelizing operations, especially wide transformations requiring data to be redistributed between machines.

Apache Spark Performance Tuning – Degree of Parallelism

This is the third article of a four-part series about Apache Spark on YARN. Apache Spark allows developers to run multiple tasks in parallel across machines in a cluster or across multiple cores on a desktop. A partition, aka split, is a logical chunk of a distributed data set. Apache Spark builds Directed Acyclic Graph (DAG) with jobs, stages, and tasks for the submitted application. The number of tasks will be determined based on number of partitions.

read more

Apache Spark Performance Tuning – Degree of Parallelism

Apache Spark Performance Tuning – Degree of Parallelism

Apache Spark Performance Tuning – Degree of Parallelism

This is the third article of a four-part series about Apache Spark on YARN. Apache Spark allows developers to run multiple tasks in parallel across machines in a cluster or across multiple cores on a desktop. A partition, aka split, is a logical chunk of a distributed data set. Apache Spark builds Directed Acyclic Graph (DAG) with jobs, stages, and tasks for the submitted application. The number of tasks will be determined based on number of partitions.

Apache Spark on YARN – Resource Planning

This is the second article of a four-part series about Apache Spark on YARN. As Apache Spark is an in-memory distributed data processing engine, the application performance is heavily dependent on resources such as executors, cores, and memory allocated. The resources for the application depends on the application characteristics such as storage and computation.

Few performance bottlenecks were identified in the SFO Fire department call service dataset use case with YARN cluster manager. One of the bottlenecks was about improper usage of resources in YARN cluster and execution of the application based on default Spark configuration.

read more

Apache Spark on YARN – Resource Planning

Apache Spark on YARN – Resource Planning

Apache Spark on YARN – Resource Planning

This is the second article of a four-part series about Apache Spark on YARN. As Apache Spark is an in-memory distributed data processing engine, the application performance is heavily dependent on resources such as executors, cores, and memory allocated. The resources for the application depends on the application characteristics such as storage and computation.

Few performance bottlenecks were identified in the SFO Fire department call service dataset use case with YARN cluster manager. One of the bottlenecks was about improper usage of resources in YARN cluster and execution of the application based on default Spark configuration.

Apache Spark on YARN – Performance and Bottlenecks

This is the first article of a four-part series about Apache Spark on YARN. Apache Spark 2.x version ships with second-generation Tungsten engine. This engine is built upon ideas from modern compilers to emit optimized code at runtime that collapses the entire query into a single function by using “whole-stage code generation” technique. Let us discuss about high-level and low-level Spark API performances. SFO Fire department call service dataset and YARN cluster manager are chosen to test as well as tune the application performance.

read more

Apache Spark on YARN – Performance and Bottlenecks

Apache Spark on YARN – Performance and Bottlenecks

Apache Spark on YARN – Performance and Bottlenecks

This is the first article of a four-part series about Apache Spark on YARN. Apache Spark 2.x version ships with second-generation Tungsten engine. This engine is built upon ideas from modern compilers to emit optimized code at runtime that collapses the entire query into a single function by using “whole-stage code generation” technique. Let us discuss about high-level and low-level Spark API performances. SFO Fire department call service dataset and YARN cluster manager are chosen to test as well as tune the application performance.

Sales Data Analysis using Dataiku DSS

Dataiku Data Science Studio (DSS), a complete data science software platform, is used to explore, prototype, build, and deliver data products. It significantly reduces the time taken by data scientists, data analysts, and data engineers to perform data loading, data cleaning, data preparation, data integration, and data transformation when building powerful predictive applications.

read more

Sales Data Analysis using Dataiku DSS

Sales Data Analysis using Dataiku DSS

Sales Data Analysis using Dataiku DSS

Dataiku Data Science Studio (DSS), a complete data science software platform, is used to explore, prototype, build, and deliver data products. It significantly reduces the time taken by data scientists, data analysts, and data engineers to perform data loading, data cleaning, data preparation, data integration, and data transformation when building powerful predictive applications.

Importing and Analyzing Data in Datameer

Datameer, an end-to-end big data analytics platform, is built on Apache Hadoop to perform integration, analysis, and visualization of massive volumes of both structured and unstructured data. It can be rapidly integrated with any data sources such as new and existing data sources to deliver an easy-to-use, cost-effective, and sophisticated solution for big data analytics.

read more

Importing and Analyzing Data in Datameer