Kylo Setup for Data Lake Management

Kylo Setup for Data Lake Management

Overview

Kylo is a feature-rich data lake platform built on Apache Hadoop and Apache Spark. It provides data lake solution enabling self-service data ingest, data preparation, and data discovery. It integrates best practices around metadata capture, security, and data quality. It contains many special purposed routines for data lake operations leveraging Apache Spark and Apache Hive.

Furthermore, it provides a flexible data processing framework (leveraging Apache NiFi) for building batch or streaming pipeline templates and for enabling self-service features without compromising governance requirements. It has an integrated metadata server currently compatible with databases such as MySQL and Postgres. It can be integrated with Apache Ranger or Sentry and CDH Navigator or Ambari for cluster monitoring.

Kylo’s web application layer offers features oriented to business users, including data analysts, data stewards, data scientists, and IT operations personnel. It utilizes Apache NiFi as its scheduler and orchestration engine for providing an integrated framework for designing new types of pipelines with 200 processors (data connectors and transforms).

Prerequisites

  • Install MySQL (password: hadoop).
    Command:  
    Optional: change “bind_ip” to 0.0.0.0 in /etc/mysql/my.cnf file and restart MySQL to enable access from outside server.
    select
  • Ensure that “/opt/” has root privileges.
  • Download Java8 and extract to /opt/java8.
    Source: wget –no-check-certificate –no-cookies –header “Cookie: oraclelicense=accept-securebackup-cookie” http://download.oracle.com/otn-pub/java/jdk/8u92-b14/jdk-8u92-linux-x64.tar.gz -P /opt/
    Commands: 

Note: Ensure that jdk1.8.92 or above is configured. Else, module “kylo-alerts-default” will not be compiled.

  • Download Scala and extract data into /opt/scala2.
    Source: wget https://downloads.lightbend.com/scala/2.12.2/scala-2.12.2.tgz -P /opt/
    Commands: 
    wget https://downloads.lightbend.com/scala/2.12.2/scala-2.12.2.tgz -P /opt/
  • Download Spark2 and extract data into /opt/spark2
    Source: wget http://d3kbcqa49mib13.cloudfront.net/spark-2.1.0-bin-hadoop2.7.tgz -P /opt/
    Commands:  
  • Download Maven3 using binary and extract data into /opt/maven3
    Source: wget http://mirror.fibergrid.in/apache/maven/maven-3/3.5.0/binaries/apache-maven-3.5.0-bin.tar.gz -P /opt/
    Commands: 

select

select

  • In Maven module, install Alien in Ubuntu for RPM package. It will build both RPM & deb packages.
  • Set environment variables in ~/.bashrc & “/etc/profile (for all users)” file.
    • JAVA_HOME=/opt/java8
    • JRE_HOME=/opt/java8/jre
    • SCALA_HOME=/opt/scala2
    • SPARK_HOME=/opt/spark2
    • MAVEN_HOME=/opt/maven3
    • M2_HOME=/opt/maven3
    • PATH=$PATH:$JAVA_HOME/bin:$JRE_HOME/bin:$MAVEN_HOME/bin$:$M2_HOME/bin:$SCALA_HOME/bin:$SPARK_HOME/bin
  • Open new session in Putty or execute the below command to load added environment variables.

Test Configuration

  • Check whether Java, Scala, and Maven are properly configured.

select

  • Check whether Spark is properly configured.

select

  • Note: Move all the downloaded tar files into another directory called “tar_files”.

Building, Installing, and Setting up Kylo using Deb Package in Linux Ubuntu Machine

Downloading Kylo from GitHub

  • Download Kylo from the GitHub location provided in the Reference section.
  • Extract zip file: unzip master.zip

select

Executing Maven to Create Deb

  • Create deb using the below comment:
    It will take around 10-20 mins to download packages.
  • Clean and compile all class files, and package all modules (core, UI, service, setup) into RPM & deb packages using the below comment:
  • Skip unit testing for faster Maven builds using the below comment:
  • If you already have downloaded packages, run MVN in offline mode using the below command:
    select
    Note: “mvn clean install” will create both RPM & deb packages. To build only one package, go to install module (/opt/kylo-master/install/) and execute the below command after building all other modules:

Copying Deb

Copy deb from “/opt/kylo-master/install/target/deb/kylo-x.x.x-SNAPSHOT.deb” to “/opt/kylo/setup” using the below command:

Creating Users and its Groups

  • Create the following users:
    useradd -r -m -s /bin/bash nifi
    useradd -r -m -s /bin/bash kylo
    useradd -r -m -s /bin/bash activemq
    useradd -r -m -s /bin/bash elasticsearch

select

  • Check whether groups were created for the above users in “/etc/group”.

select

  • If not, create groups for the users by executing the below command:

Installing kylo.deb

Install kylo.deb that has packaged whole setup in it using the below command:

select

Downloading Binary Files

To download all required binary files (JDK, Elasticsearch, ActiveMQ, Apache NiFi) locally, run the below script:

select

These files will be added to the below directories with different user privileges.

  • Directory: /opt/kylo/

select

  • Directory: /opt/kylo/setup

select

Setting up Binary Files

  • Run the below script to setup JDK, Elasticsearch, ActiveMQ, and NiFi.
  • To download it offline, run the below script:

Note:

    • Before executing the above script, ensure that spark home is setup.
    • During setup, perform the following:
      • Choose MySQL and carefully provide connection details (host: localhost, username: root, password: hadoop).
      • Give “y” 3 times to install Elasticsearch, ActiveMQ, and NiFi.
      • Choose Java option [3] and provide home “/opt/java8”.

The below image is provided for reference:

select

Once setup wizard is completed, the below services will be added:

      • ActiveMQ
      • Elasticsearch
      • kylo-service
      • kylo-spark-shell
      • kylo-ui
      • NiFi

select

Note: Manually install the services if any of the services is not installed.
For example, Nifi: cd /opt/nifi/current/bin/; ./nifi start

Optional Step: Run the below SQL scripts to create all needed tables and its data.

select

  • Check whether the tables are created in MySQL databases. If not, execute the below command:

Starting Server

To start the server (kylo-ui, kylo-services, kylo-spark-shell), execute the below script:

select

Checking Service Status

  • Check status of all services using the below script:
  • Run the below script to check the Kylo All Services:

select

select

  • Run the below script to check Nifi service:
  • Run the below script to check ActiveMQ service:

select

  • Run the below script to check Elasticsearch service:

select

Accessing UI

  • Open Kylo UI by accessing the URL: http://{IP}:8400/
  • Provide login credentials as “Username / password: dladmin / thinkbig”

select

select

Troubleshooting

ActiveMQ is not Running

Problem: ActiveMQ is not running and shows the below error:

select

The problem is caused as ActiveMQ reads the JAVA_HOME file only from any of the below locations where ever if finds the file first even if the file is defined in /etc/environment.

  • /etc/default/activemq
  • $HOME/.activemqrc
  • $INSTALLDIR/apache-activemq-/bin/env

Solution: Add “JAVA_HOME=/opt/java8” in the first line of the file “/etc/default/activemq” and start it.

select

Elasticsearch is not Running

Problem: Elasticsearch is not running and shows the below error when trying to start:

select

This problem is caused as JAVA_HOME setup alone is done in “root” user environment. But, Elasticsearch will run in “elasticsearch” user.

Solution: Add “JAVA_HOME=/opt/java8” in the first line of the file “/etc/init.d/elasticsearch” and restart it.

select

Else, install Java using apt-get. When running on Ubuntu or Debian, the package comes with OpenJDK due to licensing issues. To fix this Java path problem, run the below command:

Kylo-spark-shell is not Running

Problem: Kylo-spark-shell is not running and shows the below error in the file “/var/log/kylo-services/kylo-spark-shell.err”.

select

Solution: Add environment variables such as JAVA, Spark, and Scala (if possible all variables) in “/etc/profile” and make sure that the environment variables are set for all users.

select

Kylo-alerts-default is not Compiling

Problem: Kylo-alerts-default is not compiling and throws the below error:

select

Solution: Make sure that jdk1.8.92 or above is configured. Else, module “kylo-alerts-default” will not be compiled.

Integrating with Hortonworks

  • Login into namenode server and execute the below commands to add users into HDFS:
    • For kylo-service node
    • For namenode / masternode
  • Change metastore configuration in property file “/opt/kylo/kylo-services/conf/application.properties”.
    • hive.datasource.url=jdbc:hive2://xxxxxxxx:10000/default
    • hive.datasource.username=hive
    • hive.datasource.password=hive
    • nifi.service.hive_thrift_service.database_connection_url=jdbc:hive2://xxxxxxxx:10000/default
  • Restart server.

Conclusion

Kylo is a feature-rich data lake platform built on Apache Hadoop and Spark. Now, you can successfully setup Kylo.

In the upcoming blog, we will discuss about changing NiFi component’s configurations (for example, HiveThriftConnection and so on) of existing template and creating new templates.

References

6271 Views 9 Views Today
  • M Baig

    Very informative.
    Can Kylo be installed on Ubuntu 14.04

    • http://www.treselle.com/ Treselle Systems Blog

      Thanks!!! We have installed Kylo on AWS Linux Ubuntu 16.x version. Hope, it should work on Ubuntu 14.04 too.

  • Future

    I have followed the above steps and able to launch the kylo UI but unable to login in it by the passwd given here as “Username / password: dladmin / thinkbig”. Can you please suggest me the password through which i can login or where I can get it from mysql DB??

    Thanks in Advance!!!