Apache NiFi – Data Crawling from HTTPS Websites

Apache NiFi – Data Crawling from HTTPS Websites

Overview

Apache NiFi, a very effective, powerful, and scalable dataflow building platform, is used to process and distribute data and to automate data flow between systems.

In this blog, let us discuss about crawling data from HTTPS websites using Apache NiFi.

Pre-requisites

Download and install the following from the below links:

Use Case

Crawling employment statistics data (about 27 years from 1990 to till date) from a website.

Synopsis

  • Setting and Configuring SSL
  • Accessing and Crawling HTTPS Website Using Apache NiFi

Setting and Configuring SSL

To extract data from HTTPS websites, a SSL context is required to call HTTPS sites and a cacert is needed from JDK file.

Setting SSL in Local Machine

To set the SSL in the local machine, perform the following:

select

  • Create admin-cert.pem and admin-q-user.pfx files using the below command:
select

After successful execution of the above commands, “admin-cert.pem”, “admin-private-key.pem”, and “admin-q-user.pfx” files will be created as shown in the below diagram:

select

Adding SSL Certificate to Browser

To add “admin-q-user.pfx” file to the browser, perform the following:

  • Go to browser Settings –> Show advanced settings –> HTTPS/SSL –> Manage certificates.
    The screen looks similar to the one shown below:

select

  • Click Import –> Browse and provide the path of “admin-q-user.pfx” file.
  • Enter Password provided in above command.
    For example, SuperSecret.
  • Enable the option – “Automatically select the certificate store based on the type of certificate” and click Next as shown in the below diagram:

select

  • Click Finish.

Creating KeyStore and TrustStore

After successfully adding the certificate to the browser, perform the following:

  • Go to the location path of keytool.exe.
    For example, C:\Program Files\Java\jdk1.8.0_121\bin.
  • Open the command prompt.
  • Create “server_keystore.jks” and “server_truststore.jks” files using the below commands:
After running the above commands, the KeyStore and TrustStore files will be automatically created in the same path of keytool.exe file as shown in the below diagram:

select

Adding SSL Certificate to KeyStore

To add SSL certificate to the KeyStore, use the below command:

select

Testing SSL Certificate Added to KeyStore

After adding the certificate, check details such as issuer, validity, and so on of the certificate using the following command:

select

Configuring SSL in Apache NiFi

After adding the certificate, add user details in both “authorizations.xml” and “nifi.properties” files of Apache NiFi in the desired path to configure the SSL with Apache NiFi.

For example, C:\Apache_NIFI\nifi-1.2.0-bin\nifi-1.2.0\conf as shown in the below diagram:

select

After adding the user details, the “authorizations.xml” file looks similar to the one below:

select

After adding the user details, the “nifi.properties” file looks similar to the one below:

select

Accessing and Crawling HTTPS Website Using Apache NiFi

Configuring GetHTTP

To crawl data from HTTPS website using Apache NiFi, perform the following:

  • Open Apache NiFi.
  • Select Processor –> GetHTTP as shown in the below diagram:

select

  • Configure “GetHTTP”.

select

  • Add HTTPS website link in URL part for Apache NiFi to crawl the data from the website.
    Note: Just the website URL alone is enough to crawl the data from the website.
  • Add and configure “StandardSSLContextService” as shown in the below diagram:

select

  • Click Go To icon on the SSL Context Service in the above diagram to configure the SSL Context Service.
  • Create a new controller service as shown in the below diagram:

select

  • Edit the newly added controller service to add required property details such as Keystore Filename, Keystore Password, Keystore Type, and SSL Protocol as shown in the below diagram:

select

  • Enable the newly added controller service as shown in the below diagram:

select

Configuring PutFile

To save the output data in the prescribed location, configure PutFile as shown in the below diagram:

select

The page looks similar to the one shown below:

select

After configuring PutFile, run the process to crawl the data from HTTPS website.

Crawled Output

The crawled output data looks similar to the one as shown below:

select

References

2528 Views 7 Views Today