Table of Content
- 1 Overview
- 2 Pre-requisites
- 3 Use Case
- 4 Setting and Configuring SSL
- 5 Accessing and Crawling HTTPS Website Using Apache NiFi
- 6 References
Apache NiFi, a very effective, powerful, and scalable dataflow building platform, is used to process and distribute data and to automate data flow between systems.
In this blog, let us discuss about crawling data from HTTPS websites using Apache NiFi.
Download and install the following from the below links:
- OPENSSL for Windows: https://storage.googleapis.com/google-code-archive-downloads/v2/code.google.com/openssl-for-windows/openssl-0.9.8e_X64.zip
- Apache NiFi 1.2.0: https://nifi.apache.org/download.html
Crawling employment statistics data (about 27 years from 1990 to till date) from a website.
- Setting and Configuring SSL
- Accessing and Crawling HTTPS Website Using Apache NiFi
Setting and Configuring SSL
To extract data from HTTPS websites, a SSL context is required to call HTTPS sites and a cacert is needed from JDK file.
Setting SSL in Local Machine
To set the SSL in the local machine, perform the following:
- Download OPENSSL from the following link:
- Install it and open it from the desired path:
For example, E:\OpenSSL\openssl-0.9.8e_X64\bin\openssl.exe
- Create admin-private-key.pem file using the below command:
req -x509 -newkey rsa:2048 -config "E:\OpenSSL\openssl-0.9.8e_X64\openssl.cnf" -keyout admin-private-key.pem -out admin-cert.pem -days 365 -subj "/CN=Admin Q. User/C=US/L=Seattle" -nodes
- Create admin-cert.pem and admin-q-user.pfx files using the below command:
pkcs12 -inkey admin-private-key.pem -in admin-cert.pem -export -out admin-q-user.pfx -passout pass:"SuperSecret"
After successful execution of the above commands, “admin-cert.pem”, “admin-private-key.pem”, and “admin-q-user.pfx” files will be created as shown in the below diagram:
Adding SSL Certificate to Browser
To add “admin-q-user.pfx” file to the browser, perform the following:
- Go to browser Settings –> Show advanced settings –> HTTPS/SSL –> Manage certificates.
The screen looks similar to the one shown below:
- Click Import –> Browse and provide the path of “admin-q-user.pfx” file.
- Enter Password provided in above command.
For example, SuperSecret.
- Enable the option – “Automatically select the certificate store based on the type of certificate” and click Next as shown in the below diagram:
- Click Finish.
Creating KeyStore and TrustStore
After successfully adding the certificate to the browser, perform the following:
- Go to the location path of keytool.exe.
For example, C:\Program Files\Java\jdk1.8.0_121\bin.
- Open the command prompt.
- Create “server_keystore.jks” and “server_truststore.jks” files using the below commands:
keytool -genkeypair -alias nifiserver -keyalg RSA -keypass SuperSecret -storepass SuperSecret -keystore server_keystore.jks -dname "CN=Test NiFi Server" -noprompt
keytool -importcert -v -trustcacerts -alias admin -file admin-cert.pem -keystore server_truststore.jks -storepass SuperSecret -noprompt
Adding SSL Certificate to KeyStore
To add SSL certificate to the KeyStore, use the below command:
keytool -importcert -v -trustcacerts -alias admin -file E:\OpenSSL\openssl-0.9.8e_X64\bin\admin-cert.pem -keystore server_keystore.jks -storepass SuperSecret -noprompt
Testing SSL Certificate Added to KeyStore
After adding the certificate, check details such as issuer, validity, and so on of the certificate using the following command:
keytool -list -v -keystore server_keystore.jks
Configuring SSL in Apache NiFi
After adding the certificate, add user details in both “authorizations.xml” and “nifi.properties” files of Apache NiFi in the desired path to configure the SSL with Apache NiFi.
For example, C:\Apache_NIFI\nifi-1.2.0-bin\nifi-1.2.0\conf as shown in the below diagram:
After adding the user details, the “authorizations.xml” file looks similar to the one below:
After adding the user details, the “nifi.properties” file looks similar to the one below:
Accessing and Crawling HTTPS Website Using Apache NiFi
To crawl data from HTTPS website using Apache NiFi, perform the following:
- Open Apache NiFi.
- Select Processor –> GetHTTP as shown in the below diagram:
- Configure “GetHTTP”.
- Add HTTPS website link in URL part for Apache NiFi to crawl the data from the website.
Note: Just the website URL alone is enough to crawl the data from the website.
- Add and configure “StandardSSLContextService” as shown in the below diagram:
- Click Go To icon on the SSL Context Service in the above diagram to configure the SSL Context Service.
- Create a new controller service as shown in the below diagram:
- Edit the newly added controller service to add required property details such as Keystore Filename, Keystore Password, Keystore Type, and SSL Protocol as shown in the below diagram:
- Enable the newly added controller service as shown in the below diagram:
To save the output data in the prescribed location, configure PutFile as shown in the below diagram:
The page looks similar to the one shown below:
After configuring PutFile, run the process to crawl the data from HTTPS website.
The crawled output data looks similar to the one as shown below:
- Configuring Apache NiFi SSL Authentication: https://www.batchiq.com/nifi-configuring-ssl-auth.html
- NiFi System Administrator’s Guide: https://nifi.apache.org/docs/nifi-docs/html/administration-guide.html#configuration-best-practices