Table of Content
Data crawling is defined as the process of crawling data from source webpage or website. A crawler is a program that visits web sites and reads their pages and other information based on web page index in order to crawl particular data from the webpages.
Crawling using tHTMLInput component is considered as one of the fast crawling method using Talend Open Studio (TOS). Most data engineers use the tHTMLInput component for crawling their sites and loading the data into a file. tHTMLInput data crawling process, along with sample TOS jobs, is explained in this blog.
This use case demonstrates the steps to be performed for creating a tHTMLInput data crawling process using TOS.
- Download and install TOS.
- Download and install TOS tHTMLInput Component (Download it from the URL given in References section).
Data Crawling Steps
To crawl data from the source website, perform the following steps:
- Open TOS and right click Job Designs.
- Select Create Job and give proper name for TOS job.
- Search tHTMLInput in Component palette.
- Drag and drop the component into job designer workspace.
Note: To create a new tHTMLInput component, login to above Talend exchange dashboard URL and download it. On downloading the component, move it to your default TOS component path. Refresh the page to search the component in palette by the component name.
- Choose a source web URL to check the webpage index structures. For example, see References section.
- Open the webpage URL to be crawled in the browser.
- Right click on the page to execute the inspect element option.
- Open TOS and double click tHTMLInput component.
- Provide the below details:
- Timeout – Set time-out value for source webpage reading action.
- URL – Give source webpage URL within double quotes ["Paste your crawling URL"] .
- User Agent – Give Mozilla as user agent for better solution ["Mozilla"].
- Max Body Size – Make it empty or zero if you want to get unlimited data  .
- Parent Element – Provide the root element as the parent element to get child elements.
For example, consider “div.wikitable”. div as the root class name in parent element box.
- Mapping – Mention schema and related list of attributes.
- Use Jsoup selectors to move to next level of child attribute crawling process. For Jsoup selectors, see References section.
- Add another endpoint component file or log row for printing the crawled data.
For example, tLogRow TOS component is used for better understanding.
- Decide the div class to be considered as a root class.
You can consider the div class from the source link “div.wikitable“.
- Give column name and selectors to get particular values for mapping.
Let us crawl the list of table data marked in red in the above diagram using tHTMLInput component. The class name is already mentioned as parent element. Get the first elements of values for mapping using “td:eq(0)”.
- Save and run the job from the source webpage to get the data as shown in the below diagram:
- Save crawled data in files or storage.
In this blog, we have discussed about crawling target webpage data using ETL-TOS with tHTMLInput component. Hope, you have gained a better understanding about crawling using TOS component.
- Download and Install TOS:
- Download and Install TOS – tHTMLInput component:
- Sample URL for crawling:
- Jsoup selectors: