Web Crawler – Python with Scrapy

Web Crawler – Python with Scrapy

Introduction

Python is powerful and efficient programming language. It is friendly and easy to learn.  Scrapy is a fast, high-level screen scraping, and web crawling framework, it is completely written in Python and runs on Linux, Windows, Mac and BSD. Some basic features of Scrapy are given below

  • Simple – Scrapy was designed with simplicity in mind, by providing the essential features
  • Productive – We just have to write the rules to extract the data from web pages and Scrapy crawls the entire web site.
  • Extensible – provides several mechanisms to plug new code without having to touch the framework core

Hence we combine Python with Scrapy for web crawling.

Use Case

Let’s try to create a web crawler in Scrapy, by crawling a single or multiple websites. We will get the list of products, title, and their respective price from the crawled websites. Finally we will show the entire product list, with prices in multiple crawling websites using Python.

What we need to do:

  • Create a project in Scrapy.
  • Communicate Python script with Scrapy.
  • Create a  web content tag patterns(all the products, product title, product price)
  • Start a Scrapy reactor service.
  • Create a spider in Scrapy and get the return values in Python.

Solution

Before solving our use case, let’s get some pre-requisites satisfied.

Pre-requisites:

Create a new project in Scrapy:

  • Before you start scraping, you will have to set up a new Scrapy project. Enter a directory where you’d like to store your code and then run:
  • This will create a tutorial directory with the following contents:

folder

  • The file instructions are given below:
File name Description
scrapy.cfg  Project configuration file.
tutorial/  Project’s Python module, we will  later import our code from here.
tutorial/items.py  Project’s items file.
tutorial/pipelines.py  Project’s pipelines file.
tutorial/settings.py  Project’s settings file.
tutorial/spiders/  A directory where we  will later put our spiders.

Communicate Python script with Scrapy:

  • Create a config file

This is a configuration file. Here all the common variables such as string, numbers, list, dictionary, file names, etc. are available. We can share all the variables in Python and Scrapy.

  • Create a web list xml file
    • This is an input website list file, which helps to crawl websites.
    • Xml attributes detail
      • name – Crawling web name
      • url – Crawling web url
  • Load a web list file
    • The code for input web list file, which is loaded and stored in list format using NLTK is given below:

 Create a  web content tag patterns:

  • Create a website content tag patterns file
    • We need the tag pattern for crawling websites, hence use sample pattern xml. given below.
    • Xml attributes detail
      • url – domain name
      • products_all – Obtains the total products tag pattern.
      • products_title – Obtains the product title tag pattern.
      • products_price – Obtains the product price tag pattern.
  • Load a web content tag patterns file
    • Xml attributes values are stored in dictionary format in this file. This helps to get the tag patterns in crawling websites.
  • Start a Scrapy reactor service file
    • This file can communicate with Scrapy hence the Scrapy service starts from here. Once the crawling functionality is completed, we can get the Scrapy values in this file.
      Note: We should stop the reactor service once the service is completed.

Create a spider in Scrapy and get the return values in Python:

We can crawl multiple websites simultaneously in Scrapy.

Set tag patterns for the current crawling website:

The code to set the tag patterns in current crawling website are given below

  •  Scrapy functionalities used in the use case are given below

Spider – Helps in crawling the websites in Scrapy.
xpath – Helps in selecting the particular tag in web content.
start_urls – Helps in crawling single or multiple websites in Scrapy.
allowed_domains – Helps in crawling restricted websites.
Reactor – Helps in connecting Python script with Scrapy.

  • Show the output code
  •  Sample output
     
  • Challenges
    • Scrapy reactor service:

We have used Scrapy reactor service from Python script to crawl multiple websites. For this, we should start the Scrapy reactor service multiple times but the drawback is that we can use the Scrapy reactor only once.for domain in ['domain1.com', 'domain2.com']:

setup_crawler(domain)

Solution: I have used start_urls functionality in Scrapy spider file. We can give a multiple websites url in start_urls. So we need not use multiple reactors in Python script.

start_urls = ['domain1.com', 'domain2.com']

Conclusion

  • It is simple to create crawl functionality in Scrapy.
  • Scrapy gives output very quickly. It is very fast compared to Beautifulsoup.
  • We can easily communicate Python with Scrapy .
  • It is comfortable to use Scrapy for web crawling.

References

 

 

 

14399 Views 1 Views Today
  • Guest

    Hi one problem though : I couldnt find any package named spider anywhere ??

    ImportError: No module named spiders.my_spider

    • http://www.treselle.com/ Treselle Systems Blog

      Make sure the import file under Scrapy reactor service and
      the spider directory file name are matched. If not you will get an error(No
      module name).

  • amg

    hi…how to use twisted reactor for scrapy?. i read the documentation but it is not clear for a beginner like me. please help me…

    • http://www.treselle.com/ Treselle Systems Blog

      Scrapy uses Twisted under the hood, a Python library used for networking. Using Twisted allows Scrapy to do the following:

      • grab hostnames,

      • handle events (e.g. starting/stopping a crawler),

      • ability to send mails,

      • use the crawler within a Python console,

      • monitor and control a crawler using a web service.

      Remember that Scrapy is built on top of the Twisted asynchronous networking library, so you need to run it inside the Twisted reactor.

      For more information on Twisted Reactor refer the below link

      http://twistedmatrix.com/documents/current/core/howto/reactor-basics.html

      • amg

        hi,

        I read the tutorial. thanks.
        Would you please upload code?

  • http://potentpages.com/themeSearch/ David Selden-Treiman

    Hello. I really enjoyed the tutorial. I added it to my list of Scrapy-based website crawler tutorials. Thank you for the great resource!

  • Pingback: Python Web Crawler & Spider Tutorials | Potent Pages | IT Lyderis

  • dhamotharan ec

    Hai, It is very useful for multiple websites crawling, can you share entire code Please..

  • Webshare.io

    If you get blocked by websites, you can use a free proxy with thousands of IP addresses for Scrapy at https://proxy.webshare.io/