Web Crawler Python Tutorial, Crawlers Simplified.

Web Crawler Python Tutorial!

Coding web crawlers simplified.

This Web crawler python tutorial has been put together to provide an introduction with simple explanations to creating your first web crawler. Web scraping, also known as a web spider, web crawler, a bot, or a web scraper, is a powerful tool to pull data from websites. It allows a person to programmatically pull information that can be joined together for data analysis, archiving, and anything that requires obtaining information from the web. A spider’s strength is mining the internet for specific data. A spider can be an essential tool to acquire large sets of data, product info, text data, large amounts of image data, and so much more.

In this web crawler in python tutorial, we will cover using the scrapy python library to build a web scraper. You will learn to quickly make a quick and effective web scraper entries from the python reddit thread.

This web crawler in python tutorial is meant to be a simple introduction to building web crawlers in python and should provide a good starting point for you spider building skills.

Note: You must have Python 3 and the scrapy library installed.

Note: We will use web crawler, spider, bot, and web scraper as synonyms for each other.

Web crawler in python tutorial: web scraping with scrapy

A web crawler is built with two main principles. Our spider must download a webpage and the spider must pull useful data. Once these two principles are completed then it’s just a matter of repeat until complete.

Find the webpage you want to crawl/scrape
Download it
Pull Information from the downloaded page
????
Crawl the entire internet!!

We will break this up into a few sections to allow the logic to be explained as we progress. We will use Scrapy for our web crawler in python tutorial but the principles can be applied using any library and language.

It is completely possible to build a web crawler in python without a library or using non scraping based libraries. If you decide to build a spider without a library like scrapy then you need to handle throttling, concurrency, the robots.txt, redirections, and so much more details that occur as your spider’s complexity increases.

NOTE: Rewriting functionality that Scrapy takes care of for us is outside the scope of this web crawler in python tutorial.

Scrapy handles these complexities and offers advanced functionality that will come in handy when building more complex spiders. It offers a lot of built-in functionality that reduce the headaches you encounter when building spiders. It takes an everything and the kitchen sink approach to provide a very functional and practical library.

Therefore, this web crawler in python tutorial will use the scrapy library and we recommend this library as part of your spider building tools.

Web crawler in python tutorial part 1 – Building the foundation of our spider

First, let’s make our scrapy web crawler in python project.

$scrapy startproject redditSpider

Now navigate to the directory that we just made.

$cd redditSpider/redditSpider/spiders

Next, we create the base class for our spider. We can use the touch command inside a Linux/Mac terminal to make our base file, but you can use a text editor or your operating system’s create file command from the User Interface.

$touch redditSpider.py

Now that we have our basic file setup we will create a class that inherits, is based on, from scrapy’s scrapy.Spider class. This class provides our new redditSpider class with the ability to use the scrapy library.

import scrapy

class RedditSpider(scrapy.Spider):
    name = “redditSpider”
    start_urls = [‘https://www.reddit.com/r/Python’]

We now breakdown the web crawler in python spider code above line by line.

import scrapy – is used to bring in a reference to the scrapy library for use in our python program.
class RedditSpider(scrapy.Spider): – Creates a class called RedditSpider that inherits from scrapy’s base spider class
name – is a variable defined within scrapy’s base class and is a name for the spider
start_urls – is list of urls that your spider will begin from

Creating a subclass takes the functionality of the base spider class but creates a specialized spider for our use.

Overview: We created a basic spider called redditSpider by deriving scrapy’s base class and the spider will begin crawling at our start_urls = [“https://www.reddit.com/r/Python”]

Simple enough, huh?

Web crawler in python tutorial part 2 – How to run our basic scrapy web crawler

Now normal Linux/Mac commands can be executed in a command line/terminal style and scrapy is no exception. However, scrapy has its own command to execute a spider.

Navigate to the top directory of the spider and run the following command.

$ scrapy crawl redditSpider.py

You’ll see output similar to the following:

2019-04-10 23:03:43 [scrapy.utils.log] INFO: Scrapy 1.5.2 started (bot: scrapybot)

...

2019-04-10 23:03:43 [scrapy.core.engine] DEBUG: Crawled (200) <get https:="" www.reddit.com="" r="" python=""> (referer: None)

2019-04-10 23:03:44 [scrapy.core.scraper] ERROR: Spider error processing <get https:="" www.reddit.com="" r="" python=""> (referer: None)

NotImplementedError: RedditSpider.parse callback is not defined

2019-04-10 23:03:44 [scrapy.core.engine] INFO: Closing spider (finished)

2019-04-10 23:03:44 [scrapy.statscollectors] INFO: Dumping Scrapy stats:

{'downloader/request_bytes': 224,

…

'spider_exceptions/NotImplementedError': 1,

2019-04-10 23:03:44 [scrapy.core.engine] INFO: Spider closed (finished)</get></get>

Now this may seem like a lot of output but don’t worry we are about to break this down together.

Our web crawler in python spider initialized, loaded, and references additional components and extensions that would be needed at a later point in the program
It requests the first url in our start_urls list
It downloads the web content and passes the html to the parse method.
The parse method results in error since we never wrote an explicit parse method for our spider and the default parse method is set to result in an error
The web crawler in python spider is then closed and statistics of the crawl are displayed

Now, we are ready to pull data from the webpage using the web crawler in python spider.

Overview: So far our web crawler in python tutorial has gone over creating a basic spider, requesting, and receiving webpages.

Web crawler in python tutorial part 3 – How to pull data from our scrapy crawled webpage

Our web crawler in python tutorial has created a spider that simply downloads webpages but does nothing with them. We are only crawling and not pulling any useful data yet. Next, we have to instruct our spider on how it will pull useful data from the web content.

We must first look at the web page’s source to understand the structure and create logic that would be able to pull data we want. Iterating and tweaking our web crawler in python project is the basis of creating a better and better crawler.

At the time of this writing, a simplified reddit post includes the title, the upvote count, the username of the poster, and a flag indicating if the post is promoted or normal.

Before we finish our web crawler in python spider we need to create an item to contain the logical information of a post. We specify the item within the items.py file that is located at the same level as our spiders directory.

Our web crawler in python project current directory should look similar to this

redditSpider
- scrapy.cfg
- redditSpider
  - __init__.py
  - __pycache__
  - items.py
  - middlewares.py
  - pipelines.py
  - settings.py
  - spiders
    - __init__.py
    - __pycache__
    - redditSpider.py

The file we want is the items.py

The items.py will have an import statement, and a premade item with

name = scrapy.Field()

We want to change this to look like the following:

import scrapy

class RedditPost(scrapy.Item):
    username = scrapy.Field()
    title = scrapy.Field()
    upvotes = scrapy.Field()
    isPromoted = scrapy.Field()

The file above is broken down as follows:

This line imports the scrapy library so we can create an object based on scrapy Item

import scrapy

This line defines a new class called RedditPost that inherits, is a child of, from the scrapy.Item class.

class RedditPost(scrapy.Item):

These line define that object RedditPost contains the following fields. A field is a simple way for scrapy to hold any piece of information that can be used like a dictionary to set and get the value for each scrapy item.

    username = scrapy.Field()
    title = scrapy.Field()
    upvotes = scrapy.Field()
    isPromoted = scrapy.Field()

Now that we have created our item and we can continue toward finishing our web crawler in python tutorial and complete our redditSpider.py file to finish creating our spider.

Website Structure

We must look at the structure of the webpage https://reddit.com/r/python

Each reddit post in the feed has this structure:

There is an outer div with a class similar to the following

Everything within this div class is part of its respective post.

Within the post div we can find the username because it is an element similar to the following:

<a class="randomly generated " href="/user/userName/">u/userName</a>

We can also find if the post is promoted by looking for a span that contains the text ‘promoted’ as follows:

<span class="randomly generated id">promoted</span>

The post title structure within each post depends on if the post is promoted or a normal post.

If the post is promoted the structure is similar to the following:

<span class="randomly generated id">

<h2 class="random id">"Title text."</h2>

</span>

If the post is not promoted the structure is similar to the following:


<a data-click-id="body" class="randomly generated id" href="/r/Python/comments /relative path to post">

<h2 class="random generated id">Non promoted title text</h2>

</a>

The large difference, at the time of this writing, between the promoted post structure and non-promoted post structure is the lack of an anchor tag with a data-click-id=”body” attribute.

Lastly, the upvote count structure is similar to the following:



<div class="”scrollerItem”" id="”randomly" generated="" id”="" tabindex="”-1”">

<div class="random id" style="width:40px;border-left:4px solid transparent">

<div class="random id">
      <button class="random id" aria-label="upvote" aria-pressed="false" data-click-id="upvote" id="upvote-button-random-id ">

<div class="random id ">
          <i class="icon icon-upvote random id"></i>
</div>

      </button>

<div class="random id" style="color:#1A1A1B">Number</div>

        <button class="ranomd id " aria-label="downvote" aria-pressed="false" data-click-id="downvote">

<div class="random id”>
            <i class=" icon="" icon-downvote="" random="" id"="">
</div>

        </button>
</div>

</div>

</div>

Building the crawler

The element we want is the number value that is a div nested within a div nested within a div nested within a div and the outer div is the post div that is found using the class contains scrollerItem.

This is the current structure we will use to create the basic parsing logic for our spider.

Our current web crawler runs and throws an error because we lack a parse method. We will now import our scrapy item and introduce the parse method along with the parsing logic to create scrapy items and turned the scraped page into useable data objects.

Our web crawler in python spider will now look like this. As before we will explain each line in detail.

import scrapy
from redditSpider.items import RedditPost

class RedditSpider(scrapy.Spider):
    name = 'redditSpider'
    start_urls = ['https://www.reddit.com/r/Python']

    def parse(self, response):
        posts = response.xpath("//div[contains(@class,'scrollerItem')]")

        for post in posts:

            user = post.xpath(".//a[contains(@href, 'user')]/text()").get()
            if user is not None:
                promoted = post.xpath('.//span/h2/text()').get()
                if(promoted is not None):
                    postTitle = post.xpath('.//span/h2/text()').get()
                else:
                    postTitle  = post.xpath('.//a[@data-click-id="body"]/h2/text()').get()
                upvoteCount = post.xpath('.//div/div/div/text()').get()
                promoted = post.xpath('.//span[contains(text(),"promoted")]').get()
                
                redditPost = RedditPost(username=user[2:], title=postTitle ,upvotes=upvoteCount, isPromoted = promoted is not None)
                yield redditPost

Crawler Breakdown

The above program can be broken down as follows:

This line imports the scrapy item we defined earlier in the Items.py:

from redditSpider.items import RedditPost

Now we define the parse method that scrapy will call once it receives the response from requesting the web page.

def parse(self, response):

Next, we pull out div elements, including all elements nested within the div, whose class attribute contain the text scrollerItem. This pulls out a logical entity that is a reddit post that we will use to pull data from and construct our logical reddit post items.

posts = response.xpath(“//div[contains(@class, ‘scrollerItem’)]”)

Once we have a collection of all posts we must go through each post and pull out the data to construct our logical reddit post item. The following syntax can be read similar to go through each object in posts and place that into our object called post.

for post in posts:

Now we look for any anchor tag that contains an href attribute with the text user and get the text nested within the tag.

user = post.xpath(".//a[contains(@href, 'user')]/text()").get()

If the user object was populated then we will continue pulling more info from the current post. This is done more as a precaution to ensure we have info and not creating an object that does not pertain to a username.

if user is not None:

Once we very the user object exists we proceed to check if the current post contains a span tag with the ‘promoted’ text that would indicate our current post is promoted.

promoted = post.xpath('.//span/h2/text()').get()

NOTE: the xpath above uses the ‘.//’ syntax. This syntax indicates look within the current element. Our current element is the post that was pulled out using the above mentioned div contains scrollerItem text.

We now have determined if the post we are parsing is promoted or not so we can use that to use the correct title selector.

if(promoted is not None):
    postTitle = post.xpath('.//span/h2/text()').get()
else:
    postTitle  = post.xpath('.//a[@data-click-id="body"]/h2/text()').get()

Lastly, we have all the data except the upvote count. So we now call the upvote count selector.

upvoteCount = post.xpath('.//div/div/div/text()').get()

Now that we have pulled the last piece of information we can create our RedditPost object containing the data.

redditPost = RedditPost(username=user[2:], title=postTitle ,upvotes=upvoteCount, isPromoted = promoted is not None)

We now yield the object to scrapy to be used later in the program cycle.

yield redditPost

The output will look similar to the following:

2019-05-20 21:08:52 [scrapy.core.engine] INFO: Spider opened

2019-05-20 21:08:54 [scrapy.core.engine] DEBUG: Crawled (200) <get https:="" www.reddit.com="" r="" python=""> (referer: None)

2019-05-20 21:08:54 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.reddit.com/r/Python>

{'isPromoted': False,

'title': "Non-promoted title",

'upvotes': '22',

'username': 'username'}

2019-05-20 21:08:54 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.reddit.com/r/Python>

....

2019-05-20 21:08:54 [scrapy.core.engine] INFO: Closing spider (finished)
</get>

This now is a working basic reddit web crawler in python spider that pulls the first few posts from the reddit feed. Now you may have noticed that we only pull a few posts. As this is a simple web crawler in python tutorial, our reddit web crawler we will not be discussing how to bypass anti-crawler measures, infinite scrolling, and other items that crawler will encounter in the wild. We will be creating more web crawler in python tutorials to address these issues in an advanced tutorial.

Overview: So far our web crawler in python tutorial is now able to request a webpage, parse data out of the page, and construct our post items we created earlier. This is the basic principle of any web crawler in python spider. It is important to iterate and build more to truly get a grasp. We will cover more on selectors, the shell, items, and saving results in the next step.

Web crawler in python tutorial step 4 – ???????

This section is a reference and a view into more functionality for our web crawler. A web crawler in python developer will always continue to learn more and continuously grow.

This section in our web crawler in python tutorial will cover Selectors, Scrapy Shell, Items, and Saving Results to csv.

Web crawler in python tutorial step 4 – Selectors

Selectors are very important when scripting a strong we crawler python spider. They tell scrapy what to select when it goes through http from the requests it sends out. Selectors in scrapy are generally created and used with ‘css’ or ‘xpath’ to select a specified element from the response object used by scrapy.

Our xpath selectors use relative and direct xpath. An xpath is finding an element by using the html as xml.

The following html can be parsed using either xpath or css.



<h1>Title</h1>


Body text

The title text above can be accessed by using the xpath //h1/text().

response.xpath('//h1/text()').get()

On the other hand, using the css selector we can access the ‘Title’ text by using h1::text when calling the css path.

response.css('h1::text').get()

Overview: Our web crawler in python spider can crawl and the text can be reached by using either xpath or css style selectors. The difference matters depending on how the website is structured. I recommend getting familiar with both xpath and css selectors and finding which one you’re the most comfortable with.

Web crawler in python tutorial step 4 – Scrapy Shell

Scrapy Shell is a great way to understand what your web crawler in python spider is doing by running an interactive shell and you entering commands like you spider would be composed of. For example:

We can mimic our web crawler in python spider by requesting the shell to fetch a page for us and then parsing it. The fetch requests the url using the same underlying methodology as the web crawler in python spider will.

$ fetch('https://www.reddit.com/r/python')

Once we have the response of the fetch we can view it, inspect it, and get a better idea of what selector our web crawler in python spider can take advantage of.

$ view(response)

The view command allows us to pull up the page that was received during our fetch command. It is opened in your local web browser. Once the web browser opens up the web page we scraped we can use developer tools to view the source and find good selectors to use.

The useful available when using the Scrapy Shell are as follow:

request : the request object for the last fetched page
response : the response object for the past fetched page
settings : the settings object holding the current settings for the shell

Our web crawler in python spider’s logic can be built by using the interactive shell to understand what is being requested and getting instant feedback on your selectors.

Web crawler in python tutorial step 4 – Items

Our web crawler in python spider can pull information but converting crawl data to useful data through items makes Scrapy a powerful python library.

Items are created and then used when parsing our crawler’s response. In general the sequence for using the Items is as follows:

Create the item class and inherit from scrapy.Item
Import the item in your crawler
Crawl and parse using your spider
Construct an item and return it to scrapy for saving, displaying, or further processing

You define an object using the following syntax:

class MyItem(scrapy.Item):
    firstField = scrapy.Field()

You create an item in your spider using the following syntax:

myItem = MyItem(firstField = valueYouWant)

Once an item is created you can access values using the following syntax:

myItem['firstField']

myItem.get('firstField')

Lastly, you can update and set the field individually using the following syntax:

myItem['firstField'] = newValue

NOTE: if you attempt a non defined field within the item, you will get an error that the item does not support the specified field.

Overview: A web crawler in python spider can use Scrapy Items to structure data and create a logical representation of objects you are crawling for.

Web crawler in python tutorial step 4 – Saving Results

Our web crawler in python spider is currently parsing, creating, and then closing itself without doing or saving anything about items we created.

Our web crawler in python spider does have some built in ways to save using scrapy’s Feed Exporter settings.

To export the data as a CSV modify the settings file to include the following:

FEED_FORMAT = 'csv'

NOTE: The settings file is the settings.py located on the same level as the items.py.

Now that we have specified the Feed Export as csv we can modify the way we call our spider to specify the output file. We used to call our web crawler in python spider using the following call:

$scrapy crawl redditSpider

This works but we can save our data the following:

$scrapy crawl redditSpider -o fileNameYouSpecify.csv

We can modify our settings to export into csv but there are warnings about large files that require more complex work around. As it currently stands if you are able to create the csv we can always use that as a file that other analysis in different scripts. This web crawler in python tutorial will not go into more advanced exporting because it is outside the scope of a simple crawler.

Web crawler in python tutorial step 5 – Crawl The Internet!

Congratulations! You have completed this web crawler in python tutorial! You have now learned some basics in web crawling using scrapy. The next step is to start building another web crawler in python so you can grow your skills. Iteration and learning is the only way to grow as a developer.