Build your dataset with Scrapy

Dilusha Dasanayaka
4 min readOct 31, 2020

Ayubowan readers! Welcome back..

Recently I had to create my own dataset for a experiment on fluctuation of vehicle prices during this covid crisis. Data from classification websites are a good source of information for scenarios like this. So in this article I’m going to explain you about creating your data set from web scraping in python using Scrapy framework.

Prerequisites

First you should have python on you computer. You can do it simply installing python or anaconda in to you computer. Then you have to install Scrapy by running following command in the terminal.

pip install scrapy

you can check the whether installation went well by running following command in terminal.

scrapy
if the installation went well, the output should be like this.

Create a new project

First we have to create a new scrapy project by executing following command.

scrapy startproject scraper

It will create a whole set of files. But for now we only bother about the files inside “spiders” directory.

File Structure

Before write anything, first we have to identify the basic structure of the website. Lets take website “https://ikman.lk/” create a spider for that. If we look at the structure of the website first it have its view for add listing page.

Add Listing view structure

There is a list of adds in listing view and we need to navigate to each single advertisement page to collect more data. So, we have to visit each single add listed here and gather their data. HTML format for this list is something like this.

Here we can see that each <li> element contains a <a> element which have a link to more details. First let’s extract this link from anchor tag and ask scrapy to extract more data from that.

First, let’s create a file inside “spiders” directory called “ikman-spider.py” and instruct it to extract set of links.

In above code, you can see that, I’, extracting the value for “href” attribute of a <a> tag located inside a <div> which has class named “list-item”.

Then we can instruct scrapy to visit each of these pages by using scrapy.Request method.

yield scrapy.Request(base_url + link.get(), dont_filter=True, callback=self.scrapeSinglePage)

here, “scrapeSinglePage” is a callback function, which have steps to extract data from loaded page. Now let’s code that function.

Scrape single page

Now, we have to write logic to extract data from each page. Structure for a single advertisement page is look likes follows.

single add page

I need to extract all details as key value pair inside the “.details” div. In scrapy we can extract the text value directly. For extract text inside <h2> tag. There is no direct identifier to this element. In large web page there might me several <h2> tags. So, first we have to identify the nearest parent element which have a identifier and ask scrapy to query only inside that element.

title = response.xpath(“//*[contains(@class,‘advertiesment’)]      //h2//text()”).extract_first()

likewise we can extract all the required data from given element easily. In the case of extracting data from “.details” div first we query and select it as parent element.

parent = response.xpath(“//div[contains(@class,‘details’)] ”)

and then query the inner elements. This is just to avoid collisions with other elements with same classes and make querying faster.

Now, we can select all “.info-item” elements and loop through them.

info_list = parent.xpath(“.//div[contains(@class,‘info-item’)] ”)

You can start scraping by executing following commands in the terminal.

To get output to json file,

scrapy crawl ikman -o data.json

To get output to csv file,

scrapy crawl ikman -o data.csv

Here the “ikman” is came from the name which give in first place. Just look at the next line after we define “IkmanSpider” class 😋. You can find more about scrapy from here.

Complete code for the ikman-spider here 🙂

See you again in corona free world 🤞🤞.

--

--