how to scrape the dynamic website using scrapy
Hi,Today iam going share my idea about How to tackle the websites like dynamic webpages,ajax response,javascript without using external webdriver like selenium and splash which will slow down the scrapy process tremendously.The one thing the scrapy outofbox from the selenium is speed where the selenium can send one request at a time because the selenium is not made for web scraping it is for web automation testing purpose.Scrapy is written with Twisted, a popular event-driven networking framework for Python. Thus, it’s implemented using a non-blocking (aka asynchronous) code for concurrency.So if we want to scrape the dynamic website we have to use selenium driver or other webdriver.I would suggest you to handle dynamic webpage in the following step. I first go with sitemap spider which i going explain right now,if it fails i will go with splash(Splash is a javascript rendering service with an HTTP API. It’s a lightweight browser with an HTTP API, implemented in Python 3 using Twisted and QT5.It’s fast, lightweight and state-less which makes it easy to distribute.),but remember splash fast and lightweight browser but it will not handle heavy rendering webpage,most of the time the splash get forcestop if the webpage is too heavy to handle for that issue will go with selenium(Selenium automates browsers).Selenium can handle any kind of website we can do all sorts of things but it is very slow compared with scrapy due to rendering the webpage and executing all the javascript event where scrapy cannot handle this kind of event.
Ok,let go into the tutorial of “how to scrape the dynamic website using sitemap’.
What is sitemap?
A site map is a model of a website’s content designed to help both users and search engines navigate the site. A site map can be a hierarchical list of pages (with links) organized by topic, an organization chart, or an XML document that provides instructions to search engine crawl bots.
A sitemap lists the URLs on a site, and allows a webmasters to specify additional information about each URL:
- When it was last updated
- How often the content changes
- How important the URL is in relation to others
A sitemap file has the following structure:
<?xml version=”1.0" encoding=”utf-8"?>
<urlset xmlns=”http://www.sitemaps.org/schemas/sitemap/0.9"
xmlns:xsi=”http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation=”http://www.sitemaps.org/schemas/sitemap/0.9 http://www.sitemaps.org/schemas/sitemap/0.9/sitemap.xsd">
<url>
<loc>http://example.com/</loc>
<lastmod>2006–11–18</lastmod>
<changefreq>daily</changefreq>
<priority>0.8</priority>
</url>
</urlset>
Each URL in the site will be represented with a <url></url> tag, with all those tags wrapped in an outer <urlset></urlset> tag. There will always a <loc></loc> tag specifying the URL. The other three tags are optional.
Sitemaps files can be incredibly large, so they are often broken into multiple files and then referenced by a single sitemap index file. This file has the following format:
<?xml version=”1.0" encoding=”UTF-8"?>
<sitemapindex xmlns=”http://www.sitemaps.org/schemas/sitemap/0.9">
<sitemap>
<loc>http://www.example.com/sitemap1.xml.gz</loc>
<lastmod>2014–10–01T18:23:17+00:00</lastmod>
</sitemap>
</sitemapindex>
In most cases, the sitemap.xml file is found at the root of the domain. As an example, for nasa.gov it is https://www.nasa.gov/sitemap.xml. But note that this is not a standard, and different sites may have the map, or maps, at different locations.
A sitemap for a particular website may also be located within the site’s robots.txt file. As an example, the robots.txt file for example.com ends with the following:
Sitemap: https://www.example.com/sitemap.xml
Sitemap: http://www.example.com/image-sitemap.xml
Therefore, to get example.com’s sitemaps, we would first need to read the robots.txt file and extract that information.
After extracting the sitemap url manually from the robots.txt of that websites Then Let’s start writing the sitemapspider
#import the scrapy module
import scrapy
#import the scrapy sitemapSpider
from scrapy.spiders import SitemapSpider
# import pymongo to store the data in mongodb
import pymongo
# from pymongo import MongoClient
# initialize the Mongoclient
client = pymongo.MongoClient(‘localhost’,27017)
# creating the database
db = client[‘database_name’]
class Myspider(SitemapSpider):
name = ‘spidername’
# set the sitemap url in sitemap_urls predefined variables
sitemap_urls = [‘https://www.example.com/sitemap.xml','http://www.example.com/image-sitemap.xml']
Sitemaps has all the links of the website So we can set the specific links path as rules to spider So that we omit the unwanted or junk links.In my case I set the rules ‘/carddetails/’, the path of all following ‘/cardetails/’ are get scraped.For example www.example.com/aboutus, /homepage this webpages are omitted and www.example.com /carddetails/ card1, /carddetails/card2,www.example.com/carddetail/card1/price are scraped where the scrapy send request only to follow path of /carddetails/ like www.example.com/carddetails/card1,/carddetails/card2.
sitemap_rules = [
(‘/carddetails/’, ‘parse’),
]
*** you can also use regex(regular expression) to extract multiple pattern in url,for example i does not want if path has contact_us
which has to be omit like www.example.com/carddetails/card1/contact_ us/ should be omit we can write regular expression for this kind of situation In future i will do tutorial on regular expresssion to contrain the url more accuractly***
If the spider find the desired url,then it will request the url and get the response of the page and as usual we can call the parse method to parse the required data from the page source or response of that websites and store the data in database or csv,json file.
def parse(self,response):
title = response.css(‘h1.productlabel.hidden- xs::text’).extract_first()
category = response.css(‘body > div.page > section:nth-child(4) > ol > li:nth-child(3) > a::text’).extract_first()
describe = response.xpath(‘/html/body/div[3]/div[2]/div[1]/section[1]/div/div[2]/div[1]/div[1]/p/text()’).extract()
try:
price = response.css(‘.productprice::text’).extract_first()[4:]
except:
price = response.css(‘.productprice::text’).extract_first()
image_url = [‘https://www.parekhcards.com’ + str(i) for i in response.css(‘#thumb ul li a::attr(href)’).extract()]
lists = response.css(‘.moreinfo tr td *::text’).extract()
lists = [i.strip().replace(‘:’,’’) for i in lists]
lists_data = []
for j in range(0,len(lists)-1,2):
lists_data.append((lists[j],lists[j+1]))
description = dict(lists_data)
description[‘describe’] = describe
name = response.css(‘ul.commentslist .comment- title::text’).extract()
testimonial = response.css(‘ul.commentslist comment-text::text’).extract()
lists = []
for tuples in zip(name,testimonial):
lists.append(tuples)
review = dict(lists)
if title == ‘’ or title == ‘None’ or title == [] or title == None:
pass
else: db[category].insert
({‘title’:title,’price’:price,’description’:descr
iption,’review’:review,’image_url’:image_url,’ref_links’:response.url})
here i shared the code link here