Web scraping faster without being blocked using scrapy
Hi, this is my first blog on how to scrape the website faster and without being blocked or blacklisted by the websites. This is a easy method to scrape a website faster by controlling the scrapy. I have went through many tutorials how to stop a scrapy for sometimes and then scrape the website again but due to scrapy is asynchronous and it cannot be controlled that much easily.So this method work for most of the websites.before you start scraping a websites you have to analysis the website.
Iam going to scrape www.examplewebsite.com first Iam going to analyse the website like how the content’s are there and what content we want from that website, then how many page are there in that website. After analysing the website iam going to write a web scrapy script for that website. While start running the scrapy spider, after finite amount of request, the scrapy gets blocked by that website.Now here is the main part iam going to check, after how many request the scrapy get’s blocked. for example may be the scrapy can blocked after 200 request or 300 request it depends on the website you scrape, the maxmium threshold request for that website. Now iam going to write a scrapy controller to stop the scraping if the scrapy reached the threshold request, and set a time delay for 4 to 5 sec and again it depends on the website,so you have to analyse the website and get the correct time delay and threshold request. Then restart the scrapy from where it left off, so on until it scraped all the data.
(Note:Scrapy spider script and scrapy controller script should not be in the same folder)
I have scraped 70000 data(total request round 200000) from a website using this method in 2 hour. If I would’ve scraped that website normally it will take 2 to 3 day to finish, if i set the scrapy settings(download delay, concurrent request) too low. So that website will not block the scrapy, If i have scraped that website in default setting of scrapy,the scrapy spider will get blocked after 400 request. So i decided to set the request threshold to 400–100=300(to prevent getting from blocked i choosed to 100, also it depends upon your use case, i recommend you to choose 100)request’s to stop the scrapy and wait for sometime and start the scrapy spider again.
I wrote two python script’s one is for scrapy and other one is for scrapy controller. In scrapy controller script i wrote the list of links to be scraped and i passed the list of links which request hits below 300 to the scrape.The scrapy spider starts scraping the passed list of links from the scrapy controller after finishing all list of links(i have provided) the scrapy will stop and scrapy controller wait for sometimes and again passes next set of links to scrape and start the scrapy again and so on.
here’s the link for code part: https://github.com/vigneshgig/scrapingfaster