How to scrape a dynamic website using API method

vignesh amudha
8 min readFeb 22, 2020

--

Hi, In this blog we are going to see how to scrape a dynamic website which is very difficult and takes much time to scrape using selenium. Before going to the topic I will give some little bit important methods to scrape a website.

A static website is a non event based and also a fixed data page which does not load any new data dynamically if any user event is triggered except next page click event which is going to load new data in new page.

static website examples: http://quotes.toscrape.com/ , www.google.com All page is a static page but the google image page is a dynamic page that will load new image data on the same page when we scroll down, but not in a new page.

A dynamic website is a user event based new data loader in a same page.

Dynamic website examples: http://quotes.toscrape.com/scroll if a user scrolls down to end it triggers an event and new quotes are loaded on the same page with previous quotes. In the google image page if a user scrolls down(auto-trigger) or (manual trigger) click view more button it will load new image data in the same page with previously already loaded image.

If you are planning to scrape a website I recommend this steps to follow.

Step_1: check whether the website is dynamic or non-dynamic website and also analyze the website structure.

Step_2: Select the scraping method

  • Scrapy: If a website is a Static Website, Scraping is Faster and it consumes less memory and process.Official Link
  • Scrapy-Splash: If the Dynamic Website as a simple structure, Speed-Scraping is Fast and it consumes less memory and process compare to selenium.Official Link
  • Selenium: If the Dynamic Website structure is complex and also scraping the data size of the website small or medium, Speed-Scraping is Slower and it consumes more memory and process. Official Link
  • Selenium-Grid: Concurrent and distributed Dynamic Web scraping, Speed-Scraping is Fast But it consumes a lot of memory and processing So it required higher-end system to increase the concurrency. Link
  • Selenium-SubProcess library: Concurrent Dynamic Web Scraping, Speed-Scraping is Faster, because it does not wait for one instance of selenium chrome driver to finish like in Selenium-Grid and also if anyone fails it will not stop and does not get stuck in the failed instance. but in the selenium grid, if anyone is stuck in a loop or failed, the entire system will get stuck. It will also consume a lot of memory and processing. The code is migrated from the selenium-grid. Link
  • API: Every Website has an API which is a communication language between the front end and backend according to my knowledge. We have to use this method, If a website is Dynamic, the structure is very complex and also scraping the data size of the website is very larger. Speed -Scraping is Faster and it will consume less memory and processing due to we get data that has structured JSON format. So we will not do any rendering process in most of the cases.
  • Hybrid: I recommend to use a mix of every method according to website structure to scrape the website faster and also we can reduce the complexity of the code. For example, In most of the website list of item showing page is dynamic and item page is static. First I use any one of the following methods like selenium, selenium-grid, selenium subprocess, API to scrape all list of item data and store it as a CSV or DB, then I use scrapy or scrapy-splash to scrape the following item page.

I strongly recommend to read the tutorial on API web scraping Check out this Link.

Let’s say we will follow the above blog to scrape the quotestoscrape data, Then If we want to try some above method on some other secured website means you get an error response from the request due to request header is not properly aligned or request header format is not sent correctly if you send through a python code or you didn’t send a provisional request header.

I have taken a random website while writing this medium blog because I forget the tough website which has been given me a headache while scraping it.

Step_1: Go to www.********.com then, right-click then, select the inspect or ctrl + shift + i or f12.

Step_2: Select the network tab in the inspect menu.

Notes: In the network tab , Select the XHR in the filter menu, Because in quotestoscrape there are no junk files. But, In other website there will be lot files.So to filter out junk files which we does not needed , Select the XHR. After using XHR filter you will find again some junk files. So we have to find the right files by monitor when a click event or scrolling event happens and also we have to check the response of the API , If an API file has a data which we want to scrape.

Step_3: Then make a user event like view more or scrolling down and monitor the network. we find a list of the new file which will be loaded in the network. In the below example, we get a single file only, but you may get multiple new files.

Step_4: Now we have to check the file which has data that we need to scrape. So click on that file, then go to the response or preview tab and check the data.

Step_5: If the data is correct then make a similar request header and query parameter in postman and check it whether response data is coming or not.

Step_6: Then duplicate the request using python request manually or use scrapy library directly to scrape the data.

r=requests.get("http://www.example.com/", headers={"content-type":"text"})

But I recommend creating python request code using postman, because some request header may be complex to write the code correctly.

Click on the code, then type python and select the python request.

Step_7: Finally extract the data from JSON formatted response.

Step_8: If response status is not 200 or data is not shown.then check the request header and query parameter are correct and also check the request header parameter are properly given by ordered according to the original request header.If everything is right but it not working means ‘request header which has been shown is not correct header.For the security purpose the request header has been modified as soon as request is finished.To overcome this problem we take a screenshot of original request header or provisional request header which has been used by the browser to get the data from backend server before the browser change the request header.

Now Click on the throttling bar and select the add custom.

Then click on the add custom profile type profile name and set the download 10 and upload 10 and click on add.then close it.

select the custom throttling. Now we slow down the request process by limiting the internet speed.

Now click on view more button or scroll down depends on your event. As soon as you hit the button click on the newly generated XHR file, then you will see the provisional header request is shown with a warning icon, then immediately press the print screen button before it disappears!Bingo this provisional header request is the original request which has been modified by the browser as soon as the request is finished but now the request takes some time to finish it. By using that time we take a screenshot, then make the python code request again and hope for the best result.

Bonus:-

In this small section, I am going to talk about how to get a proper website to scrape the data. Let’s say I want to scrape a lot of real-world visiting card data. So let’s say If I have a sample image of the visiting card. First I use the sample visiting card in google image search to get the related image of the sample visiting card. Then I will look for the most repeated website in the google image search. After that, I will take one or more websites that are highly repetitive and check for the relevant data which is contained on the website and start scraping it. I can’t give a guarantee of whether it works on all the time and also you can scrape the google image search using this chrome extension.

or you can go with google image downloader python library, If you don’t have a sample image. But I recommend you to use a sample image, because it gives a much closely related image compare to the normal text google image search.

Disclaimer :- To the record, I scraped the data from the website for educational purposes only, I am not promoting any illegal activity here. this is purely for educational purposes. I am not responsible if you use it for illegal purposes.

Thanks,

If you like it please share and give a clap.

Thanks to Siva Prakash , Linkedin-https://www.linkedin.com/in/sivaprakash-desingu/ , who gave the idea of provisional headers.

--

--

No responses yet