Parallel Scraping Of Dynamic Website using Selenium.
Hi, In this blog I am going to explain the how-to scrape a website parallel using selenium. So I got this idea from the selenium-grid which is also used for concurrent scraping of a website using selenium. But I found one problem in selenium-grid is that, If any one of the instances chrome is failed to scrape a data or any webpage takes time to load for longer, then selenium grid will wait for all the instances of chrome to finish. For example, let’s say I started selenium-grid with 10 chrome instance using docker swarm and send a 1000 URL to selenium-grid to scrape, So the SG will take 10 sets of URL and run concurrently in 10 instances of chrome, after the first 10 finish next 10 URL and so on. If anyone instance of chrome is failed and got stuck, then the coming URL also will get stuck. I already done multiprocessing for other python script using subprocess to run any task-parallel and independent. So I used a subprocess library to run instances of selenium chrome driver concurrently. We can use this technique for any python script.
I would recommend you to read and do some hands-on experience on selenium-grid.
Ok, we will directly jump on to the code. I have been coded two scripts. One is the main script which is the selenium script or any main script which is you going to run multiple instances and another script is control script which is going to control the main script.
In the main script, it has four functions.
- Get_driver()-It is used to initialize the selenium chrome driver and other parameter of chrome driver.
- Connect_To_Base-This function checks the connection of the given URL and returns rendered HTML page source.
- Parse_html-The parse_html function parse the HTML page source which has been returned from connect_to_base.
- Run_process-It is the main function that has a combination of above every function according to the website structure or scraping data flow.
At last control script,
- First It will read the CSV file which contain id and website which we want to scrape.
- Second It iterates a list of ranges(no of URL we want to run concurrently), which should be set by the user according to the system memory and processing speed.
- Third It iterates a set of URLs one by one and passed as an argument to selenium main script and it independently ran by the subprocess.popen function. In popen function close_fn parameter set to True so that if subprocess is finished it will automatically get closed asynchronously, due to that there are no dependencies between other subprocesses.If you set close_fn parameter set to False. It will become a synchronous process due to that already finished processor should wait for other processes to complete like in selenium-grid.
Notes: Selenium main script is taken from the above selenium-grid tutorial I just modified a bit. So I recommend to check out the selenium-grid tutorial.
Thanks,
If you like it, please share and give a clap.