How to scrape a highly secured website using Robotic process automation tool

vignesh amudha
3 min readMar 17, 2019

--

Hi all, I am going to explain about how to scrape the website using pywinauto which is a RPA( Robotic process automation).when i was trying to scrape a website for my project,First i started with scrapy framework to scrape that website, but they blocked me even though i used several technique but they somehow they managed to block the scrapy.Then i used selenium which is famous web application testing tool. But the website again blocked the selenium tool , surprisingly not even single page loaded.Most of the websites backend server block the scraping framework by using the hitting request threshold count of a particular ip address if the ip address cross the threshold count of hitting request the backend server will block that ip address for hour or a day,So most of the time we slow down the scrape request or we use proxy to hidden our ip address and rotate the proxy to scrape the website without compensating the speed of the scrapy.So if they block the scrapy by ip address then the backend server security take sometime to block so it will allow the scrapy for sometime to scrape until it cross the threshold of the request count.But In this website it is not allowing even single page to load,So then i downloaded the selenium chrome browser, i manually start loading the website.Again the website blocked it,Then i realised it somehow they finding the scrapy or selenium framework tool.then i read a couple of blog,then i get to know that the security website software will get all the open source scraping tools and they analysis the requesting process of the scraping tool and normal requesting. So if the scraping tool hit the website the security bot finds the difference between the normal request and scraping tool request.but i am not sure about it.Then i came to know the robotic automation process tool.

What is Robotic Process Automation?

Robotics Process Automation(RPA) allows organizations to automate task just like a human being was doing them across application and systems. Robotic automation interacts with the existing IT architecture with no complex system integration required.

In my project , i want to scrape the ceo detail of a given company name list.To use RPA it should be a repeated process.

First i want to search the company name and then i want to click the first link of the search result list then save it and so on.This process is coded in the python using pywinauto which a library for rpa and also we want to tell the pywinauto library where to click so you have to get the mouse coordinates using any software which will tell the x and y screen coordinate.After the rpa process is finished, then use the scrapy tool to scrapy particular data from the locally saved webpage file.

import pywinauto.keyboard
import pywinauto
import pywinauto.mouse
import win32api
import time
from pywinauto import Desktop, Application
import pymongo
import glob
import os
# path of the chrome
chrome_dir = r'"C:\\Program Files (x86)\\Google\\Chrome\\Application\\chrome.exe"'
# initialize the mongoclient
client = pymongo.MongoClient('localhost',27017)
db = client['ausceo']new = client['ausceonew']chrome = Application(backend='uia')#--incognitochrome.start(chrome_dir + ' --force-renderer-accessibility --start-maximized ''https://domain.com')path = 'D:\\Download\\'html_files = [f for f in os.listdir(path) if f.endswith('.html')]html_files = [d[:4] for d in html_files]cursor = db['ausceo'].find(no_cursor_timeout=True)i = 0for data in cursor: print(data["company_name"]) if data['company_name'][:4] in html_files: pass else: time.sleep(15) pywinauto.mouse.click(button='left',coords=(830,129)) #830 129 search click time.sleep(5) company_name = data["company_name"].replace(' ',r"{VK_SPACE}") pywinauto.keyboard.SendKeys(company_name) time.sleep(7) #811 160 click pywinauto.mouse.click(button='left',coords=(811,160)) #1150 50 time.sleep(11) pywinauto.mouse.click(button='left',coords=(1150,50)) i=i+1 print('--------------',i,'----------------------------')
cursor.close()

one of the disadvantage in pywinauto is we cant use the system while this process is running.If you disturb the system or mouse, the mouse can click on the other link so you have to start the process from the beginning.

One more tips i want to tell in selenium we can use the extension like hotshield if your ip address blocked for long time.In hotshield you disable the cookies and other javascript injection so that website can’t identify has bot.Some website use the cookies and javascript injection to identify the chrome to check whether human or selenium bot.

Rpa + Machine learning can do any task of the human which is repeated and to handle some randomness in process we can use machine learning.So anyone knows the rpa + machine learning open source library please leave on the comment

github link :https://github.com/vigneshgig/pywinauto.git

--

--

No responses yet