How to extract the Ocr training dataset using selenium.

vignesh amudha
3 min readApr 14, 2020

Hi everyone, Recently I have been working on text extraction. In that, I got an idea to prepare a dataset to train the OCR model. In this blog, I will explain ‘How to extract the text dataset from a different website using selenium’.

To get a generalized OCR model, we need a variety of text font styles and also different lengths of text. So I searched for an OCR dataset, I got a couple of good OCR synthetic datasets, But it does not have a longer text. So I planned to prepare my own dataset, then I got an idea that the website has many different font styles and also have different text length size or sentence length. I know basic of the web application. So In HTML, It has a text which has been rendered in a browser with different styles and every text element or other element which has a coordinate point to render the element in a location on a browser. So using this coordinate point we can crop the particular text image and also we can extract the raw text and store it in a csv file.

In coding, there will be three python files.

  • Selenium_website_link_extraction.py — which will extract every link on a website and save it as a text file. Input — Domain_url.txt and output = all the domains inside the url.txt file.
  • Selenium_website_text_extraction.py — which will extract text image and raw text and save it as a CSV file. Input- previous url.txt file and output- text image and CSV file which contain a path to the cropped image file and raw text.
  • Preprocessing.py- This python file will eliminate a text if text length is greater then a certain threshold. input- CSV file and threshold set according to your purpose and output- pruned_csv file.

Note: Some text file will get miss-match due to

  • When will take a full-page screenshot, stitching is not done well due to some element will not be rendered properly and also if the stitching algorithm is not good, So this stitching code I taken from outside, it works fine in most of the cases. if you can bring a more robust algorithm please let me know in the comment.
  • And some sudden notification or pop up message. It is very difficult to control this kind of notification. To avoid this you have to go through this website and make a change according to the website or avoid this kind of website.

If you have a small dataset, you should definitely have to check this kind of little miss-match data otherwise it will collapse your model.

If you have a larger dataset like 40 GB dataset, It’s ok to neglect the little bit of miss-match. That means you can train the dataset with a miss-match. I would not recommend to train miss-match data if does not have enough information about the model and also your experience.

In this extract, most of the time miss-match occurs at the top or at the bottom of the page and totally not on the middle page. So my guess is at the top miss-match occur due to the notification and sudden change in the element and at the bottom due to the stitching algorithm. Please look at the stitching algorithm then, only you know what I am saying and also I didn’t work to rectify this miss-match problem because I have a larger number of the dataset which is both the opensource synthetic dataset and also my scraped dataset which is nearly 150 GB of the dataset. So I didn't work on a miss-match problem which happens with only some websites that to top and bottom page.

Code:

Github-

LinkedIn:

If you like this blog, Please give a clap and share it. Thanks.

--

--