Tesseract and Pytesseract for OCR
Learn about Optical Character Recognition and how Tesseract can help you to perform OCR on an image.
We'll cover the following
Introduction to OCR
The term OCR
stands for Optical Character Recognition. Optical Character Recognition deals with the problem of recognizing all the different handwritten and printed characters. These characters can be converted into a machine-readable, digital data format. OCR consists of several sub-processes to perform this operation in an efficient and accurate manner. The sub-processes are:
- Preprocessing of the image
- Text localization
- Character segmentation
- Character recognition
- Post processing
The processes mentioned in the above list could differ on a case by case basis, but these are the steps that would be needed to perform OCR on printed and handwritten characters.
Introduction to tesseract
Tesseract is an open-source OCR
engine that has gained popularity among OCR
developers. Despite sometimes being painful to implement and modify, Tesseract was one of the best free and powerful OCR
alternatives in the market for the longest time. Tesseract began as a Ph.D. research project in HP Labs, Bristol. It was developed by HP between 1984 and 1994. In 2005, HP released Tesseract as an open-source software. Since 2006, it has been developed and maintained by Google. Tesseract is supported by a variety of programming languages and frameworks through wrappers that can be found here.
Pytesseract
From the link mentioned above, you can find that pytesseract
is a wrapper class for Tesseract OCR. Pytesseract cannot be used directly to perform OCR. We need to have the Tesseract software installed on our systems to perform the OCR on digital data.
If you want to install it on your local system, please check out the Appendix section.
Create a free account to view this lesson.
By signing up, you agree to Educative's Terms of Service and Privacy Policy