Thursday, May 17, 2007

PyTesser: OCR for Python

In two years using Python, I never once searched for "Python X" without finding PyX, someone's labor of love making X easy to understand and use. Many examples come to mind. That's what hooked me when I started out: any tool I wanted was at my fingertips in 5 minutes, and just worked.

Optical character recognition was just about the only exception. So, I got excited when Google released Tesseract OCR, a straightforward, relatively accurate OCR package written in C++. Tesseract didn't have Python bindings, but it didn't take much work with PIL and subprocess to make it act like it did.

Behold! PyTesser.

>>> from pytesser import *
>>> image = Image.open('fnord.tif') # Open image object using PIL
>>> print image_to_string(image) # Run tesseract executable on image
fnord
>>> print image_file_to_string('fnord.tif')
fnord

20 comments:

clofresh said...

This is both awesome and frightening. Does this mean that captchas are now obsolete?

Michael J.T. O'Kelly said...

Captchas are usually designed specifically to defeat OCR, and they're still way ahead in the arms race. Even if OCR got better, captcha can turn to bigger guns.

Anonymous said...

Captchas are overrated. Even gocr, which is not a good OCR engine, can defeat many of them. Serious software can break basically everyone.

david.baird said...

How do captchas help blind people? ;-)

Awesome.

Steven said...

About your approach of wrapping an API around a binary: do you have any second thoughts about doing this? I am in a similar situation trying to make an API out of a few command line tools, but I feel 'bad' about it because I could technically dive into the C++ code and use SWIG or ctypes or something to get 'real' Python bindings.

I guess my uneasiness comes from the fact that there are more potential areas for it to break (e.g. different/incompatible versions of the binary, different executing environment, relying on the user to provide the binary if you can't package it with your API, etc.) when compared to using SWIG or something.

I guess I just want some more justification for this approach :) Any thoughts?

Michael J.T. O'Kelly said...

Hey, Steven. I had your thoughts exactly, and it's true that PyTesser isn't beautiful right now.

But I figure, it's better to publish something immediately that works and people will use than hold back for an elegance I might never reach.

I'd like to ctype it eventually (the new .dll is nice for this). Having people already interested and using it makes that feel a lot more worth spending time on.

So, go wrap your command line tools! Which ones are you thinking of using?

Steven Kryskalla said...

I wrote a few Python scripts to drive an SVG template -> generated SVGs -> multi-page PDF workflow using the command line options of Inkscape and Ghostscript. Since it works nice for me and could potentially be useful for others I thought of expanding on it and releasing it... which prompted me to think about making 'real' Python bindings (AFAIK there aren't any for Inkscape or GS).

You're right that it's better to have something working and released than waiting for something better. I guess I'll get started then!

BTW have you added PyTesser to the Cheeseshop ? That way people searching the package index can come across it and easy_install the package..

Michael J.T. O'Kelly said...

That's great! Let me know when you have something published.

I added PyTesser to PyPI. Don't I need to make it into a .egg for easy_install to work?

Steven Kryskalla said...

Yes, you'll need to make an .egg (and write a setup.py file) for easy_install to work. Titus Brown's 30 seconds to create an egg tutorial will cover the bare minimum for that :) Creating and uploading the egg to your Google code download page should be sufficient for it to be easy_installed.

Philip said...

image_file_to_string didn't work for me at all, haven't looked into why yet.
have a minor problem with image_to_string: It doesn't work with RGBA images (PIL can't save RGBA to .bmp)
Easy fix:
add:
if im.mode=='RGBA': im=im.convert('RGB')
before the call to im.save in image_to_scratch in util.py

Another minor problem: Tesseract prints "Tesseract Open Source OCR Engine" to standard error when you run it. Not good if you're going to be running this as a CGI process or something.
Fixed by changing the subprocess call to:
proc = subprocess.Popen(args,stderr=subprocess.PIPE)
So we're quietly discarding anything printed on stderr.

Michael J.T. O'Kelly said...

Steven: I'll have to find 30 seconds somewhere and do that. :)

Philip: Thanks for the bug report, your fixes will be part of the next release.
If you want to give more info on the image_file_to_string problem (there's an Issues tab on the PyTesser page), I'll look into it as well.

Philip said...

Got a chance to look into it.
The problem is that the check_for_errors function is assuming that a logfile always gets written on an error.
That's not true on linux, Tesseract simply prints the error on standard error.
So the check_for_errors function needs to be modified to check stderr for error text as well.

Michael J.T. O'Kelly said...

OK, I have a fix in the works for 0.0.2. Instead of using PIPE, I now send stderr to a StringIO object. check_for_errors then uses it in the same way as it would otherwise use the logfile.

alex said...

You just made my book translation project so much easier for me. I swear I searched for python binding to tesseract, because I was also thinking of doing what you did. Seems as though I may have overlooked something.Either that or Google didn't come thru for me this one time. But I used other ocr software to do it for me. Now I'm thinking of building a translation framework(In the works). To simplify language translation from books. This way all you have to do is scan the books in and run them thru this framework and books get translated to any language from any language. Word for word...

Guest said...

Check out http://wiki.github.com/hoffstaetter/python-tesseract . This project looks quite similar, but is more actively maintained.

jonathan lim said...

Thank you for posting this. I just started searching for something python and ocr related and found this.

Anonymous said...

Does anyone know how to use pytesser to recognize a string or image so that python can evaluate it?
Or, to solve this captcha for posting?

Anonymous said...

Sorry, I mean on the internet.

Anonymous said...

import cv
import pytesser
im=Image.open('f.tif')
print image_to_string(im)


i am using python 2.7 and i get the following error

Traceback (most recent call last):
File "G:\OpenCV2.2\samples\python\pyteresa1.py", line 3, in
im=Image.open('f.tif')
NameError: name 'Image' is not defined



can some 1 plz plz help me out ....



ty in advance

Pdub said...

May sure you have PIL installed and also make sure tesseract.exe is in your PATH.