Thursday, May 24, 2007

ABT

Today I passed 8.901, Astrophysics I, my final breadth requirement. That makes me officially ABD (All But Dissertation). Expect "Silencing and Recombination in Yeast Ribosomal DNA" at your local bookstore around December. A movie deal with Burger King tie-in should follow shortly thereafter.

Monday, May 21, 2007

CAPTCHAs and Tesseract

CAPTCHAs always seem to come up in discussions of OCR projects like Tesseract, so I decided to test Tesseract to see if it was actually the next big thing in spammer technology.

A commenter pointed me to a Berkeley effort to defeat CAPTCHAs that handled 92% of a selection of text-images from 2002 using a tailored OCR algorithm. I used PyTesser and difflib to run through the images and check the results.

Tesseract read the image correctly for 36 out of 191 images (19%), and was close (within one character) for 5 more. Here are a few of the harder images it was able to crack:





Tesseract had trouble most frequently on text that was more skewed, had lots of distracting dots, or white or black lines crisscrossing the words. With a small amount of image preprocessing (removing speckles and narrow lines), it might do much better on this old set. On modern CAPTCHAs, though, it's probably SOL.

Thursday, May 17, 2007

PyTesser: OCR for Python

In two years using Python, I never once searched for "Python X" without finding PyX, someone's labor of love making X easy to understand and use. Many examples come to mind. That's what hooked me when I started out: any tool I wanted was at my fingertips in 5 minutes, and just worked.

Optical character recognition was just about the only exception. So, I got excited when Google released Tesseract OCR, a straightforward, relatively accurate OCR package written in C++. Tesseract didn't have Python bindings, but it didn't take much work with PIL and subprocess to make it act like it did.

Behold! PyTesser.

>>> from pytesser import *
>>> image = Image.open('fnord.tif') # Open image object using PIL
>>> print image_to_string(image) # Run tesseract executable on image
fnord
>>> print image_file_to_string('fnord.tif')
fnord