Monday, May 21, 2007

CAPTCHAs and Tesseract

CAPTCHAs always seem to come up in discussions of OCR projects like Tesseract, so I decided to test Tesseract to see if it was actually the next big thing in spammer technology.

A commenter pointed me to a Berkeley effort to defeat CAPTCHAs that handled 92% of a selection of text-images from 2002 using a tailored OCR algorithm. I used PyTesser and difflib to run through the images and check the results.

Tesseract read the image correctly for 36 out of 191 images (19%), and was close (within one character) for 5 more. Here are a few of the harder images it was able to crack:





Tesseract had trouble most frequently on text that was more skewed, had lots of distracting dots, or white or black lines crisscrossing the words. With a small amount of image preprocessing (removing speckles and narrow lines), it might do much better on this old set. On modern CAPTCHAs, though, it's probably SOL.

4 comments:

djinn said...

Doubt that this would be understood by OCR . numcaptcha

Anonymous said...

I was pretty inspired when I saw the first PyTesser announce a week or so back. Sadly I'm not going to have time to give it a go...but solving CAPTCHAs in Python is neat :-)

I'm wondering if you wanted to give a wider Python audience a chance to look at what you've done by making a short ShowMeDo screencast on the topic?

Solving CAPTCHAs is a very visual thing, it'd come across really well in a tutorial video.

If you're interested do please email me, on Windows it is really easy (as in: 30 minutes easy) and we have a very large Python audience coming to us every day to learn new skills.

Cheers,
Ian (co-founder ShowMeDo.com)
ian AT showmedo DOT com

Michael J.T. O'Kelly said...

Hey, Ian. Interesting offer. Let me get the 0.1.0 release out (bounding boxes! confidence estimates!) (you heard it here first!), and then I'll look into making a screencast.

Bob said...

I did a test of your "pytesser_v0.0.1" on digits, the results is very disapointing...

The I changed a bit with the information found from FAQ: "How do I recognize only digits?"
http://code.google.com/p/tesseract-ocr/wiki/FAQ

The program is always having error, here is the change:

def call_tesseract(input_filename, output_filename):
"""Calls external tesseract.exe on input file (restrictions on types),
outputting output_filename+'txt'"""
args = [tesseract_exe_name, input_filename, output_filename, "nobatch", "digits"]

I am also interested to your next release, with the configuration to only digits. :-)