mokelly's devlog: CAPTCHAs and Tesseract

Monday, May 21, 2007

CAPTCHAs and Tesseract

CAPTCHAs always seem to come up in discussions of OCR projects like Tesseract, so I decided to test Tesseract to see if it was actually the next big thing in spammer technology.

A commenter pointed me to a Berkeley effort to defeat CAPTCHAs that handled 92% of a selection of text-images from 2002 using a tailored OCR algorithm. I used PyTesser and difflib to run through the images and check the results.

Tesseract read the image correctly for 36 out of 191 images (19%), and was close (within one character) for 5 more. Here are a few of the harder images it was able to crack:

Tesseract had trouble most frequently on text that was more skewed, had lots of distracting dots, or white or black lines crisscrossing the words. With a small amount of image preprocessing (removing speckles and narrow lines), it might do much better on this old set. On modern CAPTCHAs, though, it's probably SOL.

4 comments:

djinn said...: Doubt that this would be understood by OCR . numcaptcha; May 22, 2007 1:25 AM
Anonymous said...: I was pretty inspired when I saw the first PyTesser announce a week or so back. Sadly I'm not going to have time to give it a go...but solving CAPTCHAs in Python is neat :-)

I'm wondering if you wanted to give a wider Python audience a chance to look at what you've done by making a short ShowMeDo screencast on the topic?

Solving CAPTCHAs is a very visual thing, it'd come across really well in a tutorial video.

If you're interested do please email me, on Windows it is really easy (as in: 30 minutes easy) and we have a very large Python audience coming to us every day to learn new skills.

Cheers,
Ian (co-founder ShowMeDo.com)
ian AT showmedo DOT com; May 22, 2007 7:13 AM
Michael J.T. O'Kelly said...: Hey, Ian. Interesting offer. Let me get the 0.1.0 release out (bounding boxes! confidence estimates!) (you heard it here first!), and then I'll look into making a screencast.; May 23, 2007 3:56 PM
Bob said...: I did a test of your "pytesser_v0.0.1" on digits, the results is very disapointing...

The I changed a bit with the information found from FAQ: "How do I recognize only digits?"
http://code.google.com/p/tesseract-ocr/wiki/FAQ

The program is always having error, here is the change:

def call_tesseract(input_filename, output_filename):
"""Calls external tesseract.exe on input file (restrictions on types),
outputting output_filename+'txt'"""
args = [tesseract_exe_name, input_filename, output_filename, "nobatch", "digits"]

I am also interested to your next release, with the configuration to only digits. :-); July 20, 2009 6:07 AM

mokelly's devlog

Monday, May 21, 2007

CAPTCHAs and Tesseract

4 comments:

About Me

Labels

Blog Archive

Labels