Saturday, March 07, 2009

A Week in Hacker News

I started posting to Hacker News recently. The rise and fall of submissions through the rankings is really hypnotic. I wrote a screen-scraper to track my stories' trajectories, one thing led to another, and suddenly it was a bit of an art project:



(Play this in full-screen to see the text. Watch at Vimeo for HD resolution.)

The x-axis is time since submission, y-axis is karma. For ranking, stories are sorted by points divided by a power of the time since they were submitted. That means on this log-log plot there is a straight line separating front-page stories (of which there are 30) from the rest.

Front-page stories are green (with the #1 story brightest). Other stories are red. Each article leaves a dot behind at the moment they leave the front page and start to fade away.

You can see roughly where the interface is at any given time. During the day activity is fast and furious, with rapid overturn. Late at night things are quiet, with articles at a given karma lasting much longer. The slope of the interface looks to be about 1.5. (Would take more work to find exactly, because the scraped submission times are imprecise, and the interface moves around.)


Software used: Python, with Matplotlib for plotting.

Wednesday, June 11, 2008

Python-powered webcam for MIT construction site--with time-lapse video

My office on the 4th floor of the Broad Institute looks down on the construction site for MIT's new cancer research building. Soon after they tore down the beautiful landscaping that used to be outside the Stata Center, I set up a webcam on my windowsill and started idly capturing shots of the construction while I worked.

This week I finally automated the process with a Python script using VideoCapture and pyMedia. Once a day it compiles all the shots, now taken at 30 minute intervals from 7am-5pm, into a 12fps time-lapse video:

Large MPEG (1280x960, 16.6MB and growing)
Small MPEG (640x480)

A live shot is also taken every 5 minutes: [Update 3/7/09: I stopped recording when they covered the whole construction site with a big sheet.]


(click for larger image)

I used a Logitech Quickcam Pro 9000, which has a glass lens and nice resolution. It deals surprisingly well with the bright sky and dark, shadowed construction site.

Construction is expected to finish in December 2010, so I'll up the frame-rate when the file starts getting huge.

Saturday, March 15, 2008

stored_result(): automatic serialization of expensive calculations

A very common situation I encounter: "Do [expensive calculation] and pickle the result; unless it's been done before, in which case unpickle the stored result." This stored_result() utility function makes that very easy, and provides some convenient options for scripting and memory management."


def stored_results(pkl_filename, function, overwrite=0, verbose=1,
          lazy=False, args=[], kwargs={}):
    """Returns contents of pickled file 'pkl_filename'. If file does not exist
    or is malformed, returns output of function (with args & kwargs) and pickles same
    result to pkl_filename.
    If overwrite>0, runs function regardless of presence of 'pkl_filename'.
    If lazy=True, instead outputs a generator that evaluates the rest of the stored_results()
    call only when its next() method is called."""
    import cPickle
    if lazy:
        # Create a generator that will eventually retrieve stored result
        def result_gen():
            result = stored_results(pkl_filename, function, overwrite, verbose,
                                    False, args, kwargs)
            while 1:
                yield result
        return result_gen()
    else:
        if not overwrite>0:
            try:
                inf = file(pkl_filename, 'rb')
                result = cPickle.load(inf)
                inf.close()
                if verbose: print pkl_filename, "successfully unpickled"
                return result
            except (cPickle.UnpicklingError, IOError):
                if verbose: print "Result not stored. Running function", function.func_name
        result = function(*args, **kwargs)
        outf = file(pkl_filename, 'wb')
        cPickle.dump(result, outf, -1)
        outf.close()
        if verbose: print "Result pickled to", pkl_filename
        return result


Typical use:
result = stored_results('saved_result.pkl', expensive_function, args=(23, 'sweet'))

Advanced use:
To perform the calculation/load the result in a just-in-time fashion:
result_gen = stored_results('saved_result.pkl', expensive_function, args=(23, 'sweet'), lazy=True)
[memory-sensitive tasks that may or may not need the result]
result = result_gen.next()

Victory is mine!

I defended and graduated! PhD in Physics from MIT. Now it gets good.

I crossed the street to the Broad Institute to do cancer research. Now, I don't interact with patients, or even deal with tumor samples. My job is to manage and analyze the huge flow of data coming from the latest microarray experiments. We look at sequence, expression, methylation, copy-number variation, and microRNA data simultaneously to tease out the subtle cellular changes that result in cancer.

When motivation strikes, I'll write about some of the programming and other issues I deal with at work.

Thursday, May 24, 2007

ABT

Today I passed 8.901, Astrophysics I, my final breadth requirement. That makes me officially ABD (All But Dissertation). Expect "Silencing and Recombination in Yeast Ribosomal DNA" at your local bookstore around December. A movie deal with Burger King tie-in should follow shortly thereafter.

Monday, May 21, 2007

CAPTCHAs and Tesseract

CAPTCHAs always seem to come up in discussions of OCR projects like Tesseract, so I decided to test Tesseract to see if it was actually the next big thing in spammer technology.

A commenter pointed me to a Berkeley effort to defeat CAPTCHAs that handled 92% of a selection of text-images from 2002 using a tailored OCR algorithm. I used PyTesser and difflib to run through the images and check the results.

Tesseract read the image correctly for 36 out of 191 images (19%), and was close (within one character) for 5 more. Here are a few of the harder images it was able to crack:





Tesseract had trouble most frequently on text that was more skewed, had lots of distracting dots, or white or black lines crisscrossing the words. With a small amount of image preprocessing (removing speckles and narrow lines), it might do much better on this old set. On modern CAPTCHAs, though, it's probably SOL.

Thursday, May 17, 2007

PyTesser: OCR for Python

In two years using Python, I never once searched for "Python X" without finding PyX, someone's labor of love making X easy to understand and use. Many examples come to mind. That's what hooked me when I started out: any tool I wanted was at my fingertips in 5 minutes, and just worked.

Optical character recognition was just about the only exception. So, I got excited when Google released Tesseract OCR, a straightforward, relatively accurate OCR package written in C++. Tesseract didn't have Python bindings, but it didn't take much work with PIL and subprocess to make it act like it did.

Behold! PyTesser.

>>> from pytesser import *
>>> image = Image.open('fnord.tif') # Open image object using PIL
>>> print image_to_string(image) # Run tesseract executable on image
fnord
>>> print image_file_to_string('fnord.tif')
fnord

Saturday, April 15, 2006

catachrestic (adj.)

"He went to the store catachrestically."

ITA Software puzzle: Word Rectangle

A few months ago job ads for ITA Software were all over Boston's subways. Each ad said "Solve this, work here!" and gave a non-trivial programming puzzle. I'm not about to do better than the people behind Orbitz.com, but some of them looked fun to try.

"Write a program to find the largest possible rectangle of letters such that every row forms a word (reading left to right) and every column forms a word (reading top to bottom). Words should appear in this dictionary: WORD.LST (1.66MB). Heuristic solutions that may not always produce a provably optimal rectangle will be accepted: seek a reasonable tradeoff of efficiency and optimality."

So a 3x3 solution would be

cat
ate
tea

My algorithm places letters in the matrix one at a time, checking to make sure each new letter forms part of a legitimate word in both the horizontal and vertical directions. That check would take ridiculously long if I had to search through the whole wordlist each time, so I do some preprocessing. The wordlist gets parsed into a tree, with the nth level branching on the nth letter of the word. The top level divides the tree by letter counts, since we know from the beginning how many letters our word should have. It takes a while to create the word tree, but once I have it, answering the question "What are the possible 4th letters of a 6-letter word that starts with 'tur'?" is very fast:

>>> dictionary[6]['t']['u']['r'].keys()
['a', 'b', 'e', 'g', 'f', 'k', 'n', 'r', 't', 'v']

Now if we have a partial 7x7 solution like

acacias
coconut
acqui?

we just find all the letters that can legally follow both 'au' and 'acqui':
>>> dictionary[7]['a']['u'].keys()
['c', 'b', 'd', 'g', 'k', 'l', 'n', 's', 'r', 't', 'x']
>>> dictionary[7]['a']['c']['q']['u']['i'].keys()
['r', 't']
So the ? can be either an 'r' or a 't'.

I tried making a few improvements on the brute-force approach. You might even call some of them 'heuristic'. First I always try Psyco with this sort of project to see if it has any effect. Psyco is a just-in-time compiler of Python into machine code. Python does some things a lot slower than the equivalent C, especially if you don't know much about its implementation. (E.g., string concatenation can be slow because Python makes an entire new string even if you add just one character to the end of an old one. This is a Feature.) Just adding psyco.full() to a program can dramatically speed it up, without having to know anything about the Python interpreter.

Psyco helped with the first few iterations of my code, but once I optimized dictionary access in the inner loop it wasn't really improving matters. Disable it if you want to save a whole lot of memory.

Next I changed the order in which branches are followed. Before, the letters were tried in whatever order they were kept in the hash table, i.e. arbitrarily. Now I made it follow the "thickest" children first (the ones with the most leaves). That seems to be faster for smaller rectangles.

Eventually I tried paring down the children, keeping only the n thickest at each branch. For n=3 this was really fast up to 7x7, but fails on 8x8. Of course, with n<26 this is no longer guaranteed to find a solution if one exists. Possibly my scoring of the branches could be extended to some sort of A* search, but I think I'll move on to something else first. Anyway, here is the source code:

word_rectangle4.py version 0.1, 4/14/06

For 7x7--my biggest so far-- it finds
strafer
taeniae
resters
antiflu
fiefdom
earlobe
resumes
A taeniae is a sort of ribbon. (I just saved you ten seconds.)

If you have ideas for improving my code (heuristically!) or have tried this yourself, leave a note.

Tuesday, April 11, 2006

Bootstrap statistics in Python

bootstrap.py version 0.1, 3/08/06

Python didn't seem to have a straightforward published bootstrap statistics module, so here is my effort.

Suppose you had 40 measurements

>>> dist
array([ 9.95587335, 11.14893547, 12.07186714, 9.06794986,
10.66503852, 10.7564387 , 11.42936871, 9.72575509, …
…11.41262324, 9.28556988, 8.01210309, 9.98832624])

with standard deviation 0.912. What is the uncertainty in your measurement of std=.912? Might the underlying distribution have sigma=1? (There is an exact formula in this case for Gaussian samples, but that won't always be so; and besides, it's a pain.)

Bootstrap statistics assumes that our 40 data points completely describe the underlying distribution–a "non-parametric" distribution. This assumption gets better the more points we have, but has good statistical properties all the time.

If the 40 points fully chararacterize our distribution, we can essentially do our experiment over and over again by resampling from those 40 points with replacement. Bootstrap.py does this very quickly using SciPy. Given distribution dist and an estimation function that takes a distribution and returns a number–such as scipy.std()–bootstrap.py resamples dist many many times and returns the mean and standard deviation of the estimate distribution.

>>> import bootstrap
>>> print scipy.std(dist)
0.911973669727
>>> print bootstrap.fast_bootstrap(scipy.std, dist)
(0.89697434867588222, 0.10347317195461153)

What does this mean? If we had 1,000,000 data points, rather than just 40, then their standard deviation would most likely be in the interval 0.897 +- 0.103. This is consistent with sigma=1, which, you might have guessed, I used to generate the data.

  • Bootstrap.py also lets you set error tolerance and a few other parameters.
  • Are you enticed? Let me know if this was useful to you, or failed spectacularly.
  • Thanks to gumuz' devlog for making me want to share.