I started posting to Hacker News recently. The rise and fall of submissions through the rankings is really hypnotic. I wrote a screen-scraper to track my stories' trajectories, one thing led to another, and suddenly it was a bit of an art project:
(Play this in full-screen to see the text. Watch at Vimeo for HD resolution.)
The x-axis is time since submission, y-axis is karma. For ranking, stories are sorted by points divided by a power of the time since they were submitted. That means on this log-log plot there is a straight line separating front-page stories (of which there are 30) from the rest.
Front-page stories are green (with the #1 story brightest). Other stories are red. Each article leaves a dot behind at the moment they leave the front page and start to fade away.
You can see roughly where the interface is at any given time. During the day activity is fast and furious, with rapid overturn. Late at night things are quiet, with articles at a given karma lasting much longer. The slope of the interface looks to be about 1.5. (Would take more work to find exactly, because the scraped submission times are imprecise, and the interface moves around.)
Software used: Python, with Matplotlib for plotting.
Saturday, March 07, 2009
A Week in Hacker News
Wednesday, June 11, 2008
Python-powered webcam for MIT construction site--with time-lapse video
My office on the 4th floor of the Broad Institute looks down on the construction site for MIT's new cancer research building. Soon after they tore down the beautiful landscaping that used to be outside the Stata Center, I set up a webcam on my windowsill and started idly capturing shots of the construction while I worked.
This week I finally automated the process with a Python script using VideoCapture and pyMedia. Once a day it compiles all the shots, now taken at 30 minute intervals from 7am-5pm, into a 12fps time-lapse video:
Large MPEG (1280x960, 16.6MB and growing)
Small MPEG (640x480)
A live shot is also taken every 5 minutes: [Update 3/7/09: I stopped recording when they covered the whole construction site with a big sheet.]
(click for larger image)
I used a Logitech Quickcam Pro 9000, which has a glass lens and nice resolution. It deals surprisingly well with the bright sky and dark, shadowed construction site.
Construction is expected to finish in December 2010, so I'll up the frame-rate when the file starts getting huge.
Saturday, March 15, 2008
stored_result(): automatic serialization of expensive calculations
A very common situation I encounter: "Do [expensive calculation] and pickle the result; unless it's been done before, in which case unpickle the stored result." This stored_result() utility function makes that very easy, and provides some convenient options for scripting and memory management."
def stored_results(pkl_filename, function, overwrite=0, verbose=1,
lazy=False, args=[], kwargs={}):
"""Returns contents of pickled file 'pkl_filename'. If file does not exist
or is malformed, returns output of function (with args & kwargs) and pickles same
result to pkl_filename.
If overwrite>0, runs function regardless of presence of 'pkl_filename'.
If lazy=True, instead outputs a generator that evaluates the rest of the stored_results()
call only when its next() method is called."""
import cPickle
if lazy:
# Create a generator that will eventually retrieve stored result
def result_gen():
result = stored_results(pkl_filename, function, overwrite, verbose,
False, args, kwargs)
while 1:
yield result
return result_gen()
else:
if not overwrite>0:
try:
inf = file(pkl_filename, 'rb')
result = cPickle.load(inf)
inf.close()
if verbose: print pkl_filename, "successfully unpickled"
return result
except (cPickle.UnpicklingError, IOError):
if verbose: print "Result not stored. Running function", function.func_name
result = function(*args, **kwargs)
outf = file(pkl_filename, 'wb')
cPickle.dump(result, outf, -1)
outf.close()
if verbose: print "Result pickled to", pkl_filename
return result
Typical use:
result = stored_results('saved_result.pkl', expensive_function, args=(23, 'sweet'))
Advanced use:
To perform the calculation/load the result in a just-in-time fashion:
result_gen = stored_results('saved_result.pkl', expensive_function, args=(23, 'sweet'), lazy=True)
[memory-sensitive tasks that may or may not need the result]
result = result_gen.next()
Victory is mine!
I defended and graduated! PhD in Physics from MIT. Now it gets good.
I crossed the street to the Broad Institute to do cancer research. Now, I don't interact with patients, or even deal with tumor samples. My job is to manage and analyze the huge flow of data coming from the latest microarray experiments. We look at sequence, expression, methylation, copy-number variation, and microRNA data simultaneously to tease out the subtle cellular changes that result in cancer.
When motivation strikes, I'll write about some of the programming and other issues I deal with at work.
Thursday, May 24, 2007
ABT
Today I passed 8.901, Astrophysics I, my final breadth requirement. That makes me officially ABD (All But Dissertation). Expect "Silencing and Recombination in Yeast Ribosomal DNA" at your local bookstore around December. A movie deal with Burger King tie-in should follow shortly thereafter.
Monday, May 21, 2007
CAPTCHAs and Tesseract
CAPTCHAs always seem to come up in discussions of OCR projects like Tesseract, so I decided to test Tesseract to see if it was actually the next big thing in spammer technology.
A commenter pointed me to a Berkeley effort to defeat CAPTCHAs that handled 92% of a selection of text-images from 2002 using a tailored OCR algorithm. I used PyTesser and difflib to run through the images and check the results.
Tesseract read the image correctly for 36 out of 191 images (19%), and was close (within one character) for 5 more. Here are a few of the harder images it was able to crack:
Tesseract had trouble most frequently on text that was more skewed, had lots of distracting dots, or white or black lines crisscrossing the words. With a small amount of image preprocessing (removing speckles and narrow lines), it might do much better on this old set. On modern CAPTCHAs, though, it's probably SOL.
Thursday, May 17, 2007
PyTesser: OCR for Python
In two years using Python, I never once searched for "Python X" without finding PyX, someone's labor of love making X easy to understand and use. Many examples come to mind. That's what hooked me when I started out: any tool I wanted was at my fingertips in 5 minutes, and just worked.
Optical character recognition was just about the only exception. So, I got excited when Google released Tesseract OCR, a straightforward, relatively accurate OCR package written in C++. Tesseract didn't have Python bindings, but it didn't take much work with PIL and subprocess to make it act like it did.
Behold! PyTesser.
>>> from pytesser import *
>>> image = Image.open('fnord.tif') # Open image object using PIL
>>> print image_to_string(image) # Run tesseract executable on image
fnord
>>> print image_file_to_string('fnord.tif')
fnord
Saturday, April 15, 2006
ITA Software puzzle: Word Rectangle
A few months ago job ads for ITA Software were all over Boston's subways. Each ad said "Solve this, work here!" and gave a non-trivial programming puzzle. I'm not about to do better than the people behind Orbitz.com, but some of them looked fun to try.
"Write a program to find the largest possible rectangle of letters such that every row forms a word (reading left to right) and every column forms a word (reading top to bottom). Words should appear in this dictionary: WORD.LST (1.66MB). Heuristic solutions that may not always produce a provably optimal rectangle will be accepted: seek a reasonable tradeoff of efficiency and optimality."
So a 3x3 solution would be
cat
ate
tea
My algorithm places letters in the matrix one at a time, checking to make sure each new letter forms part of a legitimate word in both the horizontal and vertical directions. That check would take ridiculously long if I had to search through the whole wordlist each time, so I do some preprocessing. The wordlist gets parsed into a tree, with the nth level branching on the nth letter of the word. The top level divides the tree by letter counts, since we know from the beginning how many letters our word should have. It takes a while to create the word tree, but once I have it, answering the question "What are the possible 4th letters of a 6-letter word that starts with 'tur'?" is very fast:
>>> dictionary[6]['t']['u']['r'].keys()
['a', 'b', 'e', 'g', 'f', 'k', 'n', 'r', 't', 'v']
Now if we have a partial 7x7 solution like
we just find all the letters that can legally follow both 'au' and 'acqui':acacias
coconut
acqui?
>>> dictionary[7]['a']['u'].keys()So the ? can be either an 'r' or a 't'.
['c', 'b', 'd', 'g', 'k', 'l', 'n', 's', 'r', 't', 'x']
>>> dictionary[7]['a']['c']['q']['u']['i'].keys()
['r', 't']
I tried making a few improvements on the brute-force approach. You might even call some of them 'heuristic'. First I always try Psyco with this sort of project to see if it has any effect. Psyco is a just-in-time compiler of Python into machine code. Python does some things a lot slower than the equivalent C, especially if you don't know much about its implementation. (E.g., string concatenation can be slow because Python makes an entire new string even if you add just one character to the end of an old one. This is a Feature.) Just adding psyco.full() to a program can dramatically speed it up, without having to know anything about the Python interpreter.
Psyco helped with the first few iterations of my code, but once I optimized dictionary access in the inner loop it wasn't really improving matters. Disable it if you want to save a whole lot of memory.
Next I changed the order in which branches are followed. Before, the letters were tried in whatever order they were kept in the hash table, i.e. arbitrarily. Now I made it follow the "thickest" children first (the ones with the most leaves). That seems to be faster for smaller rectangles.
Eventually I tried paring down the children, keeping only the n thickest at each branch. For n=3 this was really fast up to 7x7, but fails on 8x8. Of course, with n<26 this is no longer guaranteed to find a solution if one exists. Possibly my scoring of the branches could be extended to some sort of A* search, but I think I'll move on to something else first. Anyway, here is the source code:
word_rectangle4.py version 0.1, 4/14/06
For 7x7--my biggest so far-- it finds
straferA taeniae is a sort of ribbon. (I just saved you ten seconds.)
taeniae
resters
antiflu
fiefdom
earlobe
resumes
If you have ideas for improving my code (heuristically!) or have tried this yourself, leave a note.
Tuesday, April 11, 2006
Bootstrap statistics in Python
bootstrap.py version 0.1, 3/08/06
Python didn't seem to have a straightforward published bootstrap statistics module, so here is my effort.
Suppose you had 40 measurements
>>> dist
array([ 9.95587335, 11.14893547, 12.07186714, 9.06794986,
10.66503852, 10.7564387 , 11.42936871, 9.72575509, …
…11.41262324, 9.28556988, 8.01210309, 9.98832624])
with standard deviation 0.912. What is the uncertainty in your measurement of std=.912? Might the underlying distribution have sigma=1? (There is an exact formula in this case for Gaussian samples, but that won't always be so; and besides, it's a pain.)
Bootstrap statistics assumes that our 40 data points completely describe the underlying distribution–a "non-parametric" distribution. This assumption gets better the more points we have, but has good statistical properties all the time.
If the 40 points fully chararacterize our distribution, we can essentially do our experiment over and over again by resampling from those 40 points with replacement. Bootstrap.py does this very quickly using SciPy. Given distribution dist and an estimation function that takes a distribution and returns a number–such as scipy.std()–bootstrap.py resamples dist many many times and returns the mean and standard deviation of the estimate distribution.
>>> import bootstrap
>>> print scipy.std(dist)
0.911973669727
>>> print bootstrap.fast_bootstrap(scipy.std, dist)
(0.89697434867588222, 0.10347317195461153)
What does this mean? If we had 1,000,000 data points, rather than just 40, then their standard deviation would most likely be in the interval 0.897 +- 0.103. This is consistent with sigma=1, which, you might have guessed, I used to generate the data.
- Bootstrap.py also lets you set error tolerance and a few other parameters.
- Are you enticed? Let me know if this was useful to you, or failed spectacularly.
- Thanks to gumuz' devlog for making me want to share.