Saturday, April 15, 2006

catachrestic (adj.)

"He went to the store catachrestically."

ITA Software puzzle: Word Rectangle

A few months ago job ads for ITA Software were all over Boston's subways. Each ad said "Solve this, work here!" and gave a non-trivial programming puzzle. I'm not about to do better than the people behind Orbitz.com, but some of them looked fun to try.

"Write a program to find the largest possible rectangle of letters such that every row forms a word (reading left to right) and every column forms a word (reading top to bottom). Words should appear in this dictionary: WORD.LST (1.66MB). Heuristic solutions that may not always produce a provably optimal rectangle will be accepted: seek a reasonable tradeoff of efficiency and optimality."

So a 3x3 solution would be

cat
ate
tea

My algorithm places letters in the matrix one at a time, checking to make sure each new letter forms part of a legitimate word in both the horizontal and vertical directions. That check would take ridiculously long if I had to search through the whole wordlist each time, so I do some preprocessing. The wordlist gets parsed into a tree, with the nth level branching on the nth letter of the word. The top level divides the tree by letter counts, since we know from the beginning how many letters our word should have. It takes a while to create the word tree, but once I have it, answering the question "What are the possible 4th letters of a 6-letter word that starts with 'tur'?" is very fast:

>>> dictionary[6]['t']['u']['r'].keys()
['a', 'b', 'e', 'g', 'f', 'k', 'n', 'r', 't', 'v']

Now if we have a partial 7x7 solution like

acacias
coconut
acqui?

we just find all the letters that can legally follow both 'au' and 'acqui':
>>> dictionary[7]['a']['u'].keys()
['c', 'b', 'd', 'g', 'k', 'l', 'n', 's', 'r', 't', 'x']
>>> dictionary[7]['a']['c']['q']['u']['i'].keys()
['r', 't']
So the ? can be either an 'r' or a 't'.

I tried making a few improvements on the brute-force approach. You might even call some of them 'heuristic'. First I always try Psyco with this sort of project to see if it has any effect. Psyco is a just-in-time compiler of Python into machine code. Python does some things a lot slower than the equivalent C, especially if you don't know much about its implementation. (E.g., string concatenation can be slow because Python makes an entire new string even if you add just one character to the end of an old one. This is a Feature.) Just adding psyco.full() to a program can dramatically speed it up, without having to know anything about the Python interpreter.

Psyco helped with the first few iterations of my code, but once I optimized dictionary access in the inner loop it wasn't really improving matters. Disable it if you want to save a whole lot of memory.

Next I changed the order in which branches are followed. Before, the letters were tried in whatever order they were kept in the hash table, i.e. arbitrarily. Now I made it follow the "thickest" children first (the ones with the most leaves). That seems to be faster for smaller rectangles.

Eventually I tried paring down the children, keeping only the n thickest at each branch. For n=3 this was really fast up to 7x7, but fails on 8x8. Of course, with n<26 this is no longer guaranteed to find a solution if one exists. Possibly my scoring of the branches could be extended to some sort of A* search, but I think I'll move on to something else first. Anyway, here is the source code:

word_rectangle4.py version 0.1, 4/14/06

For 7x7--my biggest so far-- it finds
strafer
taeniae
resters
antiflu
fiefdom
earlobe
resumes
A taeniae is a sort of ribbon. (I just saved you ten seconds.)

If you have ideas for improving my code (heuristically!) or have tried this yourself, leave a note.

Tuesday, April 11, 2006

Bootstrap statistics in Python

bootstrap.py version 0.1, 3/08/06

Python didn't seem to have a straightforward published bootstrap statistics module, so here is my effort.

Suppose you had 40 measurements

>>> dist
array([ 9.95587335, 11.14893547, 12.07186714, 9.06794986,
10.66503852, 10.7564387 , 11.42936871, 9.72575509, …
…11.41262324, 9.28556988, 8.01210309, 9.98832624])

with standard deviation 0.912. What is the uncertainty in your measurement of std=.912? Might the underlying distribution have sigma=1? (There is an exact formula in this case for Gaussian samples, but that won't always be so; and besides, it's a pain.)

Bootstrap statistics assumes that our 40 data points completely describe the underlying distribution–a "non-parametric" distribution. This assumption gets better the more points we have, but has good statistical properties all the time.

If the 40 points fully chararacterize our distribution, we can essentially do our experiment over and over again by resampling from those 40 points with replacement. Bootstrap.py does this very quickly using SciPy. Given distribution dist and an estimation function that takes a distribution and returns a number–such as scipy.std()–bootstrap.py resamples dist many many times and returns the mean and standard deviation of the estimate distribution.

>>> import bootstrap
>>> print scipy.std(dist)
0.911973669727
>>> print bootstrap.fast_bootstrap(scipy.std, dist)
(0.89697434867588222, 0.10347317195461153)

What does this mean? If we had 1,000,000 data points, rather than just 40, then their standard deviation would most likely be in the interval 0.897 +- 0.103. This is consistent with sigma=1, which, you might have guessed, I used to generate the data.

  • Bootstrap.py also lets you set error tolerance and a few other parameters.
  • Are you enticed? Let me know if this was useful to you, or failed spectacularly.
  • Thanks to gumuz' devlog for making me want to share.