Saturday, April 15, 2006

ITA Software puzzle: Word Rectangle

A few months ago job ads for ITA Software were all over Boston's subways. Each ad said "Solve this, work here!" and gave a non-trivial programming puzzle. I'm not about to do better than the people behind Orbitz.com, but some of them looked fun to try.

"Write a program to find the largest possible rectangle of letters such that every row forms a word (reading left to right) and every column forms a word (reading top to bottom). Words should appear in this dictionary: WORD.LST (1.66MB). Heuristic solutions that may not always produce a provably optimal rectangle will be accepted: seek a reasonable tradeoff of efficiency and optimality."

So a 3x3 solution would be

cat
ate
tea

My algorithm places letters in the matrix one at a time, checking to make sure each new letter forms part of a legitimate word in both the horizontal and vertical directions. That check would take ridiculously long if I had to search through the whole wordlist each time, so I do some preprocessing. The wordlist gets parsed into a tree, with the nth level branching on the nth letter of the word. The top level divides the tree by letter counts, since we know from the beginning how many letters our word should have. It takes a while to create the word tree, but once I have it, answering the question "What are the possible 4th letters of a 6-letter word that starts with 'tur'?" is very fast:

>>> dictionary[6]['t']['u']['r'].keys()
['a', 'b', 'e', 'g', 'f', 'k', 'n', 'r', 't', 'v']

Now if we have a partial 7x7 solution like

acacias
coconut
acqui?

we just find all the letters that can legally follow both 'au' and 'acqui':
>>> dictionary[7]['a']['u'].keys()
['c', 'b', 'd', 'g', 'k', 'l', 'n', 's', 'r', 't', 'x']
>>> dictionary[7]['a']['c']['q']['u']['i'].keys()
['r', 't']
So the ? can be either an 'r' or a 't'.

I tried making a few improvements on the brute-force approach. You might even call some of them 'heuristic'. First I always try Psyco with this sort of project to see if it has any effect. Psyco is a just-in-time compiler of Python into machine code. Python does some things a lot slower than the equivalent C, especially if you don't know much about its implementation. (E.g., string concatenation can be slow because Python makes an entire new string even if you add just one character to the end of an old one. This is a Feature.) Just adding psyco.full() to a program can dramatically speed it up, without having to know anything about the Python interpreter.

Psyco helped with the first few iterations of my code, but once I optimized dictionary access in the inner loop it wasn't really improving matters. Disable it if you want to save a whole lot of memory.

Next I changed the order in which branches are followed. Before, the letters were tried in whatever order they were kept in the hash table, i.e. arbitrarily. Now I made it follow the "thickest" children first (the ones with the most leaves). That seems to be faster for smaller rectangles.

Eventually I tried paring down the children, keeping only the n thickest at each branch. For n=3 this was really fast up to 7x7, but fails on 8x8. Of course, with n<26 this is no longer guaranteed to find a solution if one exists. Possibly my scoring of the branches could be extended to some sort of A* search, but I think I'll move on to something else first. Anyway, here is the source code:

word_rectangle4.py version 0.1, 4/14/06

For 7x7--my biggest so far-- it finds
strafer
taeniae
resters
antiflu
fiefdom
earlobe
resumes
A taeniae is a sort of ribbon. (I just saved you ten seconds.)

If you have ideas for improving my code (heuristically!) or have tried this yourself, leave a note.

3 comments:

Anonymous said...

I use different strategy for this calculation. here is what I have

Computation approach:
Some Heuristic strategies were used in the program. The program will first group all the words by their length. That mean all words with 20 characters are in one group, all words with 19 characters are in another group and so on. The program will then determine if two groups can form a rectangle. For example, if one group has 15 characters and another group has 10 characters then a rectangle of 150 units of area can be form. But if the group that has 15 characters has only two words. Then these two groups cannot form a rectangle. The program will maintain a list of two groups and the program will sort this list by the possible area that it can form. Starting from the largest possible area, the program will try to find if there are possible rectangles in it.
Within these two groups, say group A and B, the program read each word (let’s call this word C) from group B. Consider the following two groups

loc A B
-----------------------------------------------------
0 chcactt cathys
1 csasers hiucha
2 ctueati letter
3 eaicreh oceans
4 eyhnler oracle
5 schools seccec
6 shitrs

If group A has words with starting letter for each letter in C, the program will maintain the locations of all these words in A. In this case, the program starts scanning B from “cathys” until it reaches “seccec”. Then it maintains location 5, 3-4, 0-2, 0-2, 3-4, and 0-2 for group A. Because the combination for all the possible words in these locations can be very large, heuristic strategy is used to avoid the unnecessary combination and speed up the computational time. In this case, we have a testing word “seccec” and the associate locations of the words in group A. We start from the first two letters of “seccec”, which is “se”. For “s”, we have “schools” in group A. For “e”, we have “eaicreh” and “eyhnler”. So one entry for “s” and two entries for “e”, we have a combination of 2 x 1 = 2. We then form test groups of two. The first group we use “schools” and “eaicreh” to form “se”, “ca”, “hi”, “oc”, or, “le”, and “sh”. The second group, we use “schools” and “eyhnler” to form “se”, “cy”, “hh”, “on”, “ol”, “le”, and “sr”. For first group, the program will found that “se”, “ca”, “hi”, “oc”, or, “le”, and “sh” are all prefix of one of the words in group B. but for second group, only “se” is prefix of a word in group B for “se”, “cy”, “hh”, “on”, “ol”, “le”, and “sr”. Therefore, the program will delete second group and using only first group to start comparing with the third letter, which is “sec”. Then a set of new group will form and all the above steps are repeated until the last letter reached.
When calculate all the possible combination, it take very long time for large input. To speed up the time, the program always put the group with words that have more characters in group A and group with words that have fewer characters in group B. This can reduce the amount of combination calculation. That also means less CPU usage and memory usage. This can significantly reduce the computation time.

jto said...

Heh. Anonymous's strategy is like yours, but with worse data structures. :)

Interesting - the problem has disappeared from ITA's web site.

It's an interesting problem, though. I approached it form the opposite direction: start with large rectangles, and work down. This way you get the one best answer, rather than uncertainty. I ended up with a very similar data structure and algorithm, but in C++.

The good news (for me) is that there are few long words, so this strategy very quickly eliminates 15x15 and 14x15, and gets down into the neighborhood of 120-letter rectangles. The bad news is, you have to search the entire solution space if you want to be sure. Heuristics can help find a solution quickly where one exists--but they don't help you quickly prove that there's no solution. To do that, you have to find better ways to prune the tree (not the tree of words, but the tree of possible solutions that you're searching). And, you know, search fast.

The next optimization is to add information at each node in the tree to help detect conflicts early.

With WORD.LST gone from ITA's site, it's hard to compare results. I'm using the Scrabble TWL06, which is around 170,000 words, all 15 letters or fewer. There are no rectangles in this data set that contain over 81 letters. But my program has taken over a day to figure this out. I think (hope) it'll probably finish before I find a spare moment to optimize it further.

Michael J.T. O'Kelly said...

I tried going from big rectangles to small, but it took too long to run even on rectangles with relatively few corresponding words of the right length. 7x7 would have taken forever if it weren't so dense with solutions. 8x8 never returned anything--though now I have a much faster computer! Maybe I should get back into the game.

If you want to compare results, here is a saved copy of WORD.LST. It is more heavy on words like "aarrghh" and "zyzzyvas" than I imagine the Scrabble dictionary to be.