2015-10-18

Crossword Puzzle Compiler

Techniques demonstrated:

  • Solving an intractable problem using a scoring heuristic
  • Super-indexing data structures for ultra-fast domain-specific queries

Summary (just gimme the TL;DR)


This is old-style artificial intelligence in action, solving within seconds a problem that would normally take eons to complete.

You give it a crossword grid, and a long list of words, and it finds ways to mesh words into the grid so as to form a complete crossword puzzle.  The final working version was done in 2003 using C# version 1.2 with a minimalistic UI in WinForms. 

The following 30-second video shows the crossword compiler in action, filling multiple successive crosswords using a word list taken from actual crosswords that have been published on the interwebz by various sources through the years.  The video is in real time, showing that the crossword compiler is, in most cases, extremely fast.


A longer description for those who like reading


For my Graduation Project at California State University, San Bernardino in 1993 I chose to develop a Crossword Puzzle Compiler.  That's a program that you give it a crossword grid and a long list of words, and it weaves the words into the grid to produce a Crossword Puzzle. I wrote the program in C, on the killer machine that I had at home, an IBM PS/2 Model 80 running DOS, with an Intel 80386 clocked at 16Mhz, and equipped with an entire megabyte of RAM. I had not taken any Artifical Intelligence courses, so poor me was under the impression that I could just brute force it and it would work, since the machine was so incredibly fast.  Afterall, DOS directory listings scrolled by so much faster than on my old 8Mhz Intel 8088!

The program did work, but only in the sense that it would begin to fill the grid, and it would keep working on it without any errors. However, it would never actually complete the grid, because this is not one of those problems that you can just brute force: you need to be more clever than that.
 
The test grid was American-style 15 by 15 symmetrical, divided into 9 compartments that were roughly 4 by 4 each, each one of them communicating with the rest via only one or two longer words.  Back then we had text-mode VGA displays, so storing a single byte in the video RAM caused a character to appear on the screen, which made it very easy to have the program display its progress.  The program would fill the first 4 by 4 compartment within a fraction of a second, but then it would get stuck in the second compartment, unable to find a way to mesh the words. It would back-track to the first compartment, undo a word, and try the second compartment again for several seconds, only to back-track again. Hours would pass, and it was not getting any further than the second compartment. I once let it work overnight, and by the next day it had completed the second compartment and was now struggling with the third.  As I was watching it, it back-tracked from the third to the second, and to my great dismay after a few seconds it back-tracked to the first. I was sure that it would complete all 9 compartments one day, but I understood that the day was several centuries in the future.

I admitted failure to my professor, and he agreed to let me pick another subject, which I promptly did and graduated.  However, the desire to get that Crossword Puzzle Compiler to work stayed with me.

By the time I was back in Greece in the late nineties I hard learned about heuristics and search tree pruning, so I thought I would give it another go. I decided that I would first rewrite the whole thing in C++, and then I would try to find some heuristic to make it cut corners and complete crosswords in a timely manner.  When I had just gotten it to roughly the same point where the older version was, and before I had the chance to try any heuristics, my apartment was broken into and my laptop was stolen, so I lost all of the work that I had done.

In early 2003 I decided to give it another go, this time in a brand-new language called C#. (That was .NET Framework 1.1 on Visual Studio .NET 2003.) Either because the language made it so much easier to be productive, or because by that time I had learned so much more about Object Oriented Programming, I was able to re-create the program very quickly, and then I used a simple technique to cut corners, and thus avoid brute-forcing: when it had a list of words to consider, it would first assign a score to each word based on how many words could be found perpendicular to it, then it would sort the words by score, and then it would start trying each word in the list starting from the word with the highest score.  So, one day I hit the "Run" key on my keyboard, and after only a few seconds of waiting, a completely filled Crossword Puzzle appeared in front of me for the first time.  That must have been the most magic moment in my life. (Well, excluding certain moments with girls.)

The generated crosswords were not of a very good quality, because the Crossword Compiler would keep picking words that have high perpendicular counts, like "ADD" and "DAD". So, in 2011 I revisited the project with the intention improving the heuristic.  However, I knew that would be a time consuming process, involving a lot of trial and error, so I decided to first improve the linear running time of the program.  I did many optimizations, all of them structural and algorithmic, without any small-scale hacking and tweaking of the kind which shaves clock cycles but renders the code unmaintainable.

By far the greatest optimization that I did was what I called the Super Index.  The Crossword Dictionary Model of the Crossword Puzzle Compiler needs to be able to very quickly answer queries like "give me all 5-letter words with a the letter 'G' on the 3rd position."  A simplistic approach is to have words already grouped by word length, and to just enumerate all words of the requested length, looking for ones that have the requested letter in the requested position. However, we can do a lot better than that.  The Super Index is an index which has a sub-index for each word length, which has a sub-index for each position along that word length, which has a sub-index for each letter of the alphabet, which has a list of all words of that length, that contain that letter, at that position. Therefore, search through words is completely eliminated.  If I remember correctly, the introduction of the super index cut the running time of the Crossword Compiler to one twentieth of what it used to be.

No comments:

Post a Comment