From gtoal Sun Jun 6 19:21:54 1999 Received: (from root@localhost) by gtoal.com (8.8.5/8.8.5) id TAA12933; Sun, 6 Jun 1999 19:21:36 -0500 (CDT) Date: Sun, 6 Jun 1999 19:21:36 -0500 (CDT) From: Graham Toal Message-Id: <199906070021.TAA12933@gtoal.com> To: word-game-programmers@mit.edu Subject: CALL FOR PARTICIPATION: word square algorithms Cc: clong@eden.rutgers.edu, darrylfrancis@yahoo.com, eric@puzzlers.org, mail@comcal.softnet.co.uk, richardh@cwcom.net, wordways@juno.com Status: R This mail is going out to two gentlemen who have been active in programming solutions to word squares, to a couple of people interested in following any new developments in that area, and to the crossword-game-programmers@mit.edu mailing list where I hope we'll find some new blood with fresh ideas to help us out with some difficult problems. [Also I'm mailing this to Richard Hooker, reigning Computer Olympiad Scrabble program author, whose colleague Tony Guilfoyle is currently in hospital, as the latter may appreciate receiving a copy of this as a diversion to occupy him until he recovers :-) I believe Tony was instrumental in the original dawg-building algorithm that Richard designed so he should be well versed in dawg stuff which is almost a prerequisite to efficient wordsquare generation] Since not everyone is as immersed in this problem as a couple of us are, I'll go over some of the basics; apologies to those who know this already. Here are some word squares: The first is the most common form of word square, where the words across are the same as the words down, symmetric across the major diagonal. CROSSES \----- RICHEST |\ OCTETTE | \ SHEATHE | SETTLER ESTHETE STEERED A more prized square is a 'double' square, where the words down are not the same as the words across, such as: ABACK CUNHA HIKER ELLEN STEPS Next we have a rather fancy kind of square where every word can be read both forwards and backwards, such as this one: kramer (all these examples were generated by my program, though it romane appears this latter one is a case of independent rediscovery amunam as Darryl Francis reports that Dmitri Borgmann, possibly manuma among others, has already published this one!) enamor remark Again, you'll note the symmetry down the main diagonal, and this time also the other three diagonals! A square like this which was *not* symmetrical would be a rare find, for example: alala nagas amaga samen arara I've also seen semi-double squares where the symmetry was down the opposite diagonal, eg S / P / E / T/ STEPS although I don't have one of these to hand. \ | and of this form: \| ___\ anana simar Looking up all the words in this square quite expands your agama vocabulary! Blue Throated Macaw indeed! neper arara I suspect there are also squares that form words down both diagonals as well, though I haven't looked for these much yet. -------- Anyway, enough of that. The question is, how do we generate squares? Pretty simple, at first glance: Just put down a word horizontally, then try another one underneath it, and another one underneath that and once you've filled a square, check that the downward words are also valid. This is a simple multiple nested for-loop over the whole word list, and takes D^L steps, where D is the size of the dictionary and L is the length of a word - i.e. L nested loops over D words. For 2, 3, or maybe 4, this might even work. However, the exponential term rapidly increases the number of steps, beyond any hope of generating squares this way. So, obviously some optimisation is needed. OK, here's the next trick - place a word in the first row. Place a word in the second row. Start looking downwards right away before you place any more words. Some placements of two words will generate the initial 2 letters of a downwards word which cannot form a real word, eg 'x' below a 'v' as in VAGUS XENON - we know there are no words starting in 'VX....', so we don't need to try anything in the next three rows for that combination - so remove XENON and try another word. This of course can be generalised to checking the downward stems at every row, and as it happens, a dawg-based spelling checker is an ideal tool for checking that the initial stem of a word is valid. Next optimisation: having formed some rows above, is there any way you can optimise which words you even try in the next horizontal row? Well, yes... - our Scrabble programmers here will immediately have noticed the similarity between this and 'cross checks' of vertical words when playing a horizontal Scrabble move. Each partially formed vertical word has only a limited number of options available at the next free position, so that when placing a word horizontally, the word might be limited effectively by a regular expression such as: [aei][vblkm][erosp][e][dy], so that a word such as ABRED could be placed here, but not ABAFT. This again is an easy test to construct. You do a dawg prefix check of the downward stem for any column, and the dawg node you reach at the end of that stem will list all the allowed letters in that position. When placing a word horizontally, don't do it by iterating over all the words in the dictionary, a whole word at a time, but do it by walking the trie a letter at a time. This way we efficiently trim the horizontal placements the second we find a letter that cannot be placed - for instance if we've successfully placed AB in the example above, we know it can't be followed by several letters (including say A, or O) and so we avoid ever tring ABACK, ABAFT, ABOVE etc etc. Taken together these two optimisations make the word square problem tractible up to about 6x6 squares. One further optimisation - low level, this time, not algorithmic - is to do the vertical cross-check sets incrementally. Instead of doing a check of the whole word prefix every time we add a new row, we rather keep a 2-D array of dawg nodes so that as we add each row, we only have one step to calculate for each column rather than several. This reduces one term of the complexity function for this problem from +(L^2)/2 to +L. With this help, it's just barely possible to enumerate all 7x7 squares, if you have a powerful CPU and lots of time. ---- This algorithm is not the only algorithm possible. I have seen published in Tony Augarde's book, The Oxford Guide to Word games, the following (unsuccessful) attempt at a 10x10 square by Frank Rubin: accomplish cooperancy copatentee opalescent metenteron prestation lancetooth interiorly scenootl hyetnnhy It is clear from the intermediate result published that Frank Rubin who did this must be using a different algorithm, because the one I explained above would never have generated the intermediate prefixes 'scenoo' or 'hyetnn' because of the incremental checks as every letter was placed. In fact placement would never have gone beyond trying mententeror. I suspect what he is doing is assuming the result will be a single square (ie one which reads down the same as across) and he is placing each word vertically and horizontally at the same time, starting with accomplish c c o m p l i s h going through accomplish cooperancy co op me pr la in sc hy etc. Making the assumption that you are looking for a single rather than double or other general square must shorten the generation time considerably, though it is not obvious why he didn't trim the search at the 5th or 6th word when it was obvious that no further solutions were possible. It's only through stubborn pride I haven't tried this optimisation myself I should add! I definitely will try this soon however - it seems the most promising optimisation if only we're willing to allow the possibility of generating a subset of squares - but a subset that's probably 95% of the solution space. I just felt it would be too frustrating to rule out any *really good* solutions, and a double square 8x8 has to be a worthy challenge. I suspect the runtime win from this simplification is probably orders of magnitudes more valuable than the loss of a few solutions. ------------------- OK, so that's the basic background. Why I'm mailing everyone is to take this a step further, to generate all 8x8 through 10x10 squares, doubles included. But first, a little more complexity analysis and some figures: It soon becomes clear that the number of squares you can expect to find for any given word list is a function of the number of words in that list. When I first ran the code on a relatively small dictionary, I generated: 356 2-letter squares from a list of 64 words 221756 3-letter squares from a list of 626 words 3718565 4-letter squares from a list of 2344 words 314261 5-letter squares from a list of 4138 words 7121 6-letter squares from a list of 6254 words 12 7-letter squares from a list of 8517 words no 8 letter squares from a list of 10344 words no 9 letter squares from a list of 6148 words I then tried a larger word list (but only on the larger squares because there was no point in generating more than the three million or so 4-squares I'd already found) Here I found 56976 6-squares, and 6174 7-squares but the latter run was aborted before completion as it was clearly going to take months. Anyway, just looking at the 6-squares, we see that from a vocabulary that was approximately 4 times larger, we build about 8 times as many squares. So it's clear that there's a more than linear win from a larger word list. Intuitively you would expect that, but my intuition isn't good enough to tell me the algorithmic complexity of it. Anyone who would like to suggest a function from first principles, please take a stab at it. Meanwhile I'll do some experiments with controlled vocabulary sizes and determine the relationship empirically. My guess is that S is a function of log_L(W) - the larger L reducing the number of solutions, and the larger W increasing them. Not a guess I'm confident of however. Complexity theory was never my strong suit as a student... (I'm not ashamed to admit it. I muddled through computer science on the strength of good programming and some degree of intuition ;-) ) So... conclusions from the above: 1) it's clear that for the larger squares, a MUCH larger vocabulary is required; and 2) there is a function that can determine the vocabulary size needed before you can expect to find even 1 word square, for large word sizes. We need to know what this function is. (Actually, I think we can only come up with approximations to such a function. I think in reality it depends on the average branching ratio of each position in a word - eg the first position has 26 branches, but the middle position may have only 10. When you sum all these terms you will something approximating to but somewhat less than the worst-case analysis of a vocabulary consisting of aaaaaaaaaa though zzzzzzzzzz) It was reported to me that Chris Long estimates that the size of vocabulary needed to have a fifty-fifty chance of finding a square is about 250,000 words. I'm cc'ing him on this mail in the hope that he'll help us and also let us know the details of that calculation. ---- One optimisation I'd like to make but can't quite work out the details of yet is how to take advantage of the x-y symmetry that causes us to generate essentially the same result twice for every square - reducing the runtime by a factor of 2 isn't significant algorithmically but it sure would speed up my development loop... Which brings us to... run times. Although the algorithms used will eventually determine *all* word squares for a given vocabulary, the length of time it takes them to do so is critical. It was suggested to me by Darryl Francis that perhaps by building the squares bottom up rather than top down, we could make more trimming decisions early in the process and therefore run times would be shorter. This was inspired by the fact that most human generators of word squares start bottom-up. Intuitively it is plausible, as the initial letters of all words are spread equally across the alphabet, more or less, but the ends of words include many common affixes such as -ed -ing etc, and therefore there is less choice at the ends of words and consequently more pruning. However, it turned out that this is not the case, and that in fact generating word squares from the bottom up takes much longer, between maybe 2 to 5 times longer on some of the ones I tried. (Aside: quick&dirty algorithm for bottom-up generation: just reverse the letters in the words in your word list, and use the same top-down code, then print the result backwards) It *may* be true that this is a good optimisation for a version of a program that attempts to find *one* such square - increasing the probability somehow of finding a cood contender. However I'm greedy and want to enumerate all possible squares :-) [There's a good reason for this: there are many algorithms where doing X once is of complexity N, but doing X N times is N^2 - whereas doing all N X's at once is O(NlogN). Selecting the lowest valued itemin a list, vs sorting a list, is a prime example. Consequently I am 100% sure that calculating all word squares at once, though slower than finding one at random, is MUCH more effective in the long run at finding multiple squares. And for a problem such as 10x10 where there may only be 1 solution - effectively the same.] However my suspicion about why this is turning out to be so bad is because there are so many fewer squares of the form \ ^ \---> \ | |\ \| than there are of the form | \ <--\ v \ and I suspect the top-down approach favours the latter more easily. So, undaunted, I looked further to see if the bottom-up approach had any other merit. It turns out it may have: Consider an algorithm which builds partial word squares, ie it stops as soon as it has worked half way down a square. This *ought* to have a run time roughly equivalent to that of a square of size (L/2), plus a little more because we're generating longer rectangles rather than squares (eg half a 10x10 is a 10x5, slightly more complex than a 5x5 but nowhere near as complex as a 10x10) If we could therefore generate all top-halves of a 10x10 with one program and all bottom halves with another, then marry them up efficiently, we could remove orders of magnitude from the algorithmic complexity. A naive marrying up would be of order S^2 (where S is the number of partial solutions). Still, S^2 really should turn out to be better than L^10 But... we have another problem here... we know from the early experiments that a 4x4 square generated 150Mb of results. An 8x8 square decomposed into 2 8x4's would likely generate something almost as large. A lot of disk space and an S^2 solution iterating over all those would be very slow if you think of it in terms of disk IO. And keeping all those partial solutions in RAM is not an option, at least not for my old 32Mb 486 that I'm doing this work on. And I would want to start on 8x8 before 10x10 because we know there are several 8x8 solutions and we ought to find some within days, not months. Experimenting with 10x10, not being sure we'd even find one, and expecting to take weeks or months is not an attractive proposition when you're not even sure of your algorithm. With a 10x10 though, we would expect the intermediate storage problem and the S^2 term to be smaller. With something smaller than 8x8 though, the storage factor would be impossible to handle. Fortunately... there's a solution! Somewhat counter-intuitively, if we generate 10x6 "top half"s and 10x6 "lower half"s rather than 10x5's, we cut down the space requirement considerably. (Remember 6x6 was tractible and generated about 56 thousand squares - a managable number compared 300 thousand 5x5's - with a small word list; because we have to deal with 56000^2 and 300000^2, ie 3*10^9 vs 9*10^10, or a factor of about 30. That equates to a month of runtime instead of a day of runtime) Well, if we do that, we get an extra benefit. We can easily match up top and bottom 'halves' by looking at the two words which overlap. Only squares where those words are the same need to be checked more carefully, to see if when joined together, the vertical columns form real words. This can be done efficiently for all top and bottom halves at once by a simple system sort procedure, and is order (S log S) (S being the expected # of squares as a function of L and W - perhaps linear in L and exponential in W?) So we have reduced L^10 to L^6 + S log S, with the space requirements increasing but remaining tractable. It would appear intuitive that this binary split optimisation could be continued (eg find the 8x4 rectangles by merging top and bottom of 8x2 rectangles) but then we hit the exponential space requirement. Clearly the space-time complexity of the problem is exponential, and we just have to juggle the tradeoffs to find ones that fit our patience and computing resources. Results so far are that I'm testing a 10x10 decomposed into 10x6, but after a week have no partial solutions yet. This may still turn out to be the right thing to do - it may just be that with the wordlist I have, there *are* no solutions. Still, I would have expected some partials by now. It's also clear that the program will never terminate within the uptime average of my machine, so I need to find a new approach (or rewrite the program to save state, and restart it, losing the two weeks I've already been running :-( ) So, that in a nutshell is the problem and the work done so far. I'ld like to encourage the word-game-programmers mailing list to think about the problem and come up with new suggestions; and I'd like to appeal to the few people who over the years have specialised in generating word squares to band together and see if we can pull off the big one - the first 10x10 square of real words. At the very least, I think we ought to document all the different tricks we know and put the information on the web for the benefit of future attempts. Darryl Francis has already put me in touch with Ted Clarke and Eric Albert, and I'm trying to contact Frank Rubin. If you know of any other word- square afficionados with email, do let me know. Currently our 10-letter vocabulary is 52000 words, all from respectable sources (ie no junk words from Unix spelling checkers). However I suspect we may have to resort to either adding words from specialist sources (eg latin terms in medicine, botany etc) and/or foreign words, if in fact there *is* no solution likely for a list that size, except by sheer odds-against luck. ---- If anyone reading this wants to join in public discussion rather than private mail, and is not already a member of the mailing list 'word-game-programmers@mit.edu', you're welcome to join us. Send email to Sherrie St John at saint@mit.edu and she'll add you manually; I don't think we have an automatic subscription address. Traffic on this list is generally quite low. Anyone who wants to experiment and doesn't already have code to work with is entirely welcome to steal mine, from http://gtoal.com/~gtoal/scrabble/wordsquare/ws-stable.c I recommend against using any of the other versions in that directory as they are all 'work in progress'. Only files with "-stable" can be trusted. Graham