From gtoal Sun Jun  6 19:21:54 1999
Received: (from root@localhost)
	by gtoal.com (8.8.5/8.8.5) id TAA12933;
	Sun, 6 Jun 1999 19:21:36 -0500 (CDT)
Date: Sun, 6 Jun 1999 19:21:36 -0500 (CDT)
From: Graham Toal <gtoal>
Message-Id: <199906070021.TAA12933@gtoal.com>
To: word-game-programmers@mit.edu
Subject: CALL FOR PARTICIPATION: word square algorithms
Cc: clong@eden.rutgers.edu, darrylfrancis@yahoo.com, eric@puzzlers.org,
        mail@comcal.softnet.co.uk, richardh@cwcom.net, wordways@juno.com
Status: R

This mail is going out to two gentlemen who have been active in
programming solutions to word squares, to a couple of people interested
in following any new developments in that area, and to the 
crossword-game-programmers@mit.edu mailing list where I hope
we'll find some new blood with fresh ideas to help us out with
some difficult problems.  [Also I'm mailing this to Richard Hooker,
reigning Computer Olympiad Scrabble program author, whose colleague
Tony Guilfoyle is currently in hospital, as the latter may appreciate 
receiving a copy of this as a diversion to occupy him until he
recovers :-) I believe Tony was instrumental in the original
dawg-building algorithm that Richard designed so he should be
well versed in dawg stuff which is almost a prerequisite to efficient
wordsquare generation]

Since not everyone is as immersed in this problem as a couple of us are,
I'll go over some of the basics; apologies to those who know this
already.

Here are some word squares:

The first is the most common form of word square, where the words across
are the same as the words down, symmetric across the major diagonal.

CROSSES     \-----
RICHEST     |\
OCTETTE     | \
SHEATHE     |
SETTLER
ESTHETE
STEERED


A more prized square is a 'double' square, where the words down are
not the same as the words across, such as:

ABACK
CUNHA
HIKER
ELLEN
STEPS

Next we have a rather fancy kind of square where every word can be
read both forwards and backwards, such as this one:

kramer    (all these examples were generated by my program, though it
romane     appears this latter one is a case of independent rediscovery
amunam     as Darryl Francis reports that Dmitri Borgmann, possibly
manuma     among others, has already published this one!)
enamor
remark

Again, you'll note the symmetry down the main diagonal, and this time
also the other three diagonals!  A square like this which was *not*
symmetrical would be a rare find, for example:

alala
nagas
amaga
samen
arara


I've also seen semi-double squares where the symmetry was down the
opposite diagonal, eg 

S   /
P  /
E /
T/
STEPS

although I don't have one of these to hand.

                     \ |
and of this form:     \|
                    ___\
anana
simar    Looking up all the words in this square quite expands your
agama    vocabulary!  Blue Throated Macaw indeed!
neper
arara


I suspect there are also squares that form words down both diagonals
as well, though I haven't looked for these much yet.

--------


Anyway, enough of that.  The question is, how do we generate squares?

Pretty simple, at first glance:

Just put down a word horizontally, then try another one underneath it,
and another one underneath that and once you've filled a square, check
that the downward words are also valid.

This is a simple multiple nested for-loop over the whole word list, and
takes D^L steps, where D is the size of the dictionary and L is the length
of a word - i.e. L nested loops over D words.

For 2, 3, or maybe 4, this might even work.  However, the exponential
term rapidly increases the number of steps, beyond any hope of generating
squares this way.

So, obviously some optimisation is needed.  OK, here's the next trick -
place a word in the first row.  Place a word in the second row.  Start
looking downwards right away before you place any more words.  Some
placements of two words will generate the initial 2 letters of a downwards
word which cannot form a real word, eg 'x' below a 'v' as in

VAGUS
XENON

- we know there are no words starting in 'VX....', so we don't need
to try anything in the next three rows for that combination
- so remove XENON and try another word.

This of course can be generalised to checking the downward stems
at every row, and as it happens, a dawg-based spelling checker is
an ideal tool for checking that the initial stem of a word is valid.


Next optimisation:  having formed some rows above, is there any way
you can optimise which words you even try in the next horizontal row?

Well, yes... - our Scrabble programmers here will immediately have
noticed the similarity between this and 'cross checks' of vertical
words when playing a horizontal Scrabble move.

Each partially formed vertical word has only a limited number of
options available at the next free position, so that when placing
a word horizontally, the word might be limited effectively by a
regular expression such as: [aei][vblkm][erosp][e][dy], so that
a word such as ABRED could be placed here, but not ABAFT.

This again is an easy test to construct.  You do a dawg prefix check
of the downward stem for any column, and the dawg node you reach
at the end of that stem will list all the allowed letters in that
position.  When placing a word horizontally, don't do it by iterating
over all the words in the dictionary, a whole word at a time, but
do it by walking the trie a letter at a time.  This way we efficiently
trim the horizontal placements the second we find a letter that
cannot be placed - for instance if we've successfully placed AB in the
example above, we know it can't be followed by several letters (including
say A, or O) and so we avoid ever tring ABACK, ABAFT, ABOVE etc etc.

Taken together these two optimisations make the word square problem
tractible up to about 6x6 squares.  One further optimisation - low
level, this time, not algorithmic - is to do the vertical cross-check
sets incrementally.  Instead of doing a check of the whole word prefix
every time we add a new row, we rather keep a 2-D array of dawg nodes
so that as we add each row, we only have one step to calculate for
each column rather than several.  This reduces one term of the
complexity function for this problem from +(L^2)/2 to +L.

With this help, it's just barely possible to enumerate all 7x7 squares,
if you have a powerful CPU and lots of time.

----

This algorithm is not the only algorithm possible.  I have seen published
in Tony Augarde's book, The Oxford Guide to Word games, the following
(unsuccessful) attempt at a 10x10 square by Frank Rubin:

accomplish
cooperancy
copatentee
opalescent
metenteron
prestation
lancetooth
interiorly
scenootl
hyetnnhy

It is clear from the intermediate result published that Frank Rubin
who did this must be using a different algorithm, because the one
I explained above would never have generated the intermediate prefixes
'scenoo' or 'hyetnn' because of the incremental checks as every letter
was placed.  In fact placement would never have gone beyond trying
mententeror.

I suspect what he is doing is assuming the result will be a single square
(ie one which reads down the same as across) and he is placing each
word vertically and horizontally at the same time, starting with

accomplish
c
c
o
m
p
l
i
s
h

going through

accomplish
cooperancy
co
op
me
pr
la
in
sc
hy

etc.  Making the assumption that you are looking for a single rather
than double or other general square must shorten the generation time
considerably, though it is not obvious why he didn't trim the search
at the 5th or 6th word when it was obvious that no further solutions
were possible.

It's only through stubborn pride I haven't tried this optimisation
myself I should add!  I definitely will try this soon however - it
seems the most promising optimisation if only we're willing to
allow the possibility of generating a subset of squares - but a
subset that's probably 95% of the solution space.  I just felt
it would be too frustrating to rule out any *really good* solutions,
and a double square 8x8 has to be a worthy challenge. I suspect
the runtime win from this simplification is probably orders of
magnitudes more valuable than the loss of a few solutions.

-------------------

OK, so that's the basic background.  Why I'm mailing everyone is to
take this a step further, to generate all 8x8 through 10x10 squares,
doubles included.

But first, a little more complexity analysis and some figures:

It soon becomes clear that the number of squares you can expect
to find for any given word list is a function of the number of
words in that list.  When I first ran the code on a relatively
small dictionary, I generated:

356 2-letter squares from a list of 64 words
221756 3-letter squares from a list of 626 words
3718565 4-letter squares from a list of 2344 words
314261 5-letter squares from a list of 4138 words
7121 6-letter squares from a list of 6254 words
12 7-letter squares from a list of 8517 words
no 8 letter squares from a list of 10344 words
no 9 letter squares from a list of 6148 words

I then tried a larger word list (but only on the larger squares
because there was no point in generating more than the three million
or so 4-squares I'd already found)

Here I found 56976 6-squares, and 6174 7-squares but the latter run
was aborted before completion as it was clearly going to take months.

Anyway, just looking at the 6-squares, we see that from a vocabulary
that was approximately 4 times larger, we build about 8 times as
many squares.  So it's clear that there's a more than linear win from
a larger word list.  Intuitively you would expect that, but my intuition
isn't good enough to tell me the algorithmic complexity of it.  Anyone
who would like to suggest a function from first principles, please
take a stab at it.  Meanwhile I'll do some experiments with controlled
vocabulary sizes and determine the relationship empirically.

My guess is that S is a function of log_L(W) - the larger L reducing
the number of solutions, and the larger W increasing them.  Not a guess
I'm confident of however.  Complexity theory was never my strong suit
as a student... (I'm not ashamed to admit it.  I muddled through
computer science on the strength of good programming and some degree
of intuition ;-) )

So... conclusions from the above: 1) it's clear that for
the larger squares, a MUCH larger vocabulary is required; and 
2) there is a function that can determine the vocabulary size needed
before you can expect to find even 1 word square, for large word
sizes.  We need to know what this function is.

(Actually, I think we can only come up with approximations to such
a function.  I think in reality it depends on the average branching
ratio of each position in a word - eg the first position has 26
branches, but the middle position may have only 10.  When you sum
all these terms you will something approximating to but somewhat
less than the worst-case analysis of a vocabulary consisting
of aaaaaaaaaa though zzzzzzzzzz)

It was reported to me that Chris Long estimates that the size of
vocabulary needed to have a fifty-fifty chance of finding a square
is about 250,000 words.  I'm cc'ing him on this mail in the hope
that he'll help us and also let us know the details of that
calculation.

----

One optimisation I'd like to make but can't quite work out the
details of yet is how to take advantage of the x-y symmetry that causes
us to generate essentially the same result twice for every square -
reducing the runtime by a factor of 2 isn't significant algorithmically
but it sure would speed up my development loop...

Which brings us to... run times.  Although the algorithms used will
eventually determine *all* word squares for a given vocabulary, the length
of time it takes them to do so is critical.  It was suggested to me
by Darryl Francis that perhaps by building the squares bottom up rather
than top down, we could make more trimming decisions early in the process
and therefore run times would be shorter.  This was inspired by the
fact that most human generators of word squares start bottom-up.
Intuitively it is plausible, as the initial letters of all words
are spread equally across the alphabet, more or less, but the ends
of words include many common affixes such as -ed -ing etc, and therefore
there is less choice at the ends of words and consequently more
pruning.

However, it turned out that this is not the case, and that in fact
generating word squares from the bottom up takes much longer, between
maybe 2 to 5 times longer on some of the ones I tried.

(Aside: quick&dirty algorithm for bottom-up generation: just reverse
the letters in the words in your word list, and use the same top-down
code, then print the result backwards)

It *may* be true that this is a good optimisation for a version of
a program that attempts to find *one* such square - increasing the
probability somehow of finding a cood contender.  However I'm greedy
and want to enumerate all possible squares :-)  [There's a good
reason for this: there are many algorithms where doing X once
is of complexity N, but doing X N times is N^2 - whereas doing
all N X's at once is O(NlogN).  Selecting the lowest valued
itemin a list, vs sorting a list, is a prime example.  Consequently
I am 100% sure that calculating all word squares at once, though
slower than finding one at random, is MUCH more effective in the
long run at finding multiple squares.  And for a problem such as
10x10 where there may only be 1 solution - effectively the same.]

However my suspicion about why this is turning out to be so bad is
because there are so many fewer squares of the form

\  ^                               \--->
 \ |                               |\
  \|   than there are of the form  | \
<--\                               v  \

and I suspect the top-down approach favours the latter more easily.

So, undaunted, I looked further to see if the bottom-up approach had any
other merit.  It turns out it may have:

Consider an algorithm which builds partial word squares, ie it stops
as soon as it has worked half way down a square.  This *ought* to have
a run time roughly equivalent to that of a square of size (L/2), plus
a little more because we're generating longer rectangles rather than
squares (eg half a 10x10 is a 10x5, slightly more complex than a
5x5 but nowhere near as complex as a 10x10)

If we could therefore generate all top-halves of a 10x10 with one
program and all bottom halves with another, then marry them up
efficiently, we could remove orders of magnitude from the algorithmic
complexity.  A naive marrying up would be of order S^2 (where S is the
number of partial solutions). Still, S^2 really should turn out to
be better than L^10

But... we have another problem here...  we know from the early experiments
that a 4x4 square generated 150Mb of results.  An 8x8 square decomposed
into 2 8x4's would likely generate something almost as large.  A lot of disk
space and an S^2 solution iterating over all those would be very
slow if you think of it in terms of disk IO.  And keeping all those
partial solutions in RAM is not an option, at least not for my old
32Mb 486 that I'm doing this work on.

And I would want to start on 8x8 before 10x10 because we know there
are several 8x8 solutions and we ought to find some within days, not
months.  Experimenting with 10x10, not being sure we'd even find one,
and expecting to take weeks or months is not an attractive proposition
when you're not even sure of your algorithm.

With a 10x10 though, we would expect the intermediate storage problem
and the S^2 term to be smaller.  With something smaller than 8x8 though,
the storage factor would be impossible to handle.

Fortunately... there's a solution!  Somewhat counter-intuitively, if
we generate 10x6 "top half"s and 10x6 "lower half"s rather than 10x5's,
we cut down the space requirement considerably.  (Remember 6x6 was
tractible and generated about 56 thousand squares - a managable
number compared 300 thousand 5x5's - with a small word list; because
we have to deal with 56000^2 and 300000^2, ie 3*10^9 vs 9*10^10,
or a factor of about 30.  That equates to a month of runtime
instead of a day of runtime)

Well, if we do that, we get an extra benefit.  We can easily match up
top and bottom 'halves' by looking at the two words which overlap.  Only
squares where those words are the same need to be checked more carefully,
to see if when joined together, the vertical columns form real words.

This can be done efficiently for all top and bottom halves at once
by a simple system sort procedure, and is order (S log S)

(S being the expected # of squares as a function of L and W - perhaps
linear in L and exponential in W?)

So we have reduced L^10 to L^6 + S log S, with the space requirements
increasing but remaining tractable.

It would appear intuitive that this binary split optimisation could
be continued (eg find the 8x4 rectangles by merging top and bottom of
8x2 rectangles) but then we hit the exponential space requirement.
Clearly the space-time complexity of the problem is exponential, and
we just have to juggle the tradeoffs to find ones that fit our
patience and computing resources.

Results so far are that I'm testing a 10x10 decomposed into 10x6, but
after a week have no partial solutions yet.  This may still turn out to
be the right thing to do - it may just be that with the wordlist I have,
there *are* no solutions.  Still, I would have expected some partials
by now.  It's also clear that the program will never terminate within
the uptime average of my machine, so I need to find a new approach
(or rewrite the program to save state, and restart it, losing the
two weeks I've already been running :-( )

So, that in a nutshell is the problem and the work done so far.  I'ld
like to encourage the word-game-programmers mailing list to think about
the problem and come up with new suggestions; and I'd like to appeal
to the few people who over the years have specialised in generating
word squares to band together and see if we can pull off the big one -
the first 10x10 square of real words.  At the very least, I think
we ought to document all the different tricks we know and put the
information on the web for the benefit of future attempts.
Darryl Francis has already put me in touch with Ted Clarke and Eric Albert,
and I'm trying to contact Frank Rubin.  If you know of any other word-
square afficionados with email, do let me know.

Currently our 10-letter vocabulary is 52000 words, all from respectable
sources (ie no junk words from Unix spelling checkers).  However I
suspect we may have to resort to either adding words from specialist
sources (eg latin terms in medicine, botany etc) and/or foreign
words, if in fact there *is* no solution likely for a list that
size, except by sheer odds-against luck.

----

If anyone reading this wants to join in public discussion rather than
private mail, and is not already a member of the mailing list 
'word-game-programmers@mit.edu', you're welcome to join us.  Send
email to Sherrie St John at saint@mit.edu and she'll add you manually;
I don't think we have an automatic subscription address.  Traffic
on this list is generally quite low.


Anyone who wants to experiment and doesn't already have code to
work with is entirely welcome to steal mine, from

http://gtoal.com/~gtoal/scrabble/wordsquare/ws-stable.c

I recommend against using any of the other versions in that directory
as they are all 'work in progress'. Only files with "-stable" can
be trusted.

Graham


=============================================================

This is a very short description of my wordsquare program.
To understand the implementation details as opposed to the
big picture you will also need to read the document on
DAWGs (Directed Acyclic Word Graphs) which I'll be writing next.

The high-level part of the algorithm is actually very obvious,
basic and simple:  take an empty square then put down a word
in the top row.  Place a word underneath it, and if the downward
columns look like they can also form valid words (checked by
looking for the starts of the word in the dictionary), then
continue to place words on subsequent rows until the square is
full.  If at any time some partially formed word in a downward
column is nonsense, remove the last word you placed and try another
one.  Sometimes you may backtrack more than one row before you
find another word that can be placed which allows sensible partially
formed vertical words.

That's the basic algorithm, but in practice it is refined considerably
because as it stands it would take forever to finish.

To make the description easier, let's assume we've placed 5 full
rows successfully, and are looking at placing the 6th row.  We could
simply try each word in the dictionary sequentially, and as we place
each word, check the 6-letter downward stems, to confirm that we have
not blocked placement of a 7th row by some nonsensical downward
stem, such as "peacez...."  But... by placing whole words at a time,
we might find ourselves repeating the first 7 or 8 vertical checks
every time, eg if we place "superhuman" on row 6 and the vertical
check fails on the 9th column, then we place 'supersonic" on row 6,
we end up checking the same 5 vertical words again, that intersect
with the prefix 'super'.

So, rather than place whole words at a time, and backtracking whole
words when one doesn't allow further placements, we actually place
the horizontal words one letter at a time, and when a letter fails,
we backtrack just one letter.  So, after placing "supersoni" and
realising that the "i" blocked a downward placement in column 9, we
would then look for something else beginning with "superson..".  Not
finding any, we backtrach to "superso..." etc until we eventually
get to a position "super....." which allows another letter to be
placed, in this case "superh...." which is the start of "superhuman".

Next, the big optimisation comes from never actually checking
the vertical prefixes twice.  We use the 'dawg' structure to perform
an incremental spelling check down the vertical columns, BUT as
we build the square up row by row, we store the intermediate
stages of each vertical cross-check in another square array the
same size as the wordsquare, and so when we add one more row to the
square, for each column we only have to check that one more letter
can be added to the partial word at that stage.  Let's say we had
already formed the partial word 'peace' on one of the downward
columns, then if the horizontal letter being placed in the 6th
row at that column is say 'm', we know from our partially checked
word 'peace' that it can be followed at that point by 'maker'
and so 'm' is valid.

Finally, having this mechanism in place means that we know that
immediately below any partially formed vertical word, only a small
number of letters are valid - eg with 'peace.....' we may have only
"peacemaker" or "peacetimes" so only 'm' or 't' are valid at that
point.  This means we no longer have to try every word in the
dictionary in turn - because we can limit the letters in words we
are placing horizontally to only those that allow vertical words
to be formed.  This cuts down the iterations tremendously.  The only
extra trick needed here is simply to mark the first row as allowing
any letter from 'a' to 'z' so that all words are valid in the
first row, so that after the first row is placed, only horizontally
placed letters that allow valid downward words to be formed can be
placed.

This whole strategy of trimming what is placed horizontally by
the partially-formed vertical words works only because the data
structure that we use to represent the word list is a tree, and
for any position in any word, we have instant access to the
suffixes that can follow that prefix, for the cost of a single
memory operation rather than any complex database lookup.

The next iteration of this document will include some examples
and partially formed squares, and a link to the description of the
dawg data structure.

Graham Toal
Texas, 15Aug99


