From Steve.Thomas@isltd.insignia.com Thu Mar 13 04:46:22 1997 Received: from MIT.EDU (SOUTH-STATION-ANNEX.MIT.EDU [18.72.1.2]) by gtoal.com (8.6.12/8.6.12) with SMTP id EAA05800 for ; Thu, 13 Mar 1997 04:46:22 -0600 Received: from mailgate.isltd.insignia.com by MIT.EDU with SMTP id AA24318; Thu, 13 Mar 97 06:05:11 EST X-Address: Insignia Solutions plc., High Wycombe, Bucks, HP11 1JU, UK X-Telephone: +44 1494 459426 X-Fax: +44 1494 459720 Received: from tricity.isltd.insignia.com by mailgate.isltd.insignia.com (5.65c8/UK-2.1.ISL) with SMTP id AA10803; Thu, 13 Mar 1997 11:04:56 GMT Received: from localhost by tricity.isltd.insignia.com (1.37.109.4/UK-2.1.ISL) with SMTP id AA04393; Thu, 13 Mar 97 11:05:03 GMT Sender: Steve.Thomas@isltd.insignia.com Message-Id: <3327DF5E.D5D@insignia.co.uk> Date: Thu, 13 Mar 1997 11:05:02 +0000 From: Steve Thomas Organization: Insignia Solutions Ltd X-Mailer: Mozilla 3.01Gold (X11; I; HP-UX A.09.01 9000/735) Mime-Version: 1.0 To: "WODRICH, M, , WDRMAR002" Cc: word-game-programmers@MIT.EDU Subject: Re: Dawg References: Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Status: RO WODRICH, M, , WDRMAR002 wrote: > All this talk about DAWG's has been very interesting, but no-one has > mentioned whether there is a "standard" file format for DAWGs. I take > it every coder invents his own storage mechanism ? I don't know whether there's a standard format, though Appel & Jacobson gave some pretty string hints in their original description which probably means most DAWGs are rather similar. > Another question : How fast are the standard DAWG builders? I'd like > to know how my method compares... On a P100, I get around 20 thousand > words per second (5 secs for 100,000 words), which seems good. (An > early version of my builder took ages to build large DAWGs). I've just timed mine, and it takes 2.1s CPU time (about 5s elapsed) for OSW's 140000 words, running on a reasonably swift HP workstation. There's been no attempt to make the I/O swift, but then there's no real incentive to optimise a program which takes a few seconds and only need be run once in a great while. Of course, if you go the wrong way about it DAWG-builders can take many hours, as we discovered by experiment. I did spend a while building GADDAGs (a data structure due to Steven Gordon) and there the most time-consuming bit was sorting the modified word list. (A GADDAG is a DAG containing, not words as such, but all patterns rev(X)*Y, where XY is a word and X is non-empty, or some such. The details can probably be dragged out of long-term storage with sufficient incentive. A GADDAG is about 3-4 times as big as a DAWG with the same lexicon, and can be used to reduce the relatively aimless searching in the spaces to the left of (or above) a hook square. Gordon suggests that this approximately doubles the speed of a Scrabble engine.) Steve From gtoal@admin.vt.com Thu Mar 13 11:24:17 1997 Received: from MIT.EDU (SOUTH-STATION-ANNEX.MIT.EDU [18.72.1.2]) by gtoal.com (8.6.12/8.6.12) with SMTP id LAA15205 for ; Thu, 13 Mar 1997 11:24:17 -0600 Received: from admin.vt.com by MIT.EDU with SMTP id AA06161; Thu, 13 Mar 97 12:36:33 EST Received: (from root@localhost) by admin.vt.com (8.7.3/8.7.3) id LAA08800 for word-game-programmers@mit.edu; Thu, 13 Mar 1997 11:36:28 -0600 (CST) Date: Thu, 13 Mar 1997 11:36:28 -0600 (CST) From: Graham Toal Message-Id: <199703131736.LAA08800@admin.vt.com> To: word-game-programmers@MIT.EDU Subject: Re: Dawg Status: RO > I did spend a while building GADDAGs (a data structure due to Steven > Gordon) and there the most time-consuming bit was sorting the modified > word list. (A GADDAG is a DAG containing, not words as such, but all I may be able to help there. I rewrote GNU Sort much more efficiently so that it used a large continuous array of memory instead of piecemeal malloc allocations. If you are sorting a file larger than the available ram (less of a problem nowadays but a major problem when I wrote the code) it sorts as large chunks as it can in ram then does a balanced merge on all the chunks. The source is at http://www.gtoal.com/wordgames/sort/ The one thing you cannot afford in a sort program is to stray outside your working set, or the program is slowed down by factors of a thousand as soon as you start swapping head movement for direct ram access. Unlike some of my programs, this one is extremely well commented (to excess) so should not be at all daunting to hack on. > patterns rev(X)*Y, where XY is a word and X is non-empty, or some > such. The details can probably be dragged out of long-term storage > with sufficient incentive. A GADDAG is about 3-4 times as big as a > DAWG with the same lexicon, and can be used to reduce the relatively > aimless searching in the spaces to the left of (or above) a hook > square. Gordon suggests that this approximately doubles the speed of > a Scrabble engine.) I generalised this structure to a keyed trie database lookup, by writing code which assumed that the contents of the DAWG were "lhs=rhs" ie it treats the "=" as a word terminator, and returns the right hand side after a successful match. The dawg building is identical (the same program, same options) to ordinary dawg building. G From gtoal@gtoal.com Fri Mar 14 12:15:10 1997 Received: from MIT.EDU (SOUTH-STATION-ANNEX.MIT.EDU [18.72.1.2]) by gtoal.com (8.6.12/8.6.12) with SMTP id MAA24743 for ; Fri, 14 Mar 1997 12:15:07 -0600 Received: from central.vt.com by MIT.EDU with SMTP id AA12120; Fri, 14 Mar 97 13:18:08 EST Received: from gtoal.com (root@gtoal.com [167.114.209.10]) by central.vt.com (8.7.3/8.7.3) with SMTP id MAA14308 for ; Fri, 14 Mar 1997 12:26:26 -0600 (CST) Received: (from root@localhost) by gtoal.com (8.6.12/8.6.12) id LAA23640 for word-game-programmers@mit.edu; Fri, 14 Mar 1997 11:49:02 -0600 Date: Fri, 14 Mar 1997 11:49:02 -0600 From: Graham Toal Message-Id: <199703141749.LAA23640@gtoal.com> To: word-game-programmers@MIT.EDU Subject: Re: Dawg Status: RO : Gordon) and there the most time-consuming bit was sorting the modified : word list. (A GADDAG is a DAG containing, not words as such, but all : patterns rev(X)*Y, where XY is a word and X is non-empty, or some : such. The details can probably be dragged out of long-term storage : with sufficient incentive. A GADDAG is about 3-4 times as big as a : DAWG with the same lexicon, and can be used to reduce the relatively : aimless searching in the spaces to the left of (or above) a hook : square. Gordon suggests that this approximately doubles the speed of : a Scrabble engine.) Steven mailed me his algorithm last night, and although I'm having a little trouble understanding his description, it sounds very similar to an algorithm I came up with myself some years ago but didn't implement at the time due to lack of RAM: Variation 1) generate a KWIC-style rotation of the words in it at each position in the word. For example: word: trailers build dawgs for: A1: trailers= A2: railers=t A3: ailers=tr A4: ilers=tra A5: lers=trai A6: ers=trail A7: rs=traile A8: s=trailer Now, if you have a slot on the board of the form ....L.R. then you can use the list anchored on the 5th letter (A5 above), match the remaining squares with "lers" (from "lers=trai", hit the end of the word, and then use the rest of what follows the '=' ("=trai") to check against the available tiles. (You could use a pointer to a 'bag' of tiles here (the data structure used in Ars Magna) rather than the "=trai" text itself if you wanted, which could improve checking speed for the remainder of the word, since it has no adjacent hook tiles (by definition, since the first letter position examined was the first hook or anchor square) Now, here's Variation 2: Using the same data structure, but reversing the text after the '=', we could start at position 6 (a suitable location near to and between some anchor tiles) so that it can prune both forwards and backwards against the check tiles, R and L. A1: trailers= A2: railers=t A3: ailers=rt A4: ilers=art A5: lers=iart A6: ers=liart A7: rs=elairt A8: s=relairt ie we use dawg A6 and we match "ers=" then go backwards on "=liart" against "L...." A little experimentation will tell whether its better to start between tiles or on a tile. Finally (Variation 3), do the same for the reversed words: A1: srelairt= A2: reliart=s A3: eliart=rs A4: liart=ers A5: iart=lers A6: art=ilers A7: rt=ailers A8: t=railers I think this variation is the GADDAG that Steven Gordon has written up in his paper. He puts all the dawgs into one big dawg, which is probably the right way to implement it; I just split them up to clarify my thinking on the concept. Then choose whether to use the forward or reverse dawg according to how many spaces to the left or right of the anchor square are open. In fact, if we're going to throw hardware at the problem, it is a program begging for a parallel decomposition. Simply throw jobs at processors in a round-robin fashion, starting with the jobs that expect to take longest to get a simple load-balancing. Graham Note: since the special data format is just a textual manipulation of strings, you can use any old dawg building program (like the one I posted) to build these structures; you don't need anything fancy, as long as your dawg structure has room for the equivalent of the '=' character above and isn't just limited to alpha. From MAGORDON@ECUVAX.CIS.ECU.EDU Fri Mar 14 19:07:57 1997 Received: from ECUVAX.CIS.ECU.EDU (ecuvax.cis.ecu.edu [150.216.1.239]) by gtoal.com (8.6.12/8.6.12) with SMTP id TAA09118 for ; Fri, 14 Mar 1997 19:07:51 -0600 From: MAGORDON@ECUVAX.CIS.ECU.EDU Date: Fri, 14 Mar 1997 20:37:04 -0500 (EST) To: gtoal@gtoal.com Message-Id: <970314203704.de6c@ECUVAX.CIS.ECU.EDU> Subject: Re: Dawg Status: RO Graham, You appear to understand the concept correctly. The rest is just implementation details. The fact that it minimizes so nicely was a bit of a surprise to me. Putting it in one DAWG does slightly increase the number of states that can be merged. Thanks for taking the time to validate it. Steven Gordon From gtoal@archive.vt.com Wed May 26 01:00:24 1999 Received: from archive.vt.com (archive.vt.com [204.117.188.217]) by gtoal.com (8.8.5/8.8.5) with ESMTP id BAA05633 for ; Wed, 26 May 1999 01:00:18 -0500 (CDT) Received: (from root@localhost) by archive.vt.com (8.8.5/8.8.5) id WAA16998 for gtoal@gtoal.com; Tue, 25 May 1999 22:56:17 -0500 (CDT) Date: Tue, 25 May 1999 22:56:17 -0500 (CDT) From: Graham Toal Message-Id: <199905260356.WAA16998@archive.vt.com> To: gtoal@gtoal.com Subject: gaddag Status: R >From MAGORDON@ECUVAX.CIS.ECU.EDU Thu Mar 13 18:10:06 1997 Received: from ECUVAX.CIS.ECU.EDU (ecuvax.cis.ecu.edu [150.216.1.239]) by admin.vt.com (8.7.3/8.7.3) with SMTP id SAA14096 for ; Thu, 13 Mar 1997 18:10:01 -0600 (CST) From: MAGORDON@ECUVAX.CIS.ECU.EDU Date: Thu, 13 Mar 1997 19:09:49 -0500 (EST) To: gtoal@vt.com Message-Id: <970313190949.8661@ECUVAX.CIS.ECU.EDU> Subject: Re: Dawg Status: RO Graham, A DAWG for lhs=rhs would not have the advantages of my GADDAG (i.e., a DAWG for rev(lhs)#rhs. I describe the advantages in great detail in my Software Practice and Experience article that I emailed a summary of to you a few weeks ago (please, tell me if that email never got to you). The main advantage is that non-determinism is reduced significantly by working both directions from anchor squares, more than doubling the speed of move generation. Of course, it would have no utility for spell-checking. Preallocating all the memory needed for the entire array at the beginning of the program is so obvious, I never thought to mention it here or in my paper. Also, one of the nice things about the bit map that I use to represent Letter Sets (all the letters that complete an acceptable word from a given node) in the GADDAG is that for English the number of distinct Letter Sets is < 2048 for every lexicon I have encountered. This makes it easy to compress them. I think we are all interested in your old work, but please look at the accomplishments of others before writing up your ideas as if you already solved all the issues optimally years ago. Steven Gordon >From MAGORDON@ECUVAX.CIS.ECU.EDU Thu Mar 13 22:38:46 1997 Received: from ECUVAX.CIS.ECU.EDU (ecuvax.cis.ecu.edu [150.216.1.239]) by admin.vt.com (8.7.3/8.7.3) with SMTP id WAA17436 for ; Thu, 13 Mar 1997 22:38:38 -0600 (CST) From: MAGORDON@ECUVAX.CIS.ECU.EDU Date: Thu, 13 Mar 1997 23:38:28 -0500 (EST) To: gtoal@vt.com Message-Id: <970313233828.92a3@ECUVAX.CIS.ECU.EDU> Subject: Re: Dawg Status: RO Graham, Sorry, perhaps I only sent this to the person whose inquiry about dawgs appeared at the same time that you started participating, although I thought I had sent it to both of you. I am sending you the following recycled summary of my Scrabble(R) background and work. I can only send you summaries rather than verbatim copies of my papers. The move generation paper in SP&E should be available in almost any university library. The second paper is much more speculative, and only available from the AAAI. Let me warn you that my algorithm trades significant memory space for time speed up. This is mostly of theretical interest, since in practice Apell & Jacobsen's algorithm is very fast. The only time the added speed is of practical use is in doing simulations (which the second paper explores as a method of improving play). I got my algorithm to work on VAXes, SUNs, and MACs, but found the memory limitations of PCs to be a problem. BTW, if you are looking to compare a program to a commercial product, Maven (available from Brian Sheppard - sheppardco@aol.com) is in my opinion the best program available today. There is also a couple of programs running on the crossword game servers mentioned in the FAQ. I have written two articles on algorithms for the game: 1. A Faster Scrabble Move Generation Algorithm, which has been accepted for publication by Software Practice and Experience, presents an improvement in the generation of every possible move. 2. A Comparison Between Probabilistic Search And Weighted Heuristics in a Game with Incomplete Information, which was presented at the AAAI Fall 1993 Symposium on Games: Playing and Learning, compares two approaches for picking the "best" move. The procedings of this symposium is available from the AAAI, whose address is not handy at the moment. Those of you with a general interest in game-playing algorithms should find this entire publication of great interest. I include higlights below: Appel and Jacobson ['The world's fastest Scrabble program', Commun. ACM, 31,(5), (May 1988)] presented a fast algorithm for generating every possible move using a directed acyclic word graph (DAWG). Words can only be generated from left to right. The DAWG algorithm builds every prefix shorter than a context- dependent length that can be composed from the given rack and is the prefix of at least one word in the lexicon. It then extends each such prefix into complete words as constrained by the board and the remaining tiles in the rack. Many prefixes are generated that have no chance of being completed, because the prefix cannot be completed with any of the remaining tiles in the rack, the prefix cannot be completed with the letter(s) on the board that the play must go through, or the only hookable letters were already consumed in building the prefix. They suggest eliminating this non-determinism with a "two-way" DAWG. A practical variation is the DAWG for the language L = {REV(x)*y | xy is a word and x is not empty}, where * is just a delimiter. Each word in the lexicon can be generated starting from each letter in that word by placing tiles leftward upon the board starting at an anchor square while traversing the corresponding arcs in the structure until encountering the *, and then placing tiles rightward from square to the right of the anchor square while still traversing correspond- ing arcs until acceptance. Being the reverse of the directed acyclic graph for prefixes followed by the directed acyclic graph for suffixes, I called it a GADDAG. A backtracking, depth-first search for every possible path through the GADDAG given the rack of tiles and board constraints generates every legal move. Reversing the prefixes allows them to be played just like suffixes, one tile at a time, moving away from anchor squares. The location of each tile in the prefix is known, so board constraints can be considered, eliminating unworkable prefixes as soon as possible. In extensive tests using a lexicon of every 2 through 8 letter word in the OSPD, the GADDAG algorithm generated moves more that twice as fast as the DAWG algorithm, although the structure required 5 times as much memory. Below are the basic algorithms for generating plays with a GADDAG and for building the structure in a "semi-minimized form": Gen(pos, word, rack, arc): {pos = offset from anchor square} IF a letter, L, is already on this square THEN GoOn(pos, L, word, rack, NextArc(arc, L), arc) ELSE IF letters remain on the rack THEN FOR each letter on the rack, L, allowed on this square GoOn(pos, L, word, rack - L, NextArc(arc, L), arc) IF the rack contains a BLANK THEN FOR each letter the BLANK could be, L, allowed on this square GoOn(pos, L, word, rack - BLANK, NextArc(arc, L), arc) GoOn(pos, L, word, rack, NewArc, OldArc): IF pos <= 0 THEN {moving left:} word <-- L || word IF L on OldArc & no letter directly left THEN RecordPlay IF NewArc <-- 0 THEN IF room to the left THEN Gen(pos-1, word, rack, NewArc) NewArc <-- NextArc(NewArc, *) {shift direction:} IF NewArc <> 0, no letter directly left & room to the right THEN Gen(1, word, rack, NewArc) ELSE IF pos > 0 THEN {moving right:} word <-- word || L IF L on OldArc & no letter directly right THEN RecordPlay IF NewArc <> 0 & room to the right THEN Gen(pos+1, word, rack, NewArc) The GADDAG Move Generation Algorithm. ______________________________________________________________________ FOR each word a1a2...an in the lexicon: st <-- initialState {create path for anan-1...a1:} FOR i FROM n DOWNTO 3 AddArc(st, ai) AddFinalArc(st, a2, a1) st <-- initialState {create path for an-1*a1...an:} FOR i FROM n-1 DOWNTO 1 AddArc(st, ai) AddFinalArc(st, *, an) FOR m FROM n-2 DOWNTO 1 {partially minimize the remaining paths:} forceSt <-- st st <-- initialState FOR i FROM m DOWNTO 1 AddArc(st, ai) AddArc(st, *) ForceArc(st, am+1, forceSt) AddArc(st, ch) adds an arc from st for ch (if one does not already exist) and resets st to the node this arc leads to. AddFinalArc(st, c1, c2) adds an arc from st for c1 (if one does not already exist) and adds c2 to the letter set on this arc. ForceArc(st, ch, fst) adds an arc from st for ch to fst (an error occurs if an arc from st for ch already exists going to any other state). The GADDAG Construction Algorithm. ________________________________________________________ A program that just plays the highest scoring legal move (i.e., the greedy evaluation function) has enough of a lexical advantage to beat most people, but would not fare well against good tournament players. The greedy evaluation function frequently chooses a move that uses a BLANK or S, when a more effective strategy would be to play a slightly lower scoring alternative that saves the valuable tile for a future move. A related defect is making a move that retains a Q or J when a more effective strategy would be to play a slightly lower scoring alternative that gets rid of the hard to use tile. A heuristic that addresses both defects is to evaluate a play by adding to its score an estimate of the future utility of the tiles left in the rack, assigning positive values to tiles like BLANK or S and negative values to tiles like Q or J. Ballard ['Renaissance of Scrabble theory 2', Games Medleys, 17, 4:7 (May 1992)] presents the values listed as Heuristic1 as a rule of thumb for human players. The sum of these values for the letters in a rack estimates its future utility. This also provides a heuristic for trading tiles (with loss of turn): trade in a subset of your tiles if the resulting rack leave has a higher utility than the sum of the score and rack evaluation of any other play. Duplicate tiles reduce the number of unique combinations of tiles in a rack as well as the probability of playing all seven letters, and should therefore reduce the estimated utility of a rack. The contribution of each letter in a rack would be evaluated more accurately by weighing both the utility of the letter by itself and the decrease in utility if the letter is duplicated (Heuristic2). Instead of encoding separate penalties for triplication, quadruplication, etc., the duplication penalty is simply reapplied for each additional duplication. For instance, Heuristic1 values the rack IIISS at -1.5 + -1.5 + -1.5 + 7.5 + 7.5 = +10.5, whereas Heuristic2 values this rack at -0.5 + (-0.5 - 4.0) + (-0.5 - 4.0 - 4.0) + 7.5 + (7.5 - 4.0) = -2.5. Heuristic3 includes an additional factor accounting for vowel-consonant mix. Letter Heuristic1 Heuristic2 Letter Heuristic1 Heuristic2 A +0.5 +1.0 -3.0 B -3.5 -3.5 -3.0 C -0.5 -0.5 -3.5 D -1.0 0.0 -2.5 E +4.0 +4.0 -2.5 F -3.0 -2.0 -2.0 G -3.5 -2.0 -2.5 H +0.5 +0.5 -3.5 I -1.5 -0.5 -4.0 J -2.5 -3.0 K -1.5 -2.5 L -1.5 -1.0 -2.0 M -0.5 -1.0 -2.0 N 0.0 +0.5 -2.5 O -2.5 -1.5 -3.5 P -1.5 -1.5 -2.5 Q -11.5 -11.5 R +1.0 +1.5 -3.5 S +7.5 +7.5 -4.0 T -1.0 0.0 -2.5 U -4.5 -3.0 -3.0 V -6.5 -5.5 -3.5 W -4.0 -4.0 -4.5 X +3.5 +3.5 Y -2.5 -2.0 -4.5 Z +3.0 +2.0 BLANK +24.5 +24.5 -15.0 Weights for Rack Evaluation Heuristic1 and Heuristic2. Average Winning Sacrifices Secs/ Scores % Moves Trades Move Greedy vs. Greedy 388.8 388.8 50.0 253,563 0 0 0.489 RackH1 vs. Greedy 402.8 396.4 53.4 257,278 43,617 179 0.606 RackH2 vs. Greedy 411.9 388.6 59.5 251,743 41,834 207 0.624 RackH3 vs. Greedy 417.0 386.7 63.4 251,975 45,792 254 0.653 Performance of Rack Heuristics in 10,000 Random Games on a VAX4300. The above heuristics are effective because the weights have been set to those that optimized performance over the course of a large number of games. They cannot be sensitive to differences in positions, so given the same rack and choice of moves, it will pick the same move no matter the score, whether the opponent is likely to have an S or not, which letters remain to be drawn, and the quality of the openings already on the board. Although such a strategy wins many games, it would intermittently make obvious errors in judgment. In other words, it would play like one of the strong commercial programs currently available: effectively, but not intelligently. One way to try to circumvent the limitations of weighted heuristics would be to simulate what might happen after each reasonable alternative. Moves that might be appropriate in most positions, but would be a mistake in the given position, should perform worse than a more appropriate move over a sufficiently large random sample of scenarios. Unfortunately, tests results were very inconsitent, due to the computational inefficiency of probabilistic search, limiting the number of candidates and scenarios that can be simulated within practical time constraints This undermines not only its potential effectiveness as a Scrabble strategy, but also the investigation into its potential effectiveness. Further investigation into the conditions under which probabilistic search is effective and how many scenarios are required for reliability could yield a practical way to increase the intelligence of the current effective, but unintelligent Scrabble strategies base on weighted heuristics. Steven Gordon >From MAGORDON@ECUVAX.CIS.ECU.EDU Thu Mar 13 22:45:07 1997 Received: from ECUVAX.CIS.ECU.EDU (ecuvax.cis.ecu.edu [150.216.1.239]) by admin.vt.com (8.7.3/8.7.3) with SMTP id WAA17537 for ; Thu, 13 Mar 1997 22:45:03 -0600 (CST) From: MAGORDON@ECUVAX.CIS.ECU.EDU Date: Thu, 13 Mar 1997 23:44:56 -0500 (EST) To: gtoal@vt.com Message-Id: <970313234456.92a3@ECUVAX.CIS.ECU.EDU> Subject: Re: Dawg Status: RO G, It also occurs to me that for all purposes I can imagine, the part after the = is just data, and should not be part of the dawg itself, but just stored at the nodes (or more likely pointed to by the terminal nodes). I am sure this is what you mean. Steven Gordon