A CROSSWORD PUZZLE GENERATOR FOR TURKISH Ilan Berker and A. C. Cem Say Computer Engineering Department Bogazici University Bebek 80815, Istanbul, Turkey berker@trboun.bitnet, say@trboun.bitnet Abstract: The generation of crossword puzzles forms a very interesting search problem with backtracking and constraint satisfaction for computers. After considering many different approaches we formulated an algorithm based on one of the approaches and then implemented it to generate puzzles of various sizes, containing the least possible number of black squares in the language of the dictionary provided, which is Turkish in our case. The choice of data structure for the dictionary together with the search and backtracking strategies are critical for the performance of the program. 1. INTRODUCTION On December 21, 1913, the first crossword puzzle was published in the Sunday supplement of the New York World by Arthur Wayne. Although many variations have been developed, the traditional crossword puzzle is still the most popular kind. Crossword puzzle generation is generally viewed as a task requiring intelligence. We developed a puzzle generator, which is capable of producing ten by ten puzzles which are similar to those published in most Turkish newspapers, only a little harder since the dictionary used is very large, containing about 20,000 entries. In the following sections, a description of the problem, possible approaches to solve it, and our approach with the details of the algorithm are presented. 2. PROBLEM STATEMENT The aim of the project is to write a computer program to generate puzzles of various sizes, containing the least possible number of black squares, with no word appearing more than once. The puzzles generated contain the grid of words and not the definitions of the words, since they are not available on magnetic media. The puzzles are to be in Turkish since a Turkish dictionary is available on magnetic media in the Computer Engineering Department. Linguistic properties of the Turkish language are not used in the project. The problem of filling a grid with meaningful words is a search problem with backtracking. The search space is extremely large. Each time a word is to be placed on the grid a decision has to be made to select between approximately 20,000 words which is the size of our dictionary, so the branching factor is very large at each decision step. There are also many constraints involved when placing words. This exponential explosion of branching and the imposed restrictions make crossword puzzle generation an artificial intelligence problem. 3. POSSIBLE SOLUTIONS 3.1. Random Placing In this first consideration, we thought of placing the words on the grid in a random fashion. There are many problems with such an approach. Firstly, when words are placed randomly, the restrictions that are formed are also random. This necessitates the dictionary to be indexed according to every letter. The second problem arises in the backtracking algorithm. If chronological backtracking is used, the word placed most recently is changed first, but it may not be imposing any restrictions on the word currently being placed. 3.2. Divide and Conquer The idea is to fill the middle row and the middle column first. This will divide the grid into four smaller portions and the problem at hand reduces significantly. Then the same method may be used to fill these smaller puzzles. Though seemingly promising, this approach was then rejected because the resulting four smaller problems are not independent. 3.3. Circular Fill The idea here is to fill the grid circularly in a clockwise or counterclockwise manner beginning with the outer edges of the grid and moving towards the center. This approach is nice in that the edges of the grid are filled first and therefore will contain the longest words, but there are problems with this approach too. As the grid is filled, the words imposing the restrictions on a word to be placed are words that have been placed not successively but with three other edges filled in between. Also, some restrictions are formed on the last letters of the words to be placed. This necessitates the use of a dictionary sorted with the last letters being the most significant. 3.4. Row-Column Fill The different approaches discussed above were all rejected for various reasons, and the approach to be discussed shortly was chosen to be used in the program. The verification of our decision is based mainly on the fact that experienced crossword puzzle creators that we have talked to, and whose puzzles are published everyday in various newspapers and magazines use methods very similar to this one. With the row-column fill algorithm, the words are placed to fill rows and columns in an alternating sequence, row one, column one, row two, column two, and so on. In this way only the beginning letters of a word to be placed are defined by previously placed words. This enables us to use an alphabetically sorted dictionary to search for the required words. 4. THE ALGORITHM 4.1. Pseudo-Code A brief outline of the algorithm to fill the grid is given here with the detailed explanations in the following sections. In the pseudo-code, line refers to a row or a column. while (puzzle not full) do if unable to place previous word then begin pop the most recently placed word from stack; if unable to place the first word of a line then delete previous line, make that the current line else delete current line; fill current line using next possible word to start it end else fill next line; {next line is perpendicular to the previous one} 4.2. Filling the Grid The row-column fill approach is used to fill the grid. Words are placed on the line to be filled, a row or a column, separated by black squares. The first word to be placed on the line has its beginning letters fixed, the following words do not have such a restriction. To place the words the dictionary is searched, starting from the longest words that may fit to that line according to the number of blank squares left. In order to decided if it is reasonable to place that word on the line, a function checks to see if a word exists in our dictionary beginning with the newly imposed restrictions. Here we do not have a continuous value of goodness for the placing of a word but we have an accept or reject answer from the function. 4.3. Data Structures The program relies heavily on fast access to the dictionary. Since using file access would reduce performance considerably, it was preferable to read the data to main memory. The structure for the dictionary is a collection of balanced search trees, one tree for words of each length. In these trees each node contains one data field to hold the word as a string, one boolean field to show if this word has been used in the puzzle, and two pointers to two sub trees. The tree structure enables the search to be conducted in logarithmic time complexity. The data structure used to represent the grid is a two dimensional array of characters. 4.4. Backtracking When no word fits into a position a black square is tried but even that may not fit since placing a black square usually forms one or two words, one vertical and one horizontal. Then we need to backtrack to a previous state. Backtracking is always done by deleting a whole row or column. So if a word cannot be placed, we try to place instead of the first word of the line, the next word that matches the restrictions. If the word that could not be placed was the first word of a line, then we backtrack to the beginning of the previous line. The reason for using such an approach instead of strict chronological backtracking is to make sure that the newly placed word has an effect on the word we were trying to place as the first word of the previous line always imposes restrictions on the first word of the current line. Since we backtrack to beginnings of lines, we only need to keep track of the first words placed on the lines on a stack. 4.5. The Search There are two different types of search that must be conducted on the data. The first is a complete match, where we look to see if the words formed by placing a black square are actually in the dictionary. The second is when a string is popped from the stack in backtracking and the string immediately following it is needed to be placed on the grid. When the first word of a line is to be placed and a prefix is fixed, again the string immediately following the prefix is required since any string beginning with that prefix follows that prefix in alphabetical order. 5. CONCLUSION Our program produces puzzles of high quality, containing an average of 17 black squares on a ten by ten puzzle, similar to those published in newspapers. An example of the generated puzzles is given in Figure 1. The only problem is that it is hard to estimate the time it will take the computer to generate a puzzle given the first word to start it. On the VAX 4000 - 200 / VMS system where the program was implemented, the time for the generation of one ten by ten puzzle varied from a few minutes to around 40 hours, with 15 percent of the puzzles being generated in less than one hour. J I N E K O L O J I E K O D E M E S I Z O R T A C A G # G A B A R # E C E # O F O H # A # A N E L E T # A M A # # B O T A B D E S T # C # # N A A T # A B E C E I B L A G # I D A M K A I L # B R # M # Figure 1. A ten by ten crossword puzzle generated by our program On smaller puzzles the performance was excellent. One hundred different five by five puzzles are generated in less than ten minutes, and 15 seven by seven puzzles are generated in about an hour. These results look very promising for producing crossword puzzles on computers. Further work is required on more efficient search and backtracking strategies for our application in order to be able to generate larger puzzles like 20 by 20. Many variations of the generated puzzles may be achieved by separating the dictionary into groups according to the meanings of the words. This would allow puzzles to be generated in certain subjects. The addition of the names of some celebrities to the dictionary would generate puzzles that would interest many magazines. In order to be able to generate non-rectangular puzzles, more work has to be done so that the program allows initial restrictions on the shape of the puzzle to be specified by the user. REFERENCES 1. Berker, I. "Crossword Puzzle Generation", unpublished project report, Bogazici University, 1993