A CROSSWORD PUZZLE GENERATOR FOR TURKISH

                    Ilan Berker and A. C. Cem Say

                   Computer Engineering Department
                        Bogazici University
                   Bebek 80815, Istanbul, Turkey
               berker@trboun.bitnet, say@trboun.bitnet


Abstract: The generation of crossword puzzles forms a very interesting search 
problem with backtracking and constraint satisfaction for computers. After 
considering many different approaches we formulated an algorithm based on one 
of the approaches and then implemented it to generate puzzles of various sizes,
containing the least possible number of black squares in the language of the 
dictionary provided, which is Turkish in our case. The choice of data 
structure for the dictionary together with the search and backtracking 
strategies are critical for the performance of the program.


1.  INTRODUCTION

On December 21, 1913, the first crossword puzzle was published in the Sunday 
supplement of the New York World by Arthur Wayne. Although many variations 
have been developed, the traditional crossword puzzle is still the most popular
kind. Crossword puzzle generation is generally viewed as a task requiring 
intelligence. We developed a puzzle generator, which is capable of producing 
ten by ten puzzles which are similar to those published in most Turkish 
newspapers, only a little harder since the dictionary used is very large, 
containing about 20,000 entries. In the following sections, a description of 
the problem, possible approaches to solve it, and our approach with the details
of the algorithm are presented.


2.  PROBLEM STATEMENT

The aim of the project is to write a computer program to generate puzzles of 
various sizes, containing the least possible number of black squares, with no 
word appearing more than once. The puzzles generated contain the grid of words
and not the definitions of the words, since they are not available on magnetic 
media. The puzzles are to be in Turkish since a Turkish dictionary is available
on magnetic media in the Computer Engineering Department. Linguistic properties
of the Turkish language are not used in the project.

The problem of filling a grid with meaningful words is a search problem with 
backtracking. The search space is extremely large. Each time a word is to be 
placed on the grid a decision has to be made to select between approximately 
20,000 words which is the size of our dictionary, so the branching factor is 
very large at each decision step. There are also many constraints involved 
when placing words. This exponential explosion of branching and the imposed 
restrictions make crossword puzzle generation an artificial intelligence 
problem.


3.  POSSIBLE SOLUTIONS

3.1.  Random Placing

In this first consideration, we thought of placing the words on the grid in a 
random fashion. There are many problems with such an approach. Firstly, when 
words are placed randomly, the restrictions that are formed are also random. 
This necessitates the dictionary to be indexed according to every letter. The 
second problem arises in the backtracking algorithm. If chronological 
backtracking is used, the word placed most recently is changed first, but it 
may not be imposing any restrictions on the word currently being placed.


3.2.  Divide and Conquer

The idea is to fill the middle row and the middle column first. This will 
divide the grid into four smaller portions and the problem at hand reduces 
significantly. Then the same method may be used to fill these smaller puzzles.
Though seemingly promising, this approach was then rejected because the 
resulting four smaller problems are not independent.


3.3.  Circular Fill

The idea here is to fill the grid circularly in a clockwise or counterclockwise
manner beginning with the outer edges of the grid and moving towards the 
center. This approach is nice in that the edges of the grid are filled first 
and therefore will contain the longest words, but there are problems with this 
approach too. As the grid is filled, the words imposing the restrictions on a 
word to be placed are words that have been placed not successively but with 
three other edges filled in between. Also, some restrictions are formed on the 
last letters of the words to be placed. This necessitates the use of a 
dictionary sorted with the last letters being the most significant.


3.4.  Row-Column Fill

The different approaches discussed above were all rejected for various reasons,
and the approach to be discussed shortly was chosen to be used in the program. 
The verification of our decision is based mainly on the fact that experienced 
crossword puzzle creators that we have talked to, and whose puzzles are 
published everyday in various newspapers and magazines use methods very 
similar to this one.

With the row-column fill algorithm, the words are placed to fill rows and 
columns in an alternating sequence, row one, column one, row two, column two, 
and so on. In this way only the beginning letters of a word to be placed are 
defined by previously placed words. This enables us to use an alphabetically 
sorted dictionary to search for the required words.


4.  THE ALGORITHM

4.1.  Pseudo-Code

A brief outline of the algorithm to fill the grid is given here with the 
detailed explanations in the following sections. In the pseudo-code, line 
refers to a row or a column.

while (puzzle not full) do
	if unable to place previous word then
		begin
		    pop the most recently placed word from stack;
		    if unable to place the first word of a line
			then delete previous line, make that the current line
		    else delete current line;
		    fill current line using next possible word to start it
		end
	else fill next line; {next line is perpendicular to the previous one}


4.2.  Filling the Grid

The row-column fill approach is used to fill the grid. Words are placed on the 
line to be filled, a row or a column, separated by black squares. The first 
word to be placed on the line has its beginning letters fixed, the following 
words do not have such a restriction. To place the words the dictionary is 
searched, starting from the longest words that may fit to that line according 
to the number of blank squares left.

In order to decided if it is reasonable to place that word on the line, a 
function checks to see if a word exists in our dictionary beginning with the 
newly imposed restrictions. Here we do not have a continuous value of goodness 
for the placing of a word but we have an accept or reject answer from the 
function.


4.3.  Data Structures

The program relies heavily on fast access to the dictionary. Since using file 
access would reduce performance considerably, it was preferable to read the 
data to main memory. The structure for the dictionary is a collection of 
balanced search trees, one tree for words of each length. In these trees each 
node contains one data field to hold the word as a string, one boolean field to
show if this word has been used in the puzzle, and two pointers to two sub 
trees. The tree structure enables the search to be conducted in logarithmic 
time complexity. The data structure used to represent the grid is a two 
dimensional array of characters.


4.4.  Backtracking

When no word fits into a position a black square is tried but even that may 
not fit since placing a black square usually forms one or two words, one 
vertical and one horizontal. Then we need to backtrack to a previous state.

Backtracking is always done by deleting a whole row or column. So if a word 
cannot be placed, we try to place instead of the first word of the line, the 
next word that matches the restrictions. If the word that could not be placed 
was the first word of a line, then we backtrack to the beginning of the 
previous line. The reason for using such an approach instead of strict 
chronological backtracking is to make sure that the newly placed word has an 
effect on the word we were trying to place as the first word of the previous 
line always imposes restrictions on the first word of the current line.

Since we backtrack to beginnings of lines, we only need to keep track of the 
first words placed on the lines on a stack.


4.5.  The Search

There are two different types of search that must be conducted on the data. 
The first is a complete match, where we look to see if the words formed by 
placing a black square are actually in the dictionary. The second is when a 
string is popped from the stack in backtracking and the string immediately 
following it is needed to be placed on the grid. When the first word of a line 
is to be placed and a prefix is fixed, again the string immediately following 
the prefix is required since any string beginning with that prefix follows that
prefix in alphabetical order.


5.  CONCLUSION

Our program produces puzzles of high quality, containing an average of 17 black
squares on a ten by ten puzzle, similar to those published in newspapers. An 
example of the generated puzzles is given in Figure 1. The only problem is that
it is hard to estimate the time it will take the computer to generate a puzzle 
given the first word to start it. On the VAX 4000 - 200 / VMS system where the 
program was implemented, the time for the generation of one ten by ten puzzle 
varied from a few minutes to around 40 hours, with 15 percent of the puzzles 
being generated in less than one hour.


                      J I N E K O L O J I
                      E K O D E M E S I Z
                      O R T A C A G # G A
                      B A R # E C E # O F
                      O H # A # A N E L E
                      T # A M A # # B O T
                      A B D E S T # C # #
                      N A A T # A B E C E
                      I B L A G # I D A M
                      K A I L # B R # M #

                          Figure 1. 
        A ten by ten crossword puzzle generated by our program


On smaller puzzles the performance was excellent. One hundred different five by
five puzzles are generated in less than ten minutes, and 15 seven by seven 
puzzles are generated in about an hour. These results look very promising for 
producing crossword puzzles on computers. Further work is required on more 
efficient search and backtracking strategies for our application in order to 
be able to generate larger puzzles like 20 by 20. Many variations of the 
generated puzzles may be achieved by separating the dictionary into groups 
according to the meanings of the words. This would allow puzzles to be 
generated in certain subjects. The addition of the names of some celebrities 
to the dictionary would generate puzzles that would interest many magazines. 
In order to be able to generate non-rectangular puzzles, more work has to be 
done so that the program allows initial restrictions on the shape of the 
puzzle to be specified by the user.


REFERENCES
1. Berker, I. "Crossword Puzzle Generation", unpublished project report, 
Bogazici University, 1993