PREPROCESSING


OVERVIEW

In our preprocessing stage, we use a robust screening algorithm.  We did this so we wouldn’t waste time searching through words which are not capable of being used.  Basically, all invalid words are extracted from the dictionary.

Formal definition: An invalid word is a word that cannot be legally played in the game of Upwords.

Invalid words can be organized into six categories:

  1. Words whose length is one
  2. Words whose length is greater than eight
  3. Words which have a ‘q’ followed by a letter which is not ‘u’
  4. Words which end in ‘q’
  5. Words which have characters other than the alphabet
  6. Words which use more tiles of a certain letter than there are available


ANALYSIS

When taking out the illegal words from the UNIX dictionary, which contains 25133 words, we found that 7836 of them were illegal. Here are the statistics:
 
 
Category Number of words Percent of dictionary
1 25 0.0994
2 7189 28.604
3 and 4 9 0.0358
5 111 0.4417
6 502 1.9974
Total 7836 31.178

So if we can assume that the UNIX dictionary is a good representative of a dictionary of normal length, we would expect that approximately 30% of the words will turn out to be illegal.

Whenever a legal word is found, it is placed in the file called “upwords.dict”, which is our preprocessed dictionary.  One more thing done in preprocessing is that all characters are capitalized.
 

COMMENTS ON ALTERNATIVE PREPROCESSING METHODS

Instead of just taking out illegal words, we could have used many more interesting techniques in the preprocessing step.  One popular method would be to create many separate files, each containing words which share some characteristics.  The use of a trie precludes having to do such a thing.  Looking up words stored in a trie is very efficient, so coming up with clever ways to store the dictionary would be not be helpful at all.