OVERVIEW
In our preprocessing stage, we use a robust screening algorithm. We did this so we wouldn’t waste time searching through words which are not capable of being used. Basically, all invalid words are extracted from the dictionary.
Formal definition: An invalid word is a word that cannot be legally played in the game of Upwords.
Invalid words can be organized into six categories:
ANALYSIS
When taking out the illegal words
from the UNIX dictionary, which contains 25133 words, we found that 7836
of them were illegal. Here are the statistics:
Category | Number of words | Percent of dictionary |
1 | 25 | 0.0994 |
2 | 7189 | 28.604 |
3 and 4 | 9 | 0.0358 |
5 | 111 | 0.4417 |
6 | 502 | 1.9974 |
Total | 7836 | 31.178 |
So if we can assume that the UNIX dictionary is a good representative of a dictionary of normal length, we would expect that approximately 30% of the words will turn out to be illegal.
Whenever a legal word is found, it
is placed in the file called “upwords.dict”, which is our preprocessed
dictionary. One more thing done in preprocessing is that all characters
are capitalized.
COMMENTS ON ALTERNATIVE PREPROCESSING METHODS
Instead of just taking out illegal
words, we could have used many more interesting techniques in the preprocessing
step. One popular method would be to create many separate files,
each containing words which share some characteristics. The use of
a trie precludes having to do such a thing. Looking up words stored
in a trie is very efficient, so coming up with clever ways to store the
dictionary would be not be helpful at all.