OVERVIEW
Formal definition: Trie Node
Each node of the trie is contains
the following fields:
A trie is a multiway tree where each node is a character. If you follow a path from the root to some point in the trie, depending on the node, the path will form a word. Nodes which end a word are marked as such. So basically, a trie is a finite automata. Here is an example of a small trie, which stores the words car, cart, cat, and dog.
Root
/ \
C D
/
\
A
O
/ \
\
R T
G
/
T
ANALYSIS
One of the reasons a trie is a good choice for dictionary searches is that it is space conservative. Words that share prefixes use the same nodes, such as the words “cart” and “cat” do in the above example. Checking to see if a word is in the dictionary is very simple, just start at the root and attempt to find a path which spells out the word that ends in a terminal node. If there is such a path, the word is in the dictionary, otherwise, it’s not.
For example, if we were looking to see if the word “door” is in the above trie, we would first go to the ‘d’, then to the ‘o’, and then try to go to another ‘o’. But we see that the second ‘o’ does not exist. So we can’t go any further. We check whether the first ‘o’ is marked as being a terminal node, and since it’s not, we conclude that “door” is not in the above trie.
Another advantage of using a trie
is that it complements our searching algorithm nicely. The searching
algorithm relies heavily on prefixes, and those are easy to generate with
tries.
TESTING
We calculated the number of nodes the preprocessed UNIX dictionary required, and it turned out to be 42709. This was for 17295 words. That’s approximately 2.47 nodes per word. So if we were dealing with a 200,000 word dictionary, there would be around 494,000 nodes.
If we had not used a trie and instead
stored each word separately (with common prefixes not being taken advantage
of), we would need about 69180 nodes (assuming the average length of a
word is 4 letters.) 69180 = 17295 * 4. The trie required only
42709 nodes, which is better by a factor of 1.6.
ALTERNATIVE WAYS OF STORING THE DICTIONARY
Another way we could have stored the dictionary is with a directed acyclic word graph (DAWG). A DAWG has the same advantages of a trie, but conserves space greatly. It removes all redundancy from the trie. A DAWG can be obtained from a trie by applying a minimization algorithm. This algorithm is exactly the one which is used to minimize the number of states in a finite automata.
In future implementations, it is
recommended that a DAWG be used. It is superior to a trie in that
it is even more space efficient. We decided not to use it for this
project because we felt that a trie performed well considering the pre-defined
time constraints.