DICTIONARY REPRESENTATION: TRIE


OVERVIEW

Formal definition: Trie Node
Each node of the trie is contains the following fields:

  1. Character
  2. Valid bit
  3. An array of 26 pointers, one for each letter.
The valid bit indicates if the node is a terminal or not. If it is, the value is 1, otherwise it is 0.

A trie is a multiway tree where each node is a character.  If you follow a path from the root to some point in the trie, depending on the node, the path will form a word.  Nodes which end a word are marked as such.  So basically, a trie is a finite automata. Here is an example of a small trie, which stores the words car, cart, cat, and dog.

                                           Root
                                          /        \
                                       C          D
                                       /             \
                                     A              O
                                    /   \               \
                                  R    T             G
                                 /
                               T
 

ANALYSIS

One of the reasons a trie is a good choice for dictionary searches is that it is space conservative.  Words that share prefixes use the same nodes, such as the words “cart” and “cat” do in the above example.  Checking to see if a word is in the dictionary is very simple, just start at the root and attempt to find a path which spells out the word that ends in a terminal node.  If there is such a path, the word is in the dictionary, otherwise, it’s not.

For example, if we were looking to see if the word “door” is in the above trie, we would first go to the ‘d’, then to the ‘o’, and then try to go to another ‘o’.  But we see that the second ‘o’ does not exist.  So we can’t go any further.  We check whether the first ‘o’ is marked as being a terminal node, and since it’s not, we conclude that “door” is not in the above trie.

Another advantage of using a trie is that it complements our searching algorithm nicely.  The searching algorithm relies heavily on prefixes, and those are easy to generate with tries.
 

TESTING

We calculated the number of nodes the preprocessed UNIX dictionary required, and it turned out to be 42709.  This was for 17295 words.  That’s approximately 2.47 nodes per word.  So if we were dealing with a 200,000 word dictionary, there would be around 494,000 nodes.

If we had not used a trie and instead stored each word separately (with common prefixes not being taken advantage of), we would need about 69180 nodes (assuming the average length of a word is 4 letters.)  69180 = 17295 * 4.  The trie required only 42709 nodes, which is better by a factor of 1.6.
 

ALTERNATIVE WAYS OF STORING THE DICTIONARY

Another way we could have stored the dictionary is with a directed acyclic word graph (DAWG).  A DAWG has the same advantages of a trie, but conserves space greatly.  It removes all redundancy from the trie.  A DAWG can be obtained from a trie by applying a minimization algorithm.  This algorithm is exactly the one which is used to minimize the number of states in a finite automata.

In future implementations, it is recommended that a DAWG be used.  It is superior to a trie in that it is even more space efficient.  We decided not to use it for this project because we felt that a trie performed well considering the pre-defined time constraints.