Trigram Scoring


Description

A trigram is a sequence of three characters. Trigram scoring assignes a value to a text based on the frequency of trigrams in the text as compared to their distribution in a sample text.

If trigrams appear in the text that did not appear in the sample text then the text is probably not a valid text and should receive a small scoring value.

A "penalty" is introduced in our algorithm to take care of that. The penalty should not be to high, though. Because many texts contain abbreviations that might not be found elsewhere. We recommend to chose the penalty to be the negative of the frequency of the trigram "the" in the sample text.

Trigram scoring has been proven to be extremly helpful and is our major tool for breaking ciphers.

Algorithm

int Score(int TrigramDistribution[][][], char Text[], int NonOccurPenalty)
{
    int score=0;

    for(int i=0; i<sizeof(Text)-2; i++) 
	if( TrigramDistribution[Text[i],Text[i+1],Text[i+2]] )
	    score += TrigramDistribution[Text[i],Text[i+1],Text[i+2]];
	else
	    score += NonOccurPenalty;

    return score;
}

Last Update: 15.04.96 (Format: DD.MM.YY)