This project is read-only.
Basic Concept: Levenshtein Distance

It is defined as the minimum number of edits needed to transform one string into the other, with the allowable edit operations being insertion, deletion, or substitution of a single character.

If we look at the word “tha” as an example for a misspelled word, we can change the “a” to “e” so the word becomes “the”, we can delete t and add d so the word becomes “had”.

This means that the Levenshtein distance between “tha” and “the” is 1 since one edit operation was needed to transform between the two words, and the Levenshtein distance between “tha” and “had” is 2 since two edit operations were needed to transform between the two words.

Bayesian Spellchecking

Bayesian spellchecking takes two factors into consideration when determining the best correction for a misspelled word:

1- The probability that the suggest correction is what the user really meant when he misspelled the word he wrote. This is equal to the Levenshtein distance between the two words. The less the Levenshtein distance the more probable the user meant this word when writing the misspelled word.

2- The probability of the suggested correction to appear in the English language. For example, the word “the” and “that” are both a correction for the misspelled word “tha”, and both have a Levenshtein distance of 1 from it, but “the” is the most common word in English, so it should be the best suggested correction.

DB4O

I used DB4O as a database to store the words. DB4O is good for this application for the following reasons:

1-It is an embedded database (formed of some dlls), no need for a database server.
2-It is an object database which can store and retrieve objects.
3-It supports Linq so no need for learning special query language.

Implementation

The spell corrector is a dll written in 142 lines of C# code and one public class, two internal classes.

The implementation has two parts, forming a dictionary of words to check against and then searching in these words.

When building a dictionary, you take a paragraph, separate it into words and add these words by calling the Method “AddWord”. When adding a word that already exists, the dictionary saves the number of times this word has been seen before to take it into consideration when returning correction results. The number of times a word has been seen is a measure of the probability of this word to appear in the English language.

This means that the results of the dictionary depend on the text added to it and how much does it represent a good sample of the English language overall.
When searching for possible corrections for a word, a problem appears: The application should measure the Levenshtein distance between the input word and all the words in the dictionary to determine the probability that the user meant a given correction.

To make the search fast, a data structure called BK Tree is used that allows us to quickly find all the words at a specific distance from a given word. A good explanation of the BK tree can be found here.

When searching, all the words in the dictionary are read and added to an in memory BK Tree which is used in the search operations to return results fast.
Searching is done by calling the method “CorrectWord” which returns a list of suggested corrections for an input word, with a given distance and with a maximum number of results.

The results are sorted based on the Bayes’ algorithm, taking into consideration the number of times the word has been seen before and the Levenshtein distance between the input word and the suggested correction.

Solution structure
The solution folder contains three projects
1- A WPF trainer application that is used to add words to the dictionary.
2- A WPF tester application.
3- The main application itself which is a class library application.

Last edited Mar 14, 2010 at 1:00 PM by muhammadadel, version 1

Comments

No comments yet.