主要内容

Create Custom Spelling Correction Function Using Edit Distance Searchers

This example shows how to correct spelling using edit distance searchers and a vocabulary of known words.

Lemmatization with normalizeWords and word2vec requires correctly spelled words to work. To easily correct the spelling of words in text, use the correctSpelling function. To learn how to create a spelling correction function from scratch using edit distance searchers, use this example as a guide.

If you have misspelled words in a collection of text, then you can use edit distance searchers to find the nearest correctly spelled words to a given vocabulary. To correct the spelling of misspelled words in documents, replace them with the nearest neighbors in the vocabulary. Use edit distance searchers to find the nearest correctly spelled word to misspelled words according to an edit distance. For example, the number of adjacent grapheme swaps and grapheme insertions, deletions, and substitutions.

Load Data

Create a vocabulary of known words. Download and extract the Spell Checking Oriented Word Lists (SCOWL) from https://sourceforge.net/projects/wordlist/ into a folder in the current directory. Import the words from the downloaded data using the supporting function scowlWordList. Update the folder name to match the version of the word list you download.

folderName = "scowl-2020.12.07";
maxSize = 60;
vocabulary = scowlWordList(folderName,"english",maxSize);

View the number of words in the vocabulary.

numWords = numel(vocabulary)
numWords = 98285

Create Simple Spelling Corrector

Using the imported vocabulary, create an edit distance searcher with a maximum distance of 2. For better results, allow for adjacent grapheme swaps by setting the SwapCost option to 1. For large vocabularies, this can take a few minutes.

maxDist = 2;
eds = editDistanceSearcher(vocabulary,maxDist,SwapCost=1);

This edit distance searcher is case sensitive which means that changing the case of characters contributes to the edit distance. For example, the searcher can find the neighbor "testing" for the word "tseting" because it has edit distance 1 (one swap), but not of the word "TSeTiNG" because it has edit distance 6.

Correct Spelling

Correct the spelling of misspelled words in an array of tokenized documents by selecting the misspelled words and finding the nearest neighbors in the edit distance searcher.

Create a tokenized document object containing typos and spelling mistakes.

str = "An exmaple dccoument with typos and averyunusualword.";
document = tokenizedDocument(str)
document = 
  tokenizedDocument:

   8 tokens: An exmaple dccoument with typos and averyunusualword .

Convert the documents to a string array of words using the string function.

words = string(document)
words = 1×8 string
    "An"    "exmaple"    "dccoument"    "with"    "typos"    "and"    "averyunusualword"    "."

Find the words that need correction. To ignore words that are correctly spelled, find the indices of the words already in the vocabulary. To ignore punctuation and complex tokens such as email addresses, find the indices of the words which do not have the token types "letters" or "other". Get the token details from the document using the tokenDetails function.

tdetails = tokenDetails(document);
idxVocabularyWords = ismember(tdetails.Token,eds.Vocabulary);

idxComplexTokens = ...
    tdetails.Type ~= "letters" & ...
    tdetails.Type ~= "other";

idxWordsToCheck = ...
    ~idxVocabularyWords & ...
    ~idxComplexTokens
idxWordsToCheck = 8×1 logical array

   1
   1
   1
   0
   0
   0
   1
   0

Find the numeric indices of the words and view the corresponding words.

idxWordsToCheck = find(idxWordsToCheck)
idxWordsToCheck = 4×1

     1
     2
     3
     7

wordsToCheck = words(idxWordsToCheck)
wordsToCheck = 1×4 string
    "An"    "exmaple"    "dccoument"    "averyunusualword"

Notice that the word "An" is flagged as a word to check. This word is flagged because the vocabulary does not contain the word "An" with an uppercase "A". A later section in the example shows how to create a case insensitive spelling corrector.

Find the nearest words and their distances using the knnsearch function with the edit distance searcher.

[idxNearestWords,d] = knnsearch(eds,wordsToCheck)
idxNearestWords = 4×1

         165
        1352
        1151
         NaN

d = 4×1

     1
     1
     2
   Inf

If any of the words are not found in the searcher, then the function returns index NaN with distance Inf. The word "averyunusualword" does not have a match within edit distance 2, so the function returns the index NaN for that word.

Find the indices of the words with positive finite edit distances.

idxMatches = ~isnan(idxNearestWords)
idxMatches = 4×1 logical array

   1
   1
   1
   0

Get the indices of the words with matches in the searcher and view the corresponding corrected words in the vocabulary.

idxCorrectedWords = idxNearestWords(idxMatches)
idxCorrectedWords = 3×1

         165
        1352
        1151

correctedWords = eds.Vocabulary(idxCorrectedWords)
correctedWords = 1×3 string
    "an"    "example"    "document"

Replace the misspelled words that have matches with the corrected words.

idxToCorrect = idxWordsToCheck(idxMatches);
words(idxToCorrect) = correctedWords
words = 1×8 string
    "an"    "example"    "document"    "with"    "typos"    "and"    "averyunusualword"    "."

To create a tokenized document of these words, use the tokenizedDocument function and set TokenizedMethod to "none".

document = tokenizedDocument(words,TokenizeMethod="none")
document = 
  tokenizedDocument:

   8 tokens: an example document with typos and averyunusualword .

The next section shows how to correct the spelling of multiple documents at once by creating a custom spelling correction function and using docfun.

Create Spelling Correction Function

To correct the spelling in multiple documents at once, create a custom function using the code from the previous section and use this function with the docfun function.

Create a function that takes an edit distance searcher, a string array of words, and the corresponding table of token details as inputs and outputs the corrected words. The correctSpelling function, listed at the end of the example, corrects the spelling in a string array of words using the corresponding token details and an edit distance searcher.

To use this function with the docfun function, create a function handle that takes a string array of words and the corresponding table of token details as the inputs.

func = @(words,tdetails) correctSpelling(eds,words,tdetails);

Correct the spelling of an array of tokenized documents using docfun with the function handle func.

str = [
    "Here is some reallyu badly wrirten texct."
    "Some moree mitsakes here too."];
documents = tokenizedDocument(str);
updatedDocuments = docfun(func,documents)
updatedDocuments = 
  2×1 tokenizedDocument:

    8 tokens: here is some really badly written text .
    6 tokens: come more mistakes here too .

Note that uppercase characters can get corrected to different lowercase characters. For example, the word "Some" can get corrected to "come". If multiple words in the edit distance searcher vocabulary have the same edit distance to the input, then the function outputs the first result it found. For example, the words "come" and "some" both have edit distance 1 from the word "Some".

The next section shows how to create an spelling corrector that is case insensitive.

Create Case Insensitive Spelling Corrector

To prevent differences in case clashing with other substitutions, create an edit distance searcher with the vocabulary in lower case and convert the documents to lowercase before using the edit distance searcher.

Convert the vocabulary to lowercase. This operation can introduce duplicate words, remove them by taking the unique values only.

vocabularyLower = lower(vocabulary);
vocabularyLower = unique(vocabularyLower);

Create an edit distance searcher using the lowercase vocabulary using the same options as before. This can take a few minutes to run.

maxDist = 2;
eds = editDistanceSearcher(vocabularyLower,maxDist,SwapCost=1);

Use the edit distance searcher to correct the spelling of the words in tokenized document. To use the case insensitive spelling corrector, convert the documents to lowercase.

documentsLower = lower(documents);

Correct the spelling using the new edit distance searcher using same steps as before.

func = @(words,tdetails) correctSpelling(eds,words,tdetails);
updatedDocuments = docfun(func,documentsLower)
updatedDocuments = 
  2×1 tokenizedDocument:

    8 tokens: here is some really badly written text .
    6 tokens: some more mistakes here too .

Here, the word "Some" in the original text is converted to "some" before being input to the spelling corrector. The corresponding word "some" is unaffected by the searcher as the word some occurs in the vocabulary.

Spelling Correction Function

The correctSpelling function corrects the spelling in a string array of words using the corresponding token details and an edit distance searcher. You can use this function with docfun to correct the spelling of multiple documents at once.

function words = correctSpelling(eds,words,tdetails)

% Get indices of misspelled words ignoring complex tokens.
idxVocabularyWords = ismember(tdetails.Token,eds.Vocabulary);

idxComplexTokens = ...
    tdetails.Type ~= "letters" & ...
    tdetails.Type ~= "other";

idxWordsToCheck = ...
    ~idxVocabularyWords & ...
    ~idxComplexTokens;

% Convert to numeric indices.
idxWordsToCheck = find(idxWordsToCheck);

% Find nearest words.
wordsToCheck = words(idxWordsToCheck);
idxNearestWords = knnsearch(eds,wordsToCheck);

% Find words with matches.
idxMatches = ~isnan(idxNearestWords);

% Get corrected words.
idxCorrectedWords = idxNearestWords(idxMatches);
correctedWords = eds.Vocabulary(idxCorrectedWords);

% Correct words.
idxToCorrect = idxWordsToCheck(idxMatches);
words(idxToCorrect) = correctedWords;

end

See Also

| | | | | |

Topics