Main Content

Extract Keywords from Text Data Using RAKE

This example shows how to extract keywords from text data using Rapid Automatic Keyword Extraction (RAKE).

The RAKE algorithm extracts keywords using a delimiter-based approach to identify candidate keywords and scores them using word co-occurrences that appear in the candidate keywords. Keywords can contain multiple tokens. Furthermore, the RAKE algorithm also merges keywords when they appear multiple times, separated by the same merging delimiter.

Extract Keywords

Create an array of tokenized document containing the text data.

textData = [
    "MATLAB provides tools for scientists and engineers. MATLAB is used by scientists and engineers."
    "Analyze text and images. You can import text and images."
    "Analyze text and images. Analyze text, images, and videos in MATLAB."];
documents = tokenizedDocument(textData);

Extract the keywords using the rakeKeywords function.

tbl = rakeKeywords(documents)
tbl=12×3 table
                     Keyword                     DocumentNumber    Score
    _________________________________________    ______________    _____

    "MATLAB"        "provides"    "tools"              1             8  
    "MATLAB"        ""            ""                   1             2  
    "scientists"    "and"         "engineers"          1             2  
    "scientists"    ""            ""                   1             1  
    "engineers"     ""            ""                   1             1  
    "Analyze"       "text"        ""                   2             4  
    "import"        "text"        ""                   2             4  
    "images"        ""            ""                   2             1  
    "Analyze"       "text"        ""                   3             4  
    "images"        ""            ""                   3             1  
    "videos"        ""            ""                   3             1  
    "MATLAB"        ""            ""                   3             1  

If a keyword contains multiple words, then the ith element of the string array corresponds to the ith word of the keyword. If the keyword has fewer words that the longest keyword, then remaining entries of the string array are the empty string "".

For readability, transform the multi-word keywords into a single string using the join and strip functions.

if size(tbl.Keyword,2) > 1
    tbl.Keyword = strip(join(tbl.Keyword));
end
head(tbl)
             Keyword              DocumentNumber    Score
    __________________________    ______________    _____

    "MATLAB provides tools"             1             8  
    "MATLAB"                            1             2  
    "scientists and engineers"          1             2  
    "scientists"                        1             1  
    "engineers"                         1             1  
    "Analyze text"                      2             4  
    "import text"                       2             4  
    "images"                            2             1  

Specify Maximum Number of Keywords Per Document

The rakeKeywords function, by default, returns all identified keywords. To reduce the number of keywords, use the 'MaxNumKeywords' option.

Extract the top three keywords for each document by setting the 'MaxNumKeywords' option to 3.

tbl = rakeKeywords(documents,'MaxNumKeywords',3)
tbl=9×3 table
                     Keyword                     DocumentNumber    Score
    _________________________________________    ______________    _____

    "MATLAB"        "provides"    "tools"              1             8  
    "MATLAB"        ""            ""                   1             2  
    "scientists"    "and"         "engineers"          1             2  
    "Analyze"       "text"        ""                   2             4  
    "import"        "text"        ""                   2             4  
    "images"        ""            ""                   2             1  
    "Analyze"       "text"        ""                   3             4  
    "images"        ""            ""                   3             1  
    "videos"        ""            ""                   3             1  

Specify Delimiters

Notice that in the extracted keywords above, the function extracts the multi-word keyword "scientists and engineers" from the first document, but does not extract the multi-word keyword "text and images" from the second document. This is because the RAKE algorithm uses tokens appearing between delimiters as candidate keywords, and the algorithm only merges keywords with delimiters when the merged phrase appears multiple times.

In this case, the instances of the token "text" appears within the two different multi-word keyword candidates "Analyze text" and "import text". Because, in this case, the function does not extract "text" as a separate candidate keyword, the algorithm does not consider merging candidates with the delimiter "and" and the candidate keyword "images".

You can specify the delimiters used for extracting keywords using the 'Delimiters' and 'MergingDelimiters'options. To specify delimiters that should not appear in extracted keywords, use the 'Delimiters' option. To specify delimiters that can appear in extracted keywords, use the 'MergingDelimiters' option.

Extract keywords from the same text as before and also specify the words "Analyze" and "import" as merging delimiters.

newDelimiters = ["Analyze" "import"];
mergingDelimiters = [stopWords newDelimiters];

tbl = rakeKeywords(documents,'MergingDelimiters', mergingDelimiters)
tbl=12×3 table
                     Keyword                     DocumentNumber    Score
    _________________________________________    ______________    _____

    "MATLAB"        "provides"    "tools"              1             8  
    "MATLAB"        ""            ""                   1             2  
    "scientists"    "and"         "engineers"          1             2  
    "scientists"    ""            ""                   1             1  
    "engineers"     ""            ""                   1             1  
    "text"          "and"         "images"             2             2  
    "text"          ""            ""                   2             1  
    "images"        ""            ""                   2             1  
    "text"          ""            ""                   3             1  
    "images"        ""            ""                   3             1  
    "videos"        ""            ""                   3             1  
    "MATLAB"        ""            ""                   3             1  

Notice here that the function treats the tokens "text" and "images" as keywords and also extracts the merged keyword "text and images". To learn more about the RAKE algorithm, see Rapid Automatic Keyword Extraction.

Alternatives

You can experiment with different keyword extraction algorithms to see what works best with your data. Because the RAKE algorithm uses a delimiter-based approach to extract candidate keywords, the extracted keywords can be very long. Alternatively, you can try extracting keywords using TextRank algorithm which starts with individual tokens as candidate keywords and then merges them when appropriate. To extract keywords using TextRank, use the textrankKeywords function. To learn more, see Extract Keywords from Text Data Using TextRank.

References

[1] Rose, Stuart, Dave Engel, Nick Cramer, and Wendy Cowley. "Automatic keyword extraction from individual documents." Text mining: applications and theory 1 (2010): 1-20.

See Also

| | |

Related Topics