主要内容

joinSimilarTextChunks

Join semantically similar text chunks

Since R2026a

    Description

    newChunkTable = joinSimilarTextChunks(chunkTable) joins adjacent, semantically similar text chunks from an input text chunk table chunkTable. This function requires Deep Learning Toolbox™. By default, this function also requires the Text Analytics Toolbox™ Model for all-MiniLM-L6-v2 Network support package.

    example

    newChunkTable = joinSimilarTextChunks(chunkTable,Name=Value) specifies additional options using one or more name-value arguments.

    example

    Examples

    collapse all

    Load the example data. The file sonnets.txt contains Shakespeare's sonnets in plain text. Extract the text from sonnets.txt using the extractFileText function.

    str = extractFileText("sonnets.txt");

    Split str into text chunks using the splitTextChunks function. Specify the target length as 100.

    chunkTable = splitTextChunks(str,TargetLength=100);

    Count the number of rows in the text chunk table.

    height(chunkTable)
    ans = 
    1277
    

    Join similar text chunks using the joinSimilarTextChunks function. This function requires Deep Learning Toolbox and Text Analytics Toolbox Model for all-MiniLM-L6-v2 Network. If this support package is not installed, then the function provides a download link.

    newChunkTable = joinSimilarTextChunks(chunkTable,MinSimilarity=0.5);

    Count the number of rows in the new text chunk table.

    height(newChunkTable)
    ans = 
    1221
    

    The new text chunk table has fewer rows than the original table because the function joined several text chunks.

    First, split an HTML document into sections using the splitHTMLSections function.

    str = "<html><body><head><title>Title</title></head>" + ...
        "<h1>Chapter 1</h1><p>Introductory paragraph of chapter 1.</p>" + ...
        "<h2>Section 1</h2><p>This content is similar to the next section.</p>" + ...
        "<h2>Section 2</h2><p>This content is similar to the previous section.</p> + ..." + ...
        "<h1>Chapter 2</h1>" + ...
        "<h2>Section 1</h2><p>This content is very similar to the previous section.</p></body></html>";
    chunkTable = splitHTMLSections(str)
    chunkTable=5×3 table
                                            Text                                            H1             H2     
        ____________________________________________________________________________    ___________    ___________
    
        "<html><body><head><title>Title</title></head>"                                 <missing>      <missing>  
        "<p>Introductory paragraph of chapter 1.</p>"                                   "Chapter 1"    <missing>  
        "<p>This content is similar to the next section.</p>"                           "Chapter 1"    "Section 1"
        "<p>This content is similar to the previous section.</p> + ..."                 "Chapter 1"    "Section 2"
        "<p>This content is very similar to the previous section.</p></body></html>"    "Chapter 2"    "Section 1"
    
    

    Join similar text chunks that are in the same H1 section, but not necessarily in the same H2 section, by using the joinSimilarTextChunks function and specifying the Levels name-value arguments as "H1".

    For more information about document similarity, see Information Retrieval with Document Embeddings.

    Specify a minimum similarity of 0.2. The joinSimilarTextChunks function requires Deep Learning Toolbox and Text Analytics Toolbox Model for all-MiniLM-L6-v2 Network. If this support package is not installed, then the function provides a download link.

    newChunkTable = joinSimilarTextChunks(chunkTable,Levels="H1",MinSimilarity=0.2)
    newChunkTable=3×3 table
                                                                                      Text                                                                                       H1             H2     
        _________________________________________________________________________________________________________________________________________________________________    ___________    ___________
    
        "<html><body><head><title>Title</title></head>"                                                                                                                      <missing>      <missing>  
        "<p>Introductory paragraph of chapter 1.</p>↵↵<p>This content is similar to the next section.</p>↵↵<p>This content is similar to the previous section.</p> + ..."    "Chapter 1"    <missing>  
        "<p>This content is very similar to the previous section.</p></body></html>"                                                                                         "Chapter 2"    "Section 1"
    
    

    Input Arguments

    collapse all

    Input table of text chunks. chunkTable must contain a variable named Text, specified as a string scalar that contains the text chunks.

    Create a table of text chunks from a document or table of documents by using the splitTextChunks, splitHTMLSections, splitMarkdownSections, or splitMarkdownSections function.

    Name-Value Arguments

    collapse all

    Specify optional pairs of arguments as Name1=Value1,...,NameN=ValueN, where Name is the argument name and Value is the corresponding value. Name-value arguments must appear after other arguments, but the order of the pairs does not matter.

    Example: joinSimilarTextChunks(chunkTable,TargetLength=100) sets the target length of the output text chunks to 100.

    Document embedding used to compute the text chunk similarities, specified as a documentEmbedding object.

    The joinSimilarTextChunks function uses sentence transformer models, specified as documentEmbedding objects, to map the text chunks to vectors, such that similar text chunks have similar embedding vectors.

    Total target length of output text chunks, specified as a positive integer that represents the number of characters.

    Data Types: single | double | int8 | int16 | int32 | int64 | uint8 | uint16 | uint32 | uint64
    Complex Number Support: Yes

    Section levels within which to join similar text chunks, specified as a string array, character vector, or cell array of character vectors, corresponding to columns in the input table of text chunks.

    If you specify this argument, then the function joins text chunks only if they are part of the same section, that is, if their corresponding entries in the columns specified by Levels are all equal.

    By default, the function uses every table variable except for Text as section levels. For example, if a table has variables Text, H1, and H2, then by default, the function uses Levels = ["H1","H2"].

    If chunkTable does not have a variable specified by Levels, then joinSimilarTextChunks ignores that variable.

    Example: ["H1","H2"]

    Data Types: string | char | cell

    Minimum cosine similarity between text chunks to join, specified as a scalar between 0 and 1.

    The function only joins adjacent text chunks whose embedding vectors have a cosine similarity of at least MinSimilarity. To specify the document embedding, use the Embedding name-value argument.

    Data Types: single | double | int8 | int16 | int32 | int64 | uint8 | uint16 | uint32 | uint64

    Output Arguments

    collapse all

    Output table of text chunks. newChunkTable contains the same variables as the input text chunk table chunkTable.

    More About

    collapse all

    Version History

    Introduced in R2026a