joinSimilarTextChunks

Join semantically similar text chunks

Since R2026a

Syntax

newChunkTable = joinSimilarTextChunks(chunkTable)

newChunkTable = joinSimilarTextChunks(chunkTable,Name=Value)

Description

newChunkTable = joinSimilarTextChunks(chunkTable) joins adjacent, semantically similar text chunks from an input text chunk table chunkTable. This function requires Deep Learning Toolbox™. By default, this function also requires the Text Analytics Toolbox™ Model for all-MiniLM-L6-v2 Network support package.

example

newChunkTable = joinSimilarTextChunks(chunkTable,Name=Value) specifies additional options using one or more name-value arguments.

example

Examples

collapse all

Join Similar Text Chunks

This example uses:

Open Live Script

Load the example data. The file sonnets.txt contains Shakespeare's sonnets in plain text. Extract the text from sonnets.txt using the extractFileText function.

str = extractFileText("sonnets.txt");

Split str into text chunks using the splitTextChunks function. Specify the target length as 100.

chunkTable = splitTextChunks(str,TargetLength=100);

Count the number of rows in the text chunk table.

height(chunkTable)

ans = 
1277

Join similar text chunks using the joinSimilarTextChunks function. This function requires Deep Learning Toolbox and Text Analytics Toolbox Model for all-MiniLM-L6-v2 Network. If this support package is not installed, then the function provides a download link.

newChunkTable = joinSimilarTextChunks(chunkTable,MinSimilarity=0.5);

Count the number of rows in the new text chunk table.

height(newChunkTable)

ans = 
1221

The new text chunk table has fewer rows than the original table because the function joined several text chunks.

Join Similar Text Chunks Within Section

This example uses:

Open Live Script

First, split an HTML document into sections using the splitHTMLSections function.

str = "<html><body><head><title>Title</title></head>" + ...
    "<h1>Chapter 1</h1><p>Introductory paragraph of chapter 1.</p>" + ...
    "<h2>Section 1</h2><p>This content is similar to the next section.</p>" + ...
    "<h2>Section 2</h2><p>This content is similar to the previous section.</p> + ..." + ...
    "<h1>Chapter 2</h1>" + ...
    "<h2>Section 1</h2><p>This content is very similar to the previous section.</p></body></html>";
chunkTable = splitHTMLSections(str)

chunkTable=5×3 table
                                        Text                                            H1             H2     
    ____________________________________________________________________________    ___________    ___________

    "<html><body><head><title>Title</title></head>"                                 <missing>      <missing>  
    "<p>Introductory paragraph of chapter 1.</p>"                                   "Chapter 1"    <missing>  
    "<p>This content is similar to the next section.</p>"                           "Chapter 1"    "Section 1"
    "<p>This content is similar to the previous section.</p> + ..."                 "Chapter 1"    "Section 2"
    "<p>This content is very similar to the previous section.</p></body></html>"    "Chapter 2"    "Section 1"

Join similar text chunks that are in the same H1 section, but not necessarily in the same H2 section, by using the joinSimilarTextChunks function and specifying the Levels name-value arguments as "H1".

For more information about document similarity, see Information Retrieval with Document Embeddings.

Specify a minimum similarity of 0.2. The joinSimilarTextChunks function requires Deep Learning Toolbox and Text Analytics Toolbox Model for all-MiniLM-L6-v2 Network. If this support package is not installed, then the function provides a download link.

newChunkTable = joinSimilarTextChunks(chunkTable,Levels="H1",MinSimilarity=0.2)

newChunkTable=3×3 table
                                                                                  Text                                                                                       H1             H2     
    _________________________________________________________________________________________________________________________________________________________________    ___________    ___________

    "<html><body><head><title>Title</title></head>"                                                                                                                      <missing>      <missing>  
    "<p>Introductory paragraph of chapter 1.</p>↵↵<p>This content is similar to the next section.</p>↵↵<p>This content is similar to the previous section.</p> + ..."    "Chapter 1"    <missing>  
    "<p>This content is very similar to the previous section.</p></body></html>"                                                                                         "Chapter 2"    "Section 1"

Input Arguments

collapse all

`chunkTable` — Input table of text chunks
table

Input table of text chunks. chunkTable must contain a variable named Text, specified as a string scalar that contains the text chunks.

Create a table of text chunks from a document or table of documents by using the splitTextChunks, splitHTMLSections, splitMarkdownSections, or splitMarkdownSections function.

Name-Value Arguments

collapse all

Specify optional pairs of arguments as Name1=Value1,...,NameN=ValueN, where Name is the argument name and Value is the corresponding value. Name-value arguments must appear after other arguments, but the order of the pairs does not matter.

Example: joinSimilarTextChunks(chunkTable,TargetLength=100) sets the target length of the output text chunks to 100.

`Embedding` — Document embedding
`documentEmbedding` object

Document embedding used to compute the text chunk similarities, specified as a documentEmbedding object.

The joinSimilarTextChunks function uses sentence transformer models, specified as documentEmbedding objects, to map the text chunks to vectors, such that similar text chunks have similar embedding vectors.

`TargetLength` — Total target length
4000 (default) | positive integer

Total target length of output text chunks, specified as a positive integer that represents the number of characters.

`Levels` — Section levels
string array | character vector | cell array of character vectors

Section levels within which to join similar text chunks, specified as a string array, character vector, or cell array of character vectors, corresponding to columns in the input table of text chunks.

If you specify this argument, then the function joins text chunks only if they are part of the same section, that is, if their corresponding entries in the columns specified by Levels are all equal.

By default, the function uses every table variable except for Text as section levels. For example, if a table has variables Text, H1, and H2, then by default, the function uses Levels = ["H1","H2"].

If chunkTable does not have a variable specified by Levels, then joinSimilarTextChunks ignores that variable.

Example: ["H1","H2"]

Data Types: string | char | cell

`MinSimilarity` — Minimum cosine similarity
0.9 (default) | scalar between 0 and 1

Minimum cosine similarity between text chunks to join, specified as a scalar between 0 and 1.

The function only joins adjacent text chunks whose embedding vectors have a cosine similarity of at least MinSimilarity. To specify the document embedding, use the Embedding name-value argument.

Output Arguments

collapse all

`newChunkTable` — Output table of text chunks
table

Output table of text chunks. newChunkTable contains the same variables as the input text chunk table chunkTable.

More About

collapse all

Text Chunks

Many analysis tools, including large language models (LLMs), perform better on small chunks of text than on large documents. Text Analytics Toolbox includes a range of functions that allow you to split large documents into semantically meaningful chunks.

Basic Workflow

The splitTextChunks function splits a document recursively into text chunks of a given target length. The function first splits a document into paragraphs. If any of the paragraphs are longer than the target length, then the function splits those paragraphs into sentences, and so on.

chunks = splitTextChunks(str);

Advanced Workflow

Split your document into sections and preserve the section metadata using one of these functions:

`splitHTMLSections`	Split an HTML-formatted document into HTML sections according to the section tags `<h1>...</h1>`, `<h2>...</h2>`, …, `<h6>...</h6>`.
`splitMarkdownSections`	Split a Markdown-formatted document into Markdown sections, for example according to ATX section tags `#` , `##` , …, `######` .
`splitCustomSections`	Split a document into custom sections according to custom section delimiters.

Split your documents or your chunks recursively into paragraphs, sentences, and tokens using the splitTextChunks function.
To avoid redundancy, join similar adjacent chunks using the joinSimilarTextChunks function.
Add overlap between adjacent text chunks using the addTextChunkOverlap function. Adding text chunk overlap avoids changing the meaning of sentences by splitting at inopportune points, for example, splitting the sentence "I would never say I love cats" into "I would never say" and "I love cats." Adding overlap in this example results in the two chunks "I would never say I love" and "never say I love cats." You can also add surrounding text to individual chunks as context by using the findTextChunkContext function.

For an example showing the advanced workflow, see Split Document Into Semantically Meaningful Text Chunks.

Retrieval-Augmented Generation (RAG)

RAG combines the text generation capabilities of large language models (LLMs) with reliable information contained in a set of source documents. First, retrieve documents relevant to the user prompt from the set of source documents. Then, append the relevant document to the prompt and use the LLM to generate a response.

To improve the quality of the generated output, split large documents into smaller, semantically meaningful chunks.
Use information retrieval to identify the text chunks that are relevant to the query. For more information, see Information Retrieval with Document Embeddings.
Create a prompt based on the most relevant chunks. To provide the LLM with additional context, you can add text from adjacent prompts within the same section by using the findTextChunkContext function, or you can you can add overlap between text chunks before information retrieval by using the addTextChunkOverlap function. Create a Markdown-formatted string from the text chunks using the formatTextChunks function. For an example, see Create Large Language Model (LLM) Prompt from Text Chunk.
Generate an answer using an LLM. To connect to large language model APIs using MATLAB, use the Large Language Models (LLMs) with MATLAB add-on.

Version History

Introduced in R2026a

joinSimilarTextChunks

Syntax

Description

Examples

Join Similar Text Chunks

Join Similar Text Chunks Within Section

Input Arguments

chunkTable — Input table of text chunks table

Name-Value Arguments

Embedding — Document embedding documentEmbedding object

TargetLength — Total target length 4000 (default) | positive integer

Levels — Section levels string array | character vector | cell array of character vectors

MinSimilarity — Minimum cosine similarity 0.9 (default) | scalar between 0 and 1

Output Arguments

newChunkTable — Output table of text chunks table

More About

Text Chunks

Retrieval-Augmented Generation (RAG)

Version History

See Also

`chunkTable` — Input table of text chunks
table

`Embedding` — Document embedding
`documentEmbedding` object

`TargetLength` — Total target length
4000 (default) | positive integer

`Levels` — Section levels
string array | character vector | cell array of character vectors

`MinSimilarity` — Minimum cosine similarity
0.9 (default) | scalar between 0 and 1

`newChunkTable` — Output table of text chunks
table