splitCustomSections

Split document into custom sections

Since R2026a

Syntax

chunkTable = splitCustomSections(str,levelRegexp)

chunkTable = splitCustomSections(t,levelRegexp)

Description

chunkTable = splitCustomSections(str,levelRegexp) splits an input document str into sections at substrings matching the regular expression specified in levelRegexp.

example

chunkTable = splitCustomSections(t,levelRegexp) splits a table of documents t into sections.

Examples

collapse all

Split Document into Custom Sections

Open Live Script

Specify an input document as a string.

str = "Section 1. Content of section 1. " + ...
    "Subsection 1. Content of subsection 1. " + ...
    "Subsection 2. Content of subsection 2.";

The top-level sections in str are delineated by the word "Section", followed by an empty space, a number, and a full stop. Specify this pattern using a regular expression. For more information on regular expressions, see regexp.

sectionDivider = "Section \d+\.";

The second-level sections in str are delineated by the word "Subsection", followed by an empty space, a number, and a full stop. Specify this pattern using a regular expression.

subsectionDivider = "Subsection \d+\.";

Split str at the substrings defined by sectionDivider and subsectionDivider by using the splitCustomSections function.

chunkTable = splitCustomSections(str,[sectionDivider,subsectionDivider])

chunkTable=3×1 table
               Text           
    __________________________

    "Content of section 1."   
    "Content of subsection 1."
    "Content of subsection 2."

Capture Custom Section Titles

Open Live Script

To capture the section hierarchy and section titles in a chunk table, use named capture groups when you specify the regular expressions that define the section delineation syntax.

First, specify an input document as a string.

str = "Section 1. Title of section 1." + newline + ...
    "Content of section 1." + ...
    "Subsection 1. Title of subsection 1." + newline + ...
    "Content of subsection 1." + newline + ...
    "Subsection 2. Title of subsection 2." + newline + ...
    "Content of subsection 2.";

The top-level sections in str are delineated by these elements in order:

The word "Section"
An empty space
A number
A full stop
An empty space
The title
A newline character

Specify this pattern using a regular expression.

To save the section title in a variable called Section, use the string "(?<Section>.*?)" in place of the title in the regular expression, followed by the newline character. This expression assumes that the title itself does not contain any new lines, and that the first newline character after "Section X" signifies the end of the title.

sectionDivider = "Section \d+\. (?<Section>.*?)" + newline;

Similarly, save the second-level section titles in a variable called Subsection.

subsectionDivider = "Subsection \d+\. (?<Subsection>.*?)" + newline;

Split str at the substrings defined by sectionDivider and subsectionDivider by using the splitCustomSections function.

chunkTable = splitCustomSections(str,[sectionDivider,subsectionDivider])

chunkTable=3×3 table
               Text                      Section                  Subsection       
    __________________________    _____________________    ________________________

    "Content of section 1."       "Title of section 1."    <missing>               
    "Content of subsection 1."    "Title of section 1."    "Title of subsection 1."
    "Content of subsection 2."    "Title of section 1."    "Title of subsection 2."

Input Arguments

collapse all

`str` — Input document
string array | character vector | cell array of character vectors

Input document, specified as a string array, character vector, or cell array of character vectors.

Data Types: string | char | cell

`t` — Input table
table

Input table of documents. t must have a column named Text that contains the documents. The documents must be specified as a string scalar.

`levelRegexp` — Regular expression describing section break syntax
string array | character vector | cell array of character vectors

Regular expression describing the section break syntax, specified as a string array, character vector, or cell array of character vectors.

If you specify a string array or a cell array of character vectors, then the first element corresponds to the top-level section, the second element corresponds to the next-lower level section, and so on.

For more information on regular expressions, see regexp.

To capture the section hierarchy and titles as separate variables in the text chunk table, use named capture groups. For example, to create a variable named H1, when specifying the pattern for the section delimiters, replace the title pattern with the string (?<H1>.*?).

Example: "Section \d+\. (?<H1>.*?)" + newline

Data Types: string | char | cell

Output Arguments

collapse all

`chunkTable` — Table of text chunks
table

Table of text chunks. chunkTable contains one variable, Text, that contains the text chunk, returned as a string scalar.

If you specify the input documents as a table, then chunkTable also contains all variables in the table. For each chunk, the values of the variables are the same as for the document from which the chunk originates.

If you specify a named capture group in the levelRegexp input argument, then the text chunk table contains variables corresponding to the named capture groups. For example, if levelRegexp includes the string "# (?<H1>.*?)\n", then the table also has the variable H1, returned as a string scalar containing the text between "# " and "\n".

For each section defined by a named capture group, if a chunk is not part of that section, then the row for that chunk contains <missing> in the section column. If the input document does not contain any sections described by a named capture group, then the corresponding variable does not exist in the output text chunk table.

More About

collapse all

Text Chunks

Many analysis tools, including large language models (LLMs), perform better on small chunks of text than on large documents. Text Analytics Toolbox™ includes a range of functions that allow you to split large documents into semantically meaningful chunks.

Basic Workflow

The splitTextChunks function splits a document recursively into text chunks of a given target length. The function first splits a document into paragraphs. If any of the paragraphs are longer than the target length, then the function splits those paragraphs into sentences, and so on.

chunks = splitTextChunks(str);

Advanced Workflow

Split your document into sections and preserve the section metadata using one of these functions:

`splitHTMLSections`	Split an HTML-formatted document into HTML sections according to the section tags `<h1>...</h1>`, `<h2>...</h2>`, …, `<h6>...</h6>`.
`splitMarkdownSections`	Split a Markdown-formatted document into Markdown sections, for example according to ATX section tags `#` , `##` , …, `######` .
`splitCustomSections`	Split a document into custom sections according to custom section delimiters.

Split your documents or your chunks recursively into paragraphs, sentences, and tokens using the splitTextChunks function.
To avoid redundancy, join similar adjacent chunks using the joinSimilarTextChunks function.
Add overlap between adjacent text chunks using the addTextChunkOverlap function. Adding text chunk overlap avoids changing the meaning of sentences by splitting at inopportune points, for example, splitting the sentence "I would never say I love cats" into "I would never say" and "I love cats." Adding overlap in this example results in the two chunks "I would never say I love" and "never say I love cats." You can also add surrounding text to individual chunks as context by using the findTextChunkContext function.

For an example showing the advanced workflow, see Split Document Into Semantically Meaningful Text Chunks.

Retrieval-Augmented Generation (RAG)

RAG combines the text generation capabilities of large language models (LLMs) with reliable information contained in a set of source documents. First, retrieve documents relevant to the user prompt from the set of source documents. Then, append the relevant document to the prompt and use the LLM to generate a response.

To improve the quality of the generated output, split large documents into smaller, semantically meaningful chunks.
Use information retrieval to identify the text chunks that are relevant to the query. For more information, see Information Retrieval with Document Embeddings.
Create a prompt based on the most relevant chunks. To provide the LLM with additional context, you can add text from adjacent prompts within the same section by using the findTextChunkContext function, or you can you can add overlap between text chunks before information retrieval by using the addTextChunkOverlap function. Create a Markdown-formatted string from the text chunks using the formatTextChunks function. For an example, see Create Large Language Model (LLM) Prompt from Text Chunk.
Generate an answer using an LLM. To connect to large language model APIs using MATLAB, use the Large Language Models (LLMs) with MATLAB add-on.

Version History

Introduced in R2026a

splitCustomSections

Syntax

Description

Examples

Split Document into Custom Sections

Capture Custom Section Titles

Input Arguments

str — Input document string array | character vector | cell array of character vectors

t — Input table table

levelRegexp — Regular expression describing section break syntax string array | character vector | cell array of character vectors

Output Arguments

chunkTable — Table of text chunks table

More About

Text Chunks

Retrieval-Augmented Generation (RAG)

Version History

See Also

`str` — Input document
string array | character vector | cell array of character vectors

`t` — Input table
table

`levelRegexp` — Regular expression describing section break syntax
string array | character vector | cell array of character vectors

`chunkTable` — Table of text chunks
table