splitHTMLSections

Split HTML document into sections

Since R2026a

Syntax

chunkTable = splitHTMLSections(str)

chunkTable = splitHTMLSections(t)

Description

chunkTable = splitHTMLSections(str) splits an input HTML document str into sections according to the section tags <h1>...</h1>,<h2>...</h2>,...,<h6>...</h6>.

example

chunkTable = splitHTMLSections(t) splits a table of HTML documents t into sections.

example

Examples

collapse all

Split HTML Document into Sections

Open Live Script

To split an HTML document into sections, use splitHTMLSections and specify the HTML code as a string.

str = "<html><body><head><title>Title</title></head>" + ...
    "<h1>Chapter 1</h1><p>Introductory paragraph of chapter 1.</p>" + ...
    "<h2>Section 1</h2><p>Content of section 1.</p>" + ...
    "<h2>Section 2</h2><p>Content of section 2.</p></body></html>";
chunkTable = splitHTMLSections(str)

chunkTable=4×3 table
                         Text                              H1             H2     
    _______________________________________________    ___________    ___________

    "<html><body><head><title>Title</title></head>"    <missing>      <missing>  
    "<p>Introductory paragraph of chapter 1.</p>"      "Chapter 1"    <missing>  
    "<p>Content of section 1.</p>"                     "Chapter 1"    "Section 1"
    "<p>Content of section 2.</p></body></html>"       "Chapter 1"    "Section 2"

You can then use the extractHTMLText function to extract the text from the HTML code.

chunkText = extractHTMLText(chunkTable.Text)

chunkText = 4×1 string
    ""
    "Introductory paragraph of chapter 1."
    "Content of section 1."
    "Content of section 2."

Split Table of HTML Documents into Sections

Open Live Script

To split multiple HTML documents into chunks in a single chunk table, then first create a table of documents and then use splitHTMLSections with the table of documents as the input. To retain metadata about the source documents, such as their filenames, add the metadata to the table as additional variables.

Create a table from:

A variable Text that contains HTML documents.
A variable DocumentName that contains the names of the documents.

str1 = "<html><HEAD><TITLE>Document 1</TITLE></HEAD><body><h1>Chapter 1</h1><p>This is the first chapter of the first document.</p></body></html>";
str2 = "<html><HEAD><TITLE>Document 2</TITLE></HEAD><body><h1>Chapter 1</h1><p>This is the first chapter of the second document.</p></body></html>";
str3 = "<html><HEAD><TITLE>Document 3</TITLE></HEAD><body><h1>Chapter 1</h1><p>This is the first chapter of the third document.</p></body></html>";
Text = [str1;str2;str3];
DocumentName = ["Document 1";"Document 2";"Document 3"];
t = table(Text,DocumentName)

t=3×2 table
                                                                        Text                                                                        DocumentName
    ____________________________________________________________________________________________________________________________________________    ____________

    "<html><HEAD><TITLE>Document 1</TITLE></HEAD><body><h1>Chapter 1</h1><p>This is the first chapter of the first document.</p></body></html>"     "Document 1"
    "<html><HEAD><TITLE>Document 2</TITLE></HEAD><body><h1>Chapter 1</h1><p>This is the first chapter of the second document.</p></body></html>"    "Document 2"
    "<html><HEAD><TITLE>Document 3</TITLE></HEAD><body><h1>Chapter 1</h1><p>This is the first chapter of the third document.</p></body></html>"     "Document 3"

Split the table of documents into HTML sections using the splitHTMLSections function.

chunkTable = splitHTMLSections(t)

chunkTable=6×3 table
                                      Text                                      DocumentName        H1     
    ________________________________________________________________________    ____________    ___________

    "<html><HEAD><TITLE>Document 1</TITLE></HEAD><body>"                        "Document 1"    <missing>  
    "<p>This is the first chapter of the first document.</p></body></html>"     "Document 1"    "Chapter 1"
    "<html><HEAD><TITLE>Document 2</TITLE></HEAD><body>"                        "Document 2"    <missing>  
    "<p>This is the first chapter of the second document.</p></body></html>"    "Document 2"    "Chapter 1"
    "<html><HEAD><TITLE>Document 3</TITLE></HEAD><body>"                        "Document 3"    <missing>  
    "<p>This is the first chapter of the third document.</p></body></html>"     "Document 3"    "Chapter 1"

Input Arguments

collapse all

`str` — Input document
string array | character vector | cell array of character vectors

Input document, specified as a string array, character vector, or cell array of character vectors.

Data Types: string | char | cell

`t` — Input table
table

Input table of documents. t must have a column named Text that contains the documents. The documents must be specified as a string scalar.

Output Arguments

collapse all

`chunkTable` — Table of text chunks
table

Table of text chunks. chunkTable has these variables:

Text — Text chunk, returned as a string scalar.
H1, H2, ..., H6 — Title of nth-level HTML section that contains the chunk, delineated in the HTML code by <h1 ...> ... <\h1>, <h2 ...> ... <\h2>, ..., <h6 ...> ... <\h6>. If the chunk is not part of an nth-level HTML section, then the corresponding variable Hn contains <missing>. If there are no nth-level HTML sections in the input document, then no variable Hn exists.

If you specify the input documents as a table, then chunkTable also contains all the variables in the table. For each chunk, the values of the variables are the same as for the document from which the chunk originates.

More About

collapse all

Text Chunks

Many analysis tools, including large language models (LLMs), perform better on small chunks of text than on large documents. Text Analytics Toolbox™ includes a range of functions that allow you to split large documents into semantically meaningful chunks.

Basic Workflow

The splitTextChunks function splits a document recursively into text chunks of a given target length. The function first splits a document into paragraphs. If any of the paragraphs are longer than the target length, then the function splits those paragraphs into sentences, and so on.

chunks = splitTextChunks(str);

Advanced Workflow

Split your document into sections and preserve the section metadata using one of these functions:

`splitHTMLSections`	Split an HTML-formatted document into HTML sections according to the section tags `<h1>...</h1>`, `<h2>...</h2>`, …, `<h6>...</h6>`.
`splitMarkdownSections`	Split a Markdown-formatted document into Markdown sections, for example according to ATX section tags `#` , `##` , …, `######` .
`splitCustomSections`	Split a document into custom sections according to custom section delimiters.

Split your documents or your chunks recursively into paragraphs, sentences, and tokens using the splitTextChunks function.
To avoid redundancy, join similar adjacent chunks using the joinSimilarTextChunks function.
Add overlap between adjacent text chunks using the addTextChunkOverlap function. Adding text chunk overlap avoids changing the meaning of sentences by splitting at inopportune points, for example, splitting the sentence "I would never say I love cats" into "I would never say" and "I love cats." Adding overlap in this example results in the two chunks "I would never say I love" and "never say I love cats." You can also add surrounding text to individual chunks as context by using the findTextChunkContext function.

For an example showing the advanced workflow, see Split Document Into Semantically Meaningful Text Chunks.

Retrieval-Augmented Generation (RAG)

RAG combines the text generation capabilities of large language models (LLMs) with reliable information contained in a set of source documents. First, retrieve documents relevant to the user prompt from the set of source documents. Then, append the relevant document to the prompt and use the LLM to generate a response.

To improve the quality of the generated output, split large documents into smaller, semantically meaningful chunks.
Use information retrieval to identify the text chunks that are relevant to the query. For more information, see Information Retrieval with Document Embeddings.
Create a prompt based on the most relevant chunks. To provide the LLM with additional context, you can add text from adjacent prompts within the same section by using the findTextChunkContext function, or you can you can add overlap between text chunks before information retrieval by using the addTextChunkOverlap function. Create a Markdown-formatted string from the text chunks using the formatTextChunks function. For an example, see Create Large Language Model (LLM) Prompt from Text Chunk.
Generate an answer using an LLM. To connect to large language model APIs using MATLAB, use the Large Language Models (LLMs) with MATLAB add-on.

Version History

Introduced in R2026a

splitHTMLSections

Syntax

Description

Examples

Split HTML Document into Sections

Split Table of HTML Documents into Sections

Input Arguments

str — Input document string array | character vector | cell array of character vectors

t — Input table table

Output Arguments

chunkTable — Table of text chunks table

More About

Text Chunks

Retrieval-Augmented Generation (RAG)

Version History

See Also

`str` — Input document
string array | character vector | cell array of character vectors

`t` — Input table
table

`chunkTable` — Table of text chunks
table