主要内容

splitHTMLSections

Split HTML document into sections

Since R2026a

    Description

    chunkTable = splitHTMLSections(str) splits an input HTML document str into sections according to the section tags <h1>...</h1>,<h2>...</h2>,...,<h6>...</h6>.

    example

    chunkTable = splitHTMLSections(t) splits a table of HTML documents t into sections.

    example

    Examples

    collapse all

    To split an HTML document into sections, use splitHTMLSections and specify the HTML code as a string.

    str = "<html><body><head><title>Title</title></head>" + ...
        "<h1>Chapter 1</h1><p>Introductory paragraph of chapter 1.</p>" + ...
        "<h2>Section 1</h2><p>Content of section 1.</p>" + ...
        "<h2>Section 2</h2><p>Content of section 2.</p></body></html>";
    chunkTable = splitHTMLSections(str)
    chunkTable=4×3 table
                             Text                              H1             H2     
        _______________________________________________    ___________    ___________
    
        "<html><body><head><title>Title</title></head>"    <missing>      <missing>  
        "<p>Introductory paragraph of chapter 1.</p>"      "Chapter 1"    <missing>  
        "<p>Content of section 1.</p>"                     "Chapter 1"    "Section 1"
        "<p>Content of section 2.</p></body></html>"       "Chapter 1"    "Section 2"
    
    

    You can then use the extractHTMLText function to extract the text from the HTML code.

    chunkText = extractHTMLText(chunkTable.Text)
    chunkText = 4×1 string
        ""
        "Introductory paragraph of chapter 1."
        "Content of section 1."
        "Content of section 2."
    
    

    To split multiple HTML documents into chunks in a single chunk table, then first create a table of documents and then use splitHTMLSections with the table of documents as the input. To retain metadata about the source documents, such as their filenames, add the metadata to the table as additional variables.

    Create a table from:

    • A variable Text that contains HTML documents.

    • A variable DocumentName that contains the names of the documents.

    str1 = "<html><HEAD><TITLE>Document 1</TITLE></HEAD><body><h1>Chapter 1</h1><p>This is the first chapter of the first document.</p></body></html>";
    str2 = "<html><HEAD><TITLE>Document 2</TITLE></HEAD><body><h1>Chapter 1</h1><p>This is the first chapter of the second document.</p></body></html>";
    str3 = "<html><HEAD><TITLE>Document 3</TITLE></HEAD><body><h1>Chapter 1</h1><p>This is the first chapter of the third document.</p></body></html>";
    Text = [str1;str2;str3];
    DocumentName = ["Document 1";"Document 2";"Document 3"];
    t = table(Text,DocumentName)
    t=3×2 table
                                                                            Text                                                                        DocumentName
        ____________________________________________________________________________________________________________________________________________    ____________
    
        "<html><HEAD><TITLE>Document 1</TITLE></HEAD><body><h1>Chapter 1</h1><p>This is the first chapter of the first document.</p></body></html>"     "Document 1"
        "<html><HEAD><TITLE>Document 2</TITLE></HEAD><body><h1>Chapter 1</h1><p>This is the first chapter of the second document.</p></body></html>"    "Document 2"
        "<html><HEAD><TITLE>Document 3</TITLE></HEAD><body><h1>Chapter 1</h1><p>This is the first chapter of the third document.</p></body></html>"     "Document 3"
    
    

    Split the table of documents into HTML sections using the splitHTMLSections function.

    chunkTable = splitHTMLSections(t)
    chunkTable=6×3 table
                                          Text                                      DocumentName        H1     
        ________________________________________________________________________    ____________    ___________
    
        "<html><HEAD><TITLE>Document 1</TITLE></HEAD><body>"                        "Document 1"    <missing>  
        "<p>This is the first chapter of the first document.</p></body></html>"     "Document 1"    "Chapter 1"
        "<html><HEAD><TITLE>Document 2</TITLE></HEAD><body>"                        "Document 2"    <missing>  
        "<p>This is the first chapter of the second document.</p></body></html>"    "Document 2"    "Chapter 1"
        "<html><HEAD><TITLE>Document 3</TITLE></HEAD><body>"                        "Document 3"    <missing>  
        "<p>This is the first chapter of the third document.</p></body></html>"     "Document 3"    "Chapter 1"
    
    

    Input Arguments

    collapse all

    Input document, specified as a string array, character vector, or cell array of character vectors.

    Data Types: string | char | cell

    Input table of documents. t must have a column named Text that contains the documents. The documents must be specified as a string scalar.

    Output Arguments

    collapse all

    Table of text chunks. chunkTable has these variables:

    • Text — Text chunk, returned as a string scalar.

    • H1, H2, ..., H6 — Title of nth-level HTML section that contains the chunk, delineated in the HTML code by <h1 ...> ... <\h1>, <h2 ...> ... <\h2>, ..., <h6 ...> ... <\h6>. If the chunk is not part of an nth-level HTML section, then the corresponding variable Hn contains <missing>. If there are no nth-level HTML sections in the input document, then no variable Hn exists.

    If you specify the input documents as a table, then chunkTable also contains all the variables in the table. For each chunk, the values of the variables are the same as for the document from which the chunk originates.

    More About

    collapse all

    Version History

    Introduced in R2026a