主要内容

splitCustomSections

Split document into custom sections

Since R2026a

    Description

    chunkTable = splitCustomSections(str,levelRegexp) splits an input document str into sections at substrings matching the regular expression specified in levelRegexp.

    example

    chunkTable = splitCustomSections(t,levelRegexp) splits a table of documents t into sections.

    Examples

    collapse all

    Specify an input document as a string.

    str = "Section 1. Content of section 1. " + ...
        "Subsection 1. Content of subsection 1. " + ...
        "Subsection 2. Content of subsection 2.";

    The top-level sections in str are delineated by the word "Section", followed by an empty space, a number, and a full stop. Specify this pattern using a regular expression. For more information on regular expressions, see regexp.

    sectionDivider = "Section \d+\.";

    The second-level sections in str are delineated by the word "Subsection", followed by an empty space, a number, and a full stop. Specify this pattern using a regular expression.

    subsectionDivider = "Subsection \d+\.";

    Split str at the substrings defined by sectionDivider and subsectionDivider by using the splitCustomSections function.

    chunkTable = splitCustomSections(str,[sectionDivider,subsectionDivider])
    chunkTable=3×1 table
                   Text           
        __________________________
    
        "Content of section 1."   
        "Content of subsection 1."
        "Content of subsection 2."
    
    

    To capture the section hierarchy and section titles in a chunk table, use named capture groups when you specify the regular expressions that define the section delineation syntax.

    First, specify an input document as a string.

    str = "Section 1. Title of section 1." + newline + ...
        "Content of section 1." + ...
        "Subsection 1. Title of subsection 1." + newline + ...
        "Content of subsection 1." + newline + ...
        "Subsection 2. Title of subsection 2." + newline + ...
        "Content of subsection 2.";

    The top-level sections in str are delineated by these elements in order:

    1. The word "Section"

    2. An empty space

    3. A number

    4. A full stop

    5. An empty space

    6. The title

    7. A newline character

    Specify this pattern using a regular expression.

    To save the section title in a variable called Section, use the string "(?<Section>.*?)" in place of the title in the regular expression, followed by the newline character. This expression assumes that the title itself does not contain any new lines, and that the first newline character after "Section X" signifies the end of the title.

    sectionDivider = "Section \d+\. (?<Section>.*?)" + newline;

    Similarly, save the second-level section titles in a variable called Subsection.

    subsectionDivider = "Subsection \d+\. (?<Subsection>.*?)" + newline;

    Split str at the substrings defined by sectionDivider and subsectionDivider by using the splitCustomSections function.

    chunkTable = splitCustomSections(str,[sectionDivider,subsectionDivider])
    chunkTable=3×3 table
                   Text                      Section                  Subsection       
        __________________________    _____________________    ________________________
    
        "Content of section 1."       "Title of section 1."    <missing>               
        "Content of subsection 1."    "Title of section 1."    "Title of subsection 1."
        "Content of subsection 2."    "Title of section 1."    "Title of subsection 2."
    
    

    Input Arguments

    collapse all

    Input document, specified as a string array, character vector, or cell array of character vectors.

    Data Types: string | char | cell

    Input table of documents. t must have a column named Text that contains the documents. The documents must be specified as a string scalar.

    Regular expression describing the section break syntax, specified as a string array, character vector, or cell array of character vectors.

    If you specify a string array or a cell array of character vectors, then the first element corresponds to the top-level section, the second element corresponds to the next-lower level section, and so on.

    For more information on regular expressions, see regexp.

    To capture the section hierarchy and titles as separate variables in the text chunk table, use named capture groups. For example, to create a variable named H1, when specifying the pattern for the section delimiters, replace the title pattern with the string (?<H1>.*?).

    Example: "Section \d+\. (?<H1>.*?)" + newline

    Data Types: string | char | cell

    Output Arguments

    collapse all

    Table of text chunks. chunkTable contains one variable, Text, that contains the text chunk, returned as a string scalar.

    If you specify the input documents as a table, then chunkTable also contains all variables in the table. For each chunk, the values of the variables are the same as for the document from which the chunk originates.

    If you specify a named capture group in the levelRegexp input argument, then the text chunk table contains variables corresponding to the named capture groups. For example, if levelRegexp includes the string "# (?<H1>.*?)\n", then the table also has the variable H1, returned as a string scalar containing the text between "# " and "\n".

    For each section defined by a named capture group, if a chunk is not part of that section, then the row for that chunk contains <missing> in the section column. If the input document does not contain any sections described by a named capture group, then the corresponding variable does not exist in the output text chunk table.

    More About

    collapse all

    Version History

    Introduced in R2026a