findElement
Find elements in HTML tree
Description
Examples
Read HTML code from the URL https://www.mathworks.com/help/textanalytics
using the webread
function.
url = "https://www.mathworks.com/help/textanalytics";
code = webread(url);
Parse the HTML code using htmlTree
.
tree = htmlTree(code);
Find all the hyperlinks in the HTML tree using findElement
. The hyperlinks are nodes with element name "A"
.
selector = "A";
subtrees = findElement(tree,selector);
View the first few subtrees.
subtrees(1:10)
ans = 10×1 htmlTree: <A class="skip_link sr-only" href="#content_container">Skip to content</A> <A href="https://www.mathworks.com?s_tid=gn_logo" class="svg_link navbar-brand"><IMG src="/images/responsive/global/pic-header-mathworks-logo.svg" class="mw_logo" alt="MathWorks"/></A> <A href="https://www.mathworks.com/products.html?s_tid=gn_ps">Products</A> <A href="https://www.mathworks.com/solutions.html?s_tid=gn_sol">Solutions</A> <A href="https://www.mathworks.com/academia.html?s_tid=gn_acad">Academia</A> <A href="https://www.mathworks.com/support.html?s_tid=gn_supp">Support</A> <A href="https://www.mathworks.com/matlabcentral/?s_tid=gn_mlc">Community</A> <A href="https://www.mathworks.com/company/events.html?s_tid=gn_ev">Events</A> <A href="https://www.mathworks.com/products/get-matlab.html?s_tid=gn_getml">Get MATLAB</A> <A href="https://www.mathworks.com?s_tid=gn_logo" class="svg_link pull-left"><IMG src="/images/responsive/global/pic-header-mathworks-logo.svg" class="mw_logo" alt="MathWorks"/></A>
Extract the text from the subtrees using extractHTMLText
. The result contains the link text from each link on the page.
str = extractHTMLText(subtrees); str(1:10)
ans = 10×1 string
"Skip to content"
""
"Products"
"Solutions"
"Academia"
"Support"
"Community"
"Events"
"Get MATLAB"
""
Input Arguments
HTML tree, specified as a scalar htmlTree
object.
CSS selector, specified as a string scalar or a character vector. For more information, see CSS Selectors.
Output Arguments
Matching HTML subtrees, returned as an htmlTree
array.
More About
A typical HTML element contains the following components:
Element name – Name of the HTML tag. The element name corresponds to the
Name
property of the HTML tree.Attributes – Additional information about the tag. HTML attributes have the form
, wherename
="value
"
andname
denote the attribute name and value respectively. The attributes appear inside the opening HTML tag. To get the attribute values from an HTML tree, usevalue
getAttribute
.Content – Element content. The content appears between opening and closing HTML tags. The content can be text data or nested HTML elements. To extract the text from an
htmlTree
object, useextractHTMLText
. To get the nested HTML elements of anhtmlTree
object, use theChildren
property.
For example, the HTML element <a
href="https://www.mathworks.com">Home</a>
comprises the following
components:
Component | Value | Description | |
---|---|---|---|
Element name | a | Element is a hyperlink | |
Attribute | Attribute name | href | Hyperlink reference |
Attribute value | "https://www.mathworks.com" | Hyperlink reference value | |
Content | Home | Text to display |
CSS selectors specify patterns to match elements in a tree.
This table shows some examples showing how to extract different HTML elements from an HTML tree:
Task | CSS Selector | Example |
---|---|---|
Find all paragraph (<p> ) elements. | "p" | findElement(tree,"p") |
Find all paragraph (<p> ) and list item
(<li> ) elements. | "p,li" | findElement(tree,"p,li") |
Find all paragraph (<p> ) elements that are inside table
(<table> ) elements. | "table p" | findElement(tree,"table p") |
Find all hyperlink (<a> ) elements with hyperlink
reference attribute (href ) values ending with
".pdf" . | "a[href$="".pdf""]" | findElement(tree,"a[href$="".pdf""]") |
Find all paragraph (<p> ) elements that are the first
child of their parent. | "p:first-child" | findElement(tr,"p:first-child") |
Find all paragraph (<p> ) elements that are the first
paragraph element of their parent. | "p:first-of-type" | findElement(tr,"p:first-of-type") |
Find all emphasis (<em> ) elements where the parent is a
paragraph (<p> ) element. | "p > em" | findElement(tr,"p > em") |
Find all paragraph (<p> ) elements appearing immediately
after a heading 1 (<h1> ) element | "h1 + p" | findElement(tr,"h1 + p") |
Find all empty elements. | ":empty" | findElement(tr,":empty") |
Find all nonempty label (<label> ) elements. | "label:not(:empty)" | findElement(tr,"label:not(:empty)") |
The findElement
function supports all of CSS level 3, except for
the selectors ":lang"
, ":checked"
,
":link"
, ":active"
, ":hover"
,
":focus"
, ":target"
, ":enabled"
,
and ":disabled"
.
For more information about CSS selectors, see [1].
References
[1] CSS Selector Reference. https://www.w3schools.com/cssref/css_selectors.php
Version History
Introduced in R2018b
MATLAB Command
You clicked a link that corresponds to this MATLAB command:
Run the command by entering it in the MATLAB Command Window. Web browsers do not support MATLAB commands.
选择网站
选择网站以获取翻译的可用内容,以及查看当地活动和优惠。根据您的位置,我们建议您选择:。
您也可以从以下列表中选择网站:
如何获得最佳网站性能
选择中国网站(中文或英文)以获得最佳网站性能。其他 MathWorks 国家/地区网站并未针对您所在位置的访问进行优化。
美洲
- América Latina (Español)
- Canada (English)
- United States (English)
欧洲
- Belgium (English)
- Denmark (English)
- Deutschland (Deutsch)
- España (Español)
- Finland (English)
- France (Français)
- Ireland (English)
- Italia (Italiano)
- Luxembourg (English)
- Netherlands (English)
- Norway (English)
- Österreich (Deutsch)
- Portugal (English)
- Sweden (English)
- Switzerland
- United Kingdom (English)