Extract data from HTML file stored in C drive of Laptop

9 次查看(过去 30 天)
Hello Everyone,
I want to extract data from local HTML file stored in C drive of laptop.
Can anyonw guide me how can I extract the data from the HTML file and further converting the data into array of char and using it ahead.
the file format is HTML and link is something like - file:///C:/Users/Pranav/OneDrive/Desktop/.....................................
commands that I have already used - 1) str=fileread('xxxxxxxxxxxxxxxxx.html') ---> data=extractHTMLString (str)
but it is giving output data as a 1 X 1000000 range where each letter is considered.
I am looking forward to some quality advices
Thanks in advance!
  1 个评论
Walter Roberson
Walter Roberson 2022-9-6
Are you using extractHTMLText ?
As an experiment, what happens if you fileread() the file directly and process that?
You have two separate issues:
  1. Making sure that the text can be pulled out of a url;
  2. processing text
Reading the file without url will allow you to test out the processing part separately from reading from the url.
To test reading from the url you could fileread() from the url and fileread() from the local file without url, and compare the two.

请先登录,再进行评论。

回答(1 个)

Saffan
Saffan 2023-8-30
Hi,
To accomplish this, you can modify your code to add an additional step of creating an HTMLTtree using the “htmlTree” method. This method parses the HTML code in the string and returns the resulting tree structure. You can then extract the text from the HTMLtree as shown in the following code snippet:
% Read the HTML file
htmlContent = fileread(filePath);
% Create an HTML tree from the content
tree = htmlTree(htmlContent);
% Extract the text from the HTML tree
data = extractHTMLText(tree);
Refer to this for more information:

标签

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by