How to read files from a particular website?

Question

Pouya 2022-3-3

0
链接

此问题的直接链接

https://ww2.mathworks.cn/matlabcentral/answers/1662330-how-to-read-files-from-a-particular-website

回答： VINAYAK LUHA 2024-1-4

Hello,

I'm having problem with matlab not recognizing the files in this link ( https://swarm-diss.eo.esa.int/#swarm/Level1b/Entire_mission_data/MAGx_HR/Sat_A )

There should be multiple files each about 300mb with their names starting with "SW_OPER_MAGA_HR". But instead matlab read something else as " 1x136910 char ".

Please see the code below:

clc
clear
web='https://swarm-diss.eo.esa.int/#swarm/Level1b/Entire_mission_data/MAGx_HR/Sat_A';
str=webread(web); 
fn=regexpi(str,'SW[A-Z_0-9]+.zip','match');
for k=1:size(fn,2)
 file=fn{k};
 unzip([web file(8:9)]);
end

Thank you in advance.

1 个评论
显示 -1更早的评论隐藏 -1更早的评论

Ive J 2022-3-3

Your url is protected by cookies, I guess your best chance is to try with Python. MATLAB is quite immature for web scraping.

请先登录，再进行评论。

请先登录，再回答此问题。

Answer 1

VINAYAK LUHA 2024-1-4

0
链接

此回答的直接链接

https://ww2.mathworks.cn/matlabcentral/answers/1662330-how-to-read-files-from-a-particular-website#answer_1383086

在 MATLAB Online 中打开

htmlFile.txt

Hello Pouya,

I understand that you're looking to download files organized as a table from the mentioned website using MATLAB and have already attempted to use the "webread" function, but instead, it gave you a character array.

The webread function did indeed deliver the HTML content of the page as anticipated.

To accomplish your goal, it's important to note that the table data on the website is dynamically generated, which means webread might not be the right tool for the task. Instead, you should consider saving the webpage as an HTML file and then utilizing htmlTree to extract the necessary links from the HTML source code.

Here's a code along with explanations on how to proceed:

% Read the HTML content from a saved file
html = fileread('htmlFile.html');
% Parse the HTML content to create a tree structure
tree = htmlTree(html);
% Locate all 'a' (anchor) elements within the parsed HTML tree
anchorElements = findElement(tree, "A");
% Retrieve the 'href' attributes from the identified anchor elements
hrefAttributes = getAttribute(anchorElements, "href");
% Identify the 'href' attributes that include the download keyword
downloadLinks = hrefAttributes(contains(hrefAttributes, "?do=download"));
% Iterate over the first 10 download links (or fewer if there are not as many)
for i = 1:min(10, numel(downloadLinks))
    % URL-decode each download link to get a human-readable format
    decodedText = urldecode(downloadLinks(i));
    
    % Split the decoded URL by '/' to isolate the file name
    parts = strsplit(decodedText, '/');
    
    % Extract the file name, which is the last segment after splitting
    lastPart = parts(end);
    
    % Formulate the full download URL by adding the base URL to the relative path
    modifiedLink = "https://swarm-diss.eo.esa.int/" + downloadLinks(i);
    
    % Download the file using websave and name it with the extracted file name
    websave(lastPart{1}, modifiedLink);
end

You can refer to the following documentations for more details about the used MATLAB functions-

I hope this guidance clarifies how to retrieve files from the desired website.Additionally, I've included the website's html source code as a text file as an attachment for your reference.

Regards

Vinayak Luha

0 个评论
显示 -2更早的评论隐藏 -2更早的评论

请先登录，再进行评论。

How to read files from a particular website?

1 个评论
显示 -1更早的评论隐藏 -1更早的评论

回答（1 个）

0 个评论
显示 -2更早的评论隐藏 -2更早的评论

另请参阅

类别

标签

Community Treasure Hunt

How to read files from a particular website?

1 个评论 显示 -1更早的评论隐藏 -1更早的评论

回答（1 个）

0 个评论 显示 -2更早的评论隐藏 -2更早的评论

另请参阅

类别

标签

Community Treasure Hunt

1 个评论
显示 -1更早的评论隐藏 -1更早的评论

0 个评论
显示 -2更早的评论隐藏 -2更早的评论