Pull webpage from MATLAB site using MATLAB (but with login)

30 次查看(过去 30 天)
Hello there
I have recently been working on a code that pulls information from a webpage and stores it in a file
webread() isn't very hard to use
however, I have gotten to the point where I want to pull pages that can only be seen when logged in
I am using a MATLAB webpage (only visible when logged in) to work on my solution, but I can't quite figure it out
for example,
pageLink = 'https://www.mathworks.com/matlabcentral/cody/groups/345/problems/15-find-the-longest-sequence-of-1-s-in-a-binary-sequence/solutions/new';
options = weboptions;
options.Username = 'myEmail@email.com';
options.Password = 'myPassw0rd';
pageRead = webread(pageLink, options);
(obviously with real information)
This does not work, it always returns the 'You must log in page'
I have also tried to webwrite my options, as well as renaming them the parameters called, such as...
userPage = 'https://www.mathworks.com/login?uri=https%3A%2F%2Fwww.mathworks.com%2Fproducts%2Fmatlab.html';
userId = 'myEmail@email.com';
password = 'myPassw0rd';
webwrite(userPage, 'userId', userId, 'password', password)
and all various options between webwrite and webread and options and named parameters
but it won't return the page as if I was logged in
Could someone direct me along the right path? Is it just MATLAB and should I have tried with a different website or can this be done?
Thanks,
H
  1 个评论
Highphi
Highphi 2020-7-22
update:
tried using...
system(['wget --auth-no-challenge --user=', userId, ' --password=', password, ' ', pageLink])
which started to feel like a step in the right direction... but I get:
SYSTEM_WGETRC = c:/progra~1/wget/etc/wgetrc
syswgetrc = C:\Program Files (x86)\Gow/etc/wgetrc
--2020-07-22 13:00:05-- https://www.mathworks.com/matlabcentral/cody/groups/345/problems/15-find-the-longest-sequence-of-1-s-in-a-binary-sequence/solutions/new
Resolving www.mathworks.com... 00.00.00.000
Connecting to www.mathworks.com|00.00.00.00|:443... connected.
ERROR: cannot verify www.mathworks.com's certificate, issued by `/C=US/O=DigiCert Inc/CN=DigiCert SHA2 Secure Server CA':
Unable to locally verify the issuer's authority.
To connect to www.mathworks.com insecurely, use `--no-check-certificate'.
Unable to establish SSL connection.
where 00.00.00.000 is (potentially) an IP address that I censored since I'm not sure what its significance is

请先登录,再进行评论。

采纳的回答

Highphi
Highphi 2020-7-22
Figured it out...
By myself ...............
No worries. Here's how I did it for future reference:
1. Fix your default web browser preferences
Option 1: MANUALLY
A. Under the 'Home' tab, click 'Preferences'
Option 2: From the COMMAND WINDOW
A. CODE:
preferences Web
B. In the 'Preferences' window, now go to the 'Web' subsection make sure the box next to "Use system browser when opening links to external sites (recommended).". Then click Apply
(Please forgive my handwriting, as I wrote it in Snipping Tool with my mouse lol)
2. THE REST IS HISTORY
A. Use the following code to open your window:
[a,h] = web(pageLink);
It will popup a window with that link you told it to go to
B. IF prompted to login to the desired page, do so and try to click 'Remember Me' if it is an option.
Otherwise, do this step at the beginning of every script and leave one browser window open. I will explain in a second.
C. Use the following code to pull your HTML and then close the browser:
[a, h2] = web(pageLink);
pageHTML = get(h2, 'HtmlText');
close(h2);
Notice I used the handle 'h2' in the second part. This is so that you don't close 'h', if necessary. Closing h2 will ONLY close h2, allowing you to remain logged in.
D. Rinse and repeat.
  3 个评论
Highphi
Highphi 2021-1-5
You will have to parse it.
I use this set of functions sooo much now, so here's an updated solution and some hints & tips:
1) you don't need to close(h2), it will take significantly longer to reopen if you're doing multiple pages. One thing you can do is throw a while loop in there to make sure the page is loaded and then break. i.e.
[~, h2] = web(pageLink);
pause(3)
doMe = 1;
while doMe == 1
pageHTML = get(h2, 'HtmlText');
f1 = strfind(pageHTML, 'footer'); % look for footer (is loaded)
if ~isempty(f1)
doMe = 0;
break
end
pause(1)
end
2) In order to parse the page, you may want to open the desired page in a browser (such as Chrome) and hit F12. This will open developer tools. If, say, you want to find text within a certain area, find the specific HTML surrounding it. i.e.
Then...
f1 = strfind(pageHTML, '<div class="comment "');
pageHTML = pageHTML(f1(1):end);
f2 = strfind(pageHTML, '<class="add-comment');
pageHTML = pageHTML(1:f2(1)-1);
% this will give you the code within the desired div, apply this however you need
Hopefully that helps

请先登录,再进行评论。

更多回答(1 个)

Pascal Geschwill
Pascal Geschwill 2021-4-30
Hi,
while this approach seems to work for now, it looks like this is deprecated functionality. At least with 2020a I am getting a warning:
Warning: [STAT,H] = WEB(___) does not return a handle for pages that open in the system browser. Use STAT = WEB(___) instead.
> In web>displayWarningMessage (line 432)
In web (line 96)
In my case, the solution described in this thread worked just as well. I am pulling build histories from our CI server via its REST API and then parsing them in MATLAB.

类别

Help CenterFile Exchange 中查找有关 String Parsing 的更多信息

产品


版本

R2020a

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by