retrieve data from a website with multiple pages
4 次查看(过去 30 天)
显示 更早的评论
Hi all,
I want to pull the data from this website into a table.
it has 185 pages so I wrote a for loop so it will pass the entire table.
the problam is that I'm using webread, which is seems to read everything into char array.
what I want is that in each itteration of the for loop the data from this table will be read, how can it be done?
thanks
4 个评论
Rik
2022-2-19
You can search the html for the text in the table and guess the structure from what you see.
采纳的回答
Ive J
2022-2-19
编辑:Ive J
2022-2-20
My answer doesn't totally solve your problem, but addresses your main questions (hopefully!). Before parsing the HTML itself, webread doesn't read the content of the URL because the website uses some measures against bot attacks (read more: https://stackoverflow.com/questions/53434555/python-requests-enable-cookies-javascript), so that needs to be fixed first.
url = "https://www.health.gov.il/Subjects/FoodAndNutrition/food/Pages/Manufacturer.aspx?WPID=WPQ8&PN=1";
opts = weboptions('Timeout', 5e3);
htmlraw = webread(url, opts);
% webread cannot read the contents as the website requests cookies =========
% credits: https://stackoverflow.com/a/53435185
htmlraw = string(htmlraw);
top = htmlraw.split('<script>');
top = top(2);
if contains(top, "Challenge=")
Challenge = extractBetween(top, "Challenge=", ";");
challenge_id = extractBetween(top, "ChallengeId=", ";");
arr = char(Challenge);
last_digit = str2double(arr(end));
arr = sort(arr);
min_digit = str2double(arr(1));
subvar1 = (2*str2double(arr(3))) + str2double(arr(2));
subvar2 = string(2 * str2double(arr(3))) + str2double(arr(2));
power = ((str2double(arr(1)) * 1) + 2)^str2double(arr(2));
x = double(Challenge) * 3 + subvar1;
y = cos(pi * subvar1);
answer = x * y;
answer = answer - power;
answer = answer + (min_digit - last_digit);
answer = string(floor(answer)) + subvar2;
hdrs = {'X-AA-Challenge' char(Challenge); ...
'X-AA-Challenge-ID' char(challenge_id); ...
'X-AA-Challenge-Result' char(answer)};
% now read the website contents ===========================================
htmlraw = webwrite(url, hdrs, opts); % content of URL in HTML code
end
% by manually looking at the HTML code
data = htmlTree(htmlraw); % creating an HTML tree from raw content
hdr = findElement(data ,"th").extractHTMLText; % table header
col1 = extractBetween(htmlraw, 'columnLisenceNumber">', "</td>");
col2 = extractBetween(htmlraw, 'HeValue">', "</td>");
col3 = extractBetween(htmlraw, "</table></td><td>", '</td><td class="columnWorkCity');
col4 = extractBetween(htmlraw, 'columnWorkCity">', "</td>");
col5 = extractBetween(htmlraw, 'columnWorkCity">' + ...
wildcardPattern + "</td><td>", ...
'</td><td class="gvDescImg showHideImg"');
% description column (last col)
col6_hdr = extractBetween(htmlraw, '<td class="TenderDescription">', '</td>');
col6_hdr = col6_hdr(1:2);
col6 = extractBetween(htmlraw, '<p class="itemDesc TenderDesc">', '</p>');
col6 = reshape(col6, [], 2);
% reorder as a table
% append the header so column 6 can have descirptive info as well
hdr = string([hdr; hdr(end)]); % 7 columns
hdr(6:7) = hdr(6:7) + "_" + col6_hdr;
tab = array2table([col1, col2, col3, col4, col5, col6], ...
'VariableNames', hdr);
tab = convertvars(tab, 1:width(tab), @string);
tab.(1) = double(tab.(1));
head(tab)
10 个评论
Ive J
2022-2-22
Yes, that's for 3 pages.
Feel free to use the function above! also be aware that sometimes when you send so many requests to a website, they may block your IP (temporarily).
To track possible parsing bugs, you can also save each table as a mat file. In this way, if you expect let's say 120 rows and you get only 100, you can inspect each table individually. You can do this by adding these lines:
for i = 1:n
fprintf('reading page %d of %d\n', i, n)
tab = readEachPage(i);
save("tab.page." + i + ".mat", "tab") % e.g. tab.page.10.mat contains table for page 10
unitab{i} = tab;
end
更多回答(0 个)
另请参阅
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!