retrieve data from a website with multiple pages

My answer doesn't totally solve your problem, but addresses your main questions (hopefully!). Before parsing the HTML itself, webread doesn't read the content of the URL because the website uses some measures against bot attacks (read more: https://stackoverflow.com/questions/53434555/python-requests-enable-cookies-javascript), so that needs to be fixed first.

url = "https://www.health.gov.il/Subjects/FoodAndNutrition/food/Pages/Manufacturer.aspx?WPID=WPQ8&PN=1";
opts = weboptions('Timeout', 5e3);
htmlraw = webread(url, opts);
% webread cannot read the contents as the website requests cookies =========
% credits: https://stackoverflow.com/a/53435185
htmlraw = string(htmlraw);
top = htmlraw.split('<script>');
top = top(2);
if contains(top, "Challenge=")
    Challenge = extractBetween(top, "Challenge=", ";");
    challenge_id = extractBetween(top, "ChallengeId=", ";");
    
    arr = char(Challenge);
    last_digit = str2double(arr(end));
    arr = sort(arr);
    min_digit = str2double(arr(1));
    subvar1 = (2*str2double(arr(3))) + str2double(arr(2));
    subvar2 = string(2 * str2double(arr(3))) + str2double(arr(2));
    power = ((str2double(arr(1)) * 1) + 2)^str2double(arr(2));
    x = double(Challenge) * 3 + subvar1;
    y = cos(pi * subvar1);
    answer = x * y;
    answer = answer - power;
    answer = answer + (min_digit - last_digit);
    answer = string(floor(answer)) + subvar2;
    
    hdrs = {'X-AA-Challenge' char(Challenge); ...
        'X-AA-Challenge-ID' char(challenge_id); ...
        'X-AA-Challenge-Result' char(answer)};
    
    % now read the website contents ===========================================
    htmlraw = webwrite(url, hdrs, opts); % content of URL in HTML code
end
% by manually looking at the HTML code
data = htmlTree(htmlraw); % creating an HTML tree from raw content
hdr = findElement(data ,"th").extractHTMLText; % table header
col1 = extractBetween(htmlraw, 'columnLisenceNumber">', "</td>");
col2 = extractBetween(htmlraw, 'HeValue">', "</td>");
col3 = extractBetween(htmlraw, "</table></td><td>", '</td><td class="columnWorkCity');
col4 = extractBetween(htmlraw, 'columnWorkCity">', "</td>");
col5 = extractBetween(htmlraw, 'columnWorkCity">' + ...
    wildcardPattern + "</td><td>", ...
    '</td><td class="gvDescImg showHideImg"');
% description column (last col)
col6_hdr = extractBetween(htmlraw, '<td class="TenderDescription">', '</td>');
col6_hdr = col6_hdr(1:2);
col6 = extractBetween(htmlraw, '<p class="itemDesc TenderDesc">', '</p>');
col6 = reshape(col6, [], 2);
% reorder as a table 
% append the header so column 6 can have descirptive info as well
hdr = string([hdr; hdr(end)]); % 7 columns
hdr(6:7) = hdr(6:7) + "_" + col6_hdr;
tab = array2table([col1, col2, col3, col4, col5, col6], ...
    'VariableNames', hdr);
tab = convertvars(tab, 1:width(tab), @string);
tab.(1) = double(tab.(1));
head(tab)
ans = 8×7 table
    מספר רישיון                שם יצרן                         כתובת                ישוב           מחוז                                                                  פרטים_סוג מזון (מהות היצור):                                                                        פרטים_קבוצת מזון:         
    ___________    ________________________________    _____________________    _____________    _________    __________________________________________________________________________________________________________________________________________________    ___________________________________

       55678       "א. הקר 2009 גלאט למהדרין בע"מ"     "מרכז ספיר 3 ירושלים"    "ירושלים"        "ירושלים"    "ייצור מוצרי בשר קפואים בלבד: בשר בקר טחון, בשר בעלי כנף טחון ומוצריהם, קישקע ממולא, בשר בקר מעובד, בשר בעלי כנף מעובד, ניסור ואריזת בשר בקר קפוא"    "הסעדה"                            
       68795       "א. כ. התעשיינים בע"מ"              "שד הסנהדרין 3 יבנה"     "יבנה"           "מרכז"       "בשר ומוצריו, לרבות עופות וצייד"                                                                                                                      "הסעדה (קיטרינג)"                  
       52319       "א.א בורקס ליאון"                   "איתן 24 ראשון לציון"    "ראשון לציון"    "מרכז"       "אחסנה בקירור"                                                                                                                                        "אחסון מזון בקירור"                
       69047       "א.א בליסימו בע"מ"                  "איתן 3 ראשון לציון"     "ראשון לציון"    "מרכז"       "קרחונים אכילים, כולל שרבט וסורבט"                                                                                                                    "מחסן קרור/מחסן בטמ' מבוקרת"       
       67457       "א.א מטעמים הכי טעים בע"מ"          "מודיעין 8 פתח תקווה"    "פתח תקווה"      "מרכז"       "ייצור בצקים ממולאים, ייצור עוגיות יבשות"                                                                                                             "לחם, לחמניות, עוגות שמרים ומאפים" 
       52312       "א.א. בליסימו בע"מ"                 "לזרוב 3 ראשון לציון"    "ראשון לציון"    "מרכז"       "מוצרי מאפה, תערובות להכנתם ובצקים"                                                                                                                   "לחמים ולחמניות מאודים"            
       50780       "א.א. דרך האוכל (חיפה) בע"מ"        "שנקר אריה 47 חיפה"      "חיפה"           "חיפה"       "אחסנת בצקים קפואים"                                                                                                                                  "יצור מוצרי בשר בקר וצאן טחון בלבד"
       52587       "א.א. לרנר מוצרי מזון העמק בע"מ"    "הפועלים 2 באר שבע"      "באר שבע"        "דרום"       "מחסן קרור/מחסן בטמ' מבוקרת"                                                                                                                          "בשר ומוצריו, לרבות עופות וצייד"   

10 个评论
显示 8更早的评论隐藏 8更早的评论

Ive J 2022-2-21

编辑：Ive J 2022-2-21

在 MATLAB Online 中打开

I'm not sure if I get it right; do you mean you tried something like this?

function parseFoodAndNutrition(n)
if nargin < 1
    n = 3; % read only 3 pages
end
unitab = cell(n, 1);
for i = 1:n
    fprintf('reading page %d of %d\n', i, n)
    unitab{i} = readEachPage(i);
end
unitab = vertcat(unitab{:});
unitab = convertvars(unitab, 1:width(unitab), @string);
unitab.(1) = double(unitab.(1));
end % END
%% subfunctions ===========================================================
function tab = readEachPage(n)
url = "https://www.health.gov.il/Subjects/FoodAndNutrition/food/Pages/Manufacturer.aspx?WPID=WPQ8&PN=" + n;
opts = weboptions('Timeout', 5e3);
htmlraw = webread(url, opts);
% webread cannot read the contents as the website requests cookies =========
% credits: https://stackoverflow.com/a/53435185
htmlraw = string(htmlraw);
top = htmlraw.split('<script>');
top = top(2);
if contains(top, "Challenge=")
    Challenge = extractBetween(top, "Challenge=", ";");
    challenge_id = extractBetween(top, "ChallengeId=", ";");
    
    arr = char(Challenge);
    last_digit = str2double(arr(end));
    arr = sort(arr);
    min_digit = str2double(arr(1));
    subvar1 = (2*str2double(arr(3))) + str2double(arr(2));
    subvar2 = string(2 * str2double(arr(3))) + str2double(arr(2));
    power = ((str2double(arr(1)) * 1) + 2)^str2double(arr(2));
    x = double(Challenge) * 3 + subvar1;
    y = cos(pi * subvar1);
    answer = x * y;
    answer = answer - power;
    answer = answer + (min_digit - last_digit);
    answer = string(floor(answer)) + subvar2;
    
    hdrs = {'X-AA-Challenge' char(Challenge); ...
        'X-AA-Challenge-ID' char(challenge_id); ...
        'X-AA-Challenge-Result' char(answer)};
    
    % now read the website contents ===========================================
    htmlraw = webwrite(url, hdrs, opts); % content of URL in HTML code
end
% by manually looking at the HTML code
data = htmlTree(htmlraw); % creating an HTML tree from raw content
hdr = findElement(data ,"th").extractHTMLText; % table header
col1 = extractBetween(htmlraw, 'columnLisenceNumber">', "</td>");
col2 = extractBetween(htmlraw, 'HeValue">', "</td>");
col3 = extractBetween(htmlraw, "</table></td><td>", '</td><td class="columnWorkCity');
col4 = extractBetween(htmlraw, 'columnWorkCity">', "</td>");
col5 = extractBetween(htmlraw, 'columnWorkCity">' + ...
    wildcardPattern + "</td><td>", ...
    '</td><td class="gvDescImg showHideImg"');
% description column (last col)
col6_hdr = extractBetween(htmlraw, '<td class="TenderDescription">', '</td>');
col6_hdr = col6_hdr(1:2);
col6 = extractBetween(htmlraw, '<p class="itemDesc TenderDesc">', '</p>');
col6 = reshape(col6, [], 2);
% reorder as a table 
% append the header so column 6 can have descirptive info as well
hdr = string([hdr; hdr(end)]); % 7 columns
hdr(6:7) = hdr(6:7) + "_" + col6_hdr;
tab = array2table([col1, col2, col3, col4, col5, col6], ...
    'VariableNames', hdr);
% can be done at once in the end
% tab = convertvars(tab, 1:width(tab), @string);
% tab.(1) = double(tab.(1));
end

When I run the above function I get this:

size(unitab)
ans =
    36     7

sani 2022-2-22

I was actually put your entire script in a for loop, and changed the URL as i increase. Than in each loop I was writing the answer from your script to another tanle using vertcat. If I understand correctly, the answer of size(unitab) = (36,7) is for pages 1-3? If so, this is the dimension I'm expecting to receive.

Ive J 2022-2-22

在 MATLAB Online 中打开

Yes, that's for 3 pages.

Feel free to use the function above! also be aware that sometimes when you send so many requests to a website, they may block your IP (temporarily).

To track possible parsing bugs, you can also save each table as a mat file. In this way, if you expect let's say 120 rows and you get only 100, you can inspect each table individually. You can do this by adding these lines:

for i = 1:n
    fprintf('reading page %d of %d\n', i, n)
    tab = readEachPage(i);
    save("tab.page." + i + ".mat", "tab") % e.g. tab.page.10.mat contains table for page 10
    unitab{i} = tab;
end

请先登录，再进行评论。

retrieve data from a website with multiple pages

4 个评论
显示 2更早的评论隐藏 2更早的评论

采纳的回答

10 个评论
显示 8更早的评论隐藏 8更早的评论

更多回答（0 个）

另请参阅

类别

标签

Community Treasure Hunt

retrieve data from a website with multiple pages

4 个评论 显示 2更早的评论隐藏 2更早的评论

采纳的回答

10 个评论 显示 8更早的评论隐藏 8更早的评论

更多回答（0 个）

另请参阅

类别

标签

Community Treasure Hunt

4 个评论
显示 2更早的评论隐藏 2更早的评论

10 个评论
显示 8更早的评论隐藏 8更早的评论