How do I extract the contents of an HTML table on a web page into a MATLAB table?

58 次查看(过去 30 天)
I'd like to plot and analyze the TSA traveler data from this website: https://www.tsa.gov/coronavirus/passenger-throughput
The data is embedded on the page as an HTML table element.
How do I extract the table content into a MATLAB table?

采纳的回答

Pat Canny
Pat Canny 2020-6-23
You can extract the <table> content, which is all stored in a set of <td> tags, as a string array and go from there.
You first need to use findElement and extractHTMLText on an htmlTree object.
You then can use reshape to arrange the data, then use array2table to convert to a table.
Here is one approach:
travel_data = webread('https://www.tsa.gov/coronavirus/passenger-throughput');
travel_data_tree = htmlTree(travel_data);
selector = "td";
subtrees = findElement(travel_data_tree,selector);
str = extractHTMLText(subtrees);
table_data = str(4:end); % first three elements are just the column names
reshape_ncols = 3;
reshape_nrows = length(table_data)/reshape_ncols;
table_data_reshaped = reshape(table_data,reshape_ncols,reshape_nrows)';
% Convert to table
traveler_data_table = array2table(table_data_reshaped,'VariableNames',["Date" "Travelers_Today" "Travelers_Last_Year"]); % I got lazy with VariableNames, I know.
% Convert data types from strings to appropriate types
traveler_data_table.Date = datetime(traveler_data_table.Date);
traveler_data_table.Travelers_Today = str2double(traveler_data_table.Travelers_Today);
traveler_data_table.Travelers_Last_Year = str2double(traveler_data_table.Travelers_Last_Year);
traveler_data_table.Traveler_Ratio = traveler_data_table.Travelers_Today ./ traveler_data_table.Travelers_Last_Year;
% Plot the results
figure
plot(traveler_data_table.Date,traveler_data_table.Traveler_Ratio)
title("TSA Traveler Ratio by Date (2020 vs. 2019)")
grid on
% Some more fun analysis
% When did it bottom out?
[min_ratio,idx] = min(traveler_data_table.Traveler_Ratio);
min_ratio_pct = 100*min_ratio;
min_date = traveler_data_table.Date(idx);
disp("The minimum traveler ratio of " + min_ratio_pct + "% occurred on " + string(min_date))
latest_pct = 100*traveler_data_table.Traveler_Ratio(1);
disp("The current ratio is " + latest_pct + "%")

更多回答(1 个)

Christopher Creutzig
Starting in R2021b, you can directly use readtable for HTML tables:
readtable("https://www.tsa.gov/coronavirus/passenger-throughput",...
FileType="html",ReadVariableNames=true,ThousandsSeparator=",")
ans = 364×5 table
Date 2022 2021 2020 2019 __________ __________ __________ __________ __________ 06/05/2022 2.3872e+06 1.9847e+06 4.4126e+05 2.6699e+06 06/04/2022 1.9814e+06 1.6812e+06 3.5302e+05 2.226e+06 06/03/2022 2.3326e+06 1.8799e+06 4.1968e+05 2.6498e+06 06/02/2022 2.2132e+06 1.8159e+06 3.9188e+05 2.6239e+06 06/01/2022 1.9991e+06 1.5879e+06 3.0444e+05 2.3702e+06 05/31/2022 2.1081e+06 1.6828e+06 2.6774e+05 2.2474e+06 05/30/2022 2.3122e+06 1.9002e+06 3.5326e+05 2.499e+06 05/29/2022 2.0965e+06 1.6505e+06 3.5295e+05 2.5556e+06 05/28/2022 1.9942e+06 1.6058e+06 2.6887e+05 2.1172e+06 05/27/2022 2.3847e+06 1.9596e+06 3.2713e+05 2.5706e+06 05/26/2022 2.3799e+06 1.8545e+06 3.2178e+05 2.4858e+06 05/25/2022 2.1477e+06 1.6182e+06 2.6117e+05 2.269e+06 05/24/2022 2.0207e+06 1.4708e+06 2.6484e+05 2.4536e+06 05/23/2022 2.329e+06 1.7474e+06 3.4077e+05 2.5122e+06 05/22/2022 2.3509e+06 1.8637e+06 2.6745e+05 2.0707e+06 05/21/2022 1.9888e+06 1.55e+06 2.5319e+05 2.1248e+06

类别

Help CenterFile Exchange 中查找有关 Graph and Network Algorithms 的更多信息

标签

产品


版本

R2020a

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by