readtable(html file) producing extra empty columns

Question

Simon 2023-9-10

0
链接

此问题的直接链接

https://ww2.mathworks.cn/matlabcentral/answers/2019136-readtable-html-file-producing-extra-empty-columns

评论： Simon 2023-9-17

Original question: In another thread, similar question was asked for readtable(csv file). The answer was to set {'delimiter', ','}. Because htmlImportOptions does not have 'delimiter' property, that answer does not work for my problem. I found that {'EmptyColumnRule','skip'} is a solution. Unfortunately, it can't work together with htmlImportOptions, which is used to set up DataRows.

Update: name-value pair does have 'DataRows' option.

opt.ExtraColumnsRule = 'ignore' % readtable only the first column.

% either 
opt = htmlImportOptions;
opt.DataRows = 4; 
% opt.EmptyColumnRule = 'skip' % error, html opt doesn't have this property.
% update
opt.ExtraColumnsRule = 'ignore'; 
readtable(htmlfile, opt) % read in only the first column. The other non-extra columns are ignored.
% or
% orignial post: readtable(htmlfile, 'EmptyColumnRule', 'skip') % {'DataRows', 4} is an error
% update. this works
readtable(htmlfile, 'EmptyColumnRule', 'skip', 'DataRows', 4) 
% but not both
readtable(htmlfile, opt, 'EmptyColumnRule', 'skip') % error

I suppose I can read in the ExtraVar columns first and then delete the empty columns, just that I would rather readtable( ) handle it.

Thanks for any solutions!

6 个评论
显示 4更早的评论隐藏 4更早的评论

Simon 2023-9-11

@dpb I will see if I can create a sample data.

dpb 2023-9-11

It would seem highly unlikely that simply uploading a few files with one or two records would reveal anything terribly damaging. :) Of course, it some rare instance it might be possible for industrial sabatoge to occur with only a handful of numbers or it may be company policy regardless of whether there's any real danger or not, or it could be a case such as in my former employment is part of a classified document which, by those rules makes anything in the document classified whether the specific pieces of data are sensitive or not and so can't release anything (despire our current and former leaders who seem to ignore such rules) without a signoff from a derivative classifier who likely won't declassify it for you just on general principle.

IOW, I'm just suggesting to really consider the actual content and whether it's really of need to not just use the data as is...of course, it should be relatively simple to just readcell, substitute the numeric values with rand of the same size and write back out...

请先登录，再进行评论。

请先登录，再回答此问题。

Answer 1

dpb 2023-9-10

0
链接

此回答的直接链接

https://ww2.mathworks.cn/matlabcentral/answers/2019136-readtable-html-file-producing-extra-empty-columns#answer_1306551

在 MATLAB Online 中打开

Use 'SelectedVariableNames' with the variable(s) desired

I can't tell what you want, specifically, there's a comment to read only the first??? If that is so, then

opt = htmlImportOptions;
opt.DataRows = 4; 
opt.ExtraColumnsRule = 'ignore';
opt.SelectedVariables=opt.VariableNames(1);  % read only the first column
tData=readtable(htmlfile, opt);

5 个评论
显示 3更早的评论隐藏 3更早的评论

Simon 2023-9-11

编辑：Simon 2023-9-11

@dpb "Use 'SelectedVariableNames' with the variable(s) desired"

I think your answer is in the right direction. I suspect the header rows, which I select to be used as variables, cause the problem. The headers of the first two columns are empty, and the other headers are, say X, Y, Z. So the extra variables are created by Matlab as Z_1, Z_2, ... etc.

But the problem is that I have many tables, whose headers may more or less be the same, but not alway be the same. For example, one table may have headers (empty), (empty), X, Y, Z., another table (empty), (empty), X, Y, W, Z. Matlab treats the first empty header as Var1, and the rest as 'extra' columns.

I will give 'SelectedVariableNames' a shot. Thank you.

@dpb I have just now tried it. not working. Matlab insist treats the first column, whose header is empty, as the only variable. Even SelectedVariableNames = 1:2 will produce out-of-range error. It's critical for me to have the headers of the other columns as variables. But I don't know what they might be beforehand. I have to unaccept your answer. Sorry!

dpb 2023-9-11

编辑：dpb 2023-9-11

You can't expect a single solution to be able to handle different file formats.

If you can't standardize the file format, then you'll have to have a catalog of what are possible formats you expect to be able to handle and then build a tool which can discern which is which from some characteristic within the file or have a way to know heuristically which is which a priori.

But, if the header line or data content is such that the import options detection heuristics don't return the expected number of variables/columns, then you'll have to use some other technique to identify what are columns/variables. But again, if we can't see examples to work with, we're also shooting blindly at an ill-defined problem.

The mind-(file)reading toolbox has yet to be released...

ADDENDUM:

If, as the original Q? content seems to suggest, if the "dead ahead" readtable function returns the expected variables plus some that may occasionally be extra/empty, the most expedient solution may well be to simply read the file then delete the extraneous columns instead of going through a lot of gyrations of parsing the changeable input file to handle various cases explicitly -- again, if you can't standardize the input form and can't know a priori from some other source the specifics of the particular file a priori.

In a particular case with some more-or-less free format Excel files, I simply read the variable names row first and parse it for the specific variables of interest. In that case, the variables will always exist; just sometimes there are some others that user may have thrown in ad hoc and sometimes there are some empty columns to parse out, but detectImportOptions will return a correct number of variables. If, in the case of an .html file that can't be assured, then it's going to be a lot more trouble, but I don't have to mess with such and without specific example(s) from which to work, as noted before, we're pretty-much hamstrung.

dpb 2023-9-12

As above I've never had to really mess with parsing HTML much, but it's not set up as a format for scanning by tools such as readtable so it's not at all surprising to me to find you're having difficulties.

While it won't be directly applicable to your case, I'll see if I can strip out the parsing stuff/modifications to the import object I described above into a short piece of example code just as idea generator.

If you can figure out a way to post some examples of what your files actually look like, it would still be the best way to see if somebody can build a better mouse trap.

Simon 2023-9-17

@dpb Thanks for offering the help. I couldn't find a similar sample file to upload here. htmls have all sorts of defects. You were right in your earlier comments not to rely on one function to parse them correctly. I finally used a simple combination of all and ismissing to remove the extra empty columns after readtable(). I greatly appreciate your feedbacks.

请先登录，再进行评论。

readtable(html file) producing extra empty columns

6 个评论
显示 4更早的评论隐藏 4更早的评论

回答（1 个）

5 个评论
显示 3更早的评论隐藏 3更早的评论

另请参阅

类别

标签

产品

版本

Community Treasure Hunt

readtable(html file) producing extra empty columns

6 个评论 显示 4更早的评论隐藏 4更早的评论

回答（1 个）

5 个评论 显示 3更早的评论隐藏 3更早的评论

另请参阅

类别

标签

产品

版本

Community Treasure Hunt

6 个评论
显示 4更早的评论隐藏 4更早的评论

5 个评论
显示 3更早的评论隐藏 3更早的评论