detectImportOptions locates the wrong header line

Question

Dave 2022-7-12

0
链接

此问题的直接链接

https://ww2.mathworks.cn/matlabcentral/answers/1757655-detectimportoptions-locates-the-wrong-header-line

编辑： dpb 2022-7-12

I am handling fixed column text files with optional comment lines at the beginning. Therefore the line with column headers is not known in advance. When the last datum on the first data line is missing, detectImportOptions identifies the wrong header line:

detectImportOptions ('data4.txt', ...
    'FileType', 'fixedwidth', ...
    'ReadVariableNames', true)

This file works okay:

Name   City      Lat     Lon       Elev
den    Denver    39.74   -104.99   5280
chi    Chicago   41.88   -87.71    597
atl    Atlanta   33.75   -84.36    820

Remove last item on line 2. This locates the wrong header:

Name   City      Lat     Lon       Elev
den    Denver    39.74   -104.99
chi    Chicago   41.88   -87.71    597
atl    Atlanta   33.75   -84.36    820

Selected differences between the results, from diff:

<     VariableNames: {'Name', 'City', 'Lat' ... and 2 more}
>     VariableNames: {'chi', 'Chicago', 'x41_88' ... and 2 more}
<         DataLines: [2 Inf]
>         DataLines: [4 Inf]
< VariableNamesLine: 1
> VariableNamesLine: 3

My local workaround will be to always prevent a missing value at the end of the first data line. However, it would be nice if the heuristics in detectImportOptions would find the correct line number in scenarios like this.

Is this a legal scenario? Is this a bug in Matlab?

1 个评论
显示 -1更早的评论隐藏 -1更早的评论

dpb 2022-7-12

Badly formed data files are not indicative of bugs -- the problem is in the data file as you've already acknowledged.

I believe your expectations are unrealistic -- there's really no way programmatically to determine that record is malformed and not something else.

You don't show a file with a comment line -- does it consist of a comment character in first column? Using that information MIGHT be able to help.

Probably the most foolproof way would be be to create the files as delimited and ensure have the correct number of delimiters in every real record on creation of the file instead. You may need a preprocessing step to add that robustness, or maybe the original process can be modified.

请先登录，再进行评论。

请先登录，再回答此问题。

Answer 1

Dave 2022-7-12

0
链接

此回答的直接链接

https://ww2.mathworks.cn/matlabcentral/answers/1757655-detectimportoptions-locates-the-wrong-header-line#answer_1005465

编辑：Dave 2022-7-12

@dpb, thank you for your insights. I left out comment lines to make the reproducer as small as possible. The header problem occurs the same way, with or without the comment lines.

MATLAB claims some ability to handle messy text files. I agree my scenario is pushing the limit of reasonable expectation. I am asking whether the heuristics inside detectImportOptions are working as intended, or whether they could be made smarter.

1 个评论
显示 -1更早的评论隐藏 -1更早的评论

dpb 2022-7-12

编辑：dpb 2022-7-12

The heuristics can always be made "smarter" at the cost of performance; there is a conscious decision(*) to also not be excessively time consuming for the normal case.

And, of course, "smarter" comes at another cost of the likelihood of introducing what I'll call Type II error -- wrongfully classifying non-data as data.

In this case, how's it supposed to know the difference between whether there is simply a missing value or the line should be ignored because it doesn't match the bulk of the rest of the file? Only your expectation that you know...

If you stuff that record down somewhere in the middle, then the odds are it'll detect and let you set a missing value or not import the record, your choice.

You can always use the example and submit it as an enhancement request and see what TMW thinks of the possibilities of catching it...

(*) It's not in the formal documentation, but I recall the comment having been made by a TMW employee in discussion here on some of the changes in the way the file probing has evolved. That discussion was mostly centered on the behind-the-scenes done by the readXXX routines that don't do the whole in-depth analysis the detectImportOptions does, but is an overhead in every use of the routines. One gives up some performance when doing the full probe deliberately; that is then expected to take some time but one still doesn't want it to become an inordinate delay.

请先登录，再进行评论。

detectImportOptions locates the wrong header line

1 个评论
显示 -1更早的评论隐藏 -1更早的评论

回答（1 个）

1 个评论
显示 -1更早的评论隐藏 -1更早的评论

另请参阅

类别

标签

产品

版本

Community Treasure Hunt

detectImportOptions locates the wrong header line

1 个评论 显示 -1更早的评论隐藏 -1更早的评论

回答（1 个）

1 个评论 显示 -1更早的评论隐藏 -1更早的评论

另请参阅

类别

标签

产品

版本

Community Treasure Hunt

1 个评论
显示 -1更早的评论隐藏 -1更早的评论

1 个评论
显示 -1更早的评论隐藏 -1更早的评论