detectImportOptions locates the wrong header line
9 次查看(过去 30 天)
显示 更早的评论
I am handling fixed column text files with optional comment lines at the beginning. Therefore the line with column headers is not known in advance. When the last datum on the first data line is missing, detectImportOptions identifies the wrong header line:
detectImportOptions ('data4.txt', ...
'FileType', 'fixedwidth', ...
'ReadVariableNames', true)
This file works okay:
Name City Lat Lon Elev
den Denver 39.74 -104.99 5280
chi Chicago 41.88 -87.71 597
atl Atlanta 33.75 -84.36 820
Remove last item on line 2. This locates the wrong header:
Name City Lat Lon Elev
den Denver 39.74 -104.99
chi Chicago 41.88 -87.71 597
atl Atlanta 33.75 -84.36 820
Selected differences between the results, from diff:
< VariableNames: {'Name', 'City', 'Lat' ... and 2 more}
> VariableNames: {'chi', 'Chicago', 'x41_88' ... and 2 more}
< DataLines: [2 Inf]
> DataLines: [4 Inf]
< VariableNamesLine: 1
> VariableNamesLine: 3
My local workaround will be to always prevent a missing value at the end of the first data line. However, it would be nice if the heuristics in detectImportOptions would find the correct line number in scenarios like this.
Is this a legal scenario? Is this a bug in Matlab?
1 个评论
dpb
2022-7-12
Badly formed data files are not indicative of bugs -- the problem is in the data file as you've already acknowledged.
I believe your expectations are unrealistic -- there's really no way programmatically to determine that record is malformed and not something else.
You don't show a file with a comment line -- does it consist of a comment character in first column? Using that information MIGHT be able to help.
Probably the most foolproof way would be be to create the files as delimited and ensure have the correct number of delimiters in every real record on creation of the file instead. You may need a preprocessing step to add that robustness, or maybe the original process can be modified.
回答(1 个)
Dave
2022-7-12
编辑:Dave
2022-7-12
1 个评论
dpb
2022-7-12
编辑:dpb
2022-7-12
The heuristics can always be made "smarter" at the cost of performance; there is a conscious decision(*) to also not be excessively time consuming for the normal case.
And, of course, "smarter" comes at another cost of the likelihood of introducing what I'll call Type II error -- wrongfully classifying non-data as data.
In this case, how's it supposed to know the difference between whether there is simply a missing value or the line should be ignored because it doesn't match the bulk of the rest of the file? Only your expectation that you know...
If you stuff that record down somewhere in the middle, then the odds are it'll detect and let you set a missing value or not import the record, your choice.
You can always use the example and submit it as an enhancement request and see what TMW thinks of the possibilities of catching it...
(*) It's not in the formal documentation, but I recall the comment having been made by a TMW employee in discussion here on some of the changes in the way the file probing has evolved. That discussion was mostly centered on the behind-the-scenes done by the readXXX routines that don't do the whole in-depth analysis the detectImportOptions does, but is an overhead in every use of the routines. One gives up some performance when doing the full probe deliberately; that is then expected to take some time but one still doesn't want it to become an inordinate delay.
另请参阅
类别
在 Help Center 和 File Exchange 中查找有关 Language Support 的更多信息
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!