Why are xlsread (and readtable) doing such a poor job at loading mixed data files?

4 次查看(过去 30 天)
So, this issue has popped up regularly when loading from mixed data files. For example, when importing data using [data, text, raw] = xlsread('file.xlsx'), interpretation of the data appears strangely random. Though the 'data' and 'text' contain solely numeric and text data respectively, oftentimes the 'text' data still contains numeric data stored as text; on some occasions because a column contains both numeric and text data, but regularly also with columns that contain only numeric data (albeit with a header on the first row). In some cases xlsread even returns columns with only numeric data as text but with one or more cells as numeric. The 'raw' data in that regard shows the same distinction between numeric and text data. Similar (though different) results are obtained when using readtable by the way.
As an example, see attached data which contains a section of a full table ('raw' data from xlsread). Apart from the header, columns 3 to 6, 8 and 9 contain only numeric data (verified this in original file), while columns 59 to 62 contain mixed data. Somehow only row 3 of the mentioned first 6 colums is interpretated fully as numeric. More interestingly, column 4, row 7537 is interpreted as numeric whereas the rest of the column is text. Later columns (59 to 62) containing mixed data are again loaded as numeric data.
Now I'm sure there are reasons why this happens, and even alternatives for loading the data... but why return an ambiguous result like this in the first place? Either do it correctly or return everything as text so I can interpret the data myself, because this only adds additional processing time; first from the function interpretating the data and next from having to re-intepret the data. Or is there someting I'm missing here?
  3 个评论
Sjouke Rinsma
Sjouke Rinsma 2022-7-18
I was going to comment that this would be an alternative, though setting each colunm to the correct type would require prior investigation of every file being loaded if I were to set all columns to the correct type,(since using this approach rows 4, 5, 6, 8 and 9 are still returned as 'TextVariableImportOptions'). Because I use these functions in a program for other users to import their data, this unfortunately is not really an option.
However, setting all columns to 'char' I can just do the interpretation myself :) so thanks for the suggestion, I'll see if I can speed up the process this way.
Also, upon further investigation it appears the interpretation Matlab does is simply related to how Excel interprets each cell. Even though all cells in the example were formatted as 'General', some numeric cells still appear as numbers, while others appear as text with a warning that 'the cell is formatted as text'.
dpb
dpb 2022-7-18
It's only trying to do what is asked -- import the Excel sheet. If it's malformed/mistyped, that's not MATLAB's fault; it can only presume that's what was intended.

请先登录,再进行评论。

采纳的回答

Jan
Jan 2022-7-18
编辑:Jan 2022-7-19
The automatic detection of string types is known to be a source of severe bugs. The genes "Dec1", "SEPT1" and "MARCH1" have been renamed, because Excel converted them to dates automagically. 24 other genes are affected also, but a complete list of names is not found yet, because opening such data will convert the string according to the local language. The probleme is severe, because a survey showed, that 20% of 3600 publications used Excel tables containing the failing names (See: https://genomebiology.biomedcentral.com/articles/10.1186/s13059-016-1044-7 ).
I do not trust Matlab's automatic detection of data types for scientific projects. Any kind of intelligent guessing can and will fail for some specific input. Using Excel tables as input is a marker for using a bad practice. I've seen too many projects fail for such reasons. Blaming xlsread and readtable does not hit the point. A trustworthy design of a study must exclude such sources of errors, because they can be extremely hart to detect.
  4 个评论
Jan
Jan 2022-7-19
@Sjouke Rinsma: "a clear distinction between fully numeric and not can be made without ambiguity" - This sound like a trustworthy statement. But some details are hard to control:
  • dots or commas appear as decimal points
  • Unicode characters for the d or e in scientific notation
  • Somewhere a hypen appears instead of a minus
  • Let Excel convert a NaN to "-1.#IND", "#SNAN" or "1.#QNAN"
  • ... This list cannot be complete: according to Goedel there is always an input, which let the interpreter explode.
For a long term scientific project for clinical decision making and international collaborations we decided for using ASCII files with a well defined subset of representations of numbers. This is extremely simple and several stages below JSON and XML, but bullet proof. We deliver Matlab and C code together with a complete definition of the format.
In an earlier stage of the project we used an MS Office Access data base and collided with the 2 GB limit of the data file. The export suffered from several ambiguities and therefore we switched to the stone age ASCII files. Well, it works reliably.
Sjouke Rinsma
Sjouke Rinsma 2022-7-19
编辑:Sjouke Rinsma 2022-7-19
Good point Jan, those exceptions in numeric inputs you raise will eventually, according Murphy, result in problems too (surely Excel auto-formatting has annoyed me in that regard on several occasions as well). Perhaps I should have phrased my statement as "inputs containing only digits can be distinguished ambiguously". In any case, I guess in the end I'm just not really charmed by the fact that these Matlab functions apparently import data according the format applied by another program, rather than do what I was expecting based on the function description. However, I think my question has been answered, so thanks for that :)

请先登录,再进行评论。

更多回答(0 个)

产品


版本

R2020a

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by