Bug in readtable()? - if the first values in a CSV's column are missing, the whole column is misinterpreted
5 次查看(过去 30 天)
显示 更早的评论
I am reading a big CSV file (500K lines) with readtable. In the CSV, some columns have the first 250+ lines empty (e.g. ",,,," in the CSV), while the non-missing values below (pretty rare) are either text strings or dates (in the DD-MM-YYYY format). Readtable() somehow interprets these columns as numeric, and so converts all the strings and dates into NaNs - thus, I end up with 100% NaN-filled columns instead of rarely-populated data (among empty strings and NaTs).
Furthermore, if I move the "with-data" lines up - even a few dozens positions up - readtable() starts to read everything normally!
So, it looks like readtable() checks only ~250 first values to determine the type of the column, which, in my opinion, is a bug! (Although I understand that it was likely made to improve speed.)
Is there a way to fix it systematically? I have lots of such CSVs with thousands of columns in them - so, a manual check and manual fix is not an option...
UPD: a test-file (truncated to 510 lines) is attached - the behaviour is still the same. The problem columns are the 2nd and the 3rd (p190, p191). The first non-empty value is on data-line 270.
10 个评论
Jeremy Hughes
2025-5-2
编辑:Jeremy Hughes
2025-5-2
detectImportOptions doesn't actually check more lines, it just defaults to treating the variable as text instead of double when the first 250 rows are all empty. The reason for the difference is to preserve the behavior from before detectImportOptions was introduced. Before that, it only considered the first row of data for the variable types and empty fields were treated as double by default.
回答(1 个)
Cris LaPierre
2025-4-3
I wouldn't call it a bug. It's an artifact of trying to autodetect the format, which will naturally not get it right every time. I would recommend specifying the import options it is not detecting correctly.
opts = detectImportOptions('TEST_ongoing.csv');
opts = setvartype(opts,["p190","p191"],["string","datetime"]);
data = readtable('TEST_ongoing.csv',opts)
data([270,508],:)
12 个评论
dpb
2025-4-4
编辑:dpb
2025-4-4
The datetime conversion routines are pretty picky, failing if valid date/time strings aren't all the same format in a single call. This is clearly an implementation choice for performance reasons, but can become annoying.
If this can really happen and isn't just a test case, for the immediate fix seems like it may be more expeditious to read the date column as cellstr and then test for the empty case/convert the existing or use the try...catch construction to deal with the (presumed) aberration. I agree that NaT should be the result...although this seems to be an artifact of readtable and not part of datetime's repetoire...
s="";s=standardizeMissing(s,"")
datetime(s,'InputFormat',"yyyy-MMM-dd")
Another possibility I haven't tested might be able to use the 'MissingRule' and/or 'ImportErrorRule' although they are global rather than having granularity of setting a replacement value by column/data type--another enchancement I think I've previously suggested.
另请参阅
类别
在 Help Center 和 File Exchange 中查找有关 Characters and Strings 的更多信息
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!