Bug in readtable()? - if the first values in a CSV's column are missing, the whole column is misinterpreted

Question

Konstantin 2025-4-3

1
链接

此问题的直接链接

https://ww2.mathworks.cn/matlabcentral/answers/2175955-bug-in-readtable-if-the-first-values-in-a-csv-s-column-are-missing-the-whole-column-is-misinte

编辑： Jeremy Hughes 2025-5-2

TEST_ongoing.csv

I am reading a big CSV file (500K lines) with readtable. In the CSV, some columns have the first 250+ lines empty (e.g. ",,,," in the CSV), while the non-missing values below (pretty rare) are either text strings or dates (in the DD-MM-YYYY format). Readtable() somehow interprets these columns as numeric, and so converts all the strings and dates into NaNs - thus, I end up with 100% NaN-filled columns instead of rarely-populated data (among empty strings and NaTs).

Furthermore, if I move the "with-data" lines up - even a few dozens positions up - readtable() starts to read everything normally!

So, it looks like readtable() checks only ~250 first values to determine the type of the column, which, in my opinion, is a bug! (Although I understand that it was likely made to improve speed.)

Is there a way to fix it systematically? I have lots of such CSVs with thousands of columns in them - so, a manual check and manual fix is not an option...

UPD: a test-file (truncated to 510 lines) is attached - the behaviour is still the same. The problem columns are the 2nd and the 3rd (p190, p191). The first non-empty value is on data-line 270.

10 个评论
显示 8更早的评论隐藏 8更早的评论

Konstantin 2025-4-4

@Steven Lord Yes, the case when the first non-empty value is in the bottom of the table is exactly the "worst case scenario". But my argument is that this scenario is rare, and in the wast majority real cases it won't happen. So, the "net gain" will be positive, although at the cost of some rare negative outliers.

Jeremy Hughes 2025-5-2

编辑：Jeremy Hughes 2025-5-2

detectImportOptions doesn't actually check more lines, it just defaults to treating the variable as text instead of double when the first 250 rows are all empty. The reason for the difference is to preserve the behavior from before detectImportOptions was introduced. Before that, it only considered the first row of data for the variable types and empty fields were treated as double by default.

请先登录，再进行评论。

请先登录，再回答此问题。

Answer 1

Cris LaPierre 2025-4-3

0
链接

此回答的直接链接

https://ww2.mathworks.cn/matlabcentral/answers/2175955-bug-in-readtable-if-the-first-values-in-a-csv-s-column-are-missing-the-whole-column-is-misinte#answer_1563101

在 MATLAB Online 中打开

TEST_ongoing.csv

I wouldn't call it a bug. It's an artifact of trying to autodetect the format, which will naturally not get it right every time. I would recommend specifying the import options it is not detecting correctly.

opts = detectImportOptions('TEST_ongoing.csv');
opts = setvartype(opts,["p190","p191"],["string","datetime"]);
data = readtable('TEST_ongoing.csv',opts)
data = 510x7 table
       eid          p190       p191                        p20005                           p20143                p20144               p20145
    __________    _________    ____    _______________________________________________    __________    ___________________________    ______

    4.3685e+06    <missing>    NaT     {'Current, email address assumed working'     }    2019-07-15    {'Online questionnaire'   }       2  
    3.8845e+06    <missing>    NaT     {0x0 char                                     }    2006-03-21    {'Assessment centre visit'}       1  
    5.8641e+06    <missing>    NaT     {0x0 char                                     }    2021-11-08    {'Online questionnaire'   }      14  
    5.5967e+06    <missing>    NaT     {0x0 char                                     }    2010-06-22    {'Assessment centre visit'}       1  
    5.3448e+06    <missing>    NaT     {'Current, email address assumed working'     }    2021-10-27    {'Online questionnaire'   }      16  
    4.7725e+06    <missing>    NaT     {'Unusable, email address failed or withdrawn'}    2022-07-19    {'Online questionnaire'   }      11  
    5.0307e+06    <missing>    NaT     {'Current, email address assumed working'     }    2022-11-08    {'Online questionnaire'   }      27  
    1.9987e+06    <missing>    NaT     {'Current, email address assumed working'     }    2022-11-07    {'Online questionnaire'   }      32  
    3.9414e+06    <missing>    NaT     {'Current, email address assumed working'     }    2007-12-06    {'Assessment centre visit'}       1  
    3.9442e+06    <missing>    NaT     {'Current, email address assumed working'     }    2019-06-15    {'Online questionnaire'   }      11  
     5.113e+06    <missing>    NaT     {'Current, email address assumed working'     }    2016-08-23    {'Online questionnaire'   }       4  
    3.2642e+06    <missing>    NaT     {'Current, email address assumed working'     }    2022-02-22    {'Online questionnaire'   }       7  
    5.4625e+06    <missing>    NaT     {'Current, email address assumed working'     }    2023-03-04    {'Online questionnaire'   }      20  
    4.3812e+06    <missing>    NaT     {'Current, email address assumed working'     }    2022-11-11    {'Online questionnaire'   }      30  
    3.4906e+06    <missing>    NaT     {0x0 char                                     }    2008-04-03    {'Assessment centre visit'}       1  
    3.6481e+06    <missing>    NaT     {'Current, email address assumed working'     }    2022-11-03    {'Online questionnaire'   }      38  
data([270,508],:)
ans = 2x7 table
       eid                              p190                              p191                         p20005                        p20143                p20144               p20145
    __________    _________________________________________________    __________    __________________________________________    __________    ___________________________    ______

    3.5307e+06    "UK Biobank sources report they have left the UK"    2013-05-09    {0x0 char                                }    2008-07-22    {'Assessment centre visit'}       1  
     1.717e+06    "UK Biobank sources report they have left the UK"    2012-07-19    {'Current, email address assumed working'}    2022-11-03    {'Online questionnaire'   }      21  

12 个评论
显示 10更早的评论隐藏 10更早的评论

dpb 2025-4-3

编辑：dpb 2025-4-3

Is there no definition of the(se) database(s)? How does anybody/any application know how to populate/read them otherwise? Unless they are not controlled and are just text fields that allow for mistakes which would seem to be the kind of thing @Konstantin is running into if he has columns that contain both date strings and other text (or date strings in inconsistent formats?). It seems the fundamental problem is in the database itself, not in MATLAB.

Lacking controls on the database or at least information about its column types that can be used, it would seem the only real solution would be to read the whole file as a cell array and then try to parse it.

Perhaps a workable solution would be to put the import into a try...catch...end block and if the auto-import fails then fall back to the cell import and try to deal with it programmatically and only if that fails get the human involved.

But clearly, it isn't practical for the auto-detect functionality to have to parse the whole file first as a generic implementation.

Cris LaPierre 2025-4-4

Sounds like another enhancement suggestion to me.

dpb 2025-4-4

编辑：dpb 2025-4-4

在 MATLAB Online 中打开

The datetime conversion routines are pretty picky, failing if valid date/time strings aren't all the same format in a single call. This is clearly an implementation choice for performance reasons, but can become annoying.

If this can really happen and isn't just a test case, for the immediate fix seems like it may be more expeditious to read the date column as cellstr and then test for the empty case/convert the existing or use the try...catch construction to deal with the (presumed) aberration. I agree that NaT should be the result...although this seems to be an artifact of readtable and not part of datetime's repetoire...

s="";s=standardizeMissing(s,"")
s = <missing>
datetime(s,'InputFormat',"yyyy-MMM-dd")
ans = datetime
   NaT

Another possibility I haven't tested might be able to use the 'MissingRule' and/or 'ImportErrorRule' although they are global rather than having granularity of setting a replacement value by column/data type--another enchancement I think I've previously suggested.

请先登录，再进行评论。

Bug in readtable()? - if the first values in a CSV's column are missing, the whole column is misinterpreted

10 个评论
显示 8更早的评论隐藏 8更早的评论

回答（1 个）

12 个评论
显示 10更早的评论隐藏 10更早的评论

另请参阅

类别

标签

产品

版本

Community Treasure Hunt

Bug in readtable()? - if the first values in a CSV's column are missing, the whole column is misinterpreted

10 个评论 显示 8更早的评论隐藏 8更早的评论

回答（1 个）

12 个评论 显示 10更早的评论隐藏 10更早的评论

另请参阅

类别

标签

产品

版本

Community Treasure Hunt

10 个评论
显示 8更早的评论隐藏 8更早的评论

12 个评论
显示 10更早的评论隐藏 10更早的评论