Finding problem row of data with TabularTextDatastore

Question

Kent Schonert 2018-10-4

3
链接

此问题的直接链接

https://ww2.mathworks.cn/matlabcentral/answers/422221-finding-problem-row-of-data-with-tabulartextdatastore

评论： Gunes ERTUNC 2023-2-5

I'm using a tall array / datastore to process large csv file. Sometimes the input csv files have the header line repeated in the middle of the file. Using the following code:

clear;
ds = datastore('C:\temp\data.csv');
tt = tall(ds);
x = tt.Hours(2);
gather(x)

gives

Evaluating tall expression using the Local MATLAB Session:
- Pass 1 of 2: 42% complete
Evaluation 20% complete
Error using matlab.io.datastore.TabularTextDatastore/readData (line 81)
Mismatch between file and format character vector.
Trouble reading 'Numeric' field from file (row number 2530, field number 2) ==> Hours;MID;ID1
Description;ID2;ID3;PN;ID4;Group;Model;Description;Level Number;Type;Occurrence Count
(Incremental);Timestamp;Location;City;State;Country;Software\n
Learn more about errors encountered during GATHER.
Error in matlab.io.datastore.TabularDatastore/read (line 120)
            [t, info] = readData(ds);
Error in tall/gather (line 50)
[varargout{:}] = iGather(varargin{:});
Error in tallArray (line 5)
gather(x)
Caused by:
    Reading the variable name 'Hours' using format '%f' from file: 'C:\temp\data.csv' starting at offset 363378899.

The errant line was on row (including header) 1568509 of 3734377. I was only able to find out errant row number out by hunting and pecking, i.e. deleting/re-adding rows and trying again.

My question is: How can I find the errant row more easily? Is there a way to use the "row number" given (2530), the given offset (363378899) to figure out where in the large csv the error is (1568509) ?

additional info:

 ds = 
    TabularTextDatastore with properties:
                          Files: {
                                 'C:\temp\data.csv'
                                 }
                   FileEncoding: 'UTF-8'
       AlternateFileSystemRoots: {}
              ReadVariableNames: true
                  VariableNames: {'SerialNumber', 'Hours', 'MID' ... and 17 more}
      Text Format Properties:
                 NumHeaderLines: 0
                      Delimiter: ';'
                   RowDelimiter: '\r\n'
                 TreatAsMissing: ''
                   MissingValue: NaN
      Advanced Text Format Properties:
                TextscanFormats: {'%q', '%f', '%f' ... and 17 more}
                       TextType: 'char'
             ExponentCharacters: 'eEdD'
                   CommentStyle: ''
                     Whitespace: ' \b\t'
        MultipleDelimitersAsOne: false
      Properties that control the table returned by preview, read, readall:
          SelectedVariableNames: {'SerialNumber', 'Hours', 'MID' ... and 17 more}
                SelectedFormats: {'%q', '%f', '%f' ... and 17 more}
                       ReadSize: 20000 rows