Finding problem row of data with TabularTextDatastore

16 次查看(过去 30 天)
I'm using a tall array / datastore to process large csv file. Sometimes the input csv files have the header line repeated in the middle of the file. Using the following code:
clear;
ds = datastore('C:\temp\data.csv');
tt = tall(ds);
x = tt.Hours(2);
gather(x)
gives
Evaluating tall expression using the Local MATLAB Session:
- Pass 1 of 2: 42% complete
Evaluation 20% complete
Error using matlab.io.datastore.TabularTextDatastore/readData (line 81)
Mismatch between file and format character vector.
Trouble reading 'Numeric' field from file (row number 2530, field number 2) ==> Hours;MID;ID1
Description;ID2;ID3;PN;ID4;Group;Model;Description;Level Number;Type;Occurrence Count
(Incremental);Timestamp;Location;City;State;Country;Software\n
Learn more about errors encountered during GATHER.
Error in matlab.io.datastore.TabularDatastore/read (line 120)
[t, info] = readData(ds);
Error in tall/gather (line 50)
[varargout{:}] = iGather(varargin{:});
Error in tallArray (line 5)
gather(x)
Caused by:
Reading the variable name 'Hours' using format '%f' from file: 'C:\temp\data.csv' starting at offset 363378899.
The errant line was on row (including header) 1568509 of 3734377. I was only able to find out errant row number out by hunting and pecking, i.e. deleting/re-adding rows and trying again.
My question is: How can I find the errant row more easily? Is there a way to use the "row number" given (2530), the given offset (363378899) to figure out where in the large csv the error is (1568509) ?
additional info:
ds =
TabularTextDatastore with properties:
Files: {
'C:\temp\data.csv'
}
FileEncoding: 'UTF-8'
AlternateFileSystemRoots: {}
ReadVariableNames: true
VariableNames: {'SerialNumber', 'Hours', 'MID' ... and 17 more}
Text Format Properties:
NumHeaderLines: 0
Delimiter: ';'
RowDelimiter: '\r\n'
TreatAsMissing: ''
MissingValue: NaN
Advanced Text Format Properties:
TextscanFormats: {'%q', '%f', '%f' ... and 17 more}
TextType: 'char'
ExponentCharacters: 'eEdD'
CommentStyle: ''
Whitespace: ' \b\t'
MultipleDelimitersAsOne: false
Properties that control the table returned by preview, read, readall:
SelectedVariableNames: {'SerialNumber', 'Hours', 'MID' ... and 17 more}
SelectedFormats: {'%q', '%f', '%f' ... and 17 more}
ReadSize: 20000 rows
  1 个评论
Gunes ERTUNC
Gunes ERTUNC 2023-2-5
Catching possible missing entries via 'TreatAsMissing' helped for me. You can try as follows.
ds = datastore('C:\temp\data.csv','TreatAsMissing',{'-','NA','');

请先登录,再进行评论。

回答(0 个)

类别

Help CenterFile Exchange 中查找有关 Large Files and Big Data 的更多信息

产品


版本

R2018a

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by