Importing large .csv by READTABLE gives wrong results

Question

Weronika Huss 2023-7-17

0
链接

此问题的直接链接

https://ww2.mathworks.cn/matlabcentral/answers/1997198-importing-large-csv-by-readtable-gives-wrong-results

回答： Abhas 2024-10-9

Comb_16_20230513 - Kopia.csv

Hello everyone !

I have a lot of .csv files (example attached). I need to extract column F and G from them. In col F some cells are empty and in col G it's reverse.

My script is as follows:

day = input('Give day ','s');
month = input('Give month ','s');
year = '2023';
filename = ['L:\XT4Comb_16_' [year month day] '.csv']    % I have a lot of these files with subsequent dates in their names
tablica = readtable(filename,'Delimiter',',','ReadVariableNames',true,'HeaderLines',0,'Format','%*s%*s%*s%*s%*s%s%s%*s%*s%*s%*s%*s%*s%*s%*s%*s%*s');   %% column F and G only

Of course it's just the most important part of the whole script

The problem is:

If a file is small, like that one I attached, there is no problem to get the same values (in F column) as in .csv.

But if a file is larger (like 35MB or larger) values are mixed and they don't match the original .csv file.

Why is that ? Can you tell me what am I doing wrong ?

Many thanks in advance !!!

7 个评论
显示 5更早的评论隐藏 5更早的评论

DGM 2023-7-18

编辑：DGM 2023-7-18

在 MATLAB Online 中打开

Error using readtable (line 223)

Reading failed at line 17693. All lines of a text file must have the same number of delimiters. Line 17693 has 20 delimiters, while preceding lines have 16.

There are multiple lines which have extra delimiters or different content

T = fileread(filename); % read the whole file
T = split(T);
D = cellfun(@(x) nnz(x == ','),T) % find number of delimiters
badidx = find(D~=16) % find lines with nonstandard number of delimiters
badlines = T(badidx) % print those lines
numdelim = D(badidx) % number of delimiters on those lines

badidx =

7256

11005

17693

46822

badlines =

4×1 cell array

{'1683916462929,53.53142,10.00707,0,356.7,,"TRANSPORT",,,,"[(4,114),(8,11.5),(12,15.5),(16,16.5),(20,14.5),(24,4.5),(28,3.5),(32,5),(36,2),(40,2),(44,0.5),(48,2),(52,0.5),(56,4),(60,0),(64,2.5),(68,0),(72,0),(76,0),(80,0.5),(84,1),(88,0),(92,0),(96,1),(100,0.5),(104,1),(108,0),(112,0),(116,0.5),(120,0),(124,0),(128,0),(132,0.5)]","[(4,36154.5),(8,9495),(12,3450.5),(16,2616),(20,1320),(24,1005.5),(28,552),(32,470),(36,249),(40,219),(44,136),(48,108),(52,75.5),(56,54.5),(60,30),(64,37.5),(68,18.5),(72,17),(76,13),(80,7.5),(84,10.5),(88,7),(92,7.5),(96,9),(100,7),(104,8),(108,3.5),(112,4.5),(116,4),(120,1),(124,2),(128,3.5),(132,2.5),(136,0.5),(140,0.5),(144,0.5)]",3170,401,-760,0.000754496,2.13207e-05'}

{'1683916597744,53.53086,10.00883,0,52.7,,"TRANSPORT",,,,"[(4,30),(8,8),(12,6),(16,3.5),(20,1),(24,0.5),(28,3.5),(32,0),(36,1.5),(40,0),(44,1),(48,1.5),(52,0.5),(56,0),(60,0),(64,0.5),(68,0),(72,1),(76,0),(80,0),(84,0),(88,0),(92,0),(96,0),(100,0),(104,0),(108,0.5)]","[(4,36184.5),(8,9503),(12,3456.5),(16,2619.5),(20,1321),(24,1006),(28,555.5),(32,470),(36,250.5),(40,219),(44,137),(48,109.5),(52,76),(56,54.5),(60,30),(64,38),(68,18.5),(72,18),(76,13),(80,7.5),(84,10.5),(88,7),(92,7.5),(96,9),(100,7),(104,8),(108,4),(112,4.5),(116,4),(120,1),(124,2),(128,3.5),(132,2.5),(136,0.5),(140,0.5),(144,0.5)]",3170,401,-760,0.000759143,4.54643e-06' }

{'1683916837559,53.53370,10.00302,0,451683996200710,,,0,0,149.255,,,,,,,,,,,' }

{0×0 char }

numdelim =

152

140

20

0

I would say that line 17693 is where logging got interrupted and a new entry was appended in the middle of a partial line.

Walter Roberson 2023-7-18

%q format should handle the commas but %s format would not.

请先登录，再进行评论。

请先登录，再回答此问题。

Answer 1

Abhas 2024-10-9

0
链接

此回答的直接链接

https://ww2.mathworks.cn/matlabcentral/answers/1997198-importing-large-csv-by-readtable-gives-wrong-results#answer_1529440

在 MATLAB Online 中打开

Hi @Weronika Huss,

One way to fix this issue is to adjust the formatting of the CSV file. Some lines have an inconsistent number of comma delimited fields. Ensuring that each line has the same number of comma delimited fields will help the import process progress successfully.

Alternatively, you can apply the import options from the version of the CSV file that works correctly to the version of the file with the additional lines. This would involve code that looks like this:

opts = detectImportOptions('<Working CSV Filename>');
originalTable = readtable('<Working CSV Filename>', opts); % This reads the original table
newTable = readtable('<Longer CSV Filename>', opts); % This reads the modified table with additional lines with the original formatting options.

You may refer to the MathWorks documentation link below to know more about "detectImportOptions" :

https://www.mathworks.com/help/matlab/ref/detectimportoptions.html

If the files aren't the same format exactly, they can pass a detection hint to detectImportOptions. In this case, disable header line detection by providing a value.

opts = detectImportOptions('<Longer CSV Filename>', 'NumHeaderLines',0)

More information on the "delimitedTextImportOptions" created by "detectImportOptions" on a CSV can be found by referencing the link below:

https://www.mathworks.com/help/matlab/ref/matlab.io.text.delimitedtextimportoptions.html

I hope this helps!

0 个评论
显示 -2更早的评论隐藏 -2更早的评论

请先登录，再进行评论。

Importing large .csv by READTABLE gives wrong results

7 个评论
显示 5更早的评论隐藏 5更早的评论

回答（1 个）

0 个评论
显示 -2更早的评论隐藏 -2更早的评论

另请参阅

类别

标签

Community Treasure Hunt

Importing large .csv by READTABLE gives wrong results

7 个评论 显示 5更早的评论隐藏 5更早的评论

回答（1 个）

0 个评论 显示 -2更早的评论隐藏 -2更早的评论

另请参阅

类别

标签

Community Treasure Hunt

7 个评论
显示 5更早的评论隐藏 5更早的评论

0 个评论
显示 -2更早的评论隐藏 -2更早的评论