Importing large .csv by READTABLE gives wrong results

2 次查看(过去 30 天)
Hello everyone !
I have a lot of .csv files (example attached). I need to extract column F and G from them. In col F some cells are empty and in col G it's reverse.
My script is as follows:
day = input('Give day ','s');
month = input('Give month ','s');
year = '2023';
filename = ['L:\XT4Comb_16_' [year month day] '.csv'] % I have a lot of these files with subsequent dates in their names
tablica = readtable(filename,'Delimiter',',','ReadVariableNames',true,'HeaderLines',0,'Format','%*s%*s%*s%*s%*s%s%s%*s%*s%*s%*s%*s%*s%*s%*s%*s%*s'); %% column F and G only
Of course it's just the most important part of the whole script
The problem is:
If a file is small, like that one I attached, there is no problem to get the same values (in F column) as in .csv.
But if a file is larger (like 35MB or larger) values are mixed and they don't match the original .csv file.
Why is that ? Can you tell me what am I doing wrong ?
Many thanks in advance !!!
  7 个评论
DGM
DGM 2023-7-18
编辑:DGM 2023-7-18
Error using readtable (line 223)
Reading failed at line 17693. All lines of a text file must have the same number of delimiters. Line 17693 has 20 delimiters, while preceding lines have 16.
There are multiple lines which have extra delimiters or different content
T = fileread(filename); % read the whole file
T = split(T);
D = cellfun(@(x) nnz(x == ','),T) % find number of delimiters
badidx = find(D~=16) % find lines with nonstandard number of delimiters
badlines = T(badidx) % print those lines
numdelim = D(badidx) % number of delimiters on those lines
badidx =
7256
11005
17693
46822
badlines =
4×1 cell array
{'1683916462929,53.53142,10.00707,0,356.7,,"TRANSPORT",,,,"[(4,114),(8,11.5),(12,15.5),(16,16.5),(20,14.5),(24,4.5),(28,3.5),(32,5),(36,2),(40,2),(44,0.5),(48,2),(52,0.5),(56,4),(60,0),(64,2.5),(68,0),(72,0),(76,0),(80,0.5),(84,1),(88,0),(92,0),(96,1),(100,0.5),(104,1),(108,0),(112,0),(116,0.5),(120,0),(124,0),(128,0),(132,0.5)]","[(4,36154.5),(8,9495),(12,3450.5),(16,2616),(20,1320),(24,1005.5),(28,552),(32,470),(36,249),(40,219),(44,136),(48,108),(52,75.5),(56,54.5),(60,30),(64,37.5),(68,18.5),(72,17),(76,13),(80,7.5),(84,10.5),(88,7),(92,7.5),(96,9),(100,7),(104,8),(108,3.5),(112,4.5),(116,4),(120,1),(124,2),(128,3.5),(132,2.5),(136,0.5),(140,0.5),(144,0.5)]",3170,401,-760,0.000754496,2.13207e-05'}
{'1683916597744,53.53086,10.00883,0,52.7,,"TRANSPORT",,,,"[(4,30),(8,8),(12,6),(16,3.5),(20,1),(24,0.5),(28,3.5),(32,0),(36,1.5),(40,0),(44,1),(48,1.5),(52,0.5),(56,0),(60,0),(64,0.5),(68,0),(72,1),(76,0),(80,0),(84,0),(88,0),(92,0),(96,0),(100,0),(104,0),(108,0.5)]","[(4,36184.5),(8,9503),(12,3456.5),(16,2619.5),(20,1321),(24,1006),(28,555.5),(32,470),(36,250.5),(40,219),(44,137),(48,109.5),(52,76),(56,54.5),(60,30),(64,38),(68,18.5),(72,18),(76,13),(80,7.5),(84,10.5),(88,7),(92,7.5),(96,9),(100,7),(104,8),(108,4),(112,4.5),(116,4),(120,1),(124,2),(128,3.5),(132,2.5),(136,0.5),(140,0.5),(144,0.5)]",3170,401,-760,0.000759143,4.54643e-06' }
{'1683916837559,53.53370,10.00302,0,451683996200710,,,0,0,149.255,,,,,,,,,,,' }
{0×0 char }
numdelim =
152
140
20
0
I would say that line 17693 is where logging got interrupted and a new entry was appended in the middle of a partial line.

请先登录,再进行评论。

回答(1 个)

Abhas
Abhas 2024-10-9
One way to fix this issue is to adjust the formatting of the CSV file. Some lines have an inconsistent number of comma delimited fields. Ensuring that each line has the same number of comma delimited fields will help the import process progress successfully.
Alternatively, you can apply the import options from the version of the CSV file that works correctly to the version of the file with the additional lines. This would involve code that looks like this:
opts = detectImportOptions('<Working CSV Filename>');
originalTable = readtable('<Working CSV Filename>', opts); % This reads the original table
newTable = readtable('<Longer CSV Filename>', opts); % This reads the modified table with additional lines with the original formatting options.
You may refer to the MathWorks documentation link below to know more about "detectImportOptions" :
If the files aren't the same format exactly, they can pass a detection hint to detectImportOptions. In this case, disable header line detection by providing a value.
opts = detectImportOptions('<Longer CSV Filename>', 'NumHeaderLines',0)
More information on the "delimitedTextImportOptions" created by "detectImportOptions" on a CSV can be found by referencing the link below:
I hope this helps!

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by