Importing large .csv by READTABLE gives wrong results
2 次查看(过去 30 天)
显示 更早的评论
Hello everyone !
I have a lot of .csv files (example attached). I need to extract column F and G from them. In col F some cells are empty and in col G it's reverse.
My script is as follows:
day = input('Give day ','s');
month = input('Give month ','s');
year = '2023';
filename = ['L:\XT4Comb_16_' [year month day] '.csv'] % I have a lot of these files with subsequent dates in their names
tablica = readtable(filename,'Delimiter',',','ReadVariableNames',true,'HeaderLines',0,'Format','%*s%*s%*s%*s%*s%s%s%*s%*s%*s%*s%*s%*s%*s%*s%*s%*s'); %% column F and G only
Of course it's just the most important part of the whole script
The problem is:
If a file is small, like that one I attached, there is no problem to get the same values (in F column) as in .csv.
But if a file is larger (like 35MB or larger) values are mixed and they don't match the original .csv file.
Why is that ? Can you tell me what am I doing wrong ?
Many thanks in advance !!!
7 个评论
DGM
2023-7-18
编辑:DGM
2023-7-18
Error using readtable (line 223)
Reading failed at line 17693. All lines of a text file must have the same number of delimiters. Line 17693 has 20 delimiters, while preceding lines have 16.
There are multiple lines which have extra delimiters or different content
T = fileread(filename); % read the whole file
T = split(T);
D = cellfun(@(x) nnz(x == ','),T) % find number of delimiters
badidx = find(D~=16) % find lines with nonstandard number of delimiters
badlines = T(badidx) % print those lines
numdelim = D(badidx) % number of delimiters on those lines
badidx =
7256
11005
17693
46822
badlines =
4×1 cell array
{'1683916462929,53.53142,10.00707,0,356.7,,"TRANSPORT",,,,"[(4,114),(8,11.5),(12,15.5),(16,16.5),(20,14.5),(24,4.5),(28,3.5),(32,5),(36,2),(40,2),(44,0.5),(48,2),(52,0.5),(56,4),(60,0),(64,2.5),(68,0),(72,0),(76,0),(80,0.5),(84,1),(88,0),(92,0),(96,1),(100,0.5),(104,1),(108,0),(112,0),(116,0.5),(120,0),(124,0),(128,0),(132,0.5)]","[(4,36154.5),(8,9495),(12,3450.5),(16,2616),(20,1320),(24,1005.5),(28,552),(32,470),(36,249),(40,219),(44,136),(48,108),(52,75.5),(56,54.5),(60,30),(64,37.5),(68,18.5),(72,17),(76,13),(80,7.5),(84,10.5),(88,7),(92,7.5),(96,9),(100,7),(104,8),(108,3.5),(112,4.5),(116,4),(120,1),(124,2),(128,3.5),(132,2.5),(136,0.5),(140,0.5),(144,0.5)]",3170,401,-760,0.000754496,2.13207e-05'}
{'1683916597744,53.53086,10.00883,0,52.7,,"TRANSPORT",,,,"[(4,30),(8,8),(12,6),(16,3.5),(20,1),(24,0.5),(28,3.5),(32,0),(36,1.5),(40,0),(44,1),(48,1.5),(52,0.5),(56,0),(60,0),(64,0.5),(68,0),(72,1),(76,0),(80,0),(84,0),(88,0),(92,0),(96,0),(100,0),(104,0),(108,0.5)]","[(4,36184.5),(8,9503),(12,3456.5),(16,2619.5),(20,1321),(24,1006),(28,555.5),(32,470),(36,250.5),(40,219),(44,137),(48,109.5),(52,76),(56,54.5),(60,30),(64,38),(68,18.5),(72,18),(76,13),(80,7.5),(84,10.5),(88,7),(92,7.5),(96,9),(100,7),(104,8),(108,4),(112,4.5),(116,4),(120,1),(124,2),(128,3.5),(132,2.5),(136,0.5),(140,0.5),(144,0.5)]",3170,401,-760,0.000759143,4.54643e-06' }
{'1683916837559,53.53370,10.00302,0,451683996200710,,,0,0,149.255,,,,,,,,,,,' }
{0×0 char }
numdelim =
152
140
20
0
I would say that line 17693 is where logging got interrupted and a new entry was appended in the middle of a partial line.
回答(1 个)
Abhas
2024-10-9
One way to fix this issue is to adjust the formatting of the CSV file. Some lines have an inconsistent number of comma delimited fields. Ensuring that each line has the same number of comma delimited fields will help the import process progress successfully.
Alternatively, you can apply the import options from the version of the CSV file that works correctly to the version of the file with the additional lines. This would involve code that looks like this:
opts = detectImportOptions('<Working CSV Filename>');
originalTable = readtable('<Working CSV Filename>', opts); % This reads the original table
newTable = readtable('<Longer CSV Filename>', opts); % This reads the modified table with additional lines with the original formatting options.
You may refer to the MathWorks documentation link below to know more about "detectImportOptions" :
If the files aren't the same format exactly, they can pass a detection hint to detectImportOptions. In this case, disable header line detection by providing a value.
opts = detectImportOptions('<Longer CSV Filename>', 'NumHeaderLines',0)
More information on the "delimitedTextImportOptions" created by "detectImportOptions" on a CSV can be found by referencing the link below:
I hope this helps!
0 个评论
另请参阅
类别
在 Help Center 和 File Exchange 中查找有关 Text Files 的更多信息
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!