Cannot Load CSV file
显示 更早的评论
I am trying to load a csv file using the import tool.
It takes forever (like a weekend was not enough...).
I've included the screenshot of what I am doing.
The file has numbers from H2 to AEQ639774. From A1 to AEQ1 I have headers. From A2 to G639774 I have identifiers.
I was trying to first load the numbers into a numeric matrix, and then repeat the process for headers and identifiers separately. But not even this works.
The file is 1.28 GB.. so big but not that big.
My machine has 16gb ram so that should be enough.
I am probably doing something wrong!
Thanks in advance!
采纳的回答
Adam Danz
2018-7-31
That sounds fishy. What version of matlab are you using? I assume the problem persists after exiting and rebooting Matlab.
You could use an alternative method of importing the data such as xlsread() which bypasses some of the processing done by the import tool.
15 个评论
MATLAB R2017b
oh and I am able to import small parts of the data quickly if I choose only 20 to 30 cells. so I suppose the problem is that the file is too large? is there a more efficient way to do this?
xlsread and reshaping are not really helping.
I was also trying to follow this instructions https://nl.mathworks.com/matlabcentral/answers/231857-loading-large-csv-files
but it's not working either..
What was the problem with xlsread()?
What was the problem with reading the file in chunks (as described in your link)?
If xlsread() causes problems when reading the entire numerical matrix within your csv file, first I'd like to know what the problem is, secondly, you could try extracting the data in chunks. You'd make a loop that pulls out 100 columns at a time, for example.
romulo alves
2018-7-31
编辑:romulo alves
2018-7-31
So, if I do xlsread('DOT.csv','H7:T20'), trying to extract only a little bit of numeric part, I get the message
Unable to read XLS file "path" File is not in recognized format.
If I do:
chunk_nRows = 2e4 ;
% - Open file.
fId = fopen( 'DOT.csv' ) ;
% - Read first line, convert to double, determine #columns.
line = fgetl( fId ) ;
row = sscanf( line, '%f,' )' ;
nCols = numel( row ) ;
% - Prealloc data, copy first row, init loop counter.
data = zeros( chunk_nRows, nCols ) ;
data(1,:) = row ;
rowCnt = 1 ;
% - Loop over rest of the file.
while ~feof( fId )
rowCnt = rowCnt + 1 ;
% - Realloc + a chunk if rowCnt larger than data array.
if rowCnt > size( data, 1 )
fprintf( 'Realloc ..\n' ) ;
data(size(data, 1)+chunk_nRows, nCols) = 0 ;
end
% - Read line, convert and store.
line = fgetl( fId ) ;
data(rowCnt,:) = sscanf( line, '%f,' )' ;
end
% - Truncate data to last row (truncate last chunk).
data = data(1:rowCnt,:) ;
% - Close file.
fclose( fId ) ;
I get the message
Subscript indices must either be real positive integers or logicals.
I checked and the code stops when
rowCnt = 20001
One thing that might influence this is the fact that some cells, instead of blanks or numbers, have an "e" or an "r". Could this be it?
romulo alves
2018-7-31
编辑:Walter Roberson
2018-7-31
the dataset is here in case that helps
That suggests that the file has inconsistent format. Perhaps it repeats headers after 20000 rows, for example.
I suggest reading it with readtable(). If you have a new enough MATLAB, use detectImportOptions first.
Unfortunately registration is required to download that dataset.
romulo alves
2018-7-31
编辑:romulo alves
2018-7-31
Thanks. So with readtable() and detectImportOptions I am able to load specific columns. This takes over 1 min per column, though. Since I have over 800 columns, this takes over 13 hours. Furthermore, I have other csv files I will need to load, so I am looking at a long time of loading data. Is there a quicker alternative? Thank you!
I'm not sure this would speed things up or not but have you tried importing the data as a cell array, then filtering out non-numerical elements, then filling all empties with NaN?
romulo alves
2018-7-31
编辑:Walter Roberson
2018-7-31
Thanks. I'll try. Meanwhile, if it is of any usefulness, I put the data on dropbox. Can be accessed via:
ok. so the readtable thing worked. Apparently the time is not proportional to the number of columns as I was assuming. the problem now is that even numbers appear as
'3000'
instead of
3000
Is it possible to go around this?
Weird, the quotes around the numbers are in the csv file itself.
I will see if I can work around it.
By default, the variables created by readtable() are double when the entire column is numeric, or cell arrays of character vectors when any element in a column is not numeric. This is where detectImportOptions() comes in handy. Have you tried that recommendation already?
Off the top of my head, I see two options. The first is to import cleaner data using detectImportOption(). The second is import the dirty data you've got and then clean it. UPDATE: If the quotes are part of the actual data as Walter discovered, the 2nd option may be the better of the two.
If your data is all in a cell array with mixed strings and doubles, you could use str2double() as in this example.
dirty = {'3000'; 3000; 'r'; 2000; '1000'};
clean = str2double(dirty);
isDbl = cellfun(@isnumeric, dirty);
clean(isDbl) = [dirty{isDbl}];
clean =
3000
3000
NaN
2000
1000
The 'e' and 'r' are probably the reason that most numbers are coded as if they are strings.
What do you want done with the 'e' and 'r' ? Is it okay to treat both of them the same way as empty cells, by changing all three of them into NaN ?
The file turns out to be UTF8 encoded, because it contains accented characters at various points. That leads to some problems.
I started working with reading in the entire file at one time to process as a single string (there can be a lot of advantages to working that way), but I encountered a Mathworks bug with native2unicode at the point of 1 gigabyte of decoded characters.
更多回答(0 个)
类别
在 帮助中心 和 File Exchange 中查找有关 Text Data Preparation 的更多信息
另请参阅
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!选择网站
选择网站以获取翻译的可用内容,以及查看当地活动和优惠。根据您的位置,我们建议您选择:。
您也可以从以下列表中选择网站:
如何获得最佳网站性能
选择中国网站(中文或英文)以获得最佳网站性能。其他 MathWorks 国家/地区网站并未针对您所在位置的访问进行优化。
美洲
- América Latina (Español)
- Canada (English)
- United States (English)
欧洲
- Belgium (English)
- Denmark (English)
- Deutschland (Deutsch)
- España (Español)
- Finland (English)
- France (Français)
- Ireland (English)
- Italia (Italiano)
- Luxembourg (English)
- Netherlands (English)
- Norway (English)
- Österreich (Deutsch)
- Portugal (English)
- Sweden (English)
- Switzerland
- United Kingdom (English)
