How to read rows of data in .csv file and turn them into arrays?

98 次查看(过去 30 天)
I have some data in a csv file that consists of some "delays" and for each delay there is a row of "wavelengths" and a row of corresponsing "intensities". It looks like this in excel:
I want the code to take all the delays and make them into one vector. And take the wavelengths (all the wavelength rows are identical) and make them into one vector. And take the intensities and make an array of intensities.
I tried textscan but that seems column oriented. I need matlab to read line by line (row by row), extract the data and put it into relevant array and then repeat procedure for every delay.
  3 个评论
the cyclist
the cyclist 2020-1-8
It's better to upload the file (or a small, representative sample of the file). The image helps, but we cannot test code on it.

请先登录,再进行评论。

采纳的回答

Stephen23
Stephen23 2020-1-9
编辑:Stephen23 2020-1-9
Notes to consider when importing a large file:
  • Importing as character and then doing any kind of text manipulation before converting to numeric will be very inefficient for a large file.
  • Importing the file in blocks is only efficient for a relatively small number of very large data blocks. This is not the case with your data: your file has many small blocks, forcing you to access the file repeatedly and expand the output array many times.
  • Regardless of what method you use to import it, the entire imported file data must fit into MATLAB memory. If this is not the case then you will need to use tall arrays or some other approach to processing your data.
  • Regardless of what method you use to import it, importing data will take some time.
  • If you want to import a large file efficiently then you need to minimize any handling of the imported file data, the optimal solution is to get all of the importing and conversion to numeric performed in one call to one reasonably low-level MATLAB command (i.e. avoid readtable , readmatrix, et al, which are just complex wrappers around textscan).
Method one: textscan. With suitable options to read that file format (note how ':' is defined as the EOL character, so each data block become one line of the imported matrix), you only need to call textscan once and it will import the entire file:
[fid,msg] = fopen('20200107LaAlO3.csv','rt');
assert(fid>=3,msg)
%% Detect the number of columns:
fgetl(fid); % ignore first line.
nmc = 1+nnz(fgetl(fid)==','); % number of columns.
fmt = ['%f%f',repmat('%f',1,2*nmc),'%*s'];
%% Import the file data:
opt = {'HeaderLines',1, 'EndOfLine',':', 'MultipleDelimsAsOne',true,...
'Whitespace',', \n\r\t', 'CollectOutput',true}; % COLLECTOUTPUT is optional
frewind(fid)
C = textscan(fid,fmt,opt{:});
fclose(fid);
%% Optional: split into separate arrays:
D = C{1}(:,1:2); % Delays
W = C{1}(:,3:2+nmc); % Wavelengths
I = C{1}(:,3+nmc:end); % Intensities
And checking the output:
>> D % Delays
D =
-20 -19.999
-19.985 -19.982
-19.97 -19.967
-19.955 -19.952
-19.94 -19.937
-19.925 -19.922
-19.91 -19.907
-19.895 -19.892
-19.88 -19.877
-19.865 -19.862
>> W(:,[1:5,end-5:end]) % Wavelengths (first and last few columns only)
ans =
189.25 189.71 190.18 190.65 191.12 1098.3 1098.7 1099.1 1099.5 1099.9 1100.4
189.25 189.71 190.18 190.65 191.12 1098.3 1098.7 1099.1 1099.5 1099.9 1100.4
189.25 189.71 190.18 190.65 191.12 1098.3 1098.7 1099.1 1099.5 1099.9 1100.4
189.25 189.71 190.18 190.65 191.12 1098.3 1098.7 1099.1 1099.5 1099.9 1100.4
189.25 189.71 190.18 190.65 191.12 1098.3 1098.7 1099.1 1099.5 1099.9 1100.4
189.25 189.71 190.18 190.65 191.12 1098.3 1098.7 1099.1 1099.5 1099.9 1100.4
189.25 189.71 190.18 190.65 191.12 1098.3 1098.7 1099.1 1099.5 1099.9 1100.4
189.25 189.71 190.18 190.65 191.12 1098.3 1098.7 1099.1 1099.5 1099.9 1100.4
189.25 189.71 190.18 190.65 191.12 1098.3 1098.7 1099.1 1099.5 1099.9 1100.4
189.25 189.71 190.18 190.65 191.12 1098.3 1098.7 1099.1 1099.5 1099.9 1100.4
>> I(:,[1:5,end-5:end]) % Intensities (first and last few columns only)
ans =
3802 2566 2801 2749 2918 2938 2931 2756 2760 2758 2756
3805 2570 2806 2748 2922 2944 2917 2754 2755 2755 2754
3806 2569 2800 2751 2925 2960 2918 2757 2760 2758 2756
3805 2564 2802 2747 2924 2946 2915 2754 2757 2753 2757
3804 2568 2801 2744 2928 2937 2914 2754 2759 2758 2760
3805 2568 2807 2748 2935 2939 2928 2752 2762 2756 2756
3802 2569 2802 2752 2933 2936 2910 2755 2758 2755 2757
3803 2568 2804 2747 2927 2953 2920 2754 2761 2752 2758
3803 2571 2802 2746 2923 2936 2927 2752 2759 2757 2762
3810 2567 2807 2749 2931 2945 2918 2757 2755 2758 2760
>>
It might be more efficient to not use the 'CollectOutput' option, and concatenate the vectors afterwards, I leave that up to you to experiment with (hint: use comma-separated lists or cell2mat).
Method two: sscanf. This is by far the fastest method that I found to import that file data, helped by the fact that sscanf is as low-level as it gets. Note that the imported character data is not manipulated, split, or altered in any way before being converted by sscanf:
str = fileread('20200107LaAlO3.csv');
nmb = nnz(str==':'); % number of blocks
nmc = 1+(nnz(str==',')/nmb-1)/2; % number of columns
fmt = repmat(',%f',1,nmc);
fmt = ['%*[\n\r d]elay:%f,%f',fmt(2:end),fmt(2:end)];
mat = sscanf(str,fmt,[2+2*nmc,nmb]).';
The output matrix is exactly identical to the output of Method one:
>> isequal(mat,C{1})
ans =
1
But it is faster:
Elapsed time is 0.599085 seconds. % Method one
Elapsed time is 0.111845 seconds. % Method two
Caveat: As per isakson noted at the end of their answer, importing this file data will lose information as the file contains more digits precision than can be stored in a Double floating point number (but as this is only at the 18th/19th digit this is very unlikely to be a problem).
  5 个评论
Stephen23
Stephen23 2020-1-9
编辑:Stephen23 2020-1-9
"how large is a large file?"
That is the crux of it: my own personal definition would be "a file that is big enough that naive code written without consideration of its size will fail when importing or processing", or something like that. I do not consider 0.2 GB large, nor is 2 GB particularly large these days (clearly the definition is entirely system dependent). It is much the same as any other MATLAB processing: when the data gets large enough that care must be taken to prevent data duplication, large intermediate arrays, etc., in order to prevent "out of memory" woes and the ilk, then that is a "large" file.
I have no idea if the OP's actual file meets this criteria.
Your tests probably mostly measure disc access times, which will not change significantly.
per isakson
per isakson 2020-1-9
编辑:per isakson 2020-1-10
"Your tests probably mostly measure disc access times" No, not accourding to the profiler. fileread() takes less than 20% of the total time.

请先登录,再进行评论。

更多回答(2 个)

per isakson
per isakson 2020-1-8
编辑:per isakson 2020-1-9
You write "very very long time" in a comment without mentioning the elapsed time nor the size of the text file.
Assumptions
  • Speed is important -"very very long time"
  • The text file fits in memory - should be two copies of the text file.
Approach
  • Read the entire file into a character array.
  • Split the text into a cell array of characters, with one block in each cell
  • Pre-allocate an output variables based on the size of the cell array
  • Loop over all blocks and parse one block at a time
Try
>> tic, sas = cssm( 'd:\m\cssm\20200107LaAlO3.csv' ); toc
Elapsed time is 3.898923 seconds.
>> tic,delay=cat(1,sas.delays);toc
Elapsed time is 0.001138 seconds.
>> tic,wavelength=cat(1,sas.wavelengths);toc
Elapsed time is 0.027321 seconds.
>> tic,intensity=cat(1,sas.intensities);toc
Elapsed time is 0.027349 seconds.
>> whos delay wavelength intensity
Name Size Bytes Class Attributes
delay 1940x2 31040 double
intensity 1940x2068 32095360 double
wavelength 1940x2068 32095360 double
where
function sas = cssm( ffs )
chr = fileread( ffs );
blocks = strsplit( chr, 'delay:' );
blocks(1) = [];
len = length( blocks );
sas(1,len) = struct( 'delays',[], 'wavelengths',[], 'intensities',[] );
for jj = 1 : len % loop over all blocks
cac = strsplit( blocks{jj}, newline );
sas(jj).delays = sscanf( cac{1},'%f,', [1,inf] );
sas(jj).wavelengths = sscanf( cac{2},'%f,', [1,inf] );
sas(jj).intensities = sscanf( cac{3},'%f,', [1,inf] );
end
end
and where d:\m\cssm\20200107LaAlO3.csv contains 194 copies of your uploaded file. It's size is 0.2GB (200649932/1e9).
>> d = dir('d:\m\cssm\20200107LaAlO3.csv')
struct with fields:
name: '20200107LaAlO3.csv'
folder: 'd:\m\cssm'
bytes: 200649932
...
Caveat: double isn't enough to exactly represent the numbers in the text-file. A rounding error is introduced.
1.892451199999999858e+02,1.897140509254856511e+02, ...
>> format long
>> wavelength(1,1:2)
ans =
1.0e+02 *
1.892451200000000 1.897140509254857
/ Dell OptiPlex 7050 (a new desktop), R2018b, Win10

dpb
dpb 2020-1-8
textscan is record oriented. Your file has sections; you have to read a section, rinse and repeat.
Something like
fmt1='delay: %f %f'; % delay record
fmt2=repmat('%f',1,N); % data records
fid=fopen('yourfile.csv');
delay=[];
inten=[];
rec=1;
while ~feof(fid)
delay=[delay; cell2mat(textscan(fid,fmt1,'collectoutput',1));]
if rec==1
wlen=cell2mat(textscan(fid,fmt2,'collectoutput',1));
rec=rec+1;
else
tmp=cell2mat(textscan(fid,fmt2,'collectoutput',1)); % just trash the duplicated records
end
inten=[inten; cell2mat(textscan(fid,fmt2,'collectoutput',1));]
end
fid=fclose(fid);
delay=delay(:);
wlen=wlen(:);
inten=inten(:);
NB: Air code untested since no data file and at least temporarily I have no access so couldn't do so myself anyways...somebody else bound to be along but should get started.
  2 个评论
dpb
dpb 2020-1-9
That's probably owing to the dynamic reallocation in the above...since it's an uncounted file, I made the assumption the file size would not be particularly large and took the easy route.
Preallocate for some large N of records and populate the array; only reallocating if that large number is exceeded.

请先登录,再进行评论。

类别

Help CenterFile Exchange 中查找有关 Text Data Preparation 的更多信息

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by