Processing Data in a Tall Array

Question

0 个投票

I am working on a text file with 22 GB consisting of around 180 million rows and 17 columns. It's transportation data consisting of "Trip" and "Speed". I want to calculate the average speed of each trip.

clear;clc;
data='D:\Umich\eBlueBus_2019 Summer\UMTRI data\SQLDataExport.txt';
% data='D:\Umich\eBlueBus_2019 Summer\Simulation\Trip_11to12_0702.txt';
ds = tabularTextDatastore(data);
ds.ReadSize=200000;
ds.SelectedVariableNames={'Trip','GpsSpeed'};
ds.SelectedFormats={'%f','%f'};
t_array=tall(ds);
A=matlab.tall.transform(@reduce_fcn,t_array,'OutputsLike',{(table(1,1,'VariableNames',{'Trip','Spd'}))});

Where the @reduce_fcn is the function below:

function TT=reduce_fcn(t_array)
      [groups,Y]=findgroups(t_array.Trip);
      D=splitapply(@mean,t_array.GpsSpeed,groups);
      TT= table(Y,D,'VariableNames',{'Trip','Spd'});
end

After I ran and commanded B=gather(A), I got an error:

>> B=gather(A);

Evaluating tall expression using the Local MATLAB Session:

- Pass 1 of 1: 12% complete

Evaluation 12% complete

Error using matlab.io.datastore.TabularTextDatastore/readData (line 77)

Mismatch between file and format character vector.

Trouble reading 'Numeric' field from file (row number 58461, field number 2) ==>

Trip,Time,GpsTime,GpsWeek,GpsHeading,GpsSpeed,Latitude,Longitude,Altitude,NumberOfSats,Differential,FixMode,Pdop,GpsBytes,UtcTime,UtcWeek\n

Learn more about errors encountered during GATHER.

Error in matlab.io.datastore.TabularDatastore/read (line 120)

[t, info] = readData(ds);

Error in tall/gather (line 50)

[varargout{:}, readFailureSummary] = iGather(varargin{:});

Caused by:

Reading the variable name 'Trip' using format '%f' from file: 'D:\Umich\eBlueBus_2019

Summer\UMTRI data\SQLDataExport.txt' starting at offset 2852126773.

I did check the row 58461 of the data,

the Trip and GpsSpeed are the second column (i.e. 10) and the seventh column (i.e. 1.4299999). They don't seem like a problem to me. I am confused why reading t_array.Trip by %f would be a problem indicated in the error.

Besides, when I ran the test data file which is only a portion of the original data consisting only Trip 11 and 12 (i.e. the comment part of the code), I did not get any problem and have the answer I'd expected: a 2 by 2 table where the first column is Trip 11 & 12 and the second is average speed of the Trip 11 & 12.

Appreciate any suggestions. Thank you

1 个评论
显示 -1更早的评论隐藏 -1更早的评论

Juan-Jie Sun 2019-7-4

在 MATLAB Online 中打开

I also tried to extract the first 800 Trips (about 1800 Trips in total).

clear;clc;
data='D:\Umich\eBlueBus_2019 Summer\UMTRI data\SQLDataExport.txt';
% data='D:\Umich\eBlueBus_2019 Summer\Simulation\Trip_11to12_0702.txt';
ds = tabularTextDatastore(data);
ds.ReadSize=5000000;
ds.SelectedVariableNames={'Trip','GpsSpeed'};
ds.SelectedFormats={'%f','%f'};
t_array=tall(ds);
A=matlab.tall.transform(@reduce_fcn,t_array,'OutputsLike',{(table(1,1,'VariableNames',{'Trip','Spd'}))});
B=gather(head(A,800));

And, I could successfully run the code and get a 800 by 2 answer in table. However, as I tried to extract the first 1000 Trips, i.e.

B=gather(head(A,1000));

I got a similar error again.

Error using matlab.io.datastore.TabularTextDatastore/readData (line 77)

Mismatch between file and format character vector.

Trouble reading 'Numeric' field from file (row number 258461, field number 2) ==>

Trip,Time,GpsTime,GpsWeek,GpsHeading,GpsSpeed,Latitude,Longitude,Altitude,NumberOfSats,Differential,FixMode,Pdop,GpsBytes,UtcTime,UtcWeek\n

Learn more about errors encountered during GATHER.

Error in matlab.io.datastore.TabularDatastore/read (line 120)

[t, info] = readData(ds);

Error in tall/gather (line 50)

[varargout{:}, readFailureSummary] = iGather(varargin{:});

Error in UMTRI_data (line 21)

B=gather(head(A,1000));

Caused by:

Reading the variable name 'Trip' using format '%f' from file: 'D:\Umich\eBlueBus_2019

Summer\UMTRI data\SQLDataExport.txt' starting at offset 2852126774.

请先登录，再进行评论。

请先登录，再回答此问题。

请先登录再关注

Answer 1

Juan-Jie Sun 2019-7-7

0 个投票

I finally found the problem in my data. It's actually that the row 22357841 has an issue. But it is row 58461 as indicated in the error message in the chunk it has loaded into the memory. (the followe picture read the header as row 1 so the row 22357841 becomes 22357841+1)

However, another interesting thing happens: although I specified the ReadSize=100,000, it may sometime change as I excute read(ds) command. This problem took me more time to find the format error. It really confused me because I have 16 GB RAM and can accomodate 100,000 by 1 array without problems. I recorded the ReadSize as the following picture. I may just post another question regarding this later on.

3 个评论
显示 1更早的评论隐藏 1更早的评论

Juan-Jie Sun 2019-7-8

I only have one file (22.1 GB .txt). My understanding is that the ReadSize is kind of the specified size for each chunk when executing read(ds). So, why would it have file boundaries in the middle of a file as you mentioned? Shouldn't it just load 100,000 rows each time until the end of the file?

Josh Meyer 2019-7-8

I thought you had multiple files so mentioned file boundaries, but really read(ds) can return fewer than ReadSize number of rows for a variety of reasons:

https://www.mathworks.com/matlabcentral/answers/422446-why-doesn-t-the-read-tabulartextdatastore-variable-function-return-the-correct-number-of-rows-set-b

请先登录，再进行评论。

Answer 2

Josh Meyer 2019-7-3

在 MATLAB Online 中打开

0 个投票

findgroups and splitapply work with tall arrays, so don't worry about using tall.transform or tall.reduce for this simple task.

You can tell which functions work with tall arrays by checking the bottom of the reference page in the doc for an Extended Capabilities section. If there are any limitations of the support it will be detailed there as well.

What I noticed is that

mean_spd = gather(splitapply(@mean,spd,trip));

should be

mean_spd = gather(splitapply(@mean,spd,G));

If this doesn't work (and I expect it will, with the issue being that "trip" isn't a well-defined grouping variable on its own), it makes me start to wonder if the data needs to be cleaned before the operation. You can use ismissing, rmmissing, fillmissing, etc... to do that if necessary.

1 个评论
显示 -1更早的评论隐藏 -1更早的评论

Juan-Jie Sun 2019-7-4

Hello Josh, thank you for help. I realized that I ned to update my 2018a release. After I installed 2019a, I made some progress and did get the answer I'd expected from running a test data file. However, I am confused by the format error when I run the original data file. I've updated my question. Appreciate your suggestion.

请先登录，再进行评论。

Processing Data in a Tall Array

1 个评论
显示 -1更早的评论隐藏 -1更早的评论

采纳的回答

3 个评论
显示 1更早的评论隐藏 1更早的评论

更多回答（1 个）

1 个评论
显示 -1更早的评论隐藏 -1更早的评论

类别

产品

版本

标签

Community Treasure Hunt

Processing Data in a Tall Array

1 个评论 显示 -1更早的评论 隐藏 -1更早的评论

采纳的回答

3 个评论 显示 1更早的评论 隐藏 1更早的评论

更多回答（1 个）

1 个评论 显示 -1更早的评论 隐藏 -1更早的评论

类别

产品

版本

标签

另请参阅

Community Treasure Hunt

1 个评论
显示 -1更早的评论隐藏 -1更早的评论

3 个评论
显示 1更早的评论隐藏 1更早的评论

1 个评论
显示 -1更早的评论隐藏 -1更早的评论