Building tall table from tall arrays generates error

3 次查看(过去 30 天)
clear
dataFile = 'data.csv';
ds = tabularTextDatastore(dataFile, FileExtensions='.csv');
ds.ReadVariableNames = true;
ds.Delimiter = ',';
ds.SelectedVariableNames = ["hash", "count"];
ds.SelectedFormats = {'%s', '%f'};
data = tall(ds);
Starting parallel pool (parpool) using the 'Processes' profile ... Connected to the parallel pool (number of workers: 2).
[g, THash] = findgroups(data.hash);
TCount = splitapply(@(x) {x}, data.count, g);
%% This works but cannot use it because actual data file is far larger than memory
hash = gather(THash);
Evaluating tall expression using the Parallel Pool 'Processes': - Pass 1 of 1: 0% complete - Pass 1 of 1: 100% complete - Pass 1 of 1: Completed in 1.9 sec Evaluation completed in 2.8 sec
count = gather(TCount);
Evaluating tall expression using the Parallel Pool 'Processes': - Pass 1 of 3: 0% complete - Pass 1 of 3: 100% complete - Pass 1 of 3: Completed in 0.54 sec - Pass 2 of 3: 0% complete - Pass 2 of 3: 100% complete - Pass 2 of 3: Completed in 0.46 sec - Pass 3 of 3: 0% complete - Pass 3 of 3: 100% complete - Pass 3 of 3: Completed in 0.58 sec Evaluation completed in 2.3 sec
T1 = table(hash, count);
%% This is the intended code but doesn't work
TT = table(THash,TCount);
Error using tall/table
Incompatible non-scalar tall array arguments. Each of the tall arrays must be the same size in the first dimension, must be derived from a single tall array, and must not have been indexed
differently in the first dimension (indexing operations include functions such as VERTCAT, SPLITAPPLY, SORT, CELL2MAT, SYNCHRONIZE, RETIME and so on).
write(fullfile(pwd,'data'),TT,FileType="parquet");

回答(1 个)

Oguz Kaan Hancioglu
Your code wasn't work because "gather(TCount)" returns cell array for each element. Therefore you are trying to write double array in to one single cell. You can find the length of each array into the cell. I hope this solves your problem.
%% This works but cannot use it because actual data file is far larger than memory
hash = gather(THash);
count = gather(TCount);
cellsz = cellfun(@size,count,'uni',false);
newCount = cellfun(@(x) x(1),cellsz,'UniformOutput',false)
T1 = table(hash, newCount);
  1 个评论
Harry Cho
Harry Cho 2023-3-15
Thank you for the reply. Unfortunately I have to collect cell array, in which each cell has different length of double array. My question is why it works in-memory table T1 but not in tall table TT.

请先登录,再进行评论。

类别

Help CenterFile Exchange 中查找有关 Analysis of Big Data with Tall Arrays 的更多信息

产品


版本

R2022b

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by