Vectorizing or otherwise accelerating nested loops with large multidimensional arrays

Question

Jacob McPherson 2022-8-17

0
链接

此问题的直接链接

https://ww2.mathworks.cn/matlabcentral/answers/1781010-vectorizing-or-otherwise-accelerating-nested-loops-with-large-multidimensional-arrays

编辑： Jan 2022-8-19

I have a dataset consisting of large files of neural recordings. Each file contains two 3-dimensional arrays. The first array, A, contains raw numerical data and is organized as follows: 1st dimension: N rows (corresponding to time-varying voltage; typically hundreds of thousands to millions of rows); 2nd dimension: 32 columns (corresponding to electrode channels); and 3rd dimension: 50 elements ("pages," as it were, corresponding to different frequency bands).

The second array B, has the same dimensions as A. It is organized as follows: 1st dimension: numerically categorized data points (integer values of range: 1:20); 2nd dimension: 32 elements (corresponding to electrode channels); 3rd dimension: 50 elements (corresponding to frequency bands).

I need to index and operate over both the 2nd and the 3rd dimensions of the array A. Specifically, for each "page" of array B (i.e., each of the 50 pages, or frequencies, in the 3rd dimension), I need to do the following:

(i) locate the indices of each unique value of the first dimension (1-20),

(ii) apply these indices (separately) to each page (3rd dim) of array A, and

(iii) compute the column-wise means dim 1 in array A corresponding to the indices generated from B.

In other words, I need to find the indices of the first numerical category in dimension 1 of array B (i.e., all number 1's), apply those indices to each page of array A's third dimension, and then take the columnwise mean of the index values per page. This operation will result in a 1x32x50 array. Subsequently, the operation is to be repeated for categories 2-20 using the first page of array B's third dimension. The process is then repeated in the same manner for all 50 pages in B. At the end of the operation, the result will have the dimensions of 20x32x50x50.

An inelegant solution to this problem (example code below) involves nested 20- and 50-element loops. But this approach is extremely slow given the large datasets. Is there is a way to vectorize or otherwise accelerate these operations with parallelization or cluster operation? Thanks!

A = rawDat; % for example, 600000x32x50
B = categoricalDat; % for example, 600000x32x50
%
for idx = 1:50 % corresponds to dimension 3 of A
    %
    for idx2 = 1:20 % corresponds to 'categories' in dimension 1 of array B
        %
        % create an index array, C, corresponding to the indices of a given
        % categorical value from dimension 1 of array B:
        C = B == idx2;
        %
        % expand C to dimensions consistent with A:
        C = repmat(C(:,:,idx2),[1,1,50]);
        %
        % set unwanted indices = NaN (rather than 0) so as to preseve
        % dimensions and enable appropriate averaging:
        A(~C) = NaN;
        %
        % compute columwise means
        meanAmp(idx2,:,:,idx) = nanmean(A);
        %             
        clear C
    end
    %
end

9 个评论
显示 7更早的评论隐藏 7更早的评论

Jacob McPherson 2022-8-18

在 MATLAB Online 中打开

Hi Jan - Thanks for taking a look. I've made a short, commented example in the code below. I've also added a schematic illustration of the general operation. I'm not sure how helpful the image is, but hopefully you will find it useful.

% create some tractable "representative" variables
A = rand(6,5,3); % represents the raw voltage data (would be Nx32x50 in actuality)
B = randi(20, 6,5,3); % represents the categorized indexing data (would also be Nx32x50)
%
% To be consistent with the proposed example dimensions, idx will range 
% from 1:3; idx2 can still range from 1:20 in this example. 
%
for idx = 1:3 % corresponds to size of dimension 3 of array A
    %
    for idx2 = 1:20 % corresponds to number of 'categories' in dimension 1 of array B
        %
        % create array C corresponding to the indices of a given value
        % contained within array B:
        C = B(:,:,idx) == idx2;
        % For example,:
        % if B(:,:,1) =  [15     3    18     4    17
                    %     18    12     8    14     1
                    %      6     8     9    12     2
                    %     15    17    20    14    20
                    %      3    11     1     8    14
                    %     17    10    20    13     5], 
           % and idx = 1 and idx2 = 1, then 
           % C would take the form:
                  % C =  [0     0      0    0      0
                  %       0     0      0    0      1
                  %       0     0      0    0      0
                  %       0     0      0    0      0
                  %       0     0      1    0      0
                  %       0     0      0    0      0]
        %
        % Next, expand C to dimensions consistent with A. In this example,
        % expand C such that it becomes 6x5x3 array:
        C = repmat(C,[1,1,3]);
        %
        % set the logically false values = NaN (rather than 0) so as to 
        % preseve dimensions and enable columnwise averaging: 
        A(~C) = NaN;
        %
        % in the above example, one could imagine that the result of A(~C) 
        % might look like the following:
        % A(:,:,1) = [NaN       NaN       NaN       NaN       NaN
        %             NaN       NaN       NaN       NaN    0.8959
        %             NaN       NaN       NaN       NaN       NaN
        %             NaN       NaN       NaN       NaN       NaN
        %             NaN       NaN    0.1771       NaN       NaN
        %             NaN       NaN       NaN       NaN       NaN],
        % with A(:,:,2) and A(:,:,3) taking the same form (i.e., NaN's in 
        % the same locations as A(:,:,1), but different numerical values.  
        %  
        % compute the columwise means in A:
        meanAmp(idx2,:,:,idx) = nanmean(A);
        %
        % after one pass through the idx2 loop (e.g, idx2 = 1), this
        % operation will result in meanAmp being a 1x5x3x1 array. 
        % 
        % Per the above examples:
        % meanAmp(1,:,1,1) = [NaN   NaN   0.1771   NaN   0.8959]. 
        % By the end of the first full idx2 loop (i.e., when idx2 = 20), 
        % meanAmp will have dimensions of 20x5x3x1. By the end of the idx 
        % loop (i.e., idx=3), meanAmp will have dimensions 20x5x3x3. 
        %             
        clear C
    end
    %
end

Jacob McPherson 2022-8-18

Good catch, Bruno. These should be undone after each iteration of the idx2 for loop.

Jan 2022-8-19

编辑：Jan 2022-8-19

在 MATLAB Online 中打开

@Jacob McPherson: I thought the accumulation of the NaNs happens intentionally. If not, setting values of A to NaN only to apply a nanmean() is not efficient:

X = rand(1e3, 1e3);
M = rand(size(X)) < 0.3;  % A logical mask
tic;
for k = 1:1e2
    Y    = X;
    Y(M) = NaN;
    Z1   = nanmean(Y);
end
toc
Elapsed time is 0.996901 seconds.
tic;
for k = 1:1e2
    Y    = X;
    Y(M) = 0;
    Z2   = sum(Y, 1) ./ (size(M, 1) - sum(M, 1));
end
toc
Elapsed time is 0.559732 seconds.
isequal(Z1, Z2)
ans = logical
   1

(size(M, 1)-sum(M, 1)) is slightly faster than sum(~M, 1) for large M.

nanmean is deprecated, but the replacement mean(X, 'omitnan') is slower.

请先登录，再进行评论。

请先登录，再回答此问题。

Answer 1

Bruno Luong 2022-8-18

0
链接

此回答的直接链接

https://ww2.mathworks.cn/matlabcentral/answers/1781010-vectorizing-or-otherwise-accelerating-nested-loops-with-large-multidimensional-arrays#answer_1028930

在 MATLAB Online 中打开

% Generate dummy data
B=randi(4,10,3,5);
A=rand(10,3,5);
meanAmp = nan([max(B,[],'all'),size(A,2),size(A,3),size(A,3)]);
for idx = 1:size(A,3)
    for idx2= 1:max(B,[],'all')
        C = B(:,:,idx) == idx2;
        C = repmat(C,[1,1,size(A,3)]);
        Atmp = A;
        Atmp(~C) = NaN;
        meanAmp(idx2,:,:,idx) = mean(Atmp,'omitnan');
    end
end
[m,n,p] = size(B);
I = repmat(reshape(B,[m,n,1,p]),[1,1,p,1]);
J = repmat(1:n,[m,1,p,p]);
K = repmat(reshape(1:p,[1,1,p,1]),[m,n,1,p]);
L = repmat(reshape(1:p,[1,1,1,p]),[m,n,p,1]);
AA = repmat(A,[1 1 1 p]);
IDX = [I(:),J(:),K(:),L(:)];
S = accumarray(IDX,AA(:));
N = accumarray(IDX,1);
mA = S ./ N;
% Check correctness
b = isfinite(meanAmp);
norm(meanAmp(b)-mA(b),'Inf')
ans = 0

2 个评论
显示无隐藏无

Jacob McPherson 2022-8-18

Hi again, Bruno. I like this general idea, but the challenge is that it is not tractable with the size variables I need to process. The repmat commands attempt to generate arrays on the order of 1E6x32x50x50, or ~350GB, which results in an error.

Bruno Luong 2022-8-18

编辑：Bruno Luong 2022-8-18

在 MATLAB Online 中打开

You can always change a little bit the method and loop on 4th dimension.

[m,n,p] = size(B);
q = max(B,[],'all');
mA = zeros([q,n,p,p]);
[J,K] = ndgrid(uint16(1:n),uint16(1:p));
JK = repelem([J(:) K(:)],m,1);
for l = 1:p
    I = repmat(uint16(B(:,:,l)),[1,1,p]);
    IDX = [I(:),JK];
    S = accumarray(IDX,A(:));
    N = accumarray(IDX,1);
    mA(:,:,:,l) = S ./ N;
end