Why/How is Circshift so fast?

Question

Nikunj Khetan 2024-6-5

0
链接

此问题的直接链接

https://ww2.mathworks.cn/matlabcentral/answers/2125816-why-how-is-circshift-so-fast

回答： Arun 2024-6-12

I have the following function in Matlab:

function [RawDataKK]=DataCompressKKTest(Data,RXangle,s)

tSize=size(Data,1);

xSize=size(Data,2);

TXSize=size(Data,3);

RXSize=numel(RXangle);

RawDataKK=zeros(tSize,TXSize,RXSize);

DataTemp=zeros(tSize,xSize);

slope = s*sin(RXangle(:))/2;

nShift = round(slope.*((0:xSize-1) - (xSize-1)*(1-sign(RXangle(:)))/2));

for nR=1:RXSize

for nT=1:TXSize

for nx=1:xSize

DataTemp(:,nx)=circshift(Data(:,nx,nT),nShift(nR,nx));

end

RawDataKK(:,nT,nR)=sum(DataTemp,2);

end

I tried implementing it as a C++ function using the Eigen toolbox and the Matlab Data API for C++. The basic logic of the C++ implementation uses a reindexing approach to implement circshift.

``` lang-cpp

#include <iostream>

#include "mex.hpp"

#include "mexAdapter.hpp"

#include <Eigen/Dense>

#include <MatlabDataArray.hpp>

#include <cmath>

#include <math.h>

#include <omp.h>

class MexFunction : public matlab::mex::Function {

public:

// Factory to create MATLAB data arrays

matlab::data::ArrayFactory factory;

void operator()(matlab::mex::ArgumentList outputs, matlab::mex::ArgumentList inputs) {

checkArguments(outputs, inputs);

// Initialize input parameters

int tSize = inputs[0][0];

int xSize = inputs[0][1];

int TXSize = inputs[0][2];

int RXSize = inputs[0][3];

int RFlen = inputs[0][4];

auto ptr = getDataPtr<std::complex<double>>(inputs[1]);

Eigen::Map< const Eigen::MatrixXcd > Data( ptr, RFlen, xSize );

double s = inputs[2][0];

auto ptr2 = getDataPtr<double>(inputs[3]);

Eigen::Map< const Eigen::VectorXd > RXangle( ptr2, RXSize );

auto ptr3 = getDataPtr<int32_t>(inputs[4]);

Eigen::Map< const Eigen::MatrixXi > indices( ptr3, tSize, xSize*RXSize );

outputs[0] = factory.createArray<std::complex<double>>({static_cast<size_t>(tSize),static_cast<size_t>(TXSize*RXSize)});

auto ptrRecon = getOutDataPtr<std::complex<double>>(outputs[0]);

Eigen::Map<Eigen::MatrixXcd> RFDataKK(ptrRecon,tSize,TXSize*RXSize);

// Get num threads

int numThreads = omp_get_max_threads();

int nProc = omp_get_num_procs();

omp_set_num_threads(nProc*2);

#pragma omp parallel for

for (int nR = 0; nR < RXSize; nR++) {

Eigen::MatrixXcd shiftedData = Eigen::MatrixXcd::Zero(tSize*TXSize,xSize);

for (int nT = 0; nT < TXSize; nT++){

extractData(shiftedData(Eigen::seq(nT*tSize,(nT+1)*tSize-1),Eigen::all),

Data(Eigen::seq(nT*tSize,(nT+1)*tSize-1),Eigen::all),

indices(Eigen::all,Eigen::seq(nR*xSize, (nR+1)*xSize-1)), xSize, tSize);

}

RFDataKK(Eigen::all,Eigen::seq(nR*TXSize,(nR+1)*TXSize-1) ) = shiftedData.rowwise().sum().reshaped(tSize,TXSize);

}

void extractData(Eigen::Ref<Eigen::MatrixXcd> shiftedData, const Eigen::Ref<const Eigen::MatrixXcd>& Data,

const Eigen::Ref<const Eigen::MatrixXi>& indices, const long int& xSize, const long int& tSize) {

for (int i = 0; i < tSize*xSize; i++) {

int nt = i % tSize;

int nx = i / tSize;

int row = indices(nt,nx) % tSize;

int col = indices(nt,nx) / tSize;

shiftedData(row,col) = Data(nt,nx);

}

void checkArguments(matlab::mex::ArgumentList outputs, matlab::mex::ArgumentList inputs) {

std::shared_ptr<matlab::engine::MATLABEngine> matlabPtr = getEngine();

matlab::data::ArrayFactory factory;

// TODO: Implement remaining checks

}

template <typename T>

const T* getDataPtr(matlab::data::Array arr) {

const matlab::data::TypedArray<T> arr_t = arr;

matlab::data::TypedIterator<const T> it(arr_t.begin());

return it.operator->();

}

template <typename T>

T* getOutDataPtr(matlab::data::Array& arr) {

auto range = matlab::data::getWritableElements<T>(arr);

return range.begin().operator->();

}

};

```

Where I precompute the values of indices using the following Matlab function:

function linIndices=DataCompressKKIndices(RXangle,s,tSize,xSize,TXSize,RXSize)

shiftedData = zeros(tSize, xSize, TXSize);

RFDatKK = zeros(tSize,TXSize,RXSize);

RFDataKK = zeros(tSize,xSize*RXSize);

slope=s*sin(RXangle(:))/2;

nShift = round(slope.*((0:xSize-1) - (xSize-1)*(1-sign(RXangle(:)))/2));

indices = mod((0:tSize-1).' + reshape(nShift.',[],1).',tSize);

linIndices = indices + repmat((0:xSize-1)*tSize,1,RXSize);

end

When I perform a speed test in Matlab after compiling with the following parameters (with mingw GCC version 8.3), I find that I have not gained a meaningful speedup for the array sizes I am working with. OpenMP parallelization should at least give me an order of magnitude speed up (I have 24 threads on my machine). There are two questions that yields:

1. Why is a reindexing approach slower than doing circshift? A reindexing approach in Matlab (not shown here) is almost 2x slower than using nested for loops and circshift.

2. What exactly could circshift be doing under the hood that is so efficient? Is there some funky pointer arithmetic that could accomplish the functionality of circshift?

Compilation and testing code:

%% Compilation

mingwFlags = {'CXXFLAGS="$CXXFLAGS -march=native -std=c++14 -fno-math-errno -ffast-math -fopenmp -DNDEBUG -w -Wno-error"',...

'LDFLAGS="$LDFLAGS -fopenmp"','CXXOPTIMFLAGS="-O3"'};

% ipath is the path to the eigen library.

tic; mex(ipath,mingwFlags{1},mingwFlags{2},mingwFlags{3},'CompressKK.cpp'); toc

tic; mex(ipath,mingwFlags{1},mingwFlags{2},mingwFlags{3},'CompressKKIndices.cpp'); toc

%% Initialize some random data

...

% tSize = 2688, xSize = 192, TXSize = 15, RXSize = 16 for trial cases

%% Timing Test

nTrials = 100;

T = zeros(nTrials,4);

for i = 1:nTrials

tic; indices = DataCompressKKIndices(p.RXangle,s,double(p.szRFframe+1),double(p.numEl),p.na,p.nRX);

RawDataKK3 = CompressKKIndices(param,cRF(1:param(5),p.ConnMap),s,p.RXangle,int32(indices));

RawDataKK3 = reshape(RawDataKK3,[p.szRFframe+1,p.na,p.nRX]); T(i,1) = toc;

tic; RawDataKK=DataCompressKKTest(Data,p.RXangle,s); T(i,2) = toc;

end

chkT = mean(T,1)

Timing Results:

chkT =

C++ version: 0.4867 0.8299

0 个评论
显示 -2更早的评论隐藏 -2更早的评论

请先登录，再进行评论。

请先登录，再回答此问题。

Answer 1

Arun 2024-6-12

0
链接

此回答的直接链接

https://ww2.mathworks.cn/matlabcentral/answers/2125816-why-how-is-circshift-so-fast#answer_1470751

Hi @Nikunj Khetan,

I understand that you want explanation regarding faster execution of “circshift” in MATLAB when compared to reindexing approach.

Performance edge of “circshift” over reindexing approach is due to its highly optimized internal implementation, efficient use of memory and CPU resources, and MATLAB’s execution environment, which is finely tuned for executing built-in functions efficiently.

Here is a deeper dive into why “circshift” might outperform a manual reindexing approach.

Optimized Internal Implementation: The compiled code is specifically optimized for performance, including low-level optimization that are not accessible to MATLAB code at the script level.
Vectorization: MATLAB's built-in functions, including “circshift”, are designed to automatically leverage vectorized operations that operate on whole arrays or large chunks of arrays at once, rather than iterating over elements. This is much faster due to reduced loop overhead and better utilization of CPU vector instructions.
Preallocation: “circshift” likely preallocates the output array in an optimal way, reducing the need for reallocating memory during the operation. Manual reindexing, depending on how it's implemented, might incur additional overhead due to less efficient memory management.
Just-In-Time (JIT) compilation limitations: MATLAB's JIT compiler optimizes array operations and loop execution at runtime. However, its ability to optimize custom reindexing logic might be limited compared to the optimizations it can apply to built-in functions like “circshift”.

The "circshit" function's efficiency likely stems from a combination of these advanced techniques, tailored to take full advantage of the hardware and the MATLAB execution environment.

Hope this helps!

Regards

Arun

0 个评论
显示 -2更早的评论隐藏 -2更早的评论

请先登录，再进行评论。

Why/How is Circshift so fast?

0 个评论
显示 -2更早的评论隐藏 -2更早的评论

回答（1 个）

0 个评论
显示 -2更早的评论隐藏 -2更早的评论

另请参阅

类别

标签

产品

版本

Community Treasure Hunt

Why/How is Circshift so fast?

0 个评论 显示 -2更早的评论隐藏 -2更早的评论

回答（1 个）

0 个评论 显示 -2更早的评论隐藏 -2更早的评论

另请参阅

类别

标签

产品

版本

Community Treasure Hunt

0 个评论
显示 -2更早的评论隐藏 -2更早的评论

0 个评论
显示 -2更早的评论隐藏 -2更早的评论