Parallel processing of data stream

Question

0 个投票

I am trying to process big data in parallel using parallel toolbox. The idea is quite simple. First, I would have a function running in main process (client) that reads the data batch by batch and put them into a queue (let's call it input queue). Next, I would create bunch of workers that load the data from this queue, process it and send the results to different, output queue. Finally, I would have a final worker that reads the data from this output queue serialize it and most probably save it (or whatewer).

The problem with this approach I am having is following. Ideally I would have a single input queue. However in order for the workers to be able to pull the queue it has to be created within such worker. Ok, let's make separate queues for individual workers. However now, the length of such queues is invisible to the main thread which I need for two reasons. First, I would send the data to the most empty queue to equalize load on all workers. Second, I would like to monitor the length of the queues to not overfill them (like having a fix-length buffer).

Can you reccommend me a pattern to deal with such problems. I believe it is somehow typical problem that many people actualy solve. Thanks.

0 个评论
显示 -2更早的评论隐藏 -2更早的评论

请先登录，再进行评论。

请先登录，再回答此问题。

Follow Question

Answer 1

Radek 2024-4-10

0 个投票

Just for the record...

The solution is (obviously) to use spmd block. To communicate between workers one can use functions spmdSend() and spmdReceive(). The problem is, that theese functions were introduced in R2022b, while my version is R2021b. However Matlab very succesfully hides that theese functions actually replace former ones labSend() and labRecieve() (and others), that are therefore available in my version.

So I would probably recommend Matlab to state in documentation (there is even Version History section) if some function was introduced as a replacement for some other. Or, to allow users to switch to documentation version relevant for ther Matlab version (and I know you can look in to your local documentation).

0 个评论
显示 -2更早的评论隐藏 -2更早的评论

请先登录，再进行评论。

Answer 2

Thomas Falch 2025-5-15

在 MATLAB Online 中打开

1 个投票

Starting in R2025a you can use the new "any-destination" PollableDataQueue to achieve this more easily. Unlike the old PollableDataQueue the any-destination PollableDataQueue can be created on any worker/client, and be used to send to any client/worker. You can create one like this:

queue = parallel.pool.PollableDataQueue(Destination="any");

You can find an example of a workflow very similar to yours here: Perform Data Acquisition and Processing on Pool Workers - MATLAB & Simulink

0 个评论
显示 -2更早的评论隐藏 -2更早的评论

请先登录，再进行评论。

Answer 3

Strider 2024-3-28

在 MATLAB Online 中打开

0 个投票

I do not see a need for multiple ques or having the client load everything in serial. I would let the workers do the loading and saving. If you have to send back to client to save, as in this case, then the below should work for you.

I have never done anything with concern over the que length. I suppose you could have some kind of check in the mySavefunc, or just control the size by limiting the number of pool workers if out of memory is an issue.

saveQue = parallel.pool.DataQueue; 
files = dir('*.csv'); % assuming some pattern you know
fn = fullfile({files.folder},{files.name});
afterEach(@mySavefunc, saveQue);
parfor k = 1:numel(fn)
    
    % load it in a worker
    rawData = myLoadfun(fn{k});
    
    % do the work
    processedData = myProcessfun(rawData);
    
    % send back to client
    send(saveQue,processedData);
end

1 个评论
显示 -1更早的评论隐藏 -1更早的评论

Radek 2024-4-2

编辑：Radek 2024-4-2

The reasons why I need (want?) to load the data in client rather than in workers are: First, the 'read_functions' use a lot of file related function calls (fopen, fseek, fread, ...) which (to my limitted knowledge) are not supported why doing parallel processing (actually, I have couple of 'read_functions' as an ordinary functions, they seem to work, but others as class methods and they complain that my fid is not actually fid; obviously everything works just fine in single thread.) The second (and more important) reason is that the data are serialized in the stream. To let workers to read the data in random order would have serious performance implications.

To be able to monitor the queue length is to save memory, obviously. The processing is much slower than loading the data and data won't fit within memory either. Therefore monitoring the queue length seems like elegant solution. I am not sure how limiting number of workes on any check within a worker thread can stop clinet overfilling the queue.

Also I think it will be beneficial to see the queue length (if I have to have individual queue for each worker) to balance the load as some batches may take much longer to process than others.

Finally, sending processed data back to the clinet works just fine. Distributing data from client to workers is where I am facing some issues.

请先登录，再进行评论。

Parallel processing of data stream

0 个评论
显示 -2更早的评论隐藏 -2更早的评论

采纳的回答

0 个评论
显示 -2更早的评论隐藏 -2更早的评论

更多回答（2 个）

0 个评论
显示 -2更早的评论隐藏 -2更早的评论

1 个评论
显示 -1更早的评论隐藏 -1更早的评论

类别

产品

版本

标签

Community Treasure Hunt

Parallel processing of data stream

0 个评论 显示 -2更早的评论 隐藏 -2更早的评论

采纳的回答

0 个评论 显示 -2更早的评论 隐藏 -2更早的评论

更多回答（2 个）

0 个评论 显示 -2更早的评论 隐藏 -2更早的评论

1 个评论 显示 -1更早的评论 隐藏 -1更早的评论

类别

产品

版本

标签

另请参阅

Community Treasure Hunt

0 个评论
显示 -2更早的评论隐藏 -2更早的评论

0 个评论
显示 -2更早的评论隐藏 -2更早的评论

0 个评论
显示 -2更早的评论隐藏 -2更早的评论

1 个评论
显示 -1更早的评论隐藏 -1更早的评论