Main Content

Further Notes on Communicating Jobs

Number of Tasks in a Communicating Job

Although you create only one task for a communicating job, the system copies this task for each worker that runs the job. For example, if a communicating job runs on four workers, the Tasks property of the job contains four task objects. The first task in the job's Tasks property corresponds to the task run by the worker whose spmdIndex is 1, and so on, so that the ID property for the task object and spmdIndex for the worker that ran that task have the same value. Therefore, the sequence of results returned by the fetchOutputs function corresponds to the value of spmdIndex and to the order of tasks in the job's Tasks property.

Avoid Deadlock and Other Dependency Errors

Because code running in one worker for a communicating job can block execution until some corresponding code executes on another worker, the potential for deadlock exists in communicating jobs. This is most likely to occur when transferring data between workers or when making code dependent upon the spmdIndex in an if statement. Some examples illustrate common pitfalls.

Suppose you have a codistributed array D, and you want to use the gather function to assemble the entire array in the workspace of a single worker.

if spmdIndex == 1
    assembled = gather(D);

The reason this fails is because the gather function requires communication between all the workers across which the array is distributed. When the if statement limits execution to a single worker, the other workers required for execution of the function are not executing the statement. As an alternative, you can use gather itself to collect the data into the workspace of a single worker: assembled = gather(D, 1).

In another example, suppose you want to transfer data from every worker to the next worker on the right (defined as the next higher spmdIndex). First you define for each worker what the workers on the left and right are.

from_worker_left = mod(spmdIndex - 2, spmdSize) + 1;
to_worker_right  = mod(spmdIndex, spmdSize) + 1;

Then try to pass data around the ring.

spmdSend (outdata, to_worker_right);
indata = spmdReceive(from_worker_left);

The reason this code might fail is because, depending on the size of the data being transferred, the spmdSend function can block execution in a worker until the corresponding receiving worker executes its spmdReceive function. In this case, all the workers are attempting to send at the same time, and none are attempting to receive while spmdSend has them blocked. In other words, none of the workers get to their spmdReceive statements because they are all blocked at the spmdSend statement. To avoid this particular problem, you can use the spmdSendReceive function.