I currently have an analysis which loops through a list of measurements, where each measurement itself is a vector (so, it loops through the rows of a matrix).
On each measurement input (1D vector) I apply a custom function (which is not simple enough to share but nonetheless straightforward), which then gives an output result of the same vector size as the input. The analysis function itself more or less just loops through each element in the vector and does something, but the result for each element in the vector also depend on the previous results (there is a clear dependency ordering).
I often have 10 or 100s of millions of measurements per experiment, and it currently takes about 6 hours to run about 10 million, but we'd really like to get this working in less than 15 minutes... And so we hope there is a way to get it to run on a GPU? The key thing is that each measurement is actually only 15-50 elements long, and so I'd like to parallelise along the number of measurements, rather than the function which operates on each measurement (which seems straightforward).
For an example, an outline sketch of my code might be:
Update(d again, after vectorising - see most recent update) - I've separated out the specific part of my code and created a minimum working example, which I have now attached to this post. The code runs without issue, but running it with GPU arrays is currently much slower:
Outdated info
For larger N, this is only exaggerated. After more thorough research while separating this code example out from everything else, I realised what I hope for might not be possible in the end... If anyone is able to help it would be incredibly amazing.
Update 2 - I've reworked my code again, to make it as clear as I can where the dependency lies. It is hopefully more readable now, as the function calls are far more concise and minimal.
In words; the wc at each step depends on the position at that step, but then the position result at each step depends (on a bunch of things which depend) on the wc at the previous step. In particular, this occurs when "interp1" is used with arg_position and arg_density to assess the wc at the newly-determined position.
Update 3 - Ok so it finally clicked what everyone was suggesting, and indeed it works extremely well now after vectorising across the measurements. The attached code (updated again) works a lot faster, but under one condition:
The arg_position and arg_wc used in the interp1 line must be normal 1-D vectors, the same for all measurements. This is not a dealbreaker, as actually it only changes a relatively few 2000 times instead of 100 million (indeed, everything except the first argument: "invertible_group_delays" only changes 2000 times versus 100 million)...
Does anyone have a suggestion how I can keep all the measurements performed parallel, instead of splitting it into 2000 chunks? The interp1 line seems to be the only issue now (line 45 replacing 47).
Here are the results as they stand now:
gpu_run_test
real scenarios have N = ~1e8
with only ~2000 equilibriums ==> N = ~ 50000 per equilibrium
all arguments except the first will only change each equilibrium
N = 50000
Now doing normal inversion.
Total time for normal inversion:
Elapsed time is 0.567713 seconds.
Maximum difference between expected and achieved result: 5.5202e-06
Now doing GPU.
Total time for GPU inversions:
Elapsed time is 0.502490 seconds.
Maximum difference between expected and achieved result: 5.5202e-06
Done.
gpu_run_test
N = 10000000 ( = 1e7)
Now doing normal inversion.
Total time for normal inversion:
Elapsed time is 93.092429 seconds.
Maximum difference between expected and achieved normal result: 5.5202e-09 mm
Now doing GPU.
Total time for GPU inversions:
Elapsed time is 101.491855 seconds.
Maximum difference between expected and achieved gpu result: 5.5202e-09 mm
Done.
For a typical experiment, I'd like to get through N = 1e8 in under 5 minutes. Anyone have advice on the best way to get there from where things stand? Currently it takes 15 minutes. Perhaps I just have to buy more CPUs?