So, I prepared some small self-contained examples of what I'm talking about. You can find them in the zip archive I'm attaching.
The function "test_CPU.m" is a bare-bone version of the function I use in my code. The two figures output at every "snapshot" iterations are checks on conserved quantities (they should be zero, in principle).
The function "test_GPU.m" is, in my intention, the GPU version of the previous function. As you can see i create a gpuArray for the state of the system (Gz0) and for the first auxiliary matrix (GeD). The remaining auxiliary matrices in the Runge-Kutta loop (K, K1, K2, K3, K4) should be gpuArrays themselves.
I did not create gpuArrays for (real or complex) scalars because somewhere I read that it is not necessary (can someone confirm?).
The function test_both.m runs both of the above functions and compares the results, just to make sure that they work in the same way. They do.
The functions test_GPU_driver.m and test_CPU_driver.m are simply drivers that I use to submit the codes in background and to measure execution time. For L=128, as in the current settings, the GPU function is 2/3 faster than the CPU function.
For L=64 the GPU version slower, but I guess that for larger sizes it becomes more and more convenient (assuming that the memory is enough).
But the point here is exactly memory. I would expect that the two versions employ roughly the same memory. Instead, here is what I get from top
VIRT RES SHR
CPU 2403588 176768 74492
GPU 66,961g 459856 127368
Can someone comment on this? Is this expected?
As I mention above, when launched from the queuing system the memory requirements of the GPU version are 20 times larger than those of the CPU version. This actually refers to my full function, not the minimal examples above. However, the differences between the CPU and GPU versions of my full function are exactly as in the examples.
Thanks a lot for any insight