Main Content

Analysis with NVIDIA Profiler

Not Enough Parallelism

Condition

If the kernel is doing little work, then the overhead of memcpy and kernel launches can offset a performance gains. Consider working on a larger sample set (thus increasing the loop size). To detect this condition, look at the nvvpreport.

Action

Do more work in the loop or increase sample set size

Too Many Local per-Thread Registers

Condition

In case of too many local/temp variables used in the loop body, then it causes high register pressure in the per-thread register file. You can detect this condition by running in GPU safe-build mode. Or, nvvp reports this fact.

Action

Consider using different block sizes in coder.gpu.kernel pragma.

Related Topics