I am trying to pinpoint/diagnose a parallel computing bottleneck that I've encountered on two different computers. For the computation each worker within the ‘parfor’ loop is assigned one sparse array out of 101 total (each array’s ‘full’ size is approximately 50,000x250). Each worker: 1) turns the sparse array into a ‘full’ array, 2) convolves the array with a small Gaussian kernel (which is also passed into the worker), 3) performs ICA using the ‘fast_ica’ function – the output is the independent component weight matrix. Recently, I started working on a new computer with a substantially higher core count but the performance of this code seems to be hitting some bottleneck such that I am not seeing any further performance increases. The old system has an Intel i7-8700K CPU (6 physical/12 logical cores), the new system is an AMD Ryzen 9 5950X (16 physical/32 logical cores) – both systems have 64 gb RAM, both are running Windows 10 and both have hyper-threading/SMT enabled (the old system is running Matlab R2018b and the new one is running R2020b). In order to compare the parallel performance across systems I ran the same code on both computers using different numbers of workers:
Top row shows the result for the older Intel system and the bottom row for the newer AMD system. Left column left axis shows the total execution time of each of the parfor runs (as measured by tic/tocs before and after), left column right axis shows the difference in execution time using N vs N+1 workers (i.e. points near 0 mean no improvement from N+1 as compared to N workers) - vertical dotted lines show the # of physical cores. Right columns show the system resource utilization during each of these runs. What I noticed is that in both cases using more than about 8 or 9 workers does not improve performance. This is despite the fact that a) more RAM and CPU resources are being used, and b) 9 workers represent 150% of the physical cores (75% of logical cores) in the Intel system but only 56% of the physical cores (28% of the logical cores) in the AMD system. The decreasing benefits of multi-threading past the physical core count can’t be at issue here given that the ‘bottleneck’ occurs well below the physical core count of the AMD system (16) and well above it on the Intel system (6). To me the most interesting ‘clue’ is that the number at which no further improvement occurs seems to be about 8 or 9 for both systems – however, its not impossible this is a coincidence and I don’t know quite how to interpret this fact. So my questions are:
- Given the difference in CPU memory architecture are there known differences in parallel computing performance between Intel and AMD Ryzen CPU’s?
- Given that the few physical limitations that I looked at (RAM, CPU utilization, physical core count) do not seem to be the problem, what else is likely to be bottlenecking me here?
- How can I further diagnose the source of the bottleneck (in terms of potential answers to question 2, or more generally)?