flaky GPU memory issues
3 次查看(过去 30 天)
显示 更早的评论
We have a 580 GTX with 3Gb of ram running in a linux (Ubuntu Lucid with Natty backported kernel) machine with 2011b and I find myself fighting with seemingly random crashes due to memory allocation in the GPU. The first thing I noticed was that overwriting a variable defined on the GPU does not always give me back all the ram that the old variable had minus the size of the new data, so I have to clear the variable instead of overwriting it; is there some collection of best practices to avoid wasting memory in ways similar to this?
I also find that a calculation that has been running for hours, and that has successfully complete before will sometimes crash with an "unexpected error" which seems to correlate with running close to maximum memory capacity. Since the program had completed before, I am left assuming that some other program interfered with memory allocation in the GPU and killed my task. Is there a way to prevent this from happening? Maybe running the server headless, or putting in another, smaller video card to run the display?
Thanks
0 个评论
采纳的回答
Edric Ellis
2012-2-9
In your first observation about overwriting variables on the GPU, I presume you're using the output of "gpuDevice" to check the amount of free memory on the GPU. You're quite right that overwriting an array may not necessarily cause the old memory to be freed immediately; however, it will be freed automatically if necessary to prevent running out of memory.
It's not clear what the 'unexpected error' might be, this is not something I've seen here at The MathWorks on our test machines. Do these errors show up in similar places each time? I.e. does there seem to be a gpuArray operation that particularly causes this?
One final thing to note: like CPU memory, GPU memory can become fragmented over time, and it's possible that this might cause you to run out of GPU memory earlier than you might otherwise anticipate. However, I would not normally expect this to result in 'unexpected errors' - rather, I'd expect to see failed allocations.
更多回答(3 个)
Walter Roberson
2012-2-9
It is not safe to assume that some other program interfered with the memory allocation. Instead, you have to take in to account that your program might have corrupted memory in a way that does not always cause a crash but does sometimes. For example if the corrupted memory block does not happen to be needed again until a lot of memory is in use...
2 个评论
Walter Roberson
2012-2-9
If you _do_ have a memory corruption problem from your code (or from something on MathWork's implementation), then releasing all memory _or_ using all memory would trigger the problem. However, releasing the gpu from operations could, depending on implementation, potentially have the effect of just throwing away all of the memory without bothering to put all the fragments together.
It would not be impossible for a memory allocator to offer a "Ignore everything known about the current state of memory and just re-initialize back to the starting state". I do not recall ever encountering a memory allocation library that offered that as a user call, however.
I have not examined the memory allocation system used for the GPU routines; I am reflecting back to my past experiences [redacted] years ago, using [redacted] on [redacted] (redactions to protect my delusions that I am not _that_ old...)
Ruby Fu
2012-2-10
Hi Rodrigo, Eric and Walter, It is great that I found this post just when I need it! I have the exact same problem as Rodrigo. My experience has been this:
1. my program runs perfectly fine with a smaller resolution problem, meaning smaller matrices and less memory allocation.
2. when i try to run the program in higher resolution, it yells at me for not having enough memory
3. so naturally I clear several intermediate matrices at each iteration after they are done being useful; they get updated at the next iteration anyway.
4. Now I test run the new program (with cleared memory at each iteration) in the _small_ resolution problem.(just to make sure i did not accidentally clear some useful variables)
5.
Error using parallel.gpu.GPUArray/fft
MATLAB encountered an unexpected error in evaluation on the GPU.
Coincidentally, this error occurred at a fft operation. However, it is also the first function call in the program.
Do you think having a bigger GPU will solve the problem? I have a GTX580 as well and it only comes with 1.5GB. Would having a Tesla 6GB solve this problem or is there something else we are missing here?
Eric, I have the latest CUD driver so that should not be an issue.
Thank you! Ruby
1 个评论
Edric Ellis
2012-2-13
The error message you are getting is due to CUFFT - NVIDIA's FFT algorithm - running out of memory. Unfortunately, it sometimes reports back to us this out-of-memory condition as an "unexpected error", which we then report to you. This sort of unpredictable behaviour can sometimes be helped by the "feature" command I suggested to Rodrigo - but if you're that close to running out of memory, you may still have problems. A bigger memory card would almost certainly help you.
Max W.K. Law
2013-5-9
I got the same error while trying to ifftn (complex to complex) a 256*256*516 complex-single 3D array. It is a 258MB chunk of data. It fails on my 4GB GTX 680 card. YES, if it is about running short of memory, that means 4GB memory couldn't take a 258MB data chunk, and give the error "MATLAB encountered an unexpected error in evaluation on the GPU."
There are some other data in the GPU that many cause fragmentation. The code that produces this error is just "temp=ifftn(temp);" Please, is there any way to enforce the in-place transform?
Here is the gpuDevice() command result
Name: 'GeForce GTX 680'
Index: 1
ComputeCapability: '3.0'
SupportsDouble: 1
DriverVersion: 5
ToolkitVersion: 5
MaxThreadsPerBlock: 1024
MaxShmemPerBlock: 49152
MaxThreadBlockSize: [1024 1024 64]
MaxGridSize: [2.147483647000000e+09 65535 65535]
SIMDWidth: 32
TotalMemory: 4.294639616000000e+09
FreeMemory: 1.676050432000000e+09
MultiprocessorCount: 8
ClockRateKHz: 1163000
ComputeMode: 'Default'
GPUOverlapsTransfers: 1
KernelExecutionTimeout: 1
CanMapHostMemory: 1
DeviceSupported: 1
DeviceSelected: 1
0 个评论
另请参阅
类别
在 Help Center 和 File Exchange 中查找有关 Logical 的更多信息
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!