What is the cause of CUDA_ERROR_LAUNCH_FAILED?

Question

Brian Lee 2018-10-31

1
链接

此问题的直接链接

https://ww2.mathworks.cn/matlabcentral/answers/427234-what-is-the-cause-of-cuda_error_launch_failed

编辑： cui,xingxing 2024-4-27

I was working on multi-GPU training of a Neural Network and occasionally receive the error, "CUDA_ERROR_LAUNCH_FAILED" (full error and code below). What might be the cause of this? I successfully ran the code to completion once, tried to change some hyperparameters, then received this message. Reverting the hyperparameter changes did not fix the problem. Thanks in advance.

The code I ran:

%{
Test out transfer learning with pretrained model
See example 'Transfer Learning Using AlexNet'
%}
imds = imageDatastore('PetImages', ...
    'IncludeSubfolders',true, ...
    'LabelSource','foldernames');
[imdsTrain,imdsValidation] = splitEachLabel(imds,0.7,'randomized');
net = alexnet;
inputSize = net.Layers(1).InputSize;
layersTransfer = net.Layers(1:end-3);
numClasses = numel(categories(imdsTrain.Labels));
layers = [
    layersTransfer
    fullyConnectedLayer(100,'WeightLearnRateFactor',20,'BiasLearnRateFactor',20)
    fullyConnectedLayer(100,'WeightLearnRateFactor',20,'BiasLearnRateFactor',20)
    fullyConnectedLayer(numClasses,'WeightLearnRateFactor',20,'BiasLearnRateFactor',20)
    softmaxLayer
    classificationLayer];
pixelRange = [-30 30];
imageAugmenter = imageDataAugmenter( ...
    'RandXReflection',true, ...
    'RandXTranslation',pixelRange, ...
    'RandYTranslation',pixelRange);
augimdsTrain = augmentedImageDatastore(inputSize(1:2),imdsTrain, ...
    'DataAugmentation',imageAugmenter);
augimdsValidation = augmentedImageDatastore(inputSize(1:2),imdsValidation);
options = trainingOptions('sgdm', ...
    'MiniBatchSize',1000, ...
    'MaxEpochs',6, ...
    'InitialLearnRate',1e-4, ...
    'Shuffle','every-epoch', ...
    'ValidationData',augimdsValidation, ...
    'ValidationFrequency',3, ...
    'Verbose',false, ...
    'Plots','training-progress',...
    'ExecutionEnvironment','multi-gpu');
netTransfer = trainNetwork(augimdsTrain,layers,options);

And the full error text:

Error using trainNetwork (line 150)
An unexpected error occurred during CUDA execution. The CUDA error was:
CUDA_ERROR_LAUNCH_FAILED
Error in transferLearning (line 50)
netTransfer = trainNetwork(augimdsTrain,layers,options);
Caused by:
    Error using nnet.internal.cnn.DistributedDispatcher/computeInParallel
    (line 190)
    Error detected on worker 1.
        Error using nnet.internal.cnn.TrainerGPUStrategy/computeAccumImage
        (line 23)
        An unexpected error occurred during CUDA execution. The CUDA error
        was:
        CUDA_ERROR_LAUNCH_FAILED

1 个评论
显示 -1更早的评论隐藏 -1更早的评论

Joss Knight 2018-11-5

This doesn't look great, sorry about that. Does the problem stop recurring if you reduce the MiniBatchSize?

请先登录，再进行评论。

请先登录，再回答此问题。

Answer 1

cui,xingxing 2019-7-20

0
链接

此回答的直接链接

https://ww2.mathworks.cn/matlabcentral/answers/427234-what-is-the-cause-of-cuda_error_launch_failed#answer_384042

编辑：cui,xingxing 2024-4-27

I meet same error, have you resolved?@Joss Knight,@Brian Lee thanks

-------------------------Off-topic interlude, 2024-------------------------------

I am currently looking for a job in the field of CV algorithm development, based in Shenzhen, Guangdong, China,or a remote support position. I would be very grateful if anyone is willing to offer me a job or make a recommendation. My preliminary resume can be found at: https://cuixing158.github.io/about/ . Thank you!

Email: cuixingxing150@gmail.com

1 个评论
显示 -1更早的评论隐藏 -1更早的评论

Joss Knight 2019-7-22

This error occurs in all sorts of circumstances, usually because your card does not have enough memory. Try posting a new question, provide reproduction code, and give us the output of gpuDevice.

请先登录，再进行评论。

Answer 2

Brian Lee 2019-7-22

0
链接

此回答的直接链接

https://ww2.mathworks.cn/matlabcentral/answers/427234-what-is-the-cause-of-cuda_error_launch_failed#answer_384215

Sorry for the lack of follow up, but the issue did seem to be a lack of memory. I haven't seen the issue when using much smaller mini batch sizes.

1 个评论
显示 -1更早的评论隐藏 -1更早的评论

Jacques 2019-12-16

编辑：Jacques 2019-12-16

I had the same problem.

You have two choices. The first one consists to work with CPU. The second one consists to work with smaller matrices with GPU (computation-memory tradeoff)

请先登录，再进行评论。