To ensure consistent output sizes to be fed into the Convolutional Neural Network (CNN), a "zero-padding" technique can be applied in which zeroes are added on the shorter signal so that all the signals have consistent length. The number of zeroes which must be padded can be decided by the length of the longest signal. Padding can be done with the help of the "padarray" function in MATLAB.
After this CWT can be applied to the padded audio signals whose output would be the same size which will be fed to the CNN.
Follow this link regarding the usage of "padarray" function https://in.mathworks.com/help/images/ref/padarray.html.