Audio is sort of like greyscale images.
You can start a particular greyscale pixel at either black (0) or white (1). And then you can refine it by splitting the range in half, black (00), dark grey (01), light grey (10), white (11). Then you can split each range in half again, 000, 001, 010, 011, 100, 101, 110, 111. Each splitting, each adding of one bit, gets you more and more accuracy in the representation.
Likewise with audio, each sample is an intensity, and each additional bit of intensity allows you to refine the exact intensity you want to use... on/off, not very loud / louder but not full, and so on.
If you have a high enough audio resolution, then changing the last bit of the intensity might not be very noticeable to humans, at least not to those hearing the sound once in a typical environment rather than studying the signal with equipment. So you can hide information in audio just the same way you can in video, by making small changes at a resolution sufficient that humans probably will not notice without close examination.