- This corresponds to the amount of information remembered between time steps (the hidden state). The hidden state can contain information from all previous time steps, regardless of the sequence length.
- If the number of hidden units is too large, then the layer might overfit to the training data.
- In case of sequence to sequence regression, it refers to the length of the output sequence.
- In case of sequence classification, it refers to the number of classes.