Basic Formula to Estimate GPU Memory Requirement
Memory Required=Model Size+Batch Size×(Forward Pass Memory+Backward Pass Memory)
- Model Size: Memory to store the model weights. If the model has N parameters and each parameter is of size S bytes (usually 4 bytes for float32), then the model size is N×S.
- Forward Pass Memory: Memory to store the intermediate activations during a forward pass. This depends on the model architecture and input size.
- Backward Pass Memory: Memory to store gradients during backpropagation. This is roughly equal to the forward pass memory.
- Batch Size: Number of samples processed in parallel.
Let's take an example with hypothetical values for an Inception V3 model to illustrate:
- Model Parameters (Inception V3): Approx 21.8M
- Data Type: float32 (4 bytes)
- Input Size: 540x960x3
- Batch Size: 32
For simplicity, let's assume that the forward and backward pass each require memory roughly equal to the input size times the number of feature maps at each layer.
Model Size = 21.8M parameters × 4 bytes/parameter = 87.2 MB
Forward Pass Memory = Batch Size × Input Size × Feature Maps × 4 bytes
Backward Pass Memory ≈ Forward Pass Memory
Assuming that the feature maps are roughly the same size as the input image (another gross simplification), and that there are about 1000 feature maps (across all layers):
Forward Pass Memory = 32 × 540 × 960 × 3 × 1000 × 4 bytes ≈ 197.6 GB
Total Memory = Model Size + Forward Pass Memory + Backward Pass Memory = 87.2 MB + 2 × 197.6 GB ≈ 395.2 GB
This is a very crude estimate and the actual memory requirement will likely be different due to various optimizations that deep learning frameworks employ.
So, to directly answer your question: you would likely need a GPU with a lot more than 400GB of memory with the current setup, which is currently infeasible. You would have to make adjustments to your model, data, or training regime to fit it into a GPU that you can realistically acquire.
Solutions for Memory Errors
- Reduce Batch Size: The easiest way to reduce memory usage.
- Use Gradient Accumulation: Perform a backward pass after accumulating gradients over multiple smaller batches.
- Use Mixed Precision Training: Utilizes both float16 and float32 to make training more memory-efficient.
- Use a Simpler Model: Smaller architectures require less memory.
- Distributed Training: Split the model and data across multiple GPUs.
- Check for Memory Leaks: Make sure that you're not unintentionally holding onto tensors that you no longer need.