DeepSpeed API ZeRO Offload

DeepSpeed API ZeRO Offload

The biggest problem in Deep learning and training large language models (LLMs) is the limit of GPU memory.

So, I found out the other day, is that, even the “small” large language models (LLMs) like Meta’s LLaMA 2 still use a lot of memory.

As a rule of thumb, for 1 billion parameters require about 4GB of memory to just load the model, assuming you load it with 32-bit full precision. Training a model, usually with ADAM optimizer, will take 4GB of memory for each GB of model loaded.

So, you do the math, a small LLaMA language model, with ~7 billion parameters, probably take 7 * (4 + 1) = 35 GB.

Half precision training

You can use 16-bit half precision training, which take 35 GB / 2 ~= 18 GB. You still need a RTX 3090 graphic card though.

Why not use CPU to train?

Training LLMs on CPUs rather than GPUs can be considerably slower due to the latter’s architecture, which is optimized for parallel processing. GPUs have thousands of smaller cores designed to handle multiple operations simultaneously, ideal for the matrix and vector computations prevalent in deep learning.

In my personal experience, using CPU to train a model, is at least 10000x slower. Not worth the time.

Transitional way of training LLM

import torch
from import DataLoader
from transformers import LlamaForCausalLM, LlamaTokenizer
from datasets import load_dataset

# Load tokenizer and model (this is hypothetical as the API might be different)
tokenizer = LlamaTokenizer.from_pretrained('Llama-pretrained')
model = LlamaForCausalLM.from_pretrained('Llama-pretrained')

# Load and preprocess dataset
dataset = load_dataset('your_dataset')
def encode(examples):
    return tokenizer(examples['text'], truncation=True, padding='max_length')

encoded_dataset =, batched=True)

# Convert to PyTorch DataLoader
train_loader = DataLoader(encoded_dataset['train'], batch_size=8, shuffle=True)

# Define training loop
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

optimizer = torch.optim.AdamW(model.parameters(), lr=5e-5)

for epoch in range(num_epochs):
    for batch in train_loader:
        inputs = batch['input_ids'].to(device)
        labels = batch['labels'].to(device)

        outputs = model(input_ids=inputs, labels=labels)
        loss = outputs.loss

        print(f"Epoch {epoch} Loss {loss.item()}")

# Save the trained model

What is Offloading?

So, the Transformer model that LLaMA 7-billion parameters uses 32 layers of Attention sub-structure. For more about Attention layers, please read the Transformer paper.

Attention Is All You Need :

When training a model, we load the whole model weights and parameters and the optimizer’s weights into GPU memory. Offloading a model means only load part of the model into the GPU. once the weights of that part is calculated, copy the data back out into CPU’s memory.

In this case, we can offload most model weights in CPU’s memory, and only load 1 Attention layer at a time.

Keep in mind that each layer has to copy back and forward into GPU twice for each training loop, for forward and back-propagation.

ZeRO offer even more method for faster parallel training if you have multiple GPUs. For more, check out their paper:

ZeRO: Memory Optimizations Toward Training Trillion Parameter Models

Code to Offload using DeepSpeed ZeRO


  "train_batch_size": "auto",
  "train_micro_batch_size_per_gpu": 4,
  "optimizer": {
    "type": "AdamW",
    "params": {
      "lr": 5e-5
  "zero_optimization": {
    "stage": 2,
    "offload_optimizer": {
      "device": "cpu",
      "pin_memory": true
    "allgather_partitions": true,
    "allgather_bucket_size": 2e8,
    "reduce_scatter": true,
    "reduce_bucket_size": 2e8,
    "overlap_comm": true,
    "contiguous_gradients": true
  "fp16": {
    "enabled": true
  "gradient_clipping": 1.0
import deepspeed
from transformers import AutoModelForCausalLM, AutoTokenizer
from datasets import load_dataset

# Setup tokenizer and model
tokenizer = AutoTokenizer.from_pretrained("allenai/llama")
model = AutoModelForCausalLM.from_pretrained("allenai/llama")

# Prepare dataset
# Assuming 'dataset' is already tokenized and in DataLoader format

# Initialize DeepSpeed
model_engine, optimizer, _, _ = deepspeed.initialize(
    optimizer=None,  # DeepSpeed will handle the optimizer internally

# Training loop
for epoch in range(num_epochs):
  for batch in dataset:
    input_ids = batch["input_ids"].to(model_engine.device)
    attention_mask = batch["attention_mask"].to(model_engine.device)

    # Forward pass
    outputs = model_engine(input_ids=input_ids, attention_mask=attention_mask)

    # Compute the loss
    loss = outputs.loss
    # Backward pass

    # Step the optimizer

# Save the model, which will only save the weights that are on the GPU

Performance of ZeRO

In my experience, using Deepspeed ZeRO, for single GPU, require a lot of debugging to get it working. Also, the memory saved is not as much as in theory. The result I believe is there is a time for GPU to clear its memory ( Even when you purge the GPU memory when copy the weight out to the CPU’s memory ).

Also, not only copy weights in and out GPU takes time, but you also not able to use the parallel computing between layers, if you load the hold model. In my experience, Offloading for a single GPU will result in at least 20x to 100x slower for training time, depend how large your model.

But this is still a much better way to train compare to train using CPU.