非同步儲存與分散式檢查點 (DCP)#
創建於:2024年7月22日 | 最後更新:2025年9月29日 | 最後驗證:2024年11月5日
作者: Lucas Pasqualin, Iris Zhang, Rodrigo Kumpera, Chien-Chin Huang
隨著模型和全域性規模的不斷增長,檢查點(Checkpointing)通常是分散式訓練工作負載關鍵路徑上的瓶頸,其成本也越來越高。一種抵消這種成本的絕佳策略是並行、非同步地進行檢查點儲存。下面,我們將擴充套件 分散式檢查點入門教程 中的儲存示例,展示如何輕鬆地將其與 torch.distributed.checkpoint.async_save 整合。
如何使用 DCP 並行生成檢查點
最佳化效能的有效策略
PyTorch v2.4.0 或更高版本
非同步檢查點概述#
在開始非同步檢查點儲存之前,瞭解其與同步檢查點儲存的區別和侷限性很重要。具體來說:
- 記憶體需求 - 非同步檢查點儲存透過首先將模型複製到內部 CPU 緩衝區來工作。
這很有幫助,因為它確保了在模型仍在檢查點儲存時模型和最佳化器權重不會發生變化,但確實會將 CPU 記憶體佔用量提高到
checkpoint_size_per_rank X number_of_ranks的倍數。此外,使用者應注意瞭解其系統的記憶體限制。具體來說,固定記憶體 (pinned memory) 意味著使用了page-lock記憶體,與pageable記憶體相比,page-lock記憶體可能很稀缺。
- 檢查點管理 - 由於檢查點儲存是非同步的,因此使用者需要管理併發執行的檢查點。通常,使用者可以
透過處理從
async_save返回的 Future 物件來採用自己的管理策略。對於大多數使用者,我們建議一次只限制一個非同步請求的檢查點,以避免每個請求產生額外的記憶體壓力。
import os
import torch
import torch.distributed as dist
import torch.distributed.checkpoint as dcp
import torch.multiprocessing as mp
import torch.nn as nn
from torch.distributed.fsdp import fully_shard
from torch.distributed.checkpoint.state_dict import get_state_dict, set_state_dict
from torch.distributed.checkpoint.stateful import Stateful
CHECKPOINT_DIR = "checkpoint"
class AppState(Stateful):
"""This is a useful wrapper for checkpointing the Application State. Since this object is compliant
with the Stateful protocol, DCP will automatically call state_dict/load_stat_dict as needed in the
dcp.save/load APIs.
Note: We take advantage of this wrapper to hande calling distributed state dict methods on the model
and optimizer.
"""
def __init__(self, model, optimizer=None):
self.model = model
self.optimizer = optimizer
def state_dict(self):
# this line automatically manages FSDP FQN's, as well as sets the default state dict type to FSDP.SHARDED_STATE_DICT
model_state_dict, optimizer_state_dict = get_state_dict(self.model, self.optimizer)
return {
"model": model_state_dict,
"optim": optimizer_state_dict
}
def load_state_dict(self, state_dict):
# sets our state dicts on the model and optimizer, now that we've loaded
set_state_dict(
self.model,
self.optimizer,
model_state_dict=state_dict["model"],
optim_state_dict=state_dict["optim"]
)
class ToyModel(nn.Module):
def __init__(self):
super(ToyModel, self).__init__()
self.net1 = nn.Linear(16, 16)
self.relu = nn.ReLU()
self.net2 = nn.Linear(16, 8)
def forward(self, x):
return self.net2(self.relu(self.net1(x)))
def setup(rank, world_size):
os.environ["MASTER_ADDR"] = "localhost"
os.environ["MASTER_PORT"] = "12355 "
# initialize the process group
dist.init_process_group("gloo", rank=rank, world_size=world_size)
torch.cuda.set_device(rank)
def cleanup():
dist.destroy_process_group()
def run_fsdp_checkpoint_save_example(rank, world_size):
print(f"Running basic FSDP checkpoint saving example on rank {rank}.")
setup(rank, world_size)
# create a model and move it to GPU with id rank
model = ToyModel().to(rank)
model = fully_shard(model)
loss_fn = nn.MSELoss()
optimizer = torch.optim.Adam(model.parameters(), lr=0.1)
checkpoint_future = None
for step in range(10):
optimizer.zero_grad()
model(torch.rand(8, 16, device="cuda")).sum().backward()
optimizer.step()
# waits for checkpointing to finish if one exists, avoiding queuing more then one checkpoint request at a time
if checkpoint_future is not None:
checkpoint_future.result()
state_dict = { "app": AppState(model, optimizer) }
checkpoint_future = dcp.async_save(state_dict, checkpoint_id=f"{CHECKPOINT_DIR}_step{step}")
cleanup()
if __name__ == "__main__":
world_size = torch.cuda.device_count()
print(f"Running async checkpoint example on {world_size} devices.")
mp.spawn(
run_fsdp_checkpoint_save_example,
args=(world_size,),
nprocs=world_size,
join=True,
)
使用固定記憶體獲得更高的效能#
如果上述最佳化仍然不夠高效,您可以利用一個針對 GPU 模型的額外最佳化,該最佳化使用固定記憶體緩衝區來暫存檢查點。具體來說,這項最佳化解決了非同步檢查點儲存的主要開銷,即複製到檢查點緩衝區。透過在檢查點請求之間維護一個固定記憶體緩衝區,使用者可以利用直接記憶體訪問 (DMA) 來加速此複製過程。
注意
這項最佳化的主要缺點是緩衝區在檢查點步驟之間持續存在。在沒有固定記憶體最佳化的情況下(如上所示),一旦檢查點儲存完成,任何檢查點緩衝區都會被釋放。在使用固定記憶體實現時,此緩衝區在步驟之間保持存在,從而導致相同的峰值記憶體壓力在應用程式生命週期內持續存在。
import os
import torch
import torch.distributed as dist
import torch.distributed.checkpoint as dcp
import torch.multiprocessing as mp
import torch.nn as nn
from torch.distributed.fsdp import fully_shard
from torch.distributed.checkpoint.state_dict import get_state_dict, set_state_dict
from torch.distributed.checkpoint.stateful import Stateful
from torch.distributed.checkpoint import FileSystemWriter as StorageWriter
CHECKPOINT_DIR = "checkpoint"
class AppState(Stateful):
"""This is a useful wrapper for checkpointing the Application State. Since this object is compliant
with the Stateful protocol, DCP will automatically call state_dict/load_stat_dict as needed in the
dcp.save/load APIs.
Note: We take advantage of this wrapper to hande calling distributed state dict methods on the model
and optimizer.
"""
def __init__(self, model, optimizer=None):
self.model = model
self.optimizer = optimizer
def state_dict(self):
# this line automatically manages FSDP FQN's, as well as sets the default state dict type to FSDP.SHARDED_STATE_DICT
model_state_dict, optimizer_state_dict = get_state_dict(self.model, self.optimizer)
return {
"model": model_state_dict,
"optim": optimizer_state_dict
}
def load_state_dict(self, state_dict):
# sets our state dicts on the model and optimizer, now that we've loaded
set_state_dict(
self.model,
self.optimizer,
model_state_dict=state_dict["model"],
optim_state_dict=state_dict["optim"]
)
class ToyModel(nn.Module):
def __init__(self):
super(ToyModel, self).__init__()
self.net1 = nn.Linear(16, 16)
self.relu = nn.ReLU()
self.net2 = nn.Linear(16, 8)
def forward(self, x):
return self.net2(self.relu(self.net1(x)))
def setup(rank, world_size):
os.environ["MASTER_ADDR"] = "localhost"
os.environ["MASTER_PORT"] = "12355 "
# initialize the process group
dist.init_process_group("gloo", rank=rank, world_size=world_size)
torch.cuda.set_device(rank)
def cleanup():
dist.destroy_process_group()
def run_fsdp_checkpoint_save_example(rank, world_size):
print(f"Running basic FSDP checkpoint saving example on rank {rank}.")
setup(rank, world_size)
# create a model and move it to GPU with id rank
model = ToyModel().to(rank)
model = fully_shard(model)
loss_fn = nn.MSELoss()
optimizer = torch.optim.Adam(model.parameters(), lr=0.1)
# The storage writer defines our 'staging' strategy, where staging is considered the process of copying
# checkpoints to in-memory buffers. By setting `cached_state_dict=True`, we enable efficient memory copying
# into a persistent buffer with pinned memory enabled.
# Note: It's important that the writer persists in between checkpointing requests, since it maintains the
# pinned memory buffer.
writer = StorageWriter(cache_staged_state_dict=True, path=CHECKPOINT_DIR)
checkpoint_future = None
for step in range(10):
optimizer.zero_grad()
model(torch.rand(8, 16, device="cuda")).sum().backward()
optimizer.step()
state_dict = { "app": AppState(model, optimizer) }
if checkpoint_future is not None:
# waits for checkpointing to finish, avoiding queuing more then one checkpoint request at a time
checkpoint_future.result()
checkpoint_future = dcp.async_save(state_dict, storage_writer=writer, checkpoint_id=f"{CHECKPOINT_DIR}_step{step}")
cleanup()
if __name__ == "__main__":
world_size = torch.cuda.device_count()
print(f"Running fsdp checkpoint example on {world_size} devices.")
mp.spawn(
run_fsdp_checkpoint_save_example,
args=(world_size,),
nprocs=world_size,
join=True,
)
結論#
總之,我們已經學會了如何使用 DCP 的 async_save() API 在關鍵訓練路徑之外生成檢查點。我們還了解了使用此 API 引入的額外記憶體和併發開銷,以及利用固定記憶體來進一步提高效能的額外最佳化。