注意
跳轉到結尾 下載完整的示例程式碼。
torch.compile 端到端教程#
作者: William Wen
torch.compile 是加速 PyTorch 程式碼的新方法! torch.compile 透過將 PyTorch 程式碼 JIT 編譯成最佳化核心,使 PyTorch 程式碼執行得更快,同時只需要最少的程式碼更改。
本教程涵蓋了一個端到端的真實模型訓練和評估的示例,使用了 torch.compile。對於 torch.compile 的初步介紹,請參閱 torch.compile 介紹教程。
必需的 pip 依賴項
torch >= 2.0torchvision
如何將
torch.compile應用於真實模型torch.compile在真實模型上的加速效果torch.compile的最初幾次迭代由於編譯開銷,預期會更慢
# NOTE: a modern NVIDIA GPU (H100, A100, or V100) is recommended for this tutorial in
# order to reproduce the speedup numbers shown below and documented elsewhere.
import torch
import warnings
gpu_ok = False
if torch.cuda.is_available():
device_cap = torch.cuda.get_device_capability()
if device_cap in ((7, 0), (8, 0), (9, 0)):
gpu_ok = True
if not gpu_ok:
warnings.warn(
"GPU is not NVIDIA V100, A100, or H100. Speedup numbers may be lower "
"than expected."
)
/var/lib/workspace/intermediate_source/torch_compile_full_example.py:51: UserWarning:
GPU is not NVIDIA V100, A100, or H100. Speedup numbers may be lower than expected.
讓我們演示一下使用 torch.compile 如何加速真實模型。我們將透過在隨機資料上評估和訓練一個 torchvision 模型來比較標準的 eager 模式和 torch.compile。
在開始之前,我們需要定義一些實用函式。
# Returns the result of running `fn()` and the time it took for `fn()` to run,
# in seconds. We use CUDA events and synchronization for the most accurate
# measurements.
def timed(fn):
start = torch.cuda.Event(enable_timing=True)
end = torch.cuda.Event(enable_timing=True)
start.record()
result = fn()
end.record()
torch.cuda.synchronize()
return result, start.elapsed_time(end) / 1000
# Generates random input and targets data for the model, where `b` is
# batch size.
def generate_data(b):
return (
torch.randn(b, 3, 128, 128).to().cuda(),
torch.randint(1000, (b,)).cuda(),
)
N_ITERS = 10
from torchvision.models import densenet121
def init_model():
return densenet121().cuda()
首先,讓我們比較推理。
請注意,在呼叫 torch.compile 時,我們有額外的 mode 引數,我們將在下文討論。
model = init_model()
# Note that we generally recommend directly compiling a torch.nn.Module by calling
# its .compile() method.
model_opt = init_model()
model_opt.compile(mode="reduce-overhead")
inp = generate_data(16)[0]
with torch.no_grad():
print("eager:", timed(lambda: model(inp))[1])
print("compile:", timed(lambda: model_opt(inp))[1])
eager: 0.3604090576171875
/usr/local/lib/python3.10/dist-packages/torch/backends/cuda/__init__.py:131: UserWarning:
Please use the new API settings to control TF32 behavior, such as torch.backends.cudnn.conv.fp32_precision = 'tf32' or torch.backends.cuda.matmul.fp32_precision = 'ieee'. Old settings, e.g, torch.backends.cuda.matmul.allow_tf32 = True, torch.backends.cudnn.allow_tf32 = True, allowTF32CuDNN() and allowTF32CuBLAS() will be deprecated after Pytorch 2.9. Please see https://pytorch.com.tw/docs/stable/notes/cuda.html#tensorfloat-32-tf32-on-ampere-and-later-devices (Triggered internally at /pytorch/aten/src/ATen/Context.cpp:80.)
/usr/local/lib/python3.10/dist-packages/torch/_inductor/compile_fx.py:312: UserWarning:
TensorFloat32 tensor cores for float32 matrix multiplication available but not enabled. Consider setting `torch.set_float32_matmul_precision('high')` for better performance.
compile: 51.42688671875
請注意,與 eager 模式相比,torch.compile 的完成時間要長得多。這是因為 torch.compile 在執行過程中將模型編譯成最佳化核心。在我們的示例中,模型的結構沒有改變,因此不需要重新編譯。所以如果我們再執行幾次最佳化後的模型,與 eager 模式相比,我們應該會看到顯著的改進。
eager_times = []
for i in range(N_ITERS):
inp = generate_data(16)[0]
with torch.no_grad():
_, eager_time = timed(lambda: model(inp))
eager_times.append(eager_time)
print(f"eager eval time {i}: {eager_time}")
print("~" * 10)
compile_times = []
for i in range(N_ITERS):
inp = generate_data(16)[0]
with torch.no_grad():
_, compile_time = timed(lambda: model_opt(inp))
compile_times.append(compile_time)
print(f"compile eval time {i}: {compile_time}")
print("~" * 10)
import numpy as np
eager_med = np.median(eager_times)
compile_med = np.median(compile_times)
speedup = eager_med / compile_med
assert speedup > 1
print(
f"(eval) eager median: {eager_med}, compile median: {compile_med}, speedup: {speedup}x"
)
print("~" * 10)
eager eval time 0: 0.01820876884460449
eager eval time 1: 0.016675840377807616
eager eval time 2: 0.016416767120361327
eager eval time 3: 0.01638400077819824
eager eval time 4: 0.016457696914672852
eager eval time 5: 0.016348159790039063
eager eval time 6: 0.016328704833984374
eager eval time 7: 0.016314367294311523
eager eval time 8: 0.01641472053527832
eager eval time 9: 0.01641164779663086
~~~~~~~~~~
compile eval time 0: 0.061233150482177735
compile eval time 1: 0.007819263935089112
compile eval time 2: 0.008339455604553223
compile eval time 3: 0.007483391761779785
compile eval time 4: 0.007483359813690186
compile eval time 5: 0.007465983867645264
compile eval time 6: 0.0074670081138610836
compile eval time 7: 0.0074670081138610836
compile eval time 8: 0.007468031883239746
compile eval time 9: 0.0074700798988342285
~~~~~~~~~~
(eval) eager median: 0.016413184165954588, compile median: 0.007476719856262207, speedup: 2.1952386182033488x
~~~~~~~~~~
確實,我們可以看到使用 torch.compile 執行我們的模型可以顯著加速。加速主要來自於減少 Python 開銷和 GPU 讀/寫,因此觀察到的加速可能因模型架構和批次大小等因素而異。例如,如果模型的架構很簡單,並且資料量很大,那麼瓶頸將是 GPU 計算,觀察到的加速可能不那麼顯著。
您也可能會根據選擇的 mode 引數看到不同的加速結果。"reduce-overhead" 模式使用 CUDA 圖來進一步減少 Python 的開銷。對於您自己的模型,您可能需要嘗試不同的模式來最大化加速。您可以在 此處 閱讀更多關於模式的資訊。
您也可能會注意到,我們第二次使用 torch.compile 執行模型比其他執行速度慢很多,儘管比第一次執行快得多。這是因為 "reduce-overhead" 模式會為 CUDA 圖執行幾次預熱迭代。
現在,讓我們比較一下訓練。
model = init_model()
opt = torch.optim.Adam(model.parameters())
def train(mod, data):
opt.zero_grad(True)
pred = mod(data[0])
loss = torch.nn.CrossEntropyLoss()(pred, data[1])
loss.backward()
opt.step()
eager_times = []
for i in range(N_ITERS):
inp = generate_data(16)
_, eager_time = timed(lambda: train(model, inp))
eager_times.append(eager_time)
print(f"eager train time {i}: {eager_time}")
print("~" * 10)
model = init_model()
opt = torch.optim.Adam(model.parameters())
# Note that because we are compiling a regular Python function, we do not
# call any .compile() method.
train_opt = torch.compile(train, mode="reduce-overhead")
compile_times = []
for i in range(N_ITERS):
inp = generate_data(16)
_, compile_time = timed(lambda: train_opt(model, inp))
compile_times.append(compile_time)
print(f"compile train time {i}: {compile_time}")
print("~" * 10)
eager_med = np.median(eager_times)
compile_med = np.median(compile_times)
speedup = eager_med / compile_med
assert speedup > 1
print(
f"(train) eager median: {eager_med}, compile median: {compile_med}, speedup: {speedup}x"
)
print("~" * 10)
eager train time 0: 0.2882539367675781
eager train time 1: 0.05161676788330078
eager train time 2: 0.049276927947998046
eager train time 3: 0.05065420913696289
eager train time 4: 0.8006707153320313
eager train time 5: 0.05070438385009766
eager train time 6: 0.05034195327758789
eager train time 7: 0.05022825622558594
eager train time 8: 0.050223102569580076
eager train time 9: 0.05043302536010742
~~~~~~~~~~
compile train time 0: 151.00690625
compile train time 1: 2.915029052734375
compile train time 2: 0.02395030403137207
compile train time 3: 0.021402624130249022
compile train time 4: 0.020746240615844725
compile train time 5: 0.02069811248779297
compile train time 6: 0.020706304550170897
compile train time 7: 0.020715520858764647
compile train time 8: 0.02070425605773926
compile train time 9: 0.020745216369628908
~~~~~~~~~~
(train) eager median: 0.05054361724853516, compile median: 0.020745728492736815, speedup: 2.436338510177203x
~~~~~~~~~~
同樣,我們可以看到 torch.compile 在第一次迭代中花費的時間更長,因為它必須編譯模型,但在後續迭代中,與 eager 模式相比,我們看到了顯著的加速。
我們注意到,本教程中提供的加速數字僅用於演示目的。官方加速值可以在 TorchInductor 效能儀表盤 上檢視。
結論#
在本教程中,我們將 torch.compile 應用於真實模型的訓練和推理,演示了加速效果。
重要的是,我們注意到編譯模型的前幾次迭代比 eager 模式慢,因為存在編譯開銷,但預期後續迭代會有加速。
對於 torch.compile 的初步介紹,請參閱 torch.compile 介紹教程。
為了解決問題並更深入地理解如何將 torch.compile 應用於您的程式碼,請檢視 torch.compile 程式設計模型。
我們希望您會嘗試 torch.compile!
指令碼總執行時間: (3 分鐘 29.786 秒)