評價此頁
torch.compile 端到端教程"

torch.compile 端到端教程#

作者: William Wen

torch.compile 是加速 PyTorch 程式碼的新方法! torch.compile 透過將 PyTorch 程式碼 JIT 編譯成最佳化核心,使 PyTorch 程式碼執行得更快,同時只需要最少的程式碼更改。

本教程涵蓋了一個端到端的真實模型訓練和評估的示例,使用了 torch.compile。對於 torch.compile 的初步介紹,請參閱 torch.compile 介紹教程

必需的 pip 依賴項

  • torch >= 2.0

  • torchvision

您將學到什麼
  • 如何將 torch.compile 應用於真實模型

  • torch.compile 在真實模型上的加速效果

  • torch.compile 的最初幾次迭代由於編譯開銷,預期會更慢

先決條件
# NOTE: a modern NVIDIA GPU (H100, A100, or V100) is recommended for this tutorial in
# order to reproduce the speedup numbers shown below and documented elsewhere.

import torch
import warnings

gpu_ok = False
if torch.cuda.is_available():
    device_cap = torch.cuda.get_device_capability()
    if device_cap in ((7, 0), (8, 0), (9, 0)):
        gpu_ok = True

if not gpu_ok:
    warnings.warn(
        "GPU is not NVIDIA V100, A100, or H100. Speedup numbers may be lower "
        "than expected."
    )
/var/lib/workspace/intermediate_source/torch_compile_full_example.py:51: UserWarning:

GPU is not NVIDIA V100, A100, or H100. Speedup numbers may be lower than expected.

讓我們演示一下使用 torch.compile 如何加速真實模型。我們將透過在隨機資料上評估和訓練一個 torchvision 模型來比較標準的 eager 模式和 torch.compile

在開始之前,我們需要定義一些實用函式。

# Returns the result of running `fn()` and the time it took for `fn()` to run,
# in seconds. We use CUDA events and synchronization for the most accurate
# measurements.
def timed(fn):
    start = torch.cuda.Event(enable_timing=True)
    end = torch.cuda.Event(enable_timing=True)
    start.record()
    result = fn()
    end.record()
    torch.cuda.synchronize()
    return result, start.elapsed_time(end) / 1000


# Generates random input and targets data for the model, where `b` is
# batch size.
def generate_data(b):
    return (
        torch.randn(b, 3, 128, 128).to().cuda(),
        torch.randint(1000, (b,)).cuda(),
    )


N_ITERS = 10

from torchvision.models import densenet121


def init_model():
    return densenet121().cuda()

首先,讓我們比較推理。

請注意,在呼叫 torch.compile 時,我們有額外的 mode 引數,我們將在下文討論。

model = init_model()

# Note that we generally recommend directly compiling a torch.nn.Module by calling
# its .compile() method.
model_opt = init_model()
model_opt.compile(mode="reduce-overhead")

inp = generate_data(16)[0]
with torch.no_grad():
    print("eager:", timed(lambda: model(inp))[1])
    print("compile:", timed(lambda: model_opt(inp))[1])
eager: 0.3604090576171875
/usr/local/lib/python3.10/dist-packages/torch/backends/cuda/__init__.py:131: UserWarning:

Please use the new API settings to control TF32 behavior, such as torch.backends.cudnn.conv.fp32_precision = 'tf32' or torch.backends.cuda.matmul.fp32_precision = 'ieee'. Old settings, e.g, torch.backends.cuda.matmul.allow_tf32 = True, torch.backends.cudnn.allow_tf32 = True, allowTF32CuDNN() and allowTF32CuBLAS() will be deprecated after Pytorch 2.9. Please see https://pytorch.com.tw/docs/stable/notes/cuda.html#tensorfloat-32-tf32-on-ampere-and-later-devices (Triggered internally at /pytorch/aten/src/ATen/Context.cpp:80.)

/usr/local/lib/python3.10/dist-packages/torch/_inductor/compile_fx.py:312: UserWarning:

TensorFloat32 tensor cores for float32 matrix multiplication available but not enabled. Consider setting `torch.set_float32_matmul_precision('high')` for better performance.

compile: 51.42688671875

請注意,與 eager 模式相比,torch.compile 的完成時間要長得多。這是因為 torch.compile 在執行過程中將模型編譯成最佳化核心。在我們的示例中,模型的結構沒有改變,因此不需要重新編譯。所以如果我們再執行幾次最佳化後的模型,與 eager 模式相比,我們應該會看到顯著的改進。

eager_times = []
for i in range(N_ITERS):
    inp = generate_data(16)[0]
    with torch.no_grad():
        _, eager_time = timed(lambda: model(inp))
    eager_times.append(eager_time)
    print(f"eager eval time {i}: {eager_time}")

print("~" * 10)

compile_times = []
for i in range(N_ITERS):
    inp = generate_data(16)[0]
    with torch.no_grad():
        _, compile_time = timed(lambda: model_opt(inp))
    compile_times.append(compile_time)
    print(f"compile eval time {i}: {compile_time}")
print("~" * 10)

import numpy as np

eager_med = np.median(eager_times)
compile_med = np.median(compile_times)
speedup = eager_med / compile_med
assert speedup > 1
print(
    f"(eval) eager median: {eager_med}, compile median: {compile_med}, speedup: {speedup}x"
)
print("~" * 10)
eager eval time 0: 0.01820876884460449
eager eval time 1: 0.016675840377807616
eager eval time 2: 0.016416767120361327
eager eval time 3: 0.01638400077819824
eager eval time 4: 0.016457696914672852
eager eval time 5: 0.016348159790039063
eager eval time 6: 0.016328704833984374
eager eval time 7: 0.016314367294311523
eager eval time 8: 0.01641472053527832
eager eval time 9: 0.01641164779663086
~~~~~~~~~~
compile eval time 0: 0.061233150482177735
compile eval time 1: 0.007819263935089112
compile eval time 2: 0.008339455604553223
compile eval time 3: 0.007483391761779785
compile eval time 4: 0.007483359813690186
compile eval time 5: 0.007465983867645264
compile eval time 6: 0.0074670081138610836
compile eval time 7: 0.0074670081138610836
compile eval time 8: 0.007468031883239746
compile eval time 9: 0.0074700798988342285
~~~~~~~~~~
(eval) eager median: 0.016413184165954588, compile median: 0.007476719856262207, speedup: 2.1952386182033488x
~~~~~~~~~~

確實,我們可以看到使用 torch.compile 執行我們的模型可以顯著加速。加速主要來自於減少 Python 開銷和 GPU 讀/寫,因此觀察到的加速可能因模型架構和批次大小等因素而異。例如,如果模型的架構很簡單,並且資料量很大,那麼瓶頸將是 GPU 計算,觀察到的加速可能不那麼顯著。

您也可能會根據選擇的 mode 引數看到不同的加速結果。"reduce-overhead" 模式使用 CUDA 圖來進一步減少 Python 的開銷。對於您自己的模型,您可能需要嘗試不同的模式來最大化加速。您可以在 此處 閱讀更多關於模式的資訊。

您也可能會注意到,我們第二次使用 torch.compile 執行模型比其他執行速度慢很多,儘管比第一次執行快得多。這是因為 "reduce-overhead" 模式會為 CUDA 圖執行幾次預熱迭代。

現在,讓我們比較一下訓練。

model = init_model()
opt = torch.optim.Adam(model.parameters())


def train(mod, data):
    opt.zero_grad(True)
    pred = mod(data[0])
    loss = torch.nn.CrossEntropyLoss()(pred, data[1])
    loss.backward()
    opt.step()


eager_times = []
for i in range(N_ITERS):
    inp = generate_data(16)
    _, eager_time = timed(lambda: train(model, inp))
    eager_times.append(eager_time)
    print(f"eager train time {i}: {eager_time}")
print("~" * 10)

model = init_model()
opt = torch.optim.Adam(model.parameters())

# Note that because we are compiling a regular Python function, we do not
# call any .compile() method.
train_opt = torch.compile(train, mode="reduce-overhead")

compile_times = []
for i in range(N_ITERS):
    inp = generate_data(16)
    _, compile_time = timed(lambda: train_opt(model, inp))
    compile_times.append(compile_time)
    print(f"compile train time {i}: {compile_time}")
print("~" * 10)

eager_med = np.median(eager_times)
compile_med = np.median(compile_times)
speedup = eager_med / compile_med
assert speedup > 1
print(
    f"(train) eager median: {eager_med}, compile median: {compile_med}, speedup: {speedup}x"
)
print("~" * 10)
eager train time 0: 0.2882539367675781
eager train time 1: 0.05161676788330078
eager train time 2: 0.049276927947998046
eager train time 3: 0.05065420913696289
eager train time 4: 0.8006707153320313
eager train time 5: 0.05070438385009766
eager train time 6: 0.05034195327758789
eager train time 7: 0.05022825622558594
eager train time 8: 0.050223102569580076
eager train time 9: 0.05043302536010742
~~~~~~~~~~
compile train time 0: 151.00690625
compile train time 1: 2.915029052734375
compile train time 2: 0.02395030403137207
compile train time 3: 0.021402624130249022
compile train time 4: 0.020746240615844725
compile train time 5: 0.02069811248779297
compile train time 6: 0.020706304550170897
compile train time 7: 0.020715520858764647
compile train time 8: 0.02070425605773926
compile train time 9: 0.020745216369628908
~~~~~~~~~~
(train) eager median: 0.05054361724853516, compile median: 0.020745728492736815, speedup: 2.436338510177203x
~~~~~~~~~~

同樣,我們可以看到 torch.compile 在第一次迭代中花費的時間更長,因為它必須編譯模型,但在後續迭代中,與 eager 模式相比,我們看到了顯著的加速。

我們注意到,本教程中提供的加速數字僅用於演示目的。官方加速值可以在 TorchInductor 效能儀表盤 上檢視。

結論#

在本教程中,我們將 torch.compile 應用於真實模型的訓練和推理,演示了加速效果。

重要的是,我們注意到編譯模型的前幾次迭代比 eager 模式慢,因為存在編譯開銷,但預期後續迭代會有加速。

對於 torch.compile 的初步介紹,請參閱 torch.compile 介紹教程

為了解決問題並更深入地理解如何將 torch.compile 應用於您的程式碼,請檢視 torch.compile 程式設計模型

我們希望您會嘗試 torch.compile

指令碼總執行時間: (3 分鐘 29.786 秒)