CUDA 語義#

創建於：2017年1月16日 | 最後更新於：2025年9月4日

torch.cuda 用於設定和執行 CUDA 操作。它跟蹤當前選定的 GPU，您分配的所有 CUDA 張量預設都會在該裝置上建立。可以使用 torch.cuda.device 上下文管理器更改選定的裝置。

然而，一旦張量被分配，您就可以在其上執行操作，而無需考慮選定的裝置，並且結果將始終放置在與張量相同的裝置上。

預設情況下不允許跨 GPU 操作，但 copy_() 和其他具有類複製功能的函式（如 to() 和 cuda()）除外。除非您啟用點對點記憶體訪問，否則嘗試在跨不同裝置分佈的張量上啟動操作將引發錯誤。

下面是一個小示例，展示了這一點

cuda = torch.device('cuda')     # Default CUDA device
cuda0 = torch.device('cuda:0')
cuda2 = torch.device('cuda:2')  # GPU 2 (these are 0-indexed)

x = torch.tensor([1., 2.], device=cuda0)
# x.device is device(type='cuda', index=0)
y = torch.tensor([1., 2.]).cuda()
# y.device is device(type='cuda', index=0)

with torch.cuda.device(1):
    # allocates a tensor on GPU 1
    a = torch.tensor([1., 2.], device=cuda)

    # transfers a tensor from CPU to GPU 1
    b = torch.tensor([1., 2.]).cuda()
    # a.device and b.device are device(type='cuda', index=1)

    # You can also use ``Tensor.to`` to transfer a tensor:
    b2 = torch.tensor([1., 2.]).to(device=cuda)
    # b.device and b2.device are device(type='cuda', index=1)

    c = a + b
    # c.device is device(type='cuda', index=1)

    z = x + y
    # z.device is device(type='cuda', index=0)

    # even within a context, you can specify the device
    # (or give a GPU index to the .cuda call)
    d = torch.randn(2, device=cuda2)
    e = torch.randn(2).to(cuda2)
    f = torch.randn(2).cuda(cuda2)
    # d.device, e.device, and f.device are all device(type='cuda', index=2)

Ampere (及更高版本) 裝置上的 TensorFloat-32 (TF32)#

在 PyTorch 2.9 之後，我們提供了一套新的 API 來更精細地控制 TF32 的行為，並建議使用新的 API 以獲得更好的控制。我們可以按後端和運算元設定 float32 精度。我們也可以為特定運算元覆蓋全域性設定。

torch.backends.fp32_precision = "ieee"
torch.backends.cuda.matmul.fp32_precision = "ieee"
torch.backends.cudnn.fp32_precision = "ieee"
torch.backends.cudnn.conv.fp32_precision = "tf32"
torch.backends.cudnn.rnn.fp32_precision = "tf32"

cuda/cudnn 的 fp32_precision 可以設定為 ieee 或 tf32。 ieee fp32_precision 表示我們將使用 FP32 作為內部計算精度。 tf32 fp32_precision 表示我們將允許使用 TF32 作為內部計算精度。

如果 fp32_precision 設定為 ieee，我們可以為特定運算元覆蓋通用設定。

torch.backends.cudnn.fp32_precision = "tf32"
torch.backends.cudnn.conv.fp32_precision = "ieee"
torch.backends.cudnn.rnn.fp32_precision = "ieee"

如果 fp32_precision 設定為 ieee，我們也可以為特定後端覆蓋通用設定。

torch.backends.fp32_precision = "tf32"
torch.backends.cudnn.fp32_precision = "ieee"
torch.backends.cudnn.conv.fp32_precision = "ieee"
torch.backends.cudnn.rnn.fp32_precision = "ieee"

對於上述兩種情況，torch.backends.cudnn.conv.fp32_precision 和 torch.backends.cudnn.rnn.fp32_precision 都被覆蓋為 ieee。

我們建議使用新的設定以獲得更好的控制。並且我們不支援使用新舊設定的混合。

警告

舊設定（使用 allow_tf32）將要被棄用。我們建議使用上述新設定以獲得更好的控制。並且我們不支援使用新舊設定的混合。

從 PyTorch 1.7 開始，有一個新的標誌叫做 allow_tf32。這個標誌在 PyTorch 1.7 到 1.11 中預設為 True，在 PyTorch 1.12 及更高版本中預設為 False。這個標誌控制 PyTorch 是否允許在內部使用 TensorFloat32 (TF32) 張量核心來計算矩陣乘法（matmul，以及批處理矩陣乘法）和卷積，TensorFloat32 (TF32) 張量核心在 NVIDIA Ampere 及以後的 GPU 上可用。

TF32 張量核心旨在透過將輸入資料舍入為具有 10 位尾數的精度，並以 FP32 精度累加結果（保持 FP32 動態範圍），來提高在 torch.float32 張量上的矩陣乘法和卷積的效能。

矩陣乘法和卷積是分開控制的，它們的相應標誌可以透過以下方式訪問：

# The flag below controls whether to allow TF32 on matmul. This flag defaults to False
# in PyTorch 1.12 and later.
torch.backends.cuda.matmul.allow_tf32 = True

# The flag below controls whether to allow TF32 on cuDNN. This flag defaults to True.
torch.backends.cudnn.allow_tf32 = True

矩陣乘法的精度也可以透過 set_float32_matmul_precision() 更廣泛地設定（不僅限於 CUDA）。請注意，除了矩陣乘法和卷積本身，內部使用矩陣乘法或卷積的函式和 nn 模組也受到影響。這些包括 nn.Linear、nn.Conv*、cdist、tensordot、affine grid 和 grid sample、adaptive log softmax、GRU 和 LSTM。

為了瞭解精度和速度，請參見下面的示例程式碼和基準測試資料（在 A100 上）

a_full = torch.randn(10240, 10240, dtype=torch.double, device='cuda')
b_full = torch.randn(10240, 10240, dtype=torch.double, device='cuda')
ab_full = a_full @ b_full
mean = ab_full.abs().mean()  # 80.7277

a = a_full.float()
b = b_full.float()

# Do matmul at TF32 mode.
torch.backends.cuda.matmul.allow_tf32 = True
ab_tf32 = a @ b  # takes 0.016s on GA100
error = (ab_tf32 - ab_full).abs().max()  # 0.1747
relative_error = error / mean  # 0.0022

# Do matmul with TF32 disabled.
torch.backends.cuda.matmul.allow_tf32 = False
ab_fp32 = a @ b  # takes 0.11s on GA100
error = (ab_fp32 - ab_full).abs().max()  # 0.0031
relative_error = error / mean  # 0.000039

從上面的示例中，我們可以看到，啟用 TF32 後，A100 上的速度提高了約 7 倍，與雙精度相比，相對誤差大約大了 2 個數量級。請注意，TF32 與單精度速度的確切比例取決於硬體代，因為記憶體頻寬與計算的比率以及 TF32 與 FP32 矩陣乘法吞吐量的比率可能因代或模型而異。如果需要完整的 FP32 精度，使用者可以透過以下方式停用 TF32：

torch.backends.cuda.matmul.allow_tf32 = False
torch.backends.cudnn.allow_tf32 = False

要在 C++ 中關閉 TF32 標誌，您可以執行以下操作：

at::globalContext().setAllowTF32CuBLAS(false);
at::globalContext().setAllowTF32CuDNN(false);

有關 TF32 的更多資訊，請參閱：

FP16 GEMM 中的低精度歸約#

(與旨在提高 FP16 累加硬體吞吐量而不是 FP32 累加的完整 FP16 累加不同，請參閱完整 FP16 累加)

fp16 GEMM 可能在某些中間低精度歸約中完成（例如，在 fp16 中而不是 fp32 中）。這些選擇性的精度歸約可以在某些工作負載（特別是 k 維度很大的工作負載）和 GPU 架構上實現更高的效能，但會以數值精度和潛在溢位為代價。

V100 上的一些示例基準測試資料

[--------------------------- bench_gemm_transformer --------------------------]
      [  m ,  k  ,  n  ]    |  allow_fp16_reduc=True  |  allow_fp16_reduc=False
1 threads: --------------------------------------------------------------------
      [4096, 4048, 4096]    |           1634.6        |           1639.8
      [4096, 4056, 4096]    |           1670.8        |           1661.9
      [4096, 4080, 4096]    |           1664.2        |           1658.3
      [4096, 4096, 4096]    |           1639.4        |           1651.0
      [4096, 4104, 4096]    |           1677.4        |           1674.9
      [4096, 4128, 4096]    |           1655.7        |           1646.0
      [4096, 4144, 4096]    |           1796.8        |           2519.6
      [4096, 5096, 4096]    |           2094.6        |           3190.0
      [4096, 5104, 4096]    |           2144.0        |           2663.5
      [4096, 5112, 4096]    |           2149.1        |           2766.9
      [4096, 5120, 4096]    |           2142.8        |           2631.0
      [4096, 9728, 4096]    |           3875.1        |           5779.8
      [4096, 16384, 4096]   |           6182.9        |           9656.5
(times in microseconds).

如果需要完整的精度歸約，使用者可以透過以下方式停用 FP16 GEMM 中的低精度歸約：

torch.backends.cuda.matmul.allow_fp16_reduced_precision_reduction = False

要在 C++ 中關閉低精度歸約標誌，您可以執行以下操作：

at::globalContext().setAllowFP16ReductionCuBLAS(false);

BF16 GEMM 中的低精度歸約#

BF16 GEMM 存在一個類似的標誌（如上）。請注意，此開關預設設定為 True for BF16，如果您遇到工作負載中的數值不穩定性，您可能希望將其設定為 False。

如果不需要低精度歸約，使用者可以透過以下方式停用 BF16 GEMM 中的低精度歸約：

torch.backends.cuda.matmul.allow_bf16_reduced_precision_reduction = False

要在 C++ 中關閉低精度歸約標誌，您可以執行以下操作：

at::globalContext().setAllowBF16ReductionCuBLAS(true);

FP16 GEMM 中的完整 FP16 累加#

某些 GPU 在進行 _所有_ FP16 GEMM 累加時效能會提高，但會以數值精度和更高的溢位機率為代價。請注意，此設定僅在計算能力為 7.0 (Volta) 或更高版本的 GPU 上有效。

可以透過以下方式啟用此行為：

torch.backends.cuda.matmul.allow_fp16_accumulation = True

要在 C++ 中關閉低精度歸約標誌，您可以執行以下操作：

at::globalContext().setAllowFP16AccumulationCuBLAS(true);

非同步執行#

預設情況下，GPU 操作是非同步的。當您呼叫一個使用 GPU 的函式時，操作會被 *排隊* 到特定裝置，但不一定立即執行。這允許我們並行執行更多計算，包括 CPU 或其他 GPU 上的操作。

通常，非同步計算的效果對呼叫者來說是不可見的，因為 (1) 每個裝置按排隊順序執行操作，並且 (2) PyTorch 在 CPU 和 GPU 之間或兩個 GPU 之間複製資料時會自動執行必要的同步。因此，計算將像所有操作都同步執行一樣進行。

您可以透過設定環境變數 CUDA_LAUNCH_BLOCKING=1 來強制同步計算。當 GPU 上發生錯誤時，這會很有用。（在非同步執行中，此類錯誤直到操作實際執行後才報告，因此堆疊跟蹤不會顯示請求的位置。）

非同步計算的一個後果是，不進行同步的時間測量是不準確的。為了獲得精確的測量，您應該在測量之前呼叫 torch.cuda.synchronize()，或者使用 torch.cuda.Event 來記錄時間，如下所示：

start_event = torch.cuda.Event(enable_timing=True)
end_event = torch.cuda.Event(enable_timing=True)
start_event.record()

# Run some things here

end_event.record()
torch.cuda.synchronize()  # Wait for the events to be recorded!
elapsed_time_ms = start_event.elapsed_time(end_event)

作為例外，幾個函式，如 to() 和 copy_() 允許一個顯式的 non_blocking 引數，允許呼叫者在不必要時繞過同步。另一個例外是 CUDA 流，下面將對此進行解釋。

CUDA 流#

CUDA 流是屬於特定裝置的一系列線性執行。您通常不需要顯式建立它：預設情況下，每個裝置都有自己的“預設”流。

每個流中的操作按建立順序序列化，但來自不同流的操作可以按任何相對順序併發執行，除非使用顯式同步函式（如 synchronize() 或 wait_stream()）。例如，以下程式碼是不正確的：

cuda = torch.device('cuda')
s = torch.cuda.Stream()  # Create a new stream.
A = torch.empty((100, 100), device=cuda).normal_(0.0, 1.0)
with torch.cuda.stream(s):
    # sum() may start execution before normal_() finishes!
    B = torch.sum(A)

當“當前流”是預設流時，PyTorch 會在資料在 CPU 和 GPU 之間或兩個 GPU 之間移動時自動執行必要的同步，如上所述。但是，在使用非預設流時，使用者有責任確保正確的同步。此示例的修正版本是：

cuda = torch.device('cuda')
s = torch.cuda.Stream()  # Create a new stream.
A = torch.empty((100, 100), device=cuda).normal_(0.0, 1.0)
s.wait_stream(torch.cuda.default_stream(cuda))  # NEW!
with torch.cuda.stream(s):
    B = torch.sum(A)
A.record_stream(s)  # NEW!

有兩個新增功能。 torch.cuda.Stream.wait_stream() 呼叫確保在側流上執行 sum(A) 之前 normal_() 已完成執行。 torch.Tensor.record_stream()（更多細節請參見）確保在 sum(A) 完成之前不釋放 A。您也可以在稍後的某個時間點透過 torch.cuda.default_stream(cuda).wait_stream(s) 手動等待流（注意，立即等待是毫無意義的，因為它會阻止流在預設流上的其他工作並行執行）。有關何時使用其中一個的更多詳細資訊，請參閱 torch.Tensor.record_stream() 的文件。

請注意，即使沒有讀取依賴項，也需要這種同步，例如，在以下示例中：

cuda = torch.device('cuda')
s = torch.cuda.Stream()  # Create a new stream.
A = torch.empty((100, 100), device=cuda)
s.wait_stream(torch.cuda.default_stream(cuda))  # STILL REQUIRED!
with torch.cuda.stream(s):
    A.normal_(0.0, 1.0)
    A.record_stream(s)

儘管 s 上的計算不讀取 A 的內容，並且沒有其他地方使用 A，但仍然需要同步，因為 A 可能對應於 CUDA 快取分配器重新分配的記憶體，其中包含來自舊（已釋放）記憶體的待處理操作。

反向傳播的流語義#

每個反向 CUDA 操作都在用於其相應前向操作的同一流上執行。如果您的前向傳遞在不同的流上並行執行獨立的操作，這有助於反向傳遞利用相同的並行性。

反向呼叫相對於周圍操作的流語義與任何其他呼叫的流語義相同。反向傳遞會插入內部同步以確保這一點，即使反向操作在多個流上執行，如上一段所述。更具體地說，當呼叫 autograd.backward、autograd.grad 或 tensor.backward，並選擇性地提供 CUDA 張量作為初始梯度（例如，autograd.backward(..., grad_tensors=initial_grads)、autograd.grad(..., grad_outputs=initial_grads) 或 tensor.backward(..., gradient=initial_grad)），則...

選擇性填充初始梯度，
呼叫反向傳遞，以及
使用梯度

具有與任何操作組相同的流語義關係。

s = torch.cuda.Stream()

# Safe, grads are used in the same stream context as backward()
with torch.cuda.stream(s):
    loss.backward()
    use grads

# Unsafe
with torch.cuda.stream(s):
    loss.backward()
use grads

# Safe, with synchronization
with torch.cuda.stream(s):
    loss.backward()
torch.cuda.current_stream().wait_stream(s)
use grads

# Safe, populating initial grad and invoking backward are in the same stream context
with torch.cuda.stream(s):
    loss.backward(gradient=torch.ones_like(loss))

# Unsafe, populating initial_grad and invoking backward are in different stream contexts,
# without synchronization
initial_grad = torch.ones_like(loss)
with torch.cuda.stream(s):
    loss.backward(gradient=initial_grad)

# Safe, with synchronization
initial_grad = torch.ones_like(loss)
s.wait_stream(torch.cuda.current_stream())
with torch.cuda.stream(s):
    initial_grad.record_stream(s)
    loss.backward(gradient=initial_grad)

BC 注意：在預設流上使用梯度#

在 PyTorch 的早期版本（1.9 及更早版本）中，autograd 引擎始終將預設流與所有反向操作同步，因此以下模式

with torch.cuda.stream(s):
    loss.backward()
use grads

在 use grads 發生在預設流上時是安全的。在當前的 PyTorch 中，這種模式不再安全。如果 backward() 和 use grads 在不同的流上下文中，您必須同步流：

with torch.cuda.stream(s):
    loss.backward()
torch.cuda.current_stream().wait_stream(s)
use grads

即使 use grads 發生在預設流上。

記憶體管理#

PyTorch 使用快取記憶體分配器來加快記憶體分配速度。這允許在沒有裝置同步的情況下快速釋放記憶體。但是，分配器管理的未使用記憶體仍會顯示為在 nvidia-smi 中使用。您可以使用 memory_allocated() 和 max_memory_allocated() 來監控張量佔用的記憶體，並使用 memory_reserved() 和 max_memory_reserved() 來監控快取分配器管理的記憶體總量。呼叫 empty_cache() 會釋放 PyTorch 中所有*未使用的*快取記憶體，以便其他 GPU 應用程式可以使用它們。但是，張量佔用的 GPU 記憶體不會被釋放，因此它不能增加 PyTorch 可用的 GPU 記憶體量。

要更好地瞭解 CUDA 記憶體隨時間的使用情況，請參閱理解 CUDA 記憶體使用，其中描述了用於捕獲和視覺化記憶體使用情況跟蹤的工具。

對於更高階的使用者，我們透過 memory_stats() 提供更全面的記憶體基準測試。我們還透過 memory_snapshot() 提供捕獲記憶體分配器狀態完整快照的能力，這有助於您理解程式碼產生的底層分配模式。

使用 `PYTORCH_CUDA_ALLOC_CONF` 最佳化記憶體使用#

使用快取分配器可能會干擾 cuda-memcheck 等記憶體檢查工具。要使用 cuda-memcheck 除錯記憶體錯誤，請在環境中設定 PYTORCH_NO_CUDA_MEMORY_CACHING=1 以停用快取。

快取分配器的行為可以透過環境變數 PYTORCH_CUDA_ALLOC_CONF 來控制。格式為 PYTORCH_CUDA_ALLOC_CONF=<option>:<value>,<option2>:<value2>... 可用選項：

backend 允許選擇底層分配器實現。目前，有效選項包括 native，它使用 PyTorch 的原生實現，以及 cudaMallocAsync，它使用 CUDA 內建的非同步分配器。 cudaMallocAsync 需要 CUDA 11.4 或更高版本。預設為 native。 backend 適用於程序使用的所有裝置，不能按裝置指定。
max_split_size_mb 防止原生分配器拆分大於此大小（以 MB 為單位）的塊。這可以減少碎片，並可能允許一些臨界工作負載在不耗盡記憶體的情況下完成。效能成本可能從“零”到“顯著”不等，具體取決於分配模式。預設值為無限，即所有塊都可以拆分。 memory_stats() 和 memory_summary() 方法對於調優很有用。此選項應作為最後手段，用於因“記憶體不足”而中止並顯示大量非活動拆分塊的工作負載。 max_split_size_mb 僅在 backend:native 時有意義。對於 backend:cudaMallocAsync，將忽略 max_split_size_mb。
roundup_power2_divisions 有助於將請求的分配大小舍入到最近的 2 的冪次方除法，並更好地利用塊。在原生的 CUDACachingAllocator 中，大小按 512 的塊大小倍數向上舍入，因此這對於較小的尺寸來說效果很好。但是，對於附近的大型分配，這可能效率低下，因為每個分配都會轉到不同大小的塊，並且這些塊的重用被最小化。這可能會產生大量未使用的塊，並浪費 GPU 記憶體容量。此選項支援將分配大小舍入到最近的 2 的冪次方除法。例如，如果我們需要將大小舍入到 1200，並且除法次數為 4，則大小 1200 介於 1024 和 2048 之間，如果我們在這兩者之間進行 4 次除法，則值為 1024、1280、1536 和 1792。因此，大小為 1200 的分配將被舍入到 1280，作為最近的 2 的冪次方除法的上限。指定一個值以應用於所有分配大小，或指定一個鍵值對陣列以單獨為每個 2 的冪次方間隔設定 2 的冪次方除法。例如，要為所有小於 256MB 的分配設定 1 次除法，為 256MB 到 512MB 之間的分配設定 2 次除法，為 512MB 到 1GB 之間的分配設定 4 次除法，為任何更大的分配設定 8 次除法，請將旋鈕值設定為：[256:1,512:2,1024:4,>:8]。 roundup_power2_divisions 僅在 backend:native 時有意義。對於 backend:cudaMallocAsync，將忽略 roundup_power2_divisions。
max_non_split_rounding_mb 將允許非拆分塊以實現更好的重用，例如：
一個 1024MB 的快取塊可以重用於 512MB 的分配請求。在預設情況下，我們只允許高達 20MB 的非拆分塊舍入，因此一個 512MB 的塊只能用 512-532 MB 大小的塊來提供。如果我們此選項的值設定為 1024，則允許使用 512-1536 MB 大小的塊來處理一個 512MB 的塊，從而增加了大型塊的重用。這也有助於減少昂貴的 cudaMalloc 呼叫中的停頓。
garbage_collection_threshold 有助於主動回收未使用的 GPU 記憶體，以避免觸發昂貴的同步和回收所有操作（release_cached_blocks），這可能對延遲敏感的 GPU 應用程式（例如伺服器）不利。設定此閾值後（例如，0.8），如果 GPU 記憶體容量使用率超過閾值（即，分配給 GPU 應用程式的總記憶體的 80%），分配器將開始回收 GPU 記憶體塊。該演算法優先釋放舊的、未使用的塊，以避免釋放正在被重用的塊。閾值應大於 0.0 且小於 1.0。預設值為 1.0。

garbage_collection_threshold 僅在 backend:native 時有意義。對於 backend:cudaMallocAsync，將忽略 garbage_collection_threshold。
expandable_segments（實驗性，預設值：False）如果設定為 True，此設定指示分配器建立可以稍後擴充套件的 CUDA 分配，以更好地處理作業頻繁更改分配大小的情況，例如具有可變批次大小。通常，對於大型（>2MB）分配，分配器呼叫 cudaMalloc 來獲取與使用者請求大小相同的分配。將來，這些分配的部分可以重用於其他請求，如果它們是空閒的。當程式發出許多完全相同大小或其整數倍大小的請求時，這效果很好。許多深度學習模型都遵循此行為。但是，一個常見的例外是批次大小在每次迭代之間略有變化時，例如在批處理推理中。當程式最初以批次大小 N 執行時，它將進行適合該大小的分配。如果將來，它以大小 N - 1 執行，則現有分配仍然足夠大。但是，如果它以大小 N + 1 執行，那麼它將不得不進行稍大的新分配。並非所有張量的大小都相同。有些可能是 (N + 1)*A，而另一些可能是 (N + 1)*A*B，其中 A 和 B 是模型中的某些非批次維度。由於分配器在現有分配足夠大時會重用它們，因此一些 (N + 1)*A 分配實際上會適合現有的 N*B*A 段，儘管不完全合適。隨著模型的執行，它將部分填充所有這些段，在這些段的末尾留下不可用的空閒記憶體切片。分配器在某個時候需要 cudaMalloc 一個新的 (N + 1)*A*B 段。如果記憶體不足，現在無法恢復這些段末尾的空閒記憶體切片。對於 50 多層的模型，此模式可能會重複 50 多次，從而建立許多 sliver。

expandable_segments 允許分配器最初建立一個段，然後在需要更多記憶體時擴充套件其大小。而不是為每個分配建立一個段，它會嘗試為一個段（每個流）建立並根據需要增長。現在，當執行 N + 1 情況時，分配將整齊地排列到一個大段中，直到它填滿。然後請求更多記憶體並附加到段的末尾。此過程不會建立許多不可用記憶體的 sliver，因此更有可能成功找到該記憶體。
pinned_use_cuda_host_register 選項是一個布林標誌，它決定是使用 CUDA API 的 cudaHostRegister 函式分配固定記憶體，還是使用預設的 cudaHostAlloc。當設定為 True 時，記憶體使用常規 malloc 分配，然後在呼叫 cudaHostRegister 之前將頁面對映到記憶體。這種頁面的預對映有助於減少 cudaHostRegister 執行期間的鎖定時間。
pinned_num_register_threads 選項僅在 pinned_use_cuda_host_register 設定為 True 時有效。預設情況下，使用一個執行緒來對映頁面。此選項允許使用更多執行緒來並行化頁面對映操作，以減少固定記憶體的總體分配時間。基於基準測試結果，此選項的一個好值是 8。
pinned_use_background_threads 選項是一個布林標誌，用於啟用後臺執行緒來處理事件。這可以避免在快速分配路徑中查詢/處理事件的任何緩慢路徑。此功能預設停用。
graph_capture_record_stream_reuse（實驗性，預設值：False）如果設定為 True，CUDA 快取分配器將在 CUDA 圖捕獲期間嘗試回收裝置記憶體，方法是使用圖拓撲（而不是 CUDA 事件）來確定何時可以安全地重用已釋放的塊。這可以減少在捕獲時間較長的場景中跨多個流釋放和重新分配緩衝區的峰值記憶體使用量，尤其是在捕獲 DAG 頻繁到達連線的邊界時。注意：啟用此選項可能會顯著增加捕獲圖所花費的時間。

注意

PyTorch 的 CUDA 記憶體管理 API 報告的一些統計資訊特定於 backend:native，並且對於 backend:cudaMallocAsync 沒有意義。有關詳細資訊，請參閱每個函式的文件字串。

為 CUDA 使用自定義記憶體分配器#

可以將分配器定義為 C/C++ 中的簡單函式，並將其編譯為共享庫。下面的程式碼顯示了一個基本分配器，它只跟蹤所有記憶體操作。

#include <sys/types.h>
#include <cuda_runtime_api.h>
#include <iostream>
// Compile with g++ alloc.cc -o alloc.so -I/usr/local/cuda/include -shared -fPIC
extern "C" {
void* my_malloc(ssize_t size, int device, cudaStream_t stream) {
   void *ptr;
   cudaMalloc(&ptr, size);
   std::cout<<"alloc "<<ptr<<size<<std::endl;
   return ptr;
}

void my_free(void* ptr, ssize_t size, int device, cudaStream_t stream) {
   std::cout<<"free "<<ptr<< " "<<stream<<std::endl;
   cudaFree(ptr);
}
}

可以透過 torch.cuda.memory.CUDAPluggableAllocator 在 Python 中使用它。使用者負責提供 .so 檔案的路徑以及與上述簽名匹配的 alloc/free 函式的名稱。

import torch

# Load the allocator
new_alloc = torch.cuda.memory.CUDAPluggableAllocator(
    'alloc.so', 'my_malloc', 'my_free')
# Swap the current allocator
torch.cuda.memory.change_current_allocator(new_alloc)
# This will allocate memory in the device using the new allocator
b = torch.zeros(10, device='cuda')

import torch

# Do an initial memory allocator
b = torch.zeros(10, device='cuda')
# Load the allocator
new_alloc = torch.cuda.memory.CUDAPluggableAllocator(
    'alloc.so', 'my_malloc', 'my_free')
# This will error since the current allocator was already instantiated
torch.cuda.memory.change_current_allocator(new_alloc)

在同一個程式中混合不同的 CUDA 系統分配器#

根據您的使用場景，change_current_allocator() 可能不是您想要的，因為它會交換整個程式的 CUDA 分配器（類似於 PYTORCH_CUDA_ALLOC_CONF=backend:cudaMallocAsync）。例如，如果交換的分配器沒有快取機制，您將失去 PyTorch 的 CUDACachingAllocator 的所有好處。相反，您可以使用 torch.cuda.MemPool 選擇性地將 PyTorch 程式碼的一個區域標記為使用自定義分配器。這將允許您在同一個 PyTorch 程式中使用多個 CUDA 系統分配器，以及 CUDACachingAllocator 的大部分優點（例如快取）。使用 torch.cuda.MemPool，您可以利用支援多種功能的自定義分配器，例如：

使用 ncclMemAlloc 分配器為 all-reduce 分配輸出緩衝區可以啟用 NVLink Switch Reductions (NVLS)。這可以減少重疊的計算和通訊核心在 GPU 資源（SM、Copy Engines）上的爭用，尤其是在張量並行工作負載上。
對於基於 Grace CPU 的系統，使用 cuMemCreate 並指定 CU_MEM_LOCATION_TYPE_HOST_NUMA 為 all-gather 分配主機輸出緩衝區可以啟用基於擴充套件 GPU 記憶體 (EGM) 的記憶體傳輸，從源 GPU 到目標 CPU。這可以加速 all-gather，因為傳輸發生在 NVLinks 上，而否則將透過頻寬受限的網路介面卡 (NIC) 連結進行。這種加速的 all-gather 進而可以加速模型檢查點。
如果您正在構建一個模型，並且一開始不想考慮記憶體密集型模組（例如，嵌入表）的最佳記憶體放置，或者您有一個性能不敏感且不適合 GPU 的模組，那麼您可以只使用 cudaMallocManaged 分配該模組，首選 CPU 位置，然後讓您的模型正常工作。

注意

雖然 cudaMallocManaged 提供了方便的 CUDA 統一虛擬記憶體 (UVM) 自動記憶體管理，但它不推薦用於 DL 工作負載。對於適合 GPU 記憶體的 DL 工作負載，顯式放置始終優於 UVM，因為沒有頁面故障，並且訪問模式保持可預測。當 GPU 記憶體飽和時，UVM 必須執行昂貴的雙重傳輸，將頁面逐出到 CPU，然後再調入新的頁面。

下面的程式碼顯示了包裝在 torch.cuda.memory.CUDAPluggableAllocator 中的 ncclMemAlloc。

import os

import torch
import torch.distributed as dist
from torch.cuda.memory import CUDAPluggableAllocator
from torch.distributed.distributed_c10d import _get_default_group
from torch.utils import cpp_extension


# create allocator
nccl_allocator_source = """
#include <nccl.h>
#include <iostream>
extern "C" {

void* nccl_alloc_plug(size_t size, int device, void* stream) {
  std::cout << "Using ncclMemAlloc" << std::endl;
  void* ptr;
  ncclResult_t err = ncclMemAlloc(&ptr, size);
  return ptr;

}

void nccl_free_plug(void* ptr, size_t size, int device, void* stream) {
  std::cout << "Using ncclMemFree" << std::endl;
  ncclResult_t err = ncclMemFree(ptr);
}

}
"""
nccl_allocator_libname = "nccl_allocator"
nccl_allocator = torch.utils.cpp_extension.load_inline(
    name=nccl_allocator_libname,
    cpp_sources=nccl_allocator_source,
    with_cuda=True,
    extra_ldflags=["-lnccl"],
    verbose=True,
    is_python_module=False,
    build_directory="./",
)

allocator = CUDAPluggableAllocator(
    f"./{nccl_allocator_libname}.so", "nccl_alloc_plug", "nccl_free_plug"
).allocator()

# setup distributed
rank = int(os.getenv("RANK"))
local_rank = int(os.getenv("LOCAL_RANK"))
world_size = int(os.getenv("WORLD_SIZE"))
torch.cuda.set_device(local_rank)
dist.init_process_group(backend="nccl")
device = torch.device(f"cuda:{local_rank}")
default_pg = _get_default_group()
backend = default_pg._get_backend(device)

# Note: for convenience, ProcessGroupNCCL backend provides
# the ncclMemAlloc allocator as backend.mem_allocator
allocator = backend.mem_allocator

您現在可以透過將此分配器傳遞給 torch.cuda.MemPool 來定義一個新的記憶體池：

pool = torch.cuda.MemPool(allocator)

然後可以使用 torch.cuda.use_mem_pool 上下文管理器使用該池來分配張量：

with torch.cuda.use_mem_pool(pool):
    # tensor gets allocated with ncclMemAlloc passed in the pool
    tensor = torch.arange(1024 * 1024 * 2, device=device)
    print(f"tensor ptr on rank {rank} is {hex(tensor.data_ptr())}")

# register user buffers using ncclCommRegister (called under the hood)
backend.register_mem_pool(pool)

# Collective uses Zero Copy NVLS
dist.all_reduce(tensor[0:4])
torch.cuda.synchronize()
print(tensor[0:4])

請注意上面示例中 register_mem_pool 的用法。這是 NVLS 歸約的一個額外步驟，使用者緩衝區需要註冊到 NCCL。使用者可以使用類似的 deregister_mem_pool 呼叫登出緩衝區。

要回收記憶體，使用者首先需要確保沒有任何東西在使用該池。當沒有張量持有對池的引用時，empty_cache() 將在池被刪除時在內部呼叫，從而將所有記憶體返回給系統。

del tensor, del pool

使用者可以在建立 MemPool 時選擇性地指定一個 use_on_oom 布林值（預設為 False）。如果為 True，則 CUDACachingAllocator 將能夠使用該池中的記憶體作為最後的手段，而不是 OOM。

pool = torch.cuda.MemPool(allocator, use_on_oom=True)
with torch.cuda.use_mem_pool(pool):
    a = torch.randn(40 * 1024 * 1024, dtype=torch.uint8, device="cuda")
del a

# at the memory limit, this will succeed by using pool's memory in order to avoid the oom
b = torch.randn(40 * 1024 * 1024, dtype=torch.uint8, device="cuda")

以下 torch.cuda.MemPool.use_count() 和 torch.cuda.MemPool.snapshot() API 可用於除錯目的：

pool = torch.cuda.MemPool(allocator)

# pool's use count should be 1 at this point as MemPool object
# holds a reference
assert pool.use_count() == 1

nelem_1mb = 1024 * 1024 // 4

with torch.cuda.use_mem_pool(pool):
    out_0 = torch.randn(nelem_1mb, device="cuda")

    # pool's use count should be 2 at this point as use_mem_pool
    # holds a reference
    assert pool.use_count() == 2

# pool's use count should be back to 1 at this point as use_mem_pool
# released its reference
assert pool.use_count() == 1

with torch.cuda.use_mem_pool(pool):
    # pool should have 1 segment since we made a small allocation (1 MB)
    # above and so the CUDACachingAllocator packed it into a 2 MB buffer
    assert len(pool.snapshot()) == 1

    out_1 = torch.randn(nelem_1mb, device="cuda")

    # pool should still have 1 segment since we made another small allocation
    # (1 MB) that got packed into the existing 2 MB buffer
    assert len(pool.snapshot()) == 1

    out_2 = torch.randn(nelem_1mb, device="cuda")

    # pool now should have 2 segments since the CUDACachingAllocator had
    # to make a new 2 MB buffer to accommodate out_2
    assert len(pool.snapshot()) == 2

注意

torch.cuda.MemPool 持有對池的引用。當您使用 torch.cuda.use_mem_pool 上下文管理器時，它還將獲取對池的另一個引用。退出上下文管理器時，它將釋放其引用。之後，理想情況下應該只有張量持有對池的引用。一旦張量釋放了它們的引用，池的使用計數將為 1，反映出只有 torch.cuda.MemPool 物件持有引用。只有到那時，池持有的記憶體才能在呼叫池的解構函式（使用 del）時返回給系統。
torch.cuda.MemPool 目前不支援 CUDACachingAllocator 的 expandable_segments 模式。
NCCL 對與 NVLS 歸約相容的緩衝區有特定要求。這些要求在動態工作負載中可能會被打破，例如，由 CUDACachingAllocator 傳送給 NCCL 的緩衝區可能會被拆分，因此對齊不正確。在這些情況下，NCCL 可以使用回退演算法而不是 NVLS。
像 ncclMemAlloc 這樣的分配器可能會使用比請求更多的記憶體，因為對齊要求（CU_MULTICAST_GRANULARITY_RECOMMENDED、CU_MULTICAST_GRANULARITY_MINIMUM），這可能會導致您的工作負載記憶體不足。

在 H100/H200 GPU 上使用自定義記憶體分配器調整 NVLink 效能#

在極少數情況下，H100/H200 GPU 上的 NVLink 效能會受到資料物理記憶體佈局的影響，這為開發人員提供了一個調整其應用程式以獲得最佳吞吐量的機會。

資料物理記憶體佈局影響效能的一個例子是通訊核心發出不平衡的 NVLink 讀取/寫入操作。在下面的圖中，我們可以看到每個 warp 在每個單一波中以一致的跨步模式訪問記憶體地址。我們可以透過調整工作負載中的步幅大小來實現更平衡的負載，或者我們可以實現自定義 CUDA 分配器。

_______________________________  _______________________________      _______________________________
| Warp 0 Reading | No-reading |  | Warp 1 Reading | No-reading |  ...  Warp N Reading | No-reading |
_______________________________  _______________________________      _______________________________
<----------------------------->
        Stride size

這樣的分配器可以為核心維護連續的虛擬記憶體地址，同時戰略性地安排到物理記憶體地址的對映（例如，透過改組）。這種技術允許開發人員探索不同的物理訪問模式以找到最高效的模式，從而在不修改核心邏輯的情況下解鎖更高的效能。可以使用 PyTorch 的自定義分配器支援來實現此類分配器的實際實現，如前所述，其中 malloc 和 free 函式是：

// assuming a system with 8 GPUs
struct CustomAllocInfo {
  void** devPtr;  // This will be the usable virtual memory address
  CUdeviceptr dptr;
  size_t totalSize;  // Total size of the allocated memory
  size_t padded_size;
  int device_id;
  std::vector<CUmemGenericAllocationHandle> handles;  // Handles to physical memory allocations
};

// loop over pages
cudaError_t customCudaMalloc(CustomAllocInfo* info) {
    if (!info) return cudaErrorInvalidValue;

    CUdeviceptr dptr;

    // Handles to redundant physical memory allocations which help truncate stride pattern in physical memory
    std::vector<CUmemGenericAllocationHandle> handles_redundant;

    size_t granularity = 0;
    CUmemAllocationProp prop = {};

    int currentDev = info->device_id;
    size_t totalSize = info->totalSize;

    prop.type = CU_MEM_ALLOCATION_TYPE_PINNED;
    prop.location.type = CU_MEM_LOCATION_TYPE_DEVICE;
    prop.location.id = currentDev;
    cuMemGetAllocationGranularity(&granularity, &prop, CU_MEM_ALLOC_GRANULARITY_MINIMUM);
    size_t padded_size = ROUND_UP(totalSize, granularity);

    info->padded_size = padded_size;

    // loop over pages
    size_t iter_granularity = granularity * 64; // 64 * granularity with shift_size = 2 works
    uint32_t iteration_count = (totalSize + iter_granularity - 1) / iter_granularity;

    cuMemAddressReserve(&dptr, padded_size, 0ULL, 0ULL, 0ULL);

    const int shift_size = 2;
    for (size_t i = 0; i < iteration_count; i+=shift_size) {

        CUmemGenericAllocationHandle allocHandle[shift_size];
        for (int shift = 0; (shift < shift_size)&&(i+shift < iteration_count); shift++){
            CHECK_CUDA(cuMemCreate(&allocHandle[shift], iter_granularity, &prop, 0));
            info->handles.push_back(allocHandle[shift]);
        }

        for (int shift = 0; (shift < shift_size)&&(i+shift < iteration_count); shift++){

            // mapping makes the shift (shift -> (shift+1)%shift_size  )
            CHECK_CUDA(cuMemMap(dptr + (i+shift) * iter_granularity, iter_granularity, 0, allocHandle[(shift+1)%shift_size], 0));

            setupMultiGPUAccess(dptr + (i+shift) * iter_granularity, iter_granularity, {0, 1, 2, 3, 4, 5, 6, 7}); // Enable access for all 8 GPUs
        }

        // std::cout << "Here we allocate one redundant page (2MB)..." << std::endl;
        // this is an extra optimization on top of the swizzling. It helps "break"
        // the physical access pattern even more. It can be left out if workload is already
        // performing at SOL with just swizzling.
        CUmemGenericAllocationHandle allocHandle_redundant;
        CHECK_CUDA(cuMemCreate(&allocHandle_redundant, granularity, &prop, 0));
        handles_redundant.push_back(allocHandle_redundant);
    }

    *info->devPtr = (void*)dptr;
    info->dptr = dptr;

    // Release each redundant allocation
    for (auto handle : handles_redundant) {
        // std::cout << "Here we release one redundant page (2MB)..." << std::endl;
        CHECK_CUDA(cuMemRelease(handle));
    }

    return cudaSuccess;
}

void customCudaFree(CustomAllocInfo* info) {
    if (!info) return;

    // CHECK_CUDA(cudaSetDevice(info->device_id));

    CHECK_CUDA(cuMemUnmap(info->dptr, info->padded_size));

    // Unmap and release each allocation
    for (auto handle : info->handles) {
        CHECK_CUDA(cuMemRelease(handle));
    }

    // Unreserve the virtual address space
    // CHECK_CUDA(cuMemAddressFree((CUdeviceptr)*info->devPtr, info->padded_size));
    CHECK_CUDA(cuMemAddressFree(info->dptr, info->padded_size));
}

cuBLAS 工作區#

對於每個 cuBLAS 控制代碼和 CUDA 流的組合，如果該控制代碼和流組合執行需要工作區的 cuBLAS 核心，則會分配一個 cuBLAS 工作區。為了避免重複分配工作區，除非呼叫 torch._C._cuda_clearCublasWorkspaces()，否則這些工作區不會被釋放。每個分配的工作區大小可以透過環境變數 CUBLAS_WORKSPACE_CONFIG 來指定，格式為 :[SIZE]:[COUNT]。例如，每個分配的預設工作區大小為 CUBLAS_WORKSPACE_CONFIG=:4096:2:16:8，它指定了總大小為 2 * 4096 + 8 * 16 KiB。要強制 cuBLAS 避免使用工作區，請將 CUBLAS_WORKSPACE_CONFIG=:0:0 設定為。

cuFFT 計劃快取#

對於每個 CUDA 裝置，使用一個 LRU 快取的 cuFFT 計劃來加速在具有相同幾何形狀和配置的 CUDA 張量上重複執行 FFT 方法（例如，torch.fft.fft()）。由於一些 cuFFT 計劃可能會分配 GPU 記憶體，因此這些快取具有最大容量。

您可以使用以下 API 來控制和查詢當前裝置的快取屬性：

torch.backends.cuda.cufft_plan_cache.max_size 顯示快取的容量（CUDA 10 及更高版本預設為 4096，舊版 CUDA 預設為 1023）。直接設定此值可修改容量。
torch.backends.cuda.cufft_plan_cache.size 顯示當前快取中的計劃數量。
torch.backends.cuda.cufft_plan_cache.clear() 清除快取。

要控制和查詢非預設裝置的計劃快取，您可以索引 torch.backends.cuda.cufft_plan_cache 物件，使用 torch.device 物件或裝置索引，並訪問上述屬性之一。例如，要設定裝置 1 的快取容量，您可以編寫 torch.backends.cuda.cufft_plan_cache[1].max_size = 10。

即時編譯#

PyTorch 會即時編譯一些操作，例如 torch.special.zeta，當在 CUDA 張量上執行時。此編譯可能耗時（取決於您的硬體和軟體，最多幾秒鐘），並且對於單個運算元可能發生多次，因為許多 PyTorch 運算元實際上是從各種核心中選擇的，每個核心都必須編譯一次，具體取決於它們的輸入。此編譯在每個程序中發生一次，或者如果使用核心快取，則僅發生一次。

預設情況下，PyTorch 在 $XDG_CACHE_HOME/torch/kernels 中建立一個核心快取（如果定義了 XDG_CACHE_HOME），如果未定義，則在 $HOME/.cache/torch/kernels 中建立（Windows 除外，那裡尚不支援核心快取）。快取行為可以透過兩個環境變數直接控制。如果 USE_PYTORCH_KERNEL_CACHE 設定為 0，則不使用快取，如果設定了 PYTORCH_KERNEL_CACHE_PATH，則該路徑將用作核心快取而不是預設位置。

最佳實踐#

裝置無關的程式碼#

由於 PyTorch 的結構，您可能需要顯式編寫裝置無關（CPU 或 GPU）的程式碼；一個例子可能是建立一個新的張量作為迴圈神經網路的初始隱藏狀態。

第一步是確定是否應使用 GPU。一種常見模式是使用 Python 的 argparse 模組讀取使用者引數，並有一個可以用於停用 CUDA 的標誌，結合 is_available()。在以下示例中，args.device 生成一個 torch.device 物件，可用於將張量移動到 CPU 或 CUDA。

import argparse
import torch

parser = argparse.ArgumentParser(description='PyTorch Example')
parser.add_argument('--disable-cuda', action='store_true',
                    help='Disable CUDA')
args = parser.parse_args()
args.device = None
if not args.disable_cuda and torch.cuda.is_available():
    args.device = torch.device('cuda')
else:
    args.device = torch.device('cpu')

注意

在評估給定環境中 CUDA 的可用性（is_available()）時，PyTorch 的預設行為是呼叫 CUDA Runtime API 方法 cudaGetDeviceCount。由於此呼叫反過來會初始化 CUDA 驅動 API（透過 cuInit）如果它尚未初始化，則後續的 fork 程序（已執行 is_available()）將因 CUDA 初始化錯誤而失敗。

您可以在匯入執行 is_available() 的 PyTorch 模組（或直接執行它）之前，在環境中設定 PYTORCH_NVML_BASED_CUDA_CHECK=1，以指示 is_available() 嘗試基於 NVML 的評估（nvmlDeviceGetCount_v2）。如果基於 NVML 的評估成功（即 NVML 發現/初始化不失敗），則 is_available() 呼叫不會毒害後續的 fork。

如果 NVML 發現/初始化失敗，is_available() 將回退到標準的 CUDA Runtime API 評估，並且上述 fork 限制將適用。

請注意，上述基於 NVML 的 CUDA 可用性評估提供的保證比預設的 CUDA Runtime API 方法（需要 CUDA 初始化成功）要弱。在某些情況下，NVML 檢查可能成功，但稍後的 CUDA 初始化會失敗。

現在我們有了 args.device，我們可以使用它在所需的裝置上建立一個張量。

x = torch.empty((8, 42), device=args.device)
net = Network().to(device=args.device)

這可以在多種情況下用於生成裝置無關的程式碼。下面是使用資料載入器時的示例：

cuda0 = torch.device('cuda:0')  # CUDA GPU 0
for i, x in enumerate(train_loader):
    x = x.to(cuda0)

當在一個系統上使用多個 GPU 時，您可以使用 CUDA_VISIBLE_DEVICES 環境變數來管理 PyTorch 可用的 GPU。如上所述，要手動控制張量建立在哪一個 GPU 上，最佳實踐是使用 torch.cuda.device 上下文管理器。

print("Outside device is 0")  # On device 0 (default in most scenarios)
with torch.cuda.device(1):
    print("Inside device is 1")  # On device 1
print("Outside device is still 0")  # On device 0

如果您有一個張量並希望建立具有相同型別和同一裝置的新張量，則可以使用 torch.Tensor.new_* 方法（請參閱 torch.Tensor）。雖然前面提到的 torch.* 工廠函式（建立操作）依賴於當前 GPU 上下文和您傳入的屬性引數，但 torch.Tensor.new_* 方法會保留張量的裝置和其他屬性。

當建立需要在前向傳遞中內部建立新張量的模組時，這是推薦的做法。

cuda = torch.device('cuda')
x_cpu = torch.empty(2)
x_gpu = torch.empty(2, device=cuda)
x_cpu_long = torch.empty(2, dtype=torch.int64)

y_cpu = x_cpu.new_full([3, 2], fill_value=0.3)
print(y_cpu)

    tensor([[ 0.3000,  0.3000],
            [ 0.3000,  0.3000],
            [ 0.3000,  0.3000]])

y_gpu = x_gpu.new_full([3, 2], fill_value=-5)
print(y_gpu)

    tensor([[-5.0000, -5.0000],
            [-5.0000, -5.0000],
            [-5.0000, -5.0000]], device='cuda:0')

y_cpu_long = x_cpu_long.new_tensor([[1, 2, 3]])
print(y_cpu_long)

    tensor([[ 1,  2,  3]])

如果您想建立與另一個張量具有相同型別和大小的張量，並用 1 或 0 填充它，則提供了 ones_like() 或 zeros_like() 作為方便的輔助函式（它們也保留了張量的 torch.device 和 torch.dtype）。

x_cpu = torch.empty(2, 3)
x_gpu = torch.empty(2, 3)

y_cpu = torch.ones_like(x_cpu)
y_gpu = torch.zeros_like(x_gpu)

使用固定記憶體緩衝區#

警告

這是一個高階技巧。如果您過度使用固定記憶體，在 RAM 不足時可能會導致嚴重問題，並且您應該意識到固定通常是一項昂貴的操作。

當主機到 GPU 的複製源自固定（頁鎖定）記憶體時，速度會快得多。CPU 張量和儲存公開了一個 pin_memory() 方法，該方法返回物件的副本，並將資料放入固定區域。

此外，一旦您固定了張量或儲存，您就可以使用非同步 GPU 複製。只需將額外的 non_blocking=True 引數傳遞給 to() 或 cuda() 呼叫。這可用於將資料傳輸與計算重疊。

透過將 pin_memory=True 傳遞給建構函式，您可以使 DataLoader 返回放置在固定記憶體中的批次。

使用 nn.parallel.DistributedDataParallel 而不是 multiprocessing 或 nn.DataParallel#

涉及批處理輸入和多個 GPU 的大多數用例應預設使用 DistributedDataParallel 來利用一個以上的 GPU。

在將 CUDA 模型與 multiprocessing 結合使用時存在顯著的注意事項；除非小心滿足資料處理要求，否則您的程式很可能出現不正確或未定義的行為。

建議使用 DistributedDataParallel，而不是 DataParallel 來進行多 GPU 訓練，即使只有一個節點。

DistributedDataParallel 和 DataParallel 之間的區別是：DistributedDataParallel 使用多程序，為每個 GPU 建立一個程序，而 DataParallel 使用多執行緒。透過使用多程序，每個 GPU 都有其專用的程序，這避免了 Python 直譯器 GIL 造成的效能開銷。

如果您使用 DistributedDataParallel，您可以使用 torch.distributed.launch 工具來啟動您的程式，請參閱啟動工具。

CUDA 圖#

CUDA 圖是 CUDA 流及其依賴流執行的工作（主要是核心及其引數）的記錄。有關基本原理和底層 CUDA API 的詳細資訊，請參閱 Getting Started with CUDA Graphs 和 CUDA C 程式設計指南的 Graphs 部分。

PyTorch 支援使用*流捕獲*來構建 CUDA 圖，這會將 CUDA 流置於*捕獲模式*。向正在捕獲的流發出的 CUDA 工作實際上不會在 GPU 上執行。相反，工作會被記錄在一個圖中。

捕獲後，可以*重放*該圖以根據需要多次執行 GPU 工作。每次重放都會使用相同的引數執行相同的核心。對於指標引數，這意味著使用相同的記憶體地址。透過在每次重放前用新資料（例如，來自新批次）填充輸入記憶體，您可以使用新資料重新執行相同的工作。

為什麼使用 CUDA 圖？#

重放圖會犧牲典型即時執行的動態靈活性，以換取**大大降低的 CPU 開銷**。圖的引數和核心是固定的，因此圖重放會跳過所有引數設定和核心分派層，包括 Python、C++ 和 CUDA 驅動程式的開銷。在底層，一次重放呼叫 cudaGraphLaunch 將整個圖的工作提交給 GPU。重放中的核心在 GPU 上的執行速度也會略快，但消除 CPU 開銷是主要好處。

如果您的網路全部或部分是圖安全的（通常這意味著形狀和控制流是靜態的，但請參閱其他約束），並且您懷疑其執行時至少在一定程度上受 CPU 限制，則應嘗試使用 CUDA 圖。

PyTorch API#

警告

此 API 處於 Beta 版，未來版本中可能會更改。

PyTorch 透過原始的 torch.cuda.CUDAGraph 類和兩個方便的包裝器 torch.cuda.graph 和 torch.cuda.make_graphed_callables 來公開圖。

torch.cuda.graph 是一個簡單、通用的上下文管理器，它捕獲其上下文中的 CUDA 工作。在捕獲之前，透過執行幾次即時迭代來預熱要捕獲的工作負載。預熱必須在側流上進行。由於圖在每次重放時讀取和寫入相同的記憶體地址，因此您必須在捕獲期間維護持有輸入和輸出資料的張量的長期引用。要對新輸入資料執行圖，請將新資料複製到捕獲的輸入張量(s) 中，重放圖，然後從捕獲的輸出張量(s) 中讀取新輸出。示例：

g = torch.cuda.CUDAGraph()

# Placeholder input used for capture
static_input = torch.empty((5,), device="cuda")

# Warmup before capture
s = torch.cuda.Stream()
s.wait_stream(torch.cuda.current_stream())
with torch.cuda.stream(s):
    for _ in range(3):
        static_output = static_input * 2
torch.cuda.current_stream().wait_stream(s)

# Captures the graph
# To allow capture, automatically sets a side stream as the current stream in the context
with torch.cuda.graph(g):
    static_output = static_input * 2

# Fills the graph's input memory with new data to compute on
static_input.copy_(torch.full((5,), 3, device="cuda"))
g.replay()
# static_output holds the results
print(static_output)  # full of 3 * 2 = 6

# Fills the graph's input memory with more data to compute on
static_input.copy_(torch.full((5,), 4, device="cuda"))
g.replay()
print(static_output)  # full of 4 * 2 = 8

有關實際和高階模式，請參閱整個網路捕獲、與 torch.cuda.amp 的用法和與多個流的用法。

make_graphed_callables 更為複雜。make_graphed_callables 接受 Python 函式和 torch.nn.Module。對於每個傳入的函式或模組，它會建立前向傳遞和後向傳遞工作的單獨圖。請參閱部分網路捕獲。

約束#

一組操作是*可捕獲*的，如果它不違反任何以下約束。

約束適用於 torch.cuda.graph 上下文中的所有工作，以及您傳遞給 torch.cuda.make_graphed_callables() 的任何可呼叫物件的正向和反向傳遞中的所有工作。

違反任何這些都會導致執行時錯誤。

捕獲必須在非預設流上進行。（這僅在您使用原始的 CUDAGraph.capture_begin 和 CUDAGraph.capture_end 呼叫時才需要考慮。 graph 和 make_graphed_callables() 會為您設定一個側流。）
禁止同步 CPU 與 GPU 的操作（例如，.item() 呼叫）。
CUDA RNG 操作是允許的，並且在圖中使用多個 torch.Generator 例項時，必須在圖捕獲之前使用 CUDAGraph.register_generator_state 註冊它們。在捕獲期間避免使用 Generator.get_state 和 Generator.set_state；而是使用 Generator.graphsafe_set_state 和 Generator.graphsafe_get_state 來在圖上下文中安全地管理生成器狀態。這確保了 CUDA 圖中正確的 RNG 操作和生成器管理。

違反任何這些將可能導致無聲的數值錯誤或未定義行為。

在一個程序中，一次只能進行一次捕獲。
在捕獲進行時，不允許在此程序（在任何執行緒上）執行任何非捕獲的 CUDA 工作。
CPU 工作不被捕獲。如果捕獲的操作包含 CPU 工作，這些工作將在重放期間被省略。
每次重放都讀取和寫入相同的（虛擬）記憶體地址。
禁止動態控制流（基於 CPU 或 GPU 資料）。
禁止動態形狀。該圖假定在每個重放中，捕獲的操作序列中的每個張量都具有相同的大小和佈局。
允許多個流在捕獲中使用，但有限制。

非約束#

捕獲後，圖可以在任何流上重放。

整個網路捕獲#

如果您的整個網路都可以捕獲，您可以捕獲並重放整個迭代：

N, D_in, H, D_out = 640, 4096, 2048, 1024
model = torch.nn.Sequential(torch.nn.Linear(D_in, H),
                            torch.nn.Dropout(p=0.2),
                            torch.nn.Linear(H, D_out),
                            torch.nn.Dropout(p=0.1)).cuda()
loss_fn = torch.nn.MSELoss()
optimizer = torch.optim.SGD(model.parameters(), lr=0.1)

# Placeholders used for capture
static_input = torch.randn(N, D_in, device='cuda')
static_target = torch.randn(N, D_out, device='cuda')

# warmup
# Uses static_input and static_target here for convenience,
# but in a real setting, because the warmup includes optimizer.step()
# you must use a few batches of real data.
s = torch.cuda.Stream()
s.wait_stream(torch.cuda.current_stream())
with torch.cuda.stream(s):
    for i in range(3):
        optimizer.zero_grad(set_to_none=True)
        y_pred = model(static_input)
        loss = loss_fn(y_pred, static_target)
        loss.backward()
        optimizer.step()
torch.cuda.current_stream().wait_stream(s)

# capture
g = torch.cuda.CUDAGraph()
# Sets grads to None before capture, so backward() will create
# .grad attributes with allocations from the graph's private pool
optimizer.zero_grad(set_to_none=True)
with torch.cuda.graph(g):
    static_y_pred = model(static_input)
    static_loss = loss_fn(static_y_pred, static_target)
    static_loss.backward()
    optimizer.step()

real_inputs = [torch.rand_like(static_input) for _ in range(10)]
real_targets = [torch.rand_like(static_target) for _ in range(10)]

for data, target in zip(real_inputs, real_targets):
    # Fills the graph's input memory with new data to compute on
    static_input.copy_(data)
    static_target.copy_(target)
    # replay() includes forward, backward, and step.
    # You don't even need to call optimizer.zero_grad() between iterations
    # because the captured backward refills static .grad tensors in place.
    g.replay()
    # Params have been updated. static_y_pred, static_loss, and .grad
    # attributes hold values from computing on this iteration's data.

部分網路捕獲#

如果您的部分網路不適合捕獲（例如，由於動態控制流、動態形狀、CPU 同步或必要的 CPU 端邏輯），您可以將不安全的部分或全部執行為即時執行，並使用 torch.cuda.make_graphed_callables() 來僅捕獲可捕獲的部分。

預設情況下，make_graphed_callables() 返回的可呼叫物件是 autograd 感知的，並且可以在訓練迴圈中直接替換您傳入的函式或 nn.Module。

make_graphed_callables() 內部建立 CUDAGraph 物件，執行預熱迭代，並維護所需的靜態輸入和輸出。因此（與 torch.cuda.graph 不同），您不需要手動處理這些。

在以下示例中，資料依賴的動態控制流意味著網路無法端到端捕獲，但 make_graphed_callables() 允許我們無論如何捕獲和執行圖安全的節點：

N, D_in, H, D_out = 640, 4096, 2048, 1024

module1 = torch.nn.Linear(D_in, H).cuda()
module2 = torch.nn.Linear(H, D_out).cuda()
module3 = torch.nn.Linear(H, D_out).cuda()

loss_fn = torch.nn.MSELoss()
optimizer = torch.optim.SGD(chain(module1.parameters(),
                                  module2.parameters(),
                                  module3.parameters()),
                            lr=0.1)

# Sample inputs used for capture
# requires_grad state of sample inputs must match
# requires_grad state of real inputs each callable will see.
x = torch.randn(N, D_in, device='cuda')
h = torch.randn(N, H, device='cuda', requires_grad=True)

module1 = torch.cuda.make_graphed_callables(module1, (x,))
module2 = torch.cuda.make_graphed_callables(module2, (h,))
module3 = torch.cuda.make_graphed_callables(module3, (h,))

real_inputs = [torch.rand_like(x) for _ in range(10)]
real_targets = [torch.randn(N, D_out, device="cuda") for _ in range(10)]

for data, target in zip(real_inputs, real_targets):
    optimizer.zero_grad(set_to_none=True)

    tmp = module1(data)  # forward ops run as a graph

    if tmp.sum().item() > 0:
        tmp = module2(tmp)  # forward ops run as a graph
    else:
        tmp = module3(tmp)  # forward ops run as a graph

    loss = loss_fn(tmp, target)
    # module2's or module3's (whichever was chosen) backward ops,
    # as well as module1's backward ops, run as graphs
    loss.backward()
    optimizer.step()

與 torch.cuda.amp 的用法#

對於典型的最佳化器，GradScaler.step 會將 CPU 與 GPU 同步，這在捕獲期間是被禁止的。為了避免錯誤，請使用部分網路捕獲，或者（如果前向、損失和後向是圖安全的）捕獲前向、損失和後向，但不捕獲最佳化器步驟：

# warmup
# In a real setting, use a few batches of real data.
s = torch.cuda.Stream()
s.wait_stream(torch.cuda.current_stream())
with torch.cuda.stream(s):
    for i in range(3):
        optimizer.zero_grad(set_to_none=True)
        with torch.cuda.amp.autocast():
            y_pred = model(static_input)
            loss = loss_fn(y_pred, static_target)
        scaler.scale(loss).backward()
        scaler.step(optimizer)
        scaler.update()
torch.cuda.current_stream().wait_stream(s)

# capture
g = torch.cuda.CUDAGraph()
optimizer.zero_grad(set_to_none=True)
with torch.cuda.graph(g):
    with torch.cuda.amp.autocast():
        static_y_pred = model(static_input)
        static_loss = loss_fn(static_y_pred, static_target)
    scaler.scale(static_loss).backward()
    # don't capture scaler.step(optimizer) or scaler.update()

real_inputs = [torch.rand_like(static_input) for _ in range(10)]
real_targets = [torch.rand_like(static_target) for _ in range(10)]

for data, target in zip(real_inputs, real_targets):
    static_input.copy_(data)
    static_target.copy_(target)
    g.replay()
    # Runs scaler.step and scaler.update eagerly
    scaler.step(optimizer)
    scaler.update()

與多個流的用法#

捕獲模式會自動傳播到與捕獲流同步的任何流。在捕獲期間，您可以透過發出不同流的呼叫來暴露並行性，但總體流依賴 DAG 必須從初始捕獲流分支出來，並在捕獲結束前重新加入初始流。

with torch.cuda.graph(g):
    # at context manager entrance, torch.cuda.current_stream()
    # is the initial capturing stream

    # INCORRECT (does not branch out from or rejoin initial stream)
    with torch.cuda.stream(s):
        cuda_work()

    # CORRECT:
    # branches out from initial stream
    s.wait_stream(torch.cuda.current_stream())
    with torch.cuda.stream(s):
        cuda_work()
    # rejoins initial stream before capture ends
    torch.cuda.current_stream().wait_stream(s)

注意

為了避免高階使用者在 nsight systems 或 nvprof 中檢視重放時產生混淆：與即時執行不同，圖將非平凡的流 DAG 在捕獲中解釋為提示，而不是命令。在重放期間，圖可能會將獨立的操作重新組織到不同的流上，或者以不同的順序將它們入隊（同時尊重您原始 DAG 的整體依賴關係）。

與 DistributedDataParallel 的用法#

NCCL < 2.9.6#

早於 2.9.6 的 NCCL 版本不允許捕獲集合操作。您必須使用部分網路捕獲，它將所有歸約推遲到圖以外的部分後向傳遞。

在包裝網路為 DDP 之前，對圖可捕獲的網路部分呼叫 make_graphed_callables()。

NCCL >= 2.9.6#

NCCL 版本 2.9.6 或更高版本允許在圖中使用集合操作。捕獲*整個後向傳遞*的方法是可行的選項，但需要三個設定步驟。

停用 DDP 的內部非同步錯誤處理

os.environ["NCCL_ASYNC_ERROR_HANDLING"] = "0"
torch.distributed.init_process_group(...)

在完全後向捕獲之前，DDP 必須在側流上下文中構建

with torch.cuda.stream(s):
    model = DistributedDataParallel(model)

您的預熱必須在捕獲之前至少執行 11 次 DDP 啟用的即時迭代。

圖記憶體管理#

捕獲的圖在每次重放時都作用於相同的虛擬地址。如果 PyTorch 釋放了記憶體，後續的重放可能會遇到非法記憶體訪問。如果 PyTorch 將記憶體重新分配給新的張量，重放可能會破壞那些張量看到的值。因此，圖使用的虛擬地址必須在重放之間為圖保留。PyTorch 快取分配器透過檢測捕獲何時正在進行，並從圖私有記憶體池滿足捕獲的分配來實現這一點。私有池一直存在，直到其 CUDAGraph 物件和捕獲期間建立的所有張量離開作用域。

私有池是自動維護的。預設情況下，分配器為每個捕獲建立一個單獨的私有池。如果您捕獲多個圖，這種保守的方法可以確保圖重放永遠不會破壞彼此的值，但有時會不必要地浪費記憶體。

CUDA 語義#

Ampere (及更高版本) 裝置上的 TensorFloat-32 (TF32)#

FP16 GEMM 中的低精度歸約#

BF16 GEMM 中的低精度歸約#

FP16 GEMM 中的完整 FP16 累加#

非同步執行#

CUDA 流#

反向傳播的流語義#

BC 注意：在預設流上使用梯度#

記憶體管理#

使用 PYTORCH_CUDA_ALLOC_CONF 最佳化記憶體使用#

為 CUDA 使用自定義記憶體分配器#

在同一個程式中混合不同的 CUDA 系統分配器#

在 H100/H200 GPU 上使用自定義記憶體分配器調整 NVLink 效能#

cuBLAS 工作區#

cuFFT 計劃快取#

即時編譯#

最佳實踐#

裝置無關的程式碼#

使用固定記憶體緩衝區#

使用 nn.parallel.DistributedDataParallel 而不是 multiprocessing 或 nn.DataParallel#

CUDA 圖#

為什麼使用 CUDA 圖？#

PyTorch API#

約束#

非約束#

整個網路捕獲#

部分網路捕獲#

與 torch.cuda.amp 的用法#

與多個流的用法#

與 DistributedDataParallel 的用法#

NCCL < 2.9.6#

NCCL >= 2.9.6#

圖記憶體管理#

跨捕獲共享記憶體#

文件

教程

資源

使用 `PYTORCH_CUDA_ALLOC_CONF` 最佳化記憶體使用#