torch.profiler#

創建於: 2020年12月18日 | 最後更新於: 2025年06月13日

概述#

PyTorch Profiler 是一個工具，可以收集訓練和推理過程中的效能指標。Profiler 的上下文管理器 API 可以用來更好地瞭解哪些模型運算子最耗時，檢查它們的輸入形狀和堆疊跟蹤，研究裝置核心活動，以及視覺化執行跟蹤。

注意

torch.autograd 模組中早期版本的 API 被認為是舊的，並將被棄用。

API 參考#

class torch.profiler._KinetoProfile(*, activities=None, record_shapes=False, profile_memory=False, with_stack=False, with_flops=False, with_modules=False, experimental_config=None, execution_trace_observer=None, acc_events=False, custom_trace_id_callback=None)[source]#

低階分析器包裝 autograd profile

引數

activities (iterable) – 要在分析中使用的一組活動（CPU、CUDA），支援的值：torch.profiler.ProfilerActivity.CPU、torch.profiler.ProfilerActivity.CUDA、torch.profiler.ProfilerActivity.XPU。預設值：ProfilerActivity.CPU 和（如果可用）ProfilerActivity.CUDA 或（如果可用）ProfilerActivity.XPU。
record_shapes (bool) – 儲存有關運算子輸入形狀的資訊。
profile_memory (bool) – 跟蹤張量記憶體分配/去分配（有關更多詳細資訊，請參閱 export_memory_timeline）。
with_stack (bool) – 記錄 ops 的源資訊（檔案和行號）。
with_flops (bool) – 使用公式估算特定運算子（矩陣乘法和二維卷積）的 FLOPs。
with_modules (bool) – 記錄與 op 的呼叫堆疊相對應的模組層次結構（包括函式名稱）。例如，如果模組 A 的 forward 呼叫了包含 aten::add op 的模組 B 的 forward，則 aten::add 的模組層次結構為 A.B。請注意，目前此支援僅適用於 TorchScript 模型，而不適用於 eager 模式模型。
experimental_config (_ExperimentalConfig) – 由 Kineto 等分析器庫使用的一組實驗選項。注意，不保證向後相容性。
execution_trace_observer (ExecutionTraceObserver) – PyTorch Execution Trace Observer 物件。PyTorch Execution Traces 提供基於圖的 AI/ML 工作負載表示，並支援重放基準測試、模擬器和模擬器。當包含此引數時，將在與 PyTorch profiler 相同的視窗時間內呼叫 observer 的 start() 和 stop()。
acc_events (bool) – 啟用跨多個分析週期的 FunctionEvents 的累積

注意

此 API 處於實驗階段，未來可能會發生更改。

啟用形狀和堆疊跟蹤會產生額外的開銷。當指定 record_shapes=True 時，分析器將暫時保留對張量的引用；這可能會進一步阻止某些依賴於引用計數和引入額外張量副本的最佳化。

add_metadata(key, value)[source]#

將帶有字串鍵和字串值的使用者定義元資料新增到跟蹤檔案中

add_metadata_json(key, value)[source]#

將帶有字串鍵和有效 JSON 值的使用者定義元資料新增到跟蹤檔案中

events()[source]#: 返回未聚合的分析器事件列表，用於跟蹤回撥或在分析完成後使用

export_chrome_trace(path)[source]#

以 Chrome JSON 格式匯出收集到的跟蹤。如果啟用了 kineto，則只匯出計劃的最後一個週期。

export_memory_timeline(path, device=None)[source]#

匯出分析器收集的樹中給定裝置的記憶體事件資訊，並匯出時間線圖。有 3 個可匯出的檔案使用 export_memory_timeline，每個檔案由 path 的字尾控制。

要獲得與 HTML 相容的圖，請使用字尾 .html，記憶體時間線圖將作為 PNG 檔案嵌入到 HTML 檔案中。
對於由 [times, [sizes by category]] 組成的圖點，其中 times 是時間戳，sizes 是每個類別的記憶體使用情況。記憶體時間線圖將儲存為 JSON (.json) 或 gzip 壓縮的 JSON (.json.gz)，具體取決於字尾。
要獲取原始記憶體點，請使用字尾 .raw.json.gz。每個原始記憶體事件將由 (timestamp, action, numbytes, category) 組成，其中 action 是 [PREEXISTING, CREATE, INCREMENT_VERSION, DESTROY] 中的一個，而 category 是 torch.profiler._memory_profiler.Category 中的一個列舉。

輸出：記憶體時間線以 gzip 壓縮 JSON、JSON 或 HTML 的形式寫入。

export_stacks(path, metric='self_cpu_time_total')[source]#

將堆疊跟蹤儲存到檔案

引數

path (str) – 將堆疊檔案儲存到此位置；
metric (str) – 要使用的指標：“self_cpu_time_total” 或 “self_cuda_time_total”

key_averages(group_by_input_shape=False, group_by_stack_n=0, group_by_overload_name=False)[source]#

平均事件，按運算子名稱以及（可選）輸入形狀、堆疊和過載名稱進行分組。

注意

要使用形狀/堆疊功能，請確保在建立分析器上下文管理器時設定 record_shapes/with_stack。

preset_metadata_json(key, value)[source]#

在分析器未啟動時預設使用者定義的元資料，並在稍後新增到跟蹤檔案中。元資料格式為字串鍵和有效 JSON 值

toggle_collection_dynamic(enable, activities)[source]#

在收集的任何時間點動態切換活動的收集開關。目前支援切換 Kineto 中支援的 Torch Ops (CPU) 和 CUDA 活動

引數: activities (iterable) – 要在分析中使用的一組活動，支援的值：torch.profiler.ProfilerActivity.CPU、torch.profiler.ProfilerActivity.CUDA

示例

with torch.profiler.profile(
    activities=[
        torch.profiler.ProfilerActivity.CPU,
        torch.profiler.ProfilerActivity.CUDA,
    ]
) as p:
    code_to_profile_0()
    // turn off collection of all CUDA activity
    p.toggle_collection_dynamic(False, [torch.profiler.ProfilerActivity.CUDA])
    code_to_profile_1()
    // turn on collection of all CUDA activity
    p.toggle_collection_dynamic(True, [torch.profiler.ProfilerActivity.CUDA])
    code_to_profile_2()
print(p.key_averages().table(
    sort_by="self_cuda_time_total", row_limit=-1))

class torch.profiler.profile(*, activities=None, schedule=None, on_trace_ready=None, record_shapes=False, profile_memory=False, with_stack=False, with_flops=False, with_modules=False, experimental_config=None, execution_trace_observer=None, acc_events=False, use_cuda=None, custom_trace_id_callback=None)[source]#

分析器上下文管理器。

引數

activities (iterable) – 要在分析中使用的一組活動（CPU、CUDA），支援的值：torch.profiler.ProfilerActivity.CPU、torch.profiler.ProfilerActivity.CUDA、torch.profiler.ProfilerActivity.XPU。預設值：ProfilerActivity.CPU 和（如果可用）ProfilerActivity.CUDA 或（如果可用）ProfilerActivity.XPU。
schedule (Callable) – 可呼叫物件，它接受步數（int）作為單個引數，並返回 ProfilerAction 值，該值指定在每個步數執行的分析器操作。
on_trace_ready (Callable) – 在分析期間，當 schedule 返回 ProfilerAction.RECORD_AND_SAVE 時，在每個步數呼叫的可呼叫物件。
record_shapes (bool) – 儲存有關運算子輸入形狀的資訊。
profile_memory (bool) – 跟蹤張量記憶體分配/去分配。
with_stack (bool) – 記錄 ops 的源資訊（檔案和行號）。
with_flops (bool) – 使用公式估算特定運算子（矩陣乘法和二維卷積）的 FLOPs（浮點運算）。
with_modules (bool) – 記錄與 op 的呼叫堆疊相對應的模組層次結構（包括函式名稱）。例如，如果模組 A 的 forward 呼叫了包含 aten::add op 的模組 B 的 forward，則 aten::add 的模組層次結構為 A.B。請注意，目前此支援僅適用於 TorchScript 模型，而不適用於 eager 模式模型。
experimental_config (_ExperimentalConfig) – 用於 Kineto 庫功能的實驗選項集。注意，不保證向後相容性。
execution_trace_observer (ExecutionTraceObserver) – PyTorch Execution Trace Observer 物件。PyTorch Execution Traces 提供基於圖的 AI/ML 工作負載表示，並支援重放基準測試、模擬器和模擬器。當包含此引數時，將在與 PyTorch profiler 相同的視窗時間內呼叫 observer 的 start() 和 stop()。請參閱下面的示例部分以獲取程式碼示例。
acc_events (bool) – 啟用跨多個分析週期的 FunctionEvents 的累積
use_cuda (bool) –

自版本 1.8.1 起已棄用: 使用 activities 而非此引數。

注意

使用 schedule() 來生成可呼叫的計劃。非預設計劃在分析長時間的訓練作業時很有用，並且允許使用者在訓練過程的不同迭代中獲取多個跟蹤。預設計劃僅在上下文管理器持續時間內連續記錄所有事件。

注意

使用 tensorboard_trace_handler() 為 TensorBoard 生成結果檔案

on_trace_ready=torch.profiler.tensorboard_trace_handler(dir_name)

分析完成後，結果檔案可以在指定的目錄中找到。使用命令

tensorboard --logdir dir_name

在 TensorBoard 中檢視結果。有關更多資訊，請參閱 PyTorch Profiler TensorBoard Plugin

注意

示例

with torch.profiler.profile(
    activities=[
        torch.profiler.ProfilerActivity.CPU,
        torch.profiler.ProfilerActivity.CUDA,
    ]
) as p:
    code_to_profile()
print(p.key_averages().table(sort_by="self_cuda_time_total", row_limit=-1))

使用分析器的 schedule、on_trace_ready 和 step 函式

# Non-default profiler schedule allows user to turn profiler on and off
# on different iterations of the training loop;
# trace_handler is called every time a new trace becomes available
def trace_handler(prof):
    print(
        prof.key_averages().table(sort_by="self_cuda_time_total", row_limit=-1)
    )
    # prof.export_chrome_trace("/tmp/test_trace_" + str(prof.step_num) + ".json")


with torch.profiler.profile(
    activities=[
        torch.profiler.ProfilerActivity.CPU,
        torch.profiler.ProfilerActivity.CUDA,
    ],
    # In this example with wait=1, warmup=1, active=2, repeat=1,
    # profiler will skip the first step/iteration,
    # start warming up on the second, record
    # the third and the forth iterations,
    # after which the trace will become available
    # and on_trace_ready (when set) is called;
    # the cycle repeats starting with the next step
    schedule=torch.profiler.schedule(wait=1, warmup=1, active=2, repeat=1),
    on_trace_ready=trace_handler,
    # on_trace_ready=torch.profiler.tensorboard_trace_handler('./log')
    # used when outputting for tensorboard
) as p:
    for iter in range(N):
        code_iteration_to_profile(iter)
        # send a signal to the profiler that the next iteration has started
        p.step()

以下示例顯示瞭如何設定 Execution Trace Observer (execution_trace_observer)

with torch.profiler.profile(
    ...
    execution_trace_observer=(
        ExecutionTraceObserver().register_callback("./execution_trace.json")
    ),
) as p:
    for iter in range(N):
        code_iteration_to_profile(iter)
        p.step()

您也可以參考 tests/profiler/test_profiler.py 中的 test_execution_trace_with_kineto()。注意：也可以傳遞任何滿足 _ITraceObserver 介面的物件。

get_trace_id()[source]#: 返回當前的跟蹤 ID。

set_custom_trace_id_callback(callback)[source]#

設定一個回撥函式，在生成新的跟蹤 ID 時呼叫。

step()[source]#

通知分析器下一個分析步驟已開始。

class torch.profiler.ProfilerAction(value)[source]#: 在指定的時間間隔可以執行的分析器操作

class torch.profiler.ProfilerActivity#

成員

CPU

XPU

MTIA

CUDA

HPU

PrivateUse1

property name#

torch.profiler.schedule(*, wait, warmup, active, repeat=0, skip_first=0, skip_first_wait=0)[source]#

返回一個可呼叫物件，可用作分析器的 schedule 引數。分析器將跳過前 skip_first 步，然後等待 wait 步，然後進行 warmup 步的預熱，然後進行 active 步的活動記錄，然後以 wait 步開始重複迴圈。可選的迴圈次數由 repeat 引數指定，零值表示迴圈將一直持續到分析完成。

skip_first_wait 引數控制是否應跳過第一個 wait 階段。這在使用者希望在 skip_first 之後等待比 skip_first 更長的時間，但又不希望在第一個配置檔案中等待時很有用。例如，如果 skip_first 為 10，wait 為 20，如果 skip_first_wait 為零，第一個迴圈將在預熱前等待 10 + 20 = 30 步，但如果 skip_first_wait 非零，則只等待 10 步。之後的所有迴圈將在最後一個活動和預熱之間等待 20 步。

返回型別: Callable

torch.profiler.tensorboard_trace_handler(dir_name, worker_name=None, use_gzip=False)[source]#

將跟蹤檔案輸出到 dir_name 目錄，然後該目錄可以直接作為 logdir 提供給 tensorboard。worker_name 在分散式場景中應對於每個 worker 都是唯一的，它將預設為 ‘[hostname]_[pid]’。

Intel Instrumentation and Tracing Technology APIs#

torch.profiler.itt.is_available()[source]#: 檢查 ITT 功能是否可用

torch.profiler.itt.mark(msg)[source]#

描述在某個時間點發生的瞬時事件。

引數: msg (str) – 與事件關聯的 ASCII 訊息。

torch.profiler.itt.range_push(msg)[source]#

將一個範圍推送到巢狀範圍跨度的堆疊上。返回已開始範圍的零基深度。

引數: msg (str) – 與範圍關聯的 ASCII 訊息

torch.profiler.itt.range_pop()[source]#: 將一個範圍從巢狀範圍跨度的堆疊上彈出。返回已結束範圍的零基深度。

torch.profiler#

概述#

API 參考#

Intel Instrumentation and Tracing Technology APIs#

文件

教程

資源