PyTorch 2 匯出量化與 X86 後端透過 Inductor¶

作者: Leslie Fang, Weiwen Xia, Jiong Gong, Jerry Zhang

先決條件¶

介紹¶

本教程介紹了使用 PyTorch 2 匯出量化流程來生成針對 x86 inductor 後端定製的量化模型的步驟，並解釋瞭如何將量化模型降低到 inductor 中。

pytorch 2 匯出量化流程使用 torch.export 將模型捕獲到圖（graph）中，並在 ATen 圖上執行量化轉換。這種方法預計將具有顯著更高的模型覆蓋率、更好的可程式設計性以及簡化的使用者體驗。TorchInductor 是新的編譯器後端，它將 TorchDynamo 生成的 FX 圖編譯成最佳化的 C++/Triton 核心。

這種量化 2 與 Inductor 的流程支援靜態量化和動態量化。靜態量化最適用於 CNN 模型，例如 ResNet-50。而動態量化更適合 NLP 模型，例如 RNN 和 BERT。有關這兩種量化型別的區別，請參閱以下頁面。

量化流程主要包括三個步驟：

步驟 1：基於 torch 匯出機制，從即時模式（eager）模型中捕獲 FX 圖。
步驟 2：基於捕獲的 FX 圖應用量化流程，包括定義特定後端的量化器、生成帶有觀察器的準備模型、執行準備模型的校準或量化感知訓練，以及將準備模型轉換為量化模型。
步驟 3：使用 API torch.compile 將量化模型降低到 inductor 中。

這個流程的高階架構可能如下所示

float_model(Python)                          Example Input
    \                                              /
     \                                            /
—--------------------------------------------------------
|                         export                       |
—--------------------------------------------------------
                            |
                    FX Graph in ATen
                            |            X86InductorQuantizer
                            |                 /
—--------------------------------------------------------
|                      prepare_pt2e                     |
|                           |                           |
|                     Calibrate/Train                   |
|                           |                           |
|                      convert_pt2e                     |
—--------------------------------------------------------
                            |
                     Quantized Model
                            |
—--------------------------------------------------------
|                    Lower into Inductor                |
—--------------------------------------------------------
                            |
                         Inductor

結合 PyTorch 2 匯出和 TorchInductor 的量化，我們透過新的量化前端獲得了靈活性和生產力，並透過編譯器後端獲得了出色的開箱即用效能。尤其是在 Intel 第四代 (SPR) Xeon 處理器上，可以透過利用高階矩陣擴充套件功能進一步提升模型效能。

訓練後量化¶

現在，我們將透過一個分步教程，向您展示如何將它與 torchvision resnet18 模型一起用於訓練後量化。

1. 捕獲 FX 圖¶

我們將首先執行必要的匯入，從即時模式（eager）模組中捕獲 FX 圖。

import torch
import torchvision.models as models
import copy
from torchao.quantization.pt2e.quantize_pt2e import prepare_pt2e, convert_pt2e
import torchao.quantization.pt2e.quantizer.x86_inductor_quantizer as xiq
from torchao.quantization.pt2e.quantizer.x86_inductor_quantizer import X86InductorQuantizer
from torch.export import export

# Create the Eager Model
model_name = "resnet18"
model = models.__dict__[model_name](pretrained=True)

# Set the model to eval mode
model = model.eval()

# Create the data, using the dummy data here as an example
traced_bs = 50
x = torch.randn(traced_bs, 3, 224, 224).contiguous(memory_format=torch.channels_last)
example_inputs = (x,)

# Capture the FX Graph to be quantized
with torch.no_grad():
    # Note: requires torch >= 2.6
    exported_model = export(
        model,
        example_inputs
    ).module()

接下來，我們將對 FX Module 進行量化。

2. 應用量化¶

捕獲要量化的 FX Module 後，我們將匯入 X86 CPU 的後端量化器，並配置如何量化模型。

quantizer = X86InductorQuantizer()
quantizer.set_global(xiq.get_default_x86_inductor_quantization_config())

注意

X86InductorQuantizer 中的預設量化配置對啟用和權重均使用 8 位。

當向量神經網路指令不可用時，oneDNN 後端會默默地選擇假定乘法為 7 位 x 8 位的核心。換句話說，在沒有向量神經網路指令的 CPU 上執行時，可能會出現潛在的數值飽和和準確性問題。

預設情況下，量化配置是針對靜態量化的。要應用動態量化，在獲取配置時新增引數 is_dynamic=True。

quantizer = X86InductorQuantizer()
quantizer.set_global(xiq.get_default_x86_inductor_quantization_config(is_dynamic=True))

匯入特定於後端的 Quantizer 後，我們將準備模型以進行訓練後量化。 prepare_pt2e 將 BatchNorm 運算元摺疊到前面的 Conv2d 運算元中，並在模型中的適當位置插入觀察器。

prepared_model = prepare_pt2e(exported_model, quantizer)

現在，在觀察器被插入到模型後，我們將校準 prepared_model。此步驟僅對靜態量化是必需的。

# We use the dummy data as an example here
prepared_model(*example_inputs)

# Alternatively: user can define the dataset to calibrate
# def calibrate(model, data_loader):
#     model.eval()
#     with torch.no_grad():
#         for image, target in data_loader:
#             model(image)
# calibrate(prepared_model, data_loader_test)  # run calibration on sample data

最後，我們將校準後的模型轉換為量化模型。convert_pt2e 接受一個校準過的模型並生成一個量化模型。

converted_model = convert_pt2e(prepared_model)

完成這些步驟後，我們就完成了量化流程的執行，並將獲得量化模型。

3. 降低到 Inductor¶

獲得量化模型後，我們將進一步將其降低到 inductor 後端。預設的 Inductor Wrapper 會生成 Python 程式碼來呼叫生成的核心和外部核心。此外，Inductor 支援 C++ Wrapper，它可以生成純 C++ 程式碼。這允許無縫整合生成的核心和外部核心，有效減少 Python 開銷。未來，利用 C++ Wrapper，我們可以擴充套件其功能以實現純 C++ 部署。有關 C++ Wrapper 的更全面細節，請參閱關於Inductor C++ Wrapper 教程的專用教程。

# Optional: using the C++ wrapper instead of default Python wrapper
import torch._inductor.config as config
config.cpp_wrapper = True

with torch.no_grad():
    optimized_model = torch.compile(converted_model)

    # Running some benchmark
    optimized_model(*example_inputs)

在一個更高階的場景中，int8-mixed-bf16 量化發揮了作用。在這種情況下，卷積或 GEMM 運算元會產生 BFloat16 輸出資料型別，而不是 Float32，前提是沒有後續的量化節點。隨後，BFloat16 張量會無縫地傳播到後續的逐點運算元中，從而有效減少記憶體使用量並可能提高效能。使用此功能與常規 BFloat16 Autocast 的用法類似，只需將指令碼包裝在 BFloat16 Autocast 上下文中即可。

with torch.autocast(device_type="cpu", dtype=torch.bfloat16, enabled=True), torch.no_grad():
    # Turn on Autocast to use int8-mixed-bf16 quantization. After lowering into Inductor CPP Backend,
    # For operators such as QConvolution and QLinear:
    # * The input data type is consistently defined as int8, attributable to the presence of a pair
        of quantization and dequantization nodes inserted at the input.
    # * The computation precision remains at int8.
    # * The output data type may vary, being either int8 or BFloat16, contingent on the presence
    #   of a pair of quantization and dequantization nodes at the output.
    # For non-quantizable pointwise operators, the data type will be inherited from the previous node,
    # potentially resulting in a data type of BFloat16 in this scenario.
    # For quantizable pointwise operators such as QMaxpool2D, it continues to operate with the int8
    # data type for both input and output.
    optimized_model = torch.compile(converted_model)

    # Running some benchmark
    optimized_model(*example_inputs)

將所有這些程式碼放在一起，我們就會得到一個玩具示例程式碼。請注意，由於 Inductor 的 freeze 功能預設尚未開啟，請使用 TORCHINDUCTOR_FREEZING=1 執行您的示例程式碼。

例如

TORCHINDUCTOR_FREEZING=1 python example_x86inductorquantizer_pytorch_2_1.py

透過 PyTorch 2.1 版本，TorchBench 測試套件中的所有 CNN 模型都經過了測量，並被證明與 Inductor FP32 推理路徑相比有效。有關詳細的基準測試資料，請參閱本文件。

量化感知訓練¶

PyTorch 2 匯出量化感知訓練 (QAT) 現在透過 X86InductorQuantizer 在 X86 CPU 上得到支援，之後將量化模型降低到 Inductor 中。要更深入地瞭解 PT2 匯出量化感知訓練，我們建議參考專門的PyTorch 2 匯出量化感知訓練。

PyTorch 2 匯出 QAT 流程與 PTQ 流程大體相似。

import torch
from torchao.quantization.pt2e.quantize_pt2e import (
  prepare_qat_pt2e,
  convert_pt2e,
)
import torchao.quantization.pt2e.quantizer.x86_inductor_quantizer as xiq
from torchao.quantization.pt2e.quantizer.x86_inductor_quantizer import X86InductorQuantizer

class M(torch.nn.Module):
   def __init__(self):
      super().__init__()
      self.linear = torch.nn.Linear(1024, 1000)

   def forward(self, x):
      return self.linear(x)

example_inputs = (torch.randn(1, 1024),)
m = M()

# Step 1. program capture
exported_model = torch.export.export(m, example_inputs).module()
# we get a model with aten ops

# Step 2. quantization-aware training
# Use Backend Quantizer for X86 CPU
# To apply dynamic quantization, add an argument ``is_dynamic=True`` when getting the config.
quantizer = X86InductorQuantizer()
quantizer.set_global(xiq.get_default_x86_inductor_quantization_config(is_qat=True))
prepared_model = prepare_qat_pt2e(exported_model, quantizer)

# train omitted

converted_model = convert_pt2e(prepared_model)
# we have a model with aten ops doing integer computations when possible

# move the quantized model to eval mode, equivalent to `m.eval()`
torchao.quantization.pt2e.move_exported_model_to_eval(converted_model)

# Lower the model into Inductor
with torch.no_grad():
  optimized_model = torch.compile(converted_model)
  _ = optimized_model(*example_inputs)

請注意，Inductor 的 freeze 功能預設未啟用。要使用此功能，您需要使用 TORCHINDUCTOR_FREEZING=1 執行示例程式碼。

例如

TORCHINDUCTOR_FREEZING=1 python example_x86inductorquantizer_qat.py

結論¶

透過本教程，我們介紹瞭如何在 PyTorch 2 量化中使用 Inductor 與 X86 CPU。使用者可以瞭解如何使用 X86InductorQuantizer 來量化模型並將其降低到使用 X86 CPU 裝置的 inductor 中。