自定義 SYCL 運算元#

您將學到什麼

如何將用 SYCL 編寫的自定義運算元整合到 PyTorch

先決條件

PyTorch 2.8 或更高版本
對 SYCL 程式設計的基本理解

注意

SYCL 是 Intel GPU 的後端程式語言（裝置標籤為 xpu）。有關配置詳情，請參閱：在 Intel GPU 上入門。Intel Deep Learning Essentials 中包含的 Intel 編譯器負責 SYCL 編譯。請確保在執行本教程中的程式碼示例之前安裝並激活編譯器環境。

PyTorch 提供了大量可在張量上執行的運算元（例如 torch.add, torch.sum 等）。但是，您可能希望為 PyTorch 引入新的自定義運算元。本教程演示了編寫 SYCL 自定義運算元的最佳途徑。關於 C++ 和 CUDA 運算元的教程可在自定義 C++ 和 CUDA 運算元中找到。

遵循以下結構建立自定義 SYCL 運算元

sycl_example/
├── setup.py
├── sycl_extension
│   ├── __init__.py
│   ├── muladd.sycl
│   └── ops.py
└── test_sycl_extension.py

設定構建系統#

如果您需要編譯 **SYCL** 程式碼（例如，.sycl 檔案），請使用 torch.utils.cpp_extension.SyclExtension。設定過程與 C++/CUDA 非常相似，只是需要調整 SYCL 的編譯引數。

使用 sycl_extension 與編寫以下 setup.py 一樣簡單

import os
import torch
import glob
from setuptools import find_packages, setup
from torch.utils.cpp_extension import SyclExtension, BuildExtension

library_name = "sycl_extension"
py_limited_api = True
extra_compile_args = {
    "cxx": ["-O3",
            "-fdiagnostics-color=always",
            "-DPy_LIMITED_API=0x03090000"],
    "sycl": ["-O3" ]
}

assert(torch.xpu.is_available()), "XPU is not available, please check your environment"
# Source files collection
this_dir = os.path.dirname(os.path.curdir)
extensions_dir = os.path.join(this_dir, library_name)
sources = list(glob.glob(os.path.join(extensions_dir, "*.sycl")))
# Construct extension
ext_modules = [
    SyclExtension(
        f"{library_name}._C",
        sources,
        extra_compile_args=extra_compile_args,
        py_limited_api=py_limited_api,
    )
]
setup(
    name=library_name,
    packages=find_packages(),
    ext_modules=ext_modules,
    install_requires=["torch"],
    description="Simple Example of PyTorch Sycl extensions",
    cmdclass={"build_ext": BuildExtension},
    options={"bdist_wheel": {"py_limited_api": "cp39"}} if py_limited_api else {},
)

定義自定義運算元並新增後端實現#

首先，讓我們編寫一個計算 mymuladd 的 SYCL 函式

為了從 PyTorch 的 Python 前端使用它，我們需要使用 TORCH_LIBRARY API 將其註冊為 PyTorch 運算元。這將自動將運算元繫結到 Python。

如果您還有 myaddmul 的 SYCL 實現，您也可以在單獨的 TORCH_LIBRARY_IMPL 塊中註冊它

#include <c10/xpu/XPUStream.h>
#include <sycl/sycl.hpp>
#include <ATen/Operators.h>
#include <torch/all.h>
#include <torch/library.h>

namespace sycl_extension {
// MulAdd Kernel: result = a * b + c
static void muladd_kernel(
    int numel, const float* a, const float* b, float c, float* result,
    const sycl::nd_item<1>& item) {
    int idx = item.get_global_id(0);
    if (idx < numel) {
        result[idx] = a[idx] * b[idx] + c;
    }
}

class MulAddKernelFunctor {
public:
    MulAddKernelFunctor(int _numel, const float* _a, const float* _b, float _c, float* _result)
        : numel(_numel), a(_a), b(_b), c(_c), result(_result) {}
    void operator()(const sycl::nd_item<1>& item) const {
        muladd_kernel(numel, a, b, c, result, item);
    }

private:
    int numel;
    const float* a;
    const float* b;
    float c;
    float* result;
};

at::Tensor mymuladd_xpu(const at::Tensor& a, const at::Tensor& b, double c) {
    TORCH_CHECK(a.sizes() == b.sizes(), "a and b must have the same shape");
    TORCH_CHECK(a.dtype() == at::kFloat, "a must be a float tensor");
    TORCH_CHECK(b.dtype() == at::kFloat, "b must be a float tensor");
    TORCH_CHECK(a.device().is_xpu(), "a must be an XPU tensor");
    TORCH_CHECK(b.device().is_xpu(), "b must be an XPU tensor");

    at::Tensor a_contig = a.contiguous();
    at::Tensor b_contig = b.contiguous();
    at::Tensor result = at::empty_like(a_contig);

    const float* a_ptr = a_contig.data_ptr<float>();
    const float* b_ptr = b_contig.data_ptr<float>();
    float* res_ptr = result.data_ptr<float>();
    int numel = a_contig.numel();

    sycl::queue& queue = c10::xpu::getCurrentXPUStream().queue();
    constexpr int threads = 256;
    int blocks = (numel + threads - 1) / threads;

    queue.submit([&](sycl::handler& cgh) {
        cgh.parallel_for<MulAddKernelFunctor>(
            sycl::nd_range<1>(blocks * threads, threads),
            MulAddKernelFunctor(numel, a_ptr, b_ptr, static_cast<float>(c), res_ptr)
        );
    });

    return result;
}
// Defines the operators
TORCH_LIBRARY(sycl_extension, m) {
  m.def("mymuladd(Tensor a, Tensor b, float c) -> Tensor");
}

// ==================================================
// Register SYCL Implementations to Torch Library
// ==================================================
TORCH_LIBRARY_IMPL(sycl_extension, XPU, m) {
    m.impl("mymuladd", &mymuladd_xpu);
}

} // namespace sycl_extension

建立 Python 介面#

在 sycl_extension/ops.py 檔案中為我們的運算元建立一個 Python 介面

import torch
from torch import Tensor
__all__ = ["mymuladd"]

def mymuladd(a: Tensor, b: Tensor, c: float) -> Tensor:
    """Performs a * b + c in an efficient fused kernel"""
    return torch.ops.sycl_extension.mymuladd.default(a, b, c)

初始化包#

建立 sycl_extension/__init__.py 檔案使包可匯入

import ctypes
from pathlib import Path

import torch

current_dir = Path(__file__).parent.parent
build_dir = current_dir / "build"
so_files = list(build_dir.glob("**/*.so"))

assert len(so_files) == 1, f"Expected one _C*.so file, found {len(so_files)}"

with torch._ops.dl_open_guard():
    loaded_lib = ctypes.CDLL(so_files[0])

from . import ops

__all__ = [
    "loaded_lib",
    "ops",
]

測試 SYCL 擴充套件運算元#

使用簡單的測試來驗證運算元是否正常工作。

import torch
from torch.testing._internal.common_utils import TestCase
import unittest
import sycl_extension

def reference_muladd(a, b, c):
    return a * b + c

class TestMyMulAdd(TestCase):
    def sample_inputs(self, device, *, requires_grad=False):
        def make_tensor(*size):
            return torch.randn(size, device=device, requires_grad=requires_grad)

        def make_nondiff_tensor(*size):
            return torch.randn(size, device=device, requires_grad=False)

        return [
            [make_tensor(3), make_tensor(3), 1],
            [make_tensor(20), make_tensor(20), 3.14],
            [make_tensor(20), make_nondiff_tensor(20), -123],
            [make_nondiff_tensor(2, 3), make_tensor(2, 3), -0.3],
        ]

    def _test_correctness(self, device):
        samples = self.sample_inputs(device)
        for args in samples:
            result = sycl_extension.ops.mymuladd(*args)
            expected = reference_muladd(*args)
            torch.testing.assert_close(result, expected)

    @unittest.skipIf(not torch.xpu.is_available(), "requires Intel GPU")
    def test_correctness_xpu(self):
        self._test_correctness("xpu")

if __name__ == "__main__":
    unittest.main()

此測試透過將其輸出與參考實現進行比較來檢查自定義運算元的正確性。

結論#

在本教程中，我們演示瞭如何為 PyTorch 實現和編譯自定義 SYCL 運算元。我們特別展示了一個推理操作 muladd。有關新增向後支援或啟用 torch.compile 相容性的資訊，請參閱自定義 C++ 和 CUDA 運算元。