注意
轉到末尾 下載完整的示例程式碼。
透過替換 nn.Transformer、Nested Tensors 和 torch.compile() 來加速 PyTorch Transformer#
瞭解 PyTorch 提供的低階構建塊,用於構建自定義 Transformer 層(Nested Tensors、scaled_dot_product_attention、torch.compile() 和 FlexAttention)
以 MultiHeadAttention 為例,瞭解上述技術如何改善記憶體使用和效能
利用上述構建塊探索高階自定義
PyTorch v.2.6.0 或更高版本
在過去的幾年裡,PyTorch 團隊開發了各種低階功能,這些功能組合在一起可以建立各種 Transformer 變體。這些功能包括:
帶 torch.jagged 佈局的 Nested Tensors(又名 NJTs)
scaled_dot_product_attentiontorch.compile()FlexAttention
本教程將簡要概述上述技術,並演示如何將它們組合以提供靈活且高效能的 Transformer 層,同時改善使用者體驗。
人們可能會注意到,torch.nn 模組目前提供了各種與 Transformer 相關的層。特別是,它包括 TransformerEncoderLayer、TransformerEncoder、TransformerDecoderLayer、TransformerDecoder、Transformer 和 MultiheadAttention。這組層最初是根據“Attention is All You Need”論文實現的。本教程中討論的元件在使用者體驗、靈活性和效能方面優於現有的 nn 層。
本教程適合我嗎?#
如果您想了解 torch 庫提供了哪些構建塊來編寫自己的 Transformer 層以及最佳實踐,那麼您來對地方了。請繼續閱讀!
如果您正在尋找現成的流行 Transformer 架構實現,請注意,有許多開源庫提供了它們,包括:
如果您只對高效能注意力分數修改感興趣,請參閱 FlexAttention 部落格,其中包含一個掩碼庫(gym of masks)。
介紹構建塊#
首先,我們將簡要介紹引言中提到的四種技術。
Nested Tensors 泛化了常規密集張量的形狀,允許使用相同的張量使用者體驗來表示不規則大小的資料。在 Transformer 的上下文中,我們可以將 Nested Tensors 視為表示可變序列長度的工具。它們消除了對顯式填充和掩碼(例如 nn.MultiHeadAttention 中的 key_padding_mask)這種易出錯的做法的需求。
scaled_dot_product_attention 是 softmax(QK^T / sqrt(E) + B)V 的一個原始操作,它可以分派到該操作的融合實現或回退實現。它在 eager 模式(即 PyTorch 的預設模式,操作在遇到時即時執行)下即可開箱即用,並且還與 torch.compile() 無縫整合。從 2.6 版本開始,它還將原生支援分組查詢注意力(grouped query attention)。
torch.compile() 是 2.0 版本引入的一個編譯器,能夠捕獲 PyTorch 程式碼圖並對其進行各種最佳化,例如將一系列操作融合在一起。帶有 torch.jagged 佈局的 Nested Tensors 和 scaled_dot_product_attention 與 compile 配合良好。在 Transformer 的上下文中,使用 compile 結合 nested tensor 和 SDPA 的增值之處在於,compile 可以消除 eager 模式下看到的框架開銷,並將 Transformer 中的一系列操作(例如投影和啟用)融合在一起。
FlexAttention 是一個原始操作,允許使用者在 softmax 操作之前修改注意力分數。它泛化了 scaled_dot_product_attention 的上述加性 B 項,允許進行任意計算。它需要 compile 才能實現良好的效能。
上述構建塊是“All You Need”(截至 2024 年 10 月)#
本節的主要前提是,大多數 Transformer 變體都是 GPT 風格的,由嵌入層、位置編碼、注意力塊和前饋網路等層組成。如果我們試圖對該領域的差異進行分類,可能會得到類似的結果:
層型別(啟用函式,如 SwiGLU 等;歸一化函式,如 RMSNorm 等;位置編碼,如正弦、旋轉等)。
層順序,例如在哪裡應用歸一化和位置編碼。
對注意力分數的修改,例如 ALiBi、相對位置偏差等。
在編譯器環境之前,您可能會編寫自定義 Transformer 並注意到它能正常工作但速度很慢。為解決此問題,您可能會為特定的操作序列開發自定義融合核心。在編譯器環境中,您可以僅執行初始步驟,然後進行編譯並受益於改進的效能。
MultiheadAttention#
請記住,MultiheadAttention 接收查詢、鍵和值,並由輸入投影、scaled_dot_product_attention 操作和輸出投影組成。我們在這裡要展示的主要收穫是,當我們用 Nested Tensors 替換填充/掩碼輸入時所獲得的改進。改進有三個方面:
使用者體驗 請記住,nn.MultiHeadAttention 需要 query、key 和 value 是密集的 torch.Tensors。它還提供了一個 key_padding_mask,用於掩蓋由於批次內不同序列長度而產生的 key 中的填充標記。由於 nn.MHA 中沒有 query_padding_mask,使用者必須小心地適當地掩蓋/切片輸出,以考慮查詢序列長度。NestedTensor 乾淨地消除了這種易出錯的填充掩碼的需要。
記憶體 代替使用 [B, S] 填充掩碼(其中 B 是批次大小,S 是批次中最大序列長度,D 是嵌入大小)物化一個密集 [B, S, D] 張量,Nested Tensors 允許您乾淨地表示不同序列長度的批次。因此,輸入和中間啟用將使用更少的記憶體。
效能 由於填充未物化且跳過了對填充的不必要計算,因此效能和記憶體使用都會得到改善。
我們將透過擴充套件 Nested Tensor 教程中的 MultiheadAttention 層來演示這一點,並將其與 nn.MultiHeadAttention 層進行比較。
import torch
import torch.nn as nn
import torch.nn.functional as F
class MultiHeadAttention(nn.Module):
"""
Computes multi-head attention. Supports nested or padded tensors.
Args:
E_q (int): Size of embedding dim for query
E_k (int): Size of embedding dim for key
E_v (int): Size of embedding dim for value
E_total (int): Total embedding dim of combined heads post input projection. Each head
has dim E_total // nheads
nheads (int): Number of heads
dropout (float, optional): Dropout probability. Default: 0.0
bias (bool, optional): Whether to add bias to input projection. Default: True
"""
def __init__(
self,
E_q: int,
E_k: int,
E_v: int,
E_total: int,
nheads: int,
dropout: float = 0.0,
bias=True,
device=None,
dtype=None,
):
factory_kwargs = {"device": device, "dtype": dtype}
super().__init__()
self.nheads = nheads
self.dropout = dropout
self._qkv_same_embed_dim = E_q == E_k and E_q == E_v
if self._qkv_same_embed_dim:
self.packed_proj = nn.Linear(E_q, E_total * 3, bias=bias, **factory_kwargs)
else:
self.q_proj = nn.Linear(E_q, E_total, bias=bias, **factory_kwargs)
self.k_proj = nn.Linear(E_k, E_total, bias=bias, **factory_kwargs)
self.v_proj = nn.Linear(E_v, E_total, bias=bias, **factory_kwargs)
E_out = E_q
self.out_proj = nn.Linear(E_total, E_out, bias=bias, **factory_kwargs)
assert E_total % nheads == 0, "Embedding dim is not divisible by nheads"
self.E_head = E_total // nheads
self.bias = bias
def forward(
self,
query: torch.Tensor,
key: torch.Tensor,
value: torch.Tensor,
attn_mask=None,
is_causal=False,
) -> torch.Tensor:
"""
Forward pass; runs the following process:
1. Apply input projection
2. Split heads and prepare for SDPA
3. Run SDPA
4. Apply output projection
Args:
query (torch.Tensor): query of shape (``N``, ``L_q``, ``E_qk``)
key (torch.Tensor): key of shape (``N``, ``L_kv``, ``E_qk``)
value (torch.Tensor): value of shape (``N``, ``L_kv``, ``E_v``)
attn_mask (torch.Tensor, optional): attention mask of shape (``N``, ``L_q``, ``L_kv``) to pass to SDPA. Default: None
is_causal (bool, optional): Whether to apply causal mask. Default: False
Returns:
attn_output (torch.Tensor): output of shape (N, L_t, E_q)
"""
# Step 1. Apply input projection
if self._qkv_same_embed_dim:
if query is key and key is value:
result = self.packed_proj(query)
query, key, value = torch.chunk(result, 3, dim=-1)
else:
q_weight, k_weight, v_weight = torch.chunk(
self.packed_proj.weight, 3, dim=0
)
if self.bias:
q_bias, k_bias, v_bias = torch.chunk(
self.packed_proj.bias, 3, dim=0
)
else:
q_bias, k_bias, v_bias = None, None, None
query, key, value = (
F.linear(query, q_weight, q_bias),
F.linear(key, k_weight, k_bias),
F.linear(value, v_weight, v_bias),
)
else:
query = self.q_proj(query)
key = self.k_proj(key)
value = self.v_proj(value)
# Step 2. Split heads and prepare for SDPA
# reshape query, key, value to separate by head
# (N, L_t, E_total) -> (N, L_t, nheads, E_head) -> (N, nheads, L_t, E_head)
query = query.unflatten(-1, [self.nheads, self.E_head]).transpose(1, 2)
# (N, L_s, E_total) -> (N, L_s, nheads, E_head) -> (N, nheads, L_s, E_head)
key = key.unflatten(-1, [self.nheads, self.E_head]).transpose(1, 2)
# (N, L_s, E_total) -> (N, L_s, nheads, E_head) -> (N, nheads, L_s, E_head)
value = value.unflatten(-1, [self.nheads, self.E_head]).transpose(1, 2)
# Step 3. Run SDPA
# (N, nheads, L_t, E_head)
attn_output = F.scaled_dot_product_attention(
query, key, value, dropout_p=self.dropout, is_causal=is_causal
)
# (N, nheads, L_t, E_head) -> (N, L_t, nheads, E_head) -> (N, L_t, E_total)
attn_output = attn_output.transpose(1, 2).flatten(-2)
# Step 4. Apply output projection
# (N, L_t, E_total) -> (N, L_t, E_out)
attn_output = self.out_proj(attn_output)
return attn_output
實用程式#
在本節中,我們包含一個實用程式,使用 Zipf 分佈為句子長度生成半真實資料。這用於生成巢狀的查詢、鍵和值張量。我們還包含一個基準測試實用程式。
import numpy as np
def zipf_sentence_lengths(alpha: float, batch_size: int) -> torch.Tensor:
# generate fake corpus by unigram Zipf distribution
# from wikitext-2 corpus, we get rank "." = 3, "!" = 386, "?" = 858
sentence_lengths = np.empty(batch_size, dtype=int)
for ibatch in range(batch_size):
sentence_lengths[ibatch] = 1
word = np.random.zipf(alpha)
while word != 3 and word != 386 and word != 858:
sentence_lengths[ibatch] += 1
word = np.random.zipf(alpha)
return torch.tensor(sentence_lengths)
# Generate a batch of semi-realistic data using Zipf distribution for sentence lengths
# in the form of nested tensors with the jagged layout.
def gen_batch(N, E_q, E_k, E_v, device, dtype=torch.float32, query_seq_len_1=False):
# generate semi-realistic data using Zipf distribution for sentence lengths
sentence_lengths = zipf_sentence_lengths(alpha=1.2, batch_size=N)
# Note: the torch.jagged layout is a nested tensor layout that supports a single ragged
# dimension and works with torch.compile. The batch items each have shape (B, S*, D)
# where B = batch size, S* = ragged sequence length, and D = embedding dimension.
if query_seq_len_1:
query = torch.nested.nested_tensor(
[torch.randn(1, E_q, dtype=dtype, device=device) for l in sentence_lengths],
layout=torch.jagged,
)
else:
query = torch.nested.nested_tensor(
[
torch.randn(l.item(), E_q, dtype=dtype, device=device)
for l in sentence_lengths
],
layout=torch.jagged,
)
key = torch.nested.nested_tensor(
[
torch.randn(s.item(), E_k, dtype=dtype, device=device)
for s in sentence_lengths
],
layout=torch.jagged,
)
value = torch.nested.nested_tensor(
[
torch.randn(s.item(), E_v, dtype=dtype, device=device)
for s in sentence_lengths
],
layout=torch.jagged,
)
return query, key, value, sentence_lengths
import math
import timeit
def benchmark(func, *args, **kwargs):
torch.cuda.synchronize()
torch.cuda.reset_peak_memory_stats()
begin = timeit.default_timer()
output = func(*args, **kwargs)
torch.cuda.synchronize()
end = timeit.default_timer()
return output, (end - begin), torch.cuda.max_memory_allocated()
現在,我們將演示在 MultiheadAttention 層 + compile 進行自注意力時使用 Nested Tensors 的效能改進。我們將其與傳統的 nn.MultiHeadAttention + compile(帶有填充和掩碼)進行比較。
N, E_q, E_k, E_v, E_total = 512, 512, 512, 512, 512
E_out = E_q
d_model = E_q
nheads = 8
dropout = 0.0
bias = True
device = "cuda"
torch.manual_seed(6)
query, key, value, sentence_lengths = gen_batch(N, E_q, E_k, E_v, device)
S = sentence_lengths.max().item()
print(
f"Total sequence length in nested query {sentence_lengths.sum().item()}, max sequence length {S}"
)
padded_query, padded_key, padded_value = (
t.to_padded_tensor(0.0) for t in (query, key, value)
)
torch.manual_seed(6)
mha_layer = MultiHeadAttention(
E_q, E_k, E_v, E_total, nheads, dropout=dropout, bias=bias, device="cuda"
)
torch.manual_seed(6)
vanilla_mha_layer = nn.MultiheadAttention(
E_q, nheads, dropout=dropout, batch_first=True, bias=bias, device="cuda"
)
# ``nn.MultiheadAttention`` uses a non conventional initialization for layers, so do this for exact parity :(
mha_layer.out_proj.weight = nn.Parameter(
vanilla_mha_layer.out_proj.weight.clone().detach()
)
mha_layer.packed_proj.weight = nn.Parameter(
vanilla_mha_layer.in_proj_weight.clone().detach()
)
mha_layer.out_proj.bias = nn.Parameter(vanilla_mha_layer.out_proj.bias.clone().detach())
mha_layer.packed_proj.bias = nn.Parameter(
vanilla_mha_layer.in_proj_bias.clone().detach()
)
new_mha_layer = torch.compile(mha_layer)
# warmup compile
nested_result_warmup = new_mha_layer(query, query, query, is_causal=True)
# benchmark
nested_result, nested_time, nested_peak_memory = benchmark(
new_mha_layer, query, query, query, is_causal=True
)
padded_nested_result = nested_result.to_padded_tensor(0.0)
# For the vanilla ``nn.MultiheadAttention``, we need to construct the ``key_padding_mask``
# Further, ``nn.MultiheadAttention`` forces one to materialize the ``attn_mask`` even if using ``is_causal``
src_key_padding_mask = torch.where(padded_query == 0.0, -math.inf, 0)[:, :, 0]
attn_mask = torch.empty((N, S, S), device=device).fill_(float("-inf"))
for i, s in enumerate(sentence_lengths):
attn_mask[i, :s, :s] = nn.Transformer.generate_square_subsequent_mask(s)
attn_mask = attn_mask.unsqueeze(1).expand(N, nheads, S, S).reshape(N * nheads, S, S)
vanilla_mha_layer = torch.compile(vanilla_mha_layer)
# warmup compile
warmup_vanilla_result = vanilla_mha_layer(
padded_query,
padded_query,
padded_query,
attn_mask=attn_mask,
key_padding_mask=src_key_padding_mask,
need_weights=False,
is_causal=True,
)
# benchmark
(padded_result, _), padded_time, padded_peak_memory = benchmark(
vanilla_mha_layer,
padded_query,
padded_query,
padded_query,
key_padding_mask=src_key_padding_mask,
need_weights=False,
attn_mask=attn_mask,
is_causal=True,
)
print(f"{padded_time=:.5f}, padded_peak_memory={padded_peak_memory/1e9:.2f} GB")
print(f"{nested_time=:.5f}, nested_peak_memory={nested_peak_memory/1e9:.2f} GB")
print(
"Max difference between vanilla and nested result",
(padded_result - padded_nested_result).abs().max().item(),
)
print(f"Nested speedup: {(padded_time/nested_time):.2f}")
print(
f"Nested peak memory reduction {((padded_peak_memory - nested_peak_memory)/1e9):.2f} GB"
)
Total sequence length in nested query 10869, max sequence length 133
/usr/local/lib/python3.10/dist-packages/torch/backends/cuda/__init__.py:131: UserWarning:
Please use the new API settings to control TF32 behavior, such as torch.backends.cudnn.conv.fp32_precision = 'tf32' or torch.backends.cuda.matmul.fp32_precision = 'ieee'. Old settings, e.g, torch.backends.cuda.matmul.allow_tf32 = True, torch.backends.cudnn.allow_tf32 = True, allowTF32CuDNN() and allowTF32CuBLAS() will be deprecated after Pytorch 2.9. Please see https://pytorch.com.tw/docs/stable/notes/cuda.html#tensorfloat-32-tf32-on-ampere-and-later-devices (Triggered internally at /pytorch/aten/src/ATen/Context.cpp:80.)
/usr/local/lib/python3.10/dist-packages/torch/_inductor/compile_fx.py:312: UserWarning:
TensorFloat32 tensor cores for float32 matrix multiplication available but not enabled. Consider setting `torch.set_float32_matmul_precision('high')` for better performance.
padded_time=0.01925, padded_peak_memory=3.87 GB
nested_time=0.00254, nested_peak_memory=0.74 GB
Max difference between vanilla and nested result 0.0
Nested speedup: 7.57
Nested peak memory reduction 3.13 GB
供參考,以下是在 A100 上的一些樣本輸出。
padded_time=0.03454, padded_peak_memory=4.14 GB
nested_time=0.00612, nested_peak_memory=0.76 GB
Max difference between vanilla and nested result 0.0
Nested speedup: 5.65
Nested peak memory reduction 3.39 GB
我們也可以看到向後傳遞的相同情況。
for i, entry_length in enumerate(sentence_lengths):
# padding-specific step: remove output projection bias from padded entries for fair comparison
padded_result[i, entry_length:, :] = 0.0
_, padded_bw_time, padded_bw_peak_mem = benchmark(
lambda: padded_result.sum().backward()
)
_, nested_bw_time, nested_bw_peak_mem = benchmark(
lambda: padded_nested_result.sum().backward()
)
print(f"{padded_bw_time=:.5f}, padded_bw_peak_mem={padded_bw_peak_mem/1e9:.2f} GB")
print(f"{nested_bw_time=:.5f}, nested_bw_peak_mem={nested_bw_peak_mem/1e9:.2f} GB")
print(f"Nested backward speedup: {(padded_bw_time/nested_bw_time):.2f}")
print(
f"Nested backward peak memory reduction {((padded_bw_peak_mem - nested_bw_peak_mem)/1e9):.2f} GB"
)
print(
"Difference in out_proj.weight.grad",
(mha_layer.out_proj.weight.grad - vanilla_mha_layer.out_proj.weight.grad)
.abs()
.max()
.item(),
)
print(
"Difference in packed_proj.weight.grad",
(mha_layer.packed_proj.weight.grad - vanilla_mha_layer.in_proj_weight.grad)
.abs()
.max()
.item(),
)
print(
"Difference in out_proj.bias.grad",
(mha_layer.out_proj.bias.grad - vanilla_mha_layer.out_proj.bias.grad)
.abs()
.max()
.item(),
)
print(
"Difference in packed_proj.bias.grad",
(mha_layer.packed_proj.bias.grad - vanilla_mha_layer.in_proj_bias.grad)
.abs()
.max()
.item(),
)
padded_bw_time=1.84617, padded_bw_peak_mem=4.79 GB
nested_bw_time=0.06780, nested_bw_peak_mem=3.06 GB
Nested backward speedup: 27.23
Nested backward peak memory reduction 1.73 GB
Difference in out_proj.weight.grad 0.00042724609375
Difference in packed_proj.weight.grad 0.001983642578125
Difference in out_proj.bias.grad 0.0
Difference in packed_proj.bias.grad 0.001953125
A100 上的樣本輸出
padded_bw_time=2.09337, padded_bw_peak_mem=5.10 GB
nested_bw_time=0.01452, nested_bw_peak_mem=3.24 GB
Nested backward speedup: 144.13
Nested backward peak memory reduction 1.86 GB
Difference in out_proj.weight.grad 0.000244140625
Difference in packed_proj.weight.grad 0.001556396484375
Difference in out_proj.bias.grad 0.0
Difference in packed_proj.bias.grad 0.001953125
GPT 風格層#
一個基本的 GPT 風格的 Transformer 層由一個因果自注意力層組成,後面是一個帶跳躍連線的前饋網路 (FFN)。使用上面的 MultiheadAttention 層實現這一點非常簡單,並且與具有 is_causal=True 的 nn.TransformerEncoderLayer 產生相同的結果。
我們在這裡展示了實現 nn 層其餘部分的示例,但為簡潔起見,本教程中省略了。
更進一步#
到目前為止,我們已經演示瞭如何實現遵循傳統 nn.MultiHeadAttention 的高效能 MultiheadAttention 層。回顧我們對 Transformer 架構修改的分類,請記住,我們將修改分為層型別、層順序和注意力分數修改。我們相信更改層型別和層順序(例如,用 RMSNorm 替換 LayerNorm)非常簡單。
在本節中,我們將討論使用上述構建塊的各種功能,包括以下內容:
交叉注意力
完全掩碼的行不再導致 NaN
打包投影
交叉注意力#
交叉注意力是一種注意力形式,其中查詢和鍵/值張量來自不同的序列。
一個例子是在 nn.TransformerDecoderLayer 中,其中查詢來自解碼器,鍵/值來自編碼器。
上述 MultiheadAttention 層很好地推廣到這種情況,其中查詢和鍵/值都使用 Nested Tensors。
query, _, _, q_len = gen_batch(N, E_q, E_k, E_v, device)
_, key, value, kv_len = gen_batch(N, E_q, E_k, E_v, device)
print(
f"Total sequence length in nested query {q_len.sum().item()}, max sequence length {q_len.max().item()}"
)
print(
f"Total sequence length in nested key/value {kv_len.sum().item()}, max sequence length {kv_len.max().item()}"
)
out = new_mha_layer(query, key, value, is_causal=False)
Total sequence length in nested query 10723, max sequence length 142
Total sequence length in nested key/value 11068, max sequence length 122
如上所述,我們可以將其與 vanilla 編譯的 nn.MultiHeadAttention 進行比較。
torch.manual_seed(6)
query, _, _, q_len = gen_batch(N, E_q, E_k, E_v, device)
_, key, value, kv_len = gen_batch(N, E_q, E_k, E_v, device)
padded_query, padded_key, padded_value = (
t.to_padded_tensor(0.0) for t in (query, key, value)
)
key_padding_mask = torch.where(padded_key == 0.0, -math.inf, 0)[:, :, 0]
# warmup compile
warmup_nested_result = new_mha_layer(query, key, value, is_causal=False)
warmup_vanilla_result = vanilla_mha_layer(
padded_query,
padded_key,
padded_value,
key_padding_mask=key_padding_mask,
need_weights=False,
is_causal=False,
)
nested_result, nested_time, nested_peak_memory = benchmark(
new_mha_layer, query, key, value, is_causal=False
)
(padded_result, _), padded_time, padded_peak_memory = benchmark(
vanilla_mha_layer,
padded_query,
padded_key,
padded_value,
key_padding_mask=key_padding_mask,
need_weights=False,
is_causal=False,
)
padded_nested_result = nested_result.to_padded_tensor(0.0)
for i, entry_length in enumerate(q_len):
# padding-specific step: remove output projection bias from padded entries for fair comparison
padded_result[i, entry_length:, :] = 0.0
print(
"Max difference between vanilla and nested result",
(padded_result - padded_nested_result).abs().max().item(),
)
print(f"Nested speedup: {(padded_time/nested_time):.2f}")
print(
f"Nested peak memory reduction {((padded_peak_memory - nested_peak_memory)/1e9):.2f} GB"
)
Max difference between vanilla and nested result 0.0
Nested speedup: 5.72
Nested peak memory reduction 1.30 GB
A100 上的樣本輸出
Max difference between vanilla and nested result 0.0
Nested speedup: 4.01
Nested peak memory reduction 1.40 GB
完全掩碼的行不再導致 NaN#
nn.MultiHeadAttention 和 scaled_dot_product_attention 存在一個長期存在的問題,即如果一行被完全掩碼掉,注意力層的輸出將是 NaN。請參閱 issue。這是因為對空集的 softmax 是未定義的。
感謝此 PR,這種情況不再發生。取而代之的是,scaled_dot_product_attention 中完全掩碼行的對應輸出將為 0。對於 nn.MHA 不使用“快速路徑”的情況,這也適用。
強烈建議使用具有 NJT 的自定義 MHA 層,而不是 nn.MultiHeadAttention 中現有的“快速路徑”,因為 NJT 對殘缺性的建模能力使其能夠正確地表示空序列。
打包投影#
打包投影是一種技術,它利用了投影(矩陣乘法)的輸入相同(自注意力)這一事實,我們可以將投影權重和偏置打包成單個張量。當單個投影是記憶體密集型而不是計算密集型時,它特別有用。我們將在此處演示兩個示例:
MultiheadAttention 的輸入投影
Transformer 層前饋網路中的 SwiGLU 啟用
MultiheadAttention 的輸入投影#
進行自注意力時,query、key 和 value 是同一個張量。每個張量都用一個 Linear(E_q, E_total) 層進行投影。相反,我們可以將其打包到一個層中,這就是我們在上面的 MultiheadAttention 層中所做的。
讓我們將打包投影的效能與常規方法進行比較。
class InputProjection(nn.Module):
def __init__(self, E_q, E_total, bias=False, device=None, dtype=None):
factory_kwargs = {"device": device, "dtype": dtype}
super().__init__()
self.q_proj = nn.Linear(E_q, E_total, bias=bias, **factory_kwargs)
self.k_proj = nn.Linear(E_q, E_total, bias=bias, **factory_kwargs)
self.v_proj = nn.Linear(E_q, E_total, bias=bias, **factory_kwargs)
def forward(self, x):
return self.q_proj(x), self.k_proj(x), self.v_proj(x)
class PackedInputProjection(nn.Module):
def __init__(self, E_q, E_total, bias=False, device=None, dtype=None):
factory_kwargs = {"device": device, "dtype": dtype}
super().__init__()
self.packed_proj = nn.Linear(E_q, E_total * 3, bias=bias, **factory_kwargs)
def forward(self, query):
return torch.chunk(self.packed_proj(query), 3, dim=-1)
B, D, dtype = 256, 8192, torch.bfloat16
torch.set_float32_matmul_precision("high")
in_proj = torch.compile(InputProjection(D, D, device="cuda", dtype=torch.bfloat16))
packed_in_proj = torch.compile(
PackedInputProjection(D, D, device="cuda", dtype=torch.bfloat16)
)
q, _, _, sequence_lengths = gen_batch(B, D, D, D, device="cuda", dtype=torch.bfloat16)
# warmup
in_proj(q)
packed_in_proj(q)
# benchmark
(q_out, k_out, v_out), time, _ = benchmark(in_proj, q)
(q_out, k_out, v_out), time_packed, _ = benchmark(packed_in_proj, q)
# On my A100 prints 1.05x speedup
print(
f"InputProjection: {time:5f} s, PackedInputProjection: {time_packed:5f} s, speedup: {time/time_packed:.2f}x"
)
InputProjection: 0.030302 s, PackedInputProjection: 0.030020 s, speedup: 1.01x
Transformer 層 SwiGLU 前饋網路#
Swish-Gated Linear Unit (SwiGLU) 是一種非線性啟用函式,在 Transformer 層的前饋網路(例如 Llama)中越來越受歡迎。帶有 SwiGLU 啟用的前饋網路定義為:
class SwiGLUFFN(nn.Module):
def __init__(
self,
dim,
hidden_dim,
multiple_of,
ffn_dim_multiplier=None,
device=None,
dtype=None,
):
factory_kwargs = {"device": device, "dtype": dtype}
super().__init__()
hidden_dim = int(2 * hidden_dim / 3)
# custom dim factor multiplier
if ffn_dim_multiplier is not None:
hidden_dim = int(ffn_dim_multiplier * hidden_dim)
hidden_dim = multiple_of * ((hidden_dim + multiple_of - 1) // multiple_of)
self.w1 = nn.Linear(dim, hidden_dim, bias=False, **factory_kwargs)
self.w2 = nn.Linear(hidden_dim, dim, bias=False, **factory_kwargs)
self.w3 = nn.Linear(dim, hidden_dim, bias=False, **factory_kwargs)
def forward(self, x):
return self.w2(F.silu(self.w1(x)) * self.w3(x))
實現這一點的另一種方法是使用打包投影:
class PackedSwiGLUFFN(nn.Module):
def __init__(
self,
dim,
hidden_dim,
multiple_of,
ffn_dim_multiplier=None,
device=None,
dtype=None,
):
factory_kwargs = {"device": device, "dtype": dtype}
super().__init__()
hidden_dim = int(2 * hidden_dim / 3)
# custom dim factor multiplier
if ffn_dim_multiplier is not None:
hidden_dim = int(ffn_dim_multiplier * hidden_dim)
hidden_dim = multiple_of * ((hidden_dim + multiple_of - 1) // multiple_of)
self.w13 = nn.Linear(dim, 2 * hidden_dim, bias=False, **factory_kwargs)
self.w2 = nn.Linear(hidden_dim, dim, bias=False, **factory_kwargs)
def forward(self, x):
x1, x3 = torch.chunk(self.w13(x), 2, dim=-1)
return self.w2(F.silu(x1) * x3)
我們可以比較這兩種實現的效能。根據您的硬體,您可能會看到不同的結果。在 A100 上,我看到 D=128 時有 1.12 倍的速度提升。
D = 128
swigluffn = torch.compile(SwiGLUFFN(D, D * 4, 256, device="cuda", dtype=torch.bfloat16))
packed_swigluffn = torch.compile(
PackedSwiGLUFFN(D, D * 4, 256, device="cuda", dtype=torch.bfloat16)
)
q, _, _, sentence_lengths = gen_batch(D, D, D, D, device="cuda", dtype=torch.bfloat16)
# warmup
swigluffn(q)
packed_swigluffn(q)
# benchmark
_, time, _ = benchmark(swigluffn, q)
_, time_packed, _ = benchmark(packed_swigluffn, q)
# On my A100 prints 1.08x speedup
print(
f"SwiGLUFFN: {time} s, PackedSwiGLUFFN: {time_packed} s, speedup: {time/time_packed:.2f}x"
)
SwiGLUFFN: 0.0009849209998264996 s, PackedSwiGLUFFN: 0.0008981599999060563 s, speedup: 1.10x
擴充套件示例#
我們計劃更新本教程,以演示更多使用各種高效能構建塊(如 KV 快取、分組查詢注意力等)的示例。此外,還有一些很好的例子說明如何使用各種高效能構建塊來實現各種 Transformer 架構。一些例子包括:
結論#
在本教程中,我們介紹了 PyTorch 為編寫 Transformer 層提供的低階構建塊,並演示瞭如何組合它們的示例。我們希望本教程能夠讓讀者瞭解使用者可以多麼輕鬆地使用 PyTorch 實現靈活且高效能的 Transformer 層。
指令碼總執行時間: (0 分鐘 20.649 秒)