Adafactor#

class torch.optim.Adafactor(params, lr=0.01, beta2_decay=-0.8, eps=(None, 0.001), d=1.0, weight_decay=0.0, *, foreach=None, maximize=False)#

Implements Adafactor algorithm.

\begin{aligned} &\rule{110mm}{0.4pt} \\ &\textbf{input} : \gamma \text{(lr)}, \: \tau \text{(}\beta_2\text{ decay)}, \: \theta_0 \text{(params)}, \: f(\theta) \text{(objective)}, \\ &\hspace{15mm} \: \epsilon_1, \epsilon_2 \text{ (epsilons)}, \: d \text{(clipping threshold)}, \\ &\hspace{15mm} \: \lambda \text{(weight decay)}, \: \textit{maximize} \\ &\textbf{initialize} : \: R_0 \leftarrow 0 \text{ (second moment row factor)}, \\ &\hspace{23mm} \: C_0 \leftarrow 0 \text{ (second moment col factor)}, \\ &\hspace{23mm} \: \widehat{V}_0 \leftarrow 0 \text{ (second moment for vectors)} \\[-1.ex] &\rule{110mm}{0.4pt} \\ &\textbf{for} \: t=1 \: \textbf{to} \: \ldots \: \textbf{do} \\ &\hspace{5mm}\textbf{if} \: \textit{maximize}: \\ &\hspace{10mm}G_t \leftarrow -\nabla_{\theta} f_t (\theta_{t-1}) \\ &\hspace{5mm}\textbf{else} \\ &\hspace{10mm}G_t \leftarrow \nabla_{\theta} f_t (\theta_{t-1}) \\ &\hspace{5mm}\widehat{\beta}_{2_t} \leftarrow 1 - t^{\tau} \\ &\hspace{5mm}\rho_t \leftarrow min(lr, \frac{1}{\sqrt{t}}) \\ &\hspace{5mm}\alpha_t \leftarrow max(\epsilon_2, \text{RMS}(\theta_{t-1}))\rho_t \\ &\hspace{5mm}\theta_t \leftarrow \theta_{t-1} - \gamma \lambda \theta_{t-1} \\ &\hspace{5mm}\textbf{if} \: \text{dim}(G_t) > 1: \\ &\hspace{10mm}R_t \leftarrow \widehat{\beta}_{2_t}R_{t-1}+ (1-\widehat{\beta}_{2_t})(G_t \odot G_t) \cdot 1_m \\ &\hspace{10mm}C_t \leftarrow \widehat{\beta}_{2_t}C_{t-1}+ (1-\widehat{\beta}_{2_t}) 1^\top_n \cdot (G_t \odot G_t) \\ &\hspace{10mm}\widehat{V}_t \leftarrow \frac{R_t \cdot C_t}{max(1^\top_n \cdot R_t, \epsilon_1)} \\ &\hspace{5mm}\textbf{else} \\ &\hspace{10mm}\widehat{V}_t \leftarrow \widehat{\beta}_{2_t}\widehat{V}_{t-1}+ (1-\widehat{\beta}_{2_t}) \cdot (G_t \odot G_t) \\ &\hspace{5mm}U_t \leftarrow \frac{G_t}{max(\sqrt{\widehat{V}_t}, \epsilon_1)} \\ &\hspace{5mm}\widehat{U}_t \leftarrow \frac{U_t}{max(1, \frac{\text{RMS}(U_t)}{d})} \\ &\hspace{5mm}\theta_t \leftarrow \theta_{t-1} - \alpha_t \widehat{U}_t \\ &\rule{110mm}{0.4pt} \\[-1.ex] &\bf{return} \: \theta_t \\[-1.ex] &\rule{110mm}{0.4pt} \\[-1.ex] \end{aligned}

For further details regarding the algorithm we refer to Adafactor: Adaptive Learning Rates with Sublinear Memory Cost.

引數

params (iterable) – 要最佳化的引數或命名引數的迭代器，或者是定義引數組的字典的迭代器。使用命名引數時，所有組中的所有引數都應該命名。
lr (float, Tensor, optional) – unlike other optimizers, Adafactor does not require a learning rate, and Noam Shazeer and Mitchell Stern do not use lr at all. Deviating from the paper, this implementation uses lr for applying weight decay and as the maximum value for relative step size rho_t. Note that in the paper, a constant of 0.01 is used as the maximum value for relative step size, and so we set 0.01 as the default value. (default: 1e-2)
beta2_decay (float, optional) – the decay rate of beta2. beta2 standardly refers to the coefficient used for computing the running average of the gradient squared. (default: -0.8)
eps (Tuple[float, float], optional) – epsilon1 is the term added to the denominator of the update calculation to improve numerical stability. This use of epsilon1 deviates from the algorithm written in the paper! See note below for more details. epsilon2 is the term used to avoid having too small a weight update when applying parameter scaling. (default: (None, 1e-3))
d (float, optional) – the clipping threshold, used to avoid larger-than-desired updates.
weight_decay (float, optional) – weight decay coefficient (default: 1e-2)
foreach (bool, optional) – whether foreach implementation of optimizer is used. Note that the foreach implementation uses ~ sizeof(params) more peak memory than the for-loop version due to the intermediates being a tensorlist vs just one tensor. As Adafactor is commonly used when memory is prohibitive, Adafactor will default to the slower single tensor for-loop implementation unless this flag is explicitly True. This behavior is contrary to other optimizers, which will attempt defaulting to foreach on CUDA for faster runtime. (default: None)
maximize (bool, optional) – 最大化目標函式相對於 params，而不是最小化 (預設: False)

注意

The implementation of Adafactor subtly differs from Noam Shazeer and Mitchell Stern and implementations in some other frameworks with its use of learning rate and $\epsilon_1$ .

Regarding the learning rate hyperparameter: Noam Shazeer and Mitchell Stern do not use lr at all, as the stated algorithm uses $\rho_t$ and update clipping to affect the step size.

This implementation allows lr to influence the maximum value for $\rho_t$

\begin{aligned} &\hspace{5mm}\rho_t \leftarrow min(lr, \frac{1}{\sqrt{t}}) \end{aligned}

This differs from Noam Shazeer and Mitchell Stern, who use a constant of 0.01 as the maximum value of $\rho_t$

\begin{aligned} &\hspace{5mm}\rho_t \leftarrow min(0.01, \frac{1}{\sqrt{t}}) \end{aligned}

Noam Shazeer and Mitchell Stern do not enforce an opinion on how weight decay should be computed, and so we use the learning rate as a coefficient for decoupled weight decay, similar to what is suggested in Decoupled Weight Decay Regularization.

Regarding the use of $\epsilon_1$ : The implementation attempts to replicate the presumed intention of Noam Shazeer and Mitchell Stern to use $\epsilon_1$ as a stabilizing term when the squared gradient becomes small.

This stabilization can be written as

\begin{aligned} &\hspace{5mm}R_t \leftarrow \widehat{\beta}_{2_t}R_{t-1}+ (1-\widehat{\beta}_{2_t})(G_t \odot G_t + 1_n \cdot 1^\top_m) \cdot 1_m \\ &\hspace{5mm}C_t \leftarrow \widehat{\beta}_{2_t}C_{t-1}+ (1-\widehat{\beta}_{2_t}) 1^\top_n \cdot (G_t \odot G_t + 1_n \cdot 1^\top_m) \\ &\hspace{5mm}\widehat{V}_t \leftarrow \frac{R_t \cdot C_t}{max(1^\top_n \cdot R_t, \epsilon_1)} \\ &\hspace{5mm}U_t \leftarrow \frac{G_t}{max(\sqrt{\widehat{V}_t}, \epsilon_1)} \\ \end{aligned}

where the row and column factors of gradient squared $R_t$ and $C_t$ are left alone, and we apply $\epsilon_1$ at the final calculation of the variance estimate $\widehat{V}_t$ and for the update $U_t$ .

This is in contrast to Noam Shazeer and Mitchell Stern and other frameworks which apply $\epsilon_1$ to both row and column factors of the squared gradient, but not in the calculations after

\begin{aligned} &\hspace{5mm}R_t \leftarrow \widehat{\beta}_{2_t}R_{t-1}+ (1-\widehat{\beta}_{2_t})(G_t \odot G_t + \epsilon_1 1_n \cdot 1^\top_m) \cdot 1_m \\ &\hspace{5mm}C_t \leftarrow \widehat{\beta}_{2_t}C_{t-1}+ (1-\widehat{\beta}_{2_t}) 1^\top_n \cdot (G_t \odot G_t + \epsilon_1 1_n \cdot 1^\top_m) \\ &\hspace{5mm}\widehat{V}_t \leftarrow \frac{R_t \cdot C_t}{1^\top_n \cdot R_t} \\ &\hspace{5mm}U_t \leftarrow \frac{G_t}{\sqrt{\widehat{V}_t}} \\ \end{aligned}

您可能會注意到，Noam Shazeer 和 Mitchell Stern 描述使用了梯度的平方和，而本實現則使用了平均值。這個選擇在數學上是等價的，並且對於大的求和來說，能夠提供更好的數值穩定性。

add_param_group(param_group)[source]#

向 Optimizer 的 param_groups 新增一個引數組。

這在微調預訓練網路時可能很有用，因為隨著訓練的進行，可以使凍結的層變得可訓練並新增到 Optimizer 中。

引數: param_group (dict) – 指定哪些 Tensor 應該被最佳化，以及組特定的最佳化選項。

load_state_dict(state_dict)[source]#

載入最佳化器狀態。

引數: state_dict (dict) – 最佳化器狀態。應為呼叫 state_dict() 返回的物件。

警告

請確保在初始化 torch.optim.lr_scheduler.LRScheduler 後呼叫此方法，因為在此之前呼叫會覆蓋載入的學習率。

注意

引數的名稱（如果它們存在於 state_dict() 中每個引數組的“param_names”鍵下）不會影響載入過程。要為自定義情況（例如，當載入的狀態字典中的引數與最佳化器中初始化的引數不同時）使用引數名稱，應實現自定義的 register_load_state_dict_pre_hook 來相應地調整載入的字典。如果載入的狀態字典 param_groups 中存在 param_names，則它們將被儲存並覆蓋最佳化器狀態中當前存在的名稱。如果它們不存在於載入的狀態字典中，最佳化器的 param_names 將保持不變。

示例

>>> model = torch.nn.Linear(10, 10)
>>> optim = torch.optim.SGD(model.parameters(), lr=3e-4)
>>> scheduler1 = torch.optim.lr_scheduler.LinearLR(
...     optim,
...     start_factor=0.1,
...     end_factor=1,
...     total_iters=20,
... )
>>> scheduler2 = torch.optim.lr_scheduler.CosineAnnealingLR(
...     optim,
...     T_max=80,
...     eta_min=3e-5,
... )
>>> lr = torch.optim.lr_scheduler.SequentialLR(
...     optim,
...     schedulers=[scheduler1, scheduler2],
...     milestones=[20],
... )
>>> lr.load_state_dict(torch.load("./save_seq.pt"))
>>> # now load the optimizer checkpoint after loading the LRScheduler
>>> optim.load_state_dict(torch.load("./save_optim.pt"))

register_load_state_dict_post_hook(hook, prepend=False)[source]#

註冊一個 load_state_dict 後置鉤子，它將在呼叫 load_state_dict() 後被呼叫。它應該具有以下簽名：

hook(optimizer) -> None

引數 optimizer 是正在使用的最佳化器例項。

呼叫 load_state_dict 到 self 上後，鉤子將使用引數 self 呼叫。註冊的鉤子可用於在 load_state_dict 載入了 state_dict 後執行後處理。

引數

hook (Callable) – 使用者定義的待註冊鉤子。
prepend (bool) – 如果為 True，則提供的後置 hook 將在 load_state_dict 上所有已註冊的後置鉤子之前執行。否則，提供的 hook 將在所有已註冊的後置鉤子之後執行。(預設: False)

返回

一個控制代碼，可用於透過呼叫 handle.remove() 來移除新增的鉤子

返回型別

torch.utils.hooks.RemoveableHandle

register_load_state_dict_pre_hook(hook, prepend=False)[source]#

註冊一個 load_state_dict 前置鉤子，它將在呼叫 load_state_dict() 之前被呼叫。它應該具有以下簽名：

hook(optimizer, state_dict) -> state_dict or None

引數 optimizer 是正在使用的最佳化器例項，引數 state_dict 是使用者傳遞給 load_state_dict 的 state_dict 的淺複製。鉤子可以就地修改 state_dict，或者選擇性地返回一個新的。如果返回了 state_dict，它將被用於載入到最佳化器中。

鉤子將使用引數 self 和 state_dict 呼叫，在呼叫 load_state_dict 到 self 上之前。註冊的鉤子可用於在呼叫 load_state_dict 之前執行預處理。

引數

hook (Callable) – 使用者定義的待註冊鉤子。
prepend (bool) – 如果為 True，則提供的預置 hook 將在 load_state_dict 上所有已註冊的預置鉤子之前執行。否則，提供的 hook 將在所有已註冊的預置鉤子之後執行。(預設: False)

返回

一個控制代碼，可用於透過呼叫 handle.remove() 來移除新增的鉤子

返回型別

torch.utils.hooks.RemoveableHandle

register_state_dict_post_hook(hook, prepend=False)[source]#

註冊一個 state_dict 後置鉤子，它將在呼叫 state_dict() 後被呼叫。

它應具有以下簽名

hook(optimizer, state_dict) -> state_dict or None

鉤子將使用引數 self 和 state_dict 呼叫，在 self 上生成 state_dict 後。鉤子可以就地修改 state_dict，或者選擇性地返回一個新的。註冊的鉤子可用於在返回 state_dict 之前對其進行後處理。

引數

hook (Callable) – 使用者定義的待註冊鉤子。
prepend (bool) – 如果為 True，則提供的後置 hook 將在 state_dict 上所有已註冊的後置鉤子之前執行。否則，提供的 hook 將在所有已註冊的後置鉤子之後執行。(預設: False)

返回

一個控制代碼，可用於透過呼叫 handle.remove() 來移除新增的鉤子

返回型別

torch.utils.hooks.RemoveableHandle

register_state_dict_pre_hook(hook, prepend=False)[source]#

註冊一個 state_dict 前置鉤子，它將在呼叫 state_dict() 之前被呼叫。

它應具有以下簽名

hook(optimizer) -> None

引數 optimizer 是正在使用的最佳化器例項。鉤子將使用引數 self 呼叫，在呼叫 state_dict 到 self 上之前。註冊的鉤子可用於在呼叫 state_dict 之前執行預處理。

引數

hook (Callable) – 使用者定義的待註冊鉤子。
prepend (bool) – 如果為 True，則提供的預置 hook 將在 state_dict 上所有已註冊的預置鉤子之前執行。否則，提供的 hook 將在所有已註冊的預置鉤子之後執行。(預設: False)

返回

一個控制代碼，可用於透過呼叫 handle.remove() 來移除新增的鉤子

返回型別

torch.utils.hooks.RemoveableHandle

register_step_post_hook(hook)[source]#

註冊一個最佳化器步驟後鉤子，它將在最佳化器步驟之後被呼叫。

它應具有以下簽名

hook(optimizer, args, kwargs) -> None

引數 optimizer 是正在使用的最佳化器例項。

引數: hook (Callable) – 使用者定義的待註冊鉤子。
返回: 一個控制代碼，可用於透過呼叫 handle.remove() 來移除新增的鉤子
返回型別: torch.utils.hooks.RemovableHandle

register_step_pre_hook(hook)[source]#

註冊一個最佳化器步驟預鉤子，它將在最佳化器步驟之前被呼叫。

它應具有以下簽名

hook(optimizer, args, kwargs) -> None or modified args and kwargs

引數 optimizer 是正在使用的最佳化器例項。如果 args 和 kwargs 被前置鉤子修改，則轉換後的值將作為包含 new_args 和 new_kwargs 的元組返回。

引數: hook (Callable) – 使用者定義的待註冊鉤子。
返回: 一個控制代碼，可用於透過呼叫 handle.remove() 來移除新增的鉤子
返回型別: torch.utils.hooks.RemovableHandle

state_dict()[source]#

將最佳化器的狀態作為 dict 返回。

它包含兩個條目

state：一個包含當前最佳化狀態的 Dict。其內容
在不同的最佳化器類中會有所不同，但有一些共同的特點。例如，狀態是按引數儲存的，而引數本身不儲存。 state 是一個對映引數 ID 到一個包含每個引數對應狀態的 Dict 的字典。
param_groups：一個包含所有引數組的 List，其中每個
引數組是一個 Dict。每個引數組包含最佳化器特有的元資料，例如學習率和權重衰減，以及組中引數的 ID 列表。如果引數組使用 named_parameters() 初始化，則名稱內容也會儲存在狀態字典中。

注意：引數 ID 可能看起來像索引，但它們只是將狀態與 param_group 關聯的 ID。從 state_dict 載入時，最佳化器會按順序匹配 param_group 的 params（int ID）和最佳化器的 param_groups（實際的 nn.Parameter），以匹配狀態，而無需額外驗證。

返回的狀態字典可能看起來像

{
    'state': {
        0: {'momentum_buffer': tensor(...), ...},
        1: {'momentum_buffer': tensor(...), ...},
        2: {'momentum_buffer': tensor(...), ...},
        3: {'momentum_buffer': tensor(...), ...}
    },
    'param_groups': [
        {
            'lr': 0.01,
            'weight_decay': 0,
            ...
            'params': [0]
            'param_names' ['param0']  (optional)
        },
        {
            'lr': 0.001,
            'weight_decay': 0.5,
            ...
            'params': [1, 2, 3]
            'param_names': ['param1', 'layer.weight', 'layer.bias'] (optional)
        }
    ]
}

返回型別: dict[str, Any]

step(closure=None)[source]#

執行單個最佳化步驟。

引數: closure (Callable, 可選) – 一個重新評估模型並返回損失的閉包。

zero_grad(set_to_none=True)[source]#

重置所有最佳化過的 torch.Tensor 的梯度。

引數

set_to_none (bool, optional) –

將梯度設定為 None，而不是設定為零。預設值：True

這通常會降低記憶體佔用，並能適度提高效能。但是，它會改變某些行為。例如：

當用戶嘗試訪問梯度並對其進行手動運算時，None 屬性或全零的 Tensor 會產生不同的行為。
如果使用者請求 zero_grad(set_to_none=True) 然後執行 backward，對於未收到梯度的引數，其 .grad 保證為 None。
torch.optim 最佳化器在梯度為 0 或 None 時行為不同（一種情況是以 0 梯度執行步長，另一種情況是跳過該步長）。

Adafactor#

文件

教程

資源