評價此頁

Pendulum: 使用 TorchRL 編寫環境和 transforms#

建立時間:2023 年 11 月 09 日 | 最後更新:2025 年 01 月 27 日 | 最後驗證:2024 年 11 月 05 日

作者Vincent Moens

建立環境(模擬器或物理控制系統的介面)是強化學習和控制工程中不可或缺的一部分。

TorchRL 提供了一套工具來在多種上下文中完成此操作。本教程演示瞭如何從頭開始使用 PyTorch 和 TorchRL 編寫一個倒擺模擬器。它受到 OpenAI-Gym/Farama-Gymnasium 控制庫 中 Pendulum-v1 實現的自由啟發。

Pendulum

簡單倒擺#

主要學習內容

  • 如何在 TorchRL 中設計環境:- 編寫規格(輸入、觀察和獎勵);- 實現行為:設定種子、重置和步進。

  • 轉換您的環境輸入和輸出,並編寫您自己的 transforms;

  • 如何使用 TensorDict 在整個 codebase 中傳遞任意資料結構。

    在此過程中,我們將涉及 TorchRL 的三個關鍵元件

為了讓您瞭解使用 TorchRL 的環境可以實現什麼,我們將設計一個無狀態環境。雖然有狀態環境會跟蹤遇到的最新物理狀態並依賴於此來模擬狀態到狀態的轉換,但無狀態環境會在每一步都期望當前狀態被提供給它,以及執行的動作。TorchRL 支援這兩種型別的環境,但無狀態環境更通用,因此涵蓋了 TorchRL 環境 API 的更廣泛功能。

對無狀態環境進行建模可以讓使用者完全控制模擬器的輸入和輸出:可以隨時重置實驗或從外部主動修改動力學。然而,這假設我們對任務有一定的控制,而情況並非總是如此:解決我們無法控制當前狀態的問題更具挑戰性,但具有更廣泛的應用集。

無狀態環境的另一個優點是它們可以實現過渡模擬的批次執行。如果後端和實現允許,代數運算可以無縫地在標量、向量或張量上執行。本教程提供了此類示例。

本教程將按以下方式組織

  • 我們將首先熟悉環境的屬性:它的形狀(batch_size)、它的方法(主要是 step()reset()set_seed())以及最後它的規格。

  • 在編寫完模擬器後,我們將演示如何在訓練過程中使用 transforms。

  • 我們將探索遵循 TorchRL API 的新途徑,包括:轉換輸入的可能性、模擬的向量化執行以及透過模擬圖進行反向傳播的可能性。

  • 最後,我們將訓練一個簡單的策略來解決我們實現的系統。

from collections import defaultdict
from typing import Optional

import numpy as np
import torch
import tqdm
from tensordict import TensorDict, TensorDictBase
from tensordict.nn import TensorDictModule
from torch import nn

from torchrl.data import BoundedTensorSpec, CompositeSpec, UnboundedContinuousTensorSpec
from torchrl.envs import (
    CatTensors,
    EnvBase,
    Transform,
    TransformedEnv,
    UnsqueezeTransform,
)
from torchrl.envs.transforms.transforms import _apply_to_composite
from torchrl.envs.utils import check_env_specs, step_mdp

DEFAULT_X = np.pi
DEFAULT_Y = 1.0

設計新環境類時,您必須注意四件事

  • EnvBase._reset(),它編寫了模擬器在(可能隨機的)初始狀態下的重置;

  • EnvBase._step(),它編寫了狀態轉換動態;

  • EnvBase._set_seed`(),它實現了播種機制;

  • 環境規格。

首先,我們來描述一下當前的問題:我們希望模擬一個簡單的倒擺,我們可以控制施加在其固定點的扭矩。我們的目標是將倒擺置於向上位置(約定為角度為 0)並使其在該位置保持靜止。為了設計我們的動態系統,我們需要定義兩個方程:遵循動作(施加的扭矩)的運動方程和構成我們目標函式的獎勵方程。

對於運動方程,我們將根據以下公式更新角速度

\[\dot{\theta}_{t+1} = \dot{\theta}_t + (3 * g / (2 * L) * \sin(\theta_t) + 3 / (m * L^2) * u) * dt\]

其中 \(\dot{\theta}\) 是弧度/秒的角速度,\(g\) 是重力,\(L\) 是倒擺長度,\(m\) 是其質量,\(\theta\) 是其角度位置,\(u\) 是扭矩。然後根據以下公式更新角度位置

\[\theta_{t+1} = \theta_{t} + \dot{\theta}_{t+1} dt\]

我們將獎勵定義為

\[r = -(\theta^2 + 0.1 * \dot{\theta}^2 + 0.001 * u^2)\]

當角度接近 0(倒擺處於向上位置)、角速度接近 0(無運動)且扭矩也為 0 時,這將最大化。

編碼動作的影響:_step()#

step 方法是首先要考慮的,因為它將編碼我們感興趣的模擬。在 TorchRL 中,EnvBase 類有一個 EnvBase.step() 方法,該方法接收一個 tensordict.TensorDict 例項,其中包含一個 "action" 條目,指示要執行的動作。

為了方便從該 tensordict 進行讀寫,並確保鍵與庫的預期一致,模擬部分已委託給私有抽象方法 _step(),該方法從 tensordict 讀取輸入資料,並使用輸出資料寫入一個新的 tensordict

_step() 方法應執行以下操作

  1. 讀取輸入鍵(例如 "action")並基於這些鍵執行模擬;

  2. 檢索觀察、完成狀態和獎勵;

  3. 在新的 TensorDict 中將觀察值集合與獎勵和完成狀態寫入相應的條目。

接下來,step() 方法會將 step() 的輸出合併到輸入 tensordict 中,以強制執行輸入/輸出一致性。

通常,對於有狀態環境,這將如下所示

>>> policy(env.reset())
>>> print(tensordict)
TensorDict(
    fields={
        action: Tensor(shape=torch.Size([1]), device=cpu, dtype=torch.float32, is_shared=False),
        done: Tensor(shape=torch.Size([1]), device=cpu, dtype=torch.bool, is_shared=False),
        observation: Tensor(shape=torch.Size([]), device=cpu, dtype=torch.float32, is_shared=False)},
    batch_size=torch.Size([]),
    device=cpu,
    is_shared=False)
>>> env.step(tensordict)
>>> print(tensordict)
TensorDict(
    fields={
        action: Tensor(shape=torch.Size([1]), device=cpu, dtype=torch.float32, is_shared=False),
        done: Tensor(shape=torch.Size([1]), device=cpu, dtype=torch.bool, is_shared=False),
        next: TensorDict(
            fields={
                done: Tensor(shape=torch.Size([1]), device=cpu, dtype=torch.bool, is_shared=False),
                observation: Tensor(shape=torch.Size([]), device=cpu, dtype=torch.float32, is_shared=False),
                reward: Tensor(shape=torch.Size([1]), device=cpu, dtype=torch.float32, is_shared=False)},
            batch_size=torch.Size([]),
            device=cpu,
            is_shared=False),
        observation: Tensor(shape=torch.Size([]), device=cpu, dtype=torch.float32, is_shared=False)},
    batch_size=torch.Size([]),
    device=cpu,
    is_shared=False)

請注意,根 tensordict 沒有改變,唯一的變化是出現了一個新的 "next" 條目,其中包含新資訊。

在 Pendulum 示例中,我們的 _step() 方法將讀取輸入 tensordict 中相關的條目,並在將 "action" 鍵編碼的力施加到倒擺上後,計算倒擺的位置和速度。我們計算倒擺 "new_th" 的新角度位置,作為先前位置 "th" 加上新速度 "new_thdot" 在時間間隔 dt 上的結果。

由於我們的目標是將倒擺轉向上並使其保持靜止,因此我們的 cost(負獎勵)函式對於接近目標的位置和低速來說更低。事實上,我們希望阻止那些遠離“向上”的位置和/或遠離 0 的速度。

在我們的示例中,EnvBase._step() 被編碼為靜態方法,因為我們的環境是無狀態的。在有狀態的環境中,需要 self 引數,因為狀態需要從環境中讀取。

def _step(tensordict):
    th, thdot = tensordict["th"], tensordict["thdot"]  # th := theta

    g_force = tensordict["params", "g"]
    mass = tensordict["params", "m"]
    length = tensordict["params", "l"]
    dt = tensordict["params", "dt"]
    u = tensordict["action"].squeeze(-1)
    u = u.clamp(-tensordict["params", "max_torque"], tensordict["params", "max_torque"])
    costs = angle_normalize(th) ** 2 + 0.1 * thdot**2 + 0.001 * (u**2)

    new_thdot = (
        thdot
        + (3 * g_force / (2 * length) * th.sin() + 3.0 / (mass * length**2) * u) * dt
    )
    new_thdot = new_thdot.clamp(
        -tensordict["params", "max_speed"], tensordict["params", "max_speed"]
    )
    new_th = th + new_thdot * dt
    reward = -costs.view(*tensordict.shape, 1)
    done = torch.zeros_like(reward, dtype=torch.bool)
    out = TensorDict(
        {
            "th": new_th,
            "thdot": new_thdot,
            "params": tensordict["params"],
            "reward": reward,
            "done": done,
        },
        tensordict.shape,
    )
    return out


def angle_normalize(x):
    return ((x + torch.pi) % (2 * torch.pi)) - torch.pi

重置模擬器:_reset()#

我們需要關心的第二個方法是 _reset() 方法。與 _step() 類似,它應該在輸出的 tensordict 中寫入觀察條目和可能的完成狀態(如果省略了完成狀態,則父方法 reset() 會將其填充為 False)。在某些上下文中,要求 _reset 方法接收呼叫它的函式發出的命令(例如,在多代理環境中,我們可能希望指示哪些代理需要重置)。這就是為什麼 _reset() 方法也期望一個 tensordict 作為輸入,儘管它可以是空的或 None

父方法 EnvBase.reset() 會執行一些簡單的檢查,就像 EnvBase.step() 所做的那樣,例如確保在輸出 tensordict 中返回一個 "done" 狀態,並且形狀與規格要求匹配。

對我們來說,唯一重要的注意事項是 EnvBase._reset() 是否包含所有預期的觀察。再次說明,由於我們正在處理一個無狀態環境,我們將倒擺的配置傳遞給一個名為 "params" 的巢狀 tensordict

在此示例中,我們不傳遞完成狀態,因為對於 _reset() 來說這不是強制性的,而且我們的環境是非終止的,所以我們總是期望它為 False

def _reset(self, tensordict):
    if tensordict is None or tensordict.is_empty():
        # if no ``tensordict`` is passed, we generate a single set of hyperparameters
        # Otherwise, we assume that the input ``tensordict`` contains all the relevant
        # parameters to get started.
        tensordict = self.gen_params(batch_size=self.batch_size)

    high_th = torch.tensor(DEFAULT_X, device=self.device)
    high_thdot = torch.tensor(DEFAULT_Y, device=self.device)
    low_th = -high_th
    low_thdot = -high_thdot

    # for non batch-locked environments, the input ``tensordict`` shape dictates the number
    # of simulators run simultaneously. In other contexts, the initial
    # random state's shape will depend upon the environment batch-size instead.
    th = (
        torch.rand(tensordict.shape, generator=self.rng, device=self.device)
        * (high_th - low_th)
        + low_th
    )
    thdot = (
        torch.rand(tensordict.shape, generator=self.rng, device=self.device)
        * (high_thdot - low_thdot)
        + low_thdot
    )
    out = TensorDict(
        {
            "th": th,
            "thdot": thdot,
            "params": tensordict["params"],
        },
        batch_size=tensordict.shape,
    )
    return out

環境元資料:env.*_spec#

規格定義了環境的輸入和輸出域。準確定義執行時接收的張量很重要,因為它們通常用於在多處理和分散式設定中傳遞環境資訊。它們還可以用於例項化惰性定義的神經網路和測試指令碼,而無需實際查詢環境(例如,對於真實的物理系統,這可能會很昂貴)。

有四個規格是我們必須在環境中編寫的

  • EnvBase.observation_spec:這將是一個 CompositeSpec 例項,其中每個鍵都是一個觀察值(CompositeSpec 可以看作是規格的字典)。

  • EnvBase.action_spec:它可以是任何型別的規格,但要求它對應於輸入 tensordict 中的 "action" 條目;

  • EnvBase.reward_spec:提供有關獎勵空間的資訊;

  • EnvBase.done_spec:提供有關完成標誌空間的資訊。

TorchRL 規格組織在兩個通用容器中:input_spec,包含 step 函式讀取的資訊的規格(分為 action_spec 包含動作,state_spec 包含所有其餘部分),以及 output_spec,它編碼 step 輸出的規格(observation_specreward_specdone_spec)。通常,您不應直接與 output_specinput_spec 互動,而應只與它們的內容互動:observation_specreward_specdone_specaction_specstate_spec。原因是規格以一種非平凡的方式組織在 output_specinput_spec 中,並且不應直接修改其中任何一個。

換句話說,observation_spec 和相關屬性是輸出和輸入規格容器內容的便捷快捷方式。

TorchRL 提供了多個 TensorSpec 子類 來編碼環境的輸入和輸出特徵。

規格形狀#

環境規格的領先維度必須與環境批次大小匹配。這樣做是為了強制執行環境的每個元件(包括其 transforms)都能準確表示預期的輸入和輸出形狀。在有狀態的環境中,這應該被準確編寫。

對於非批次鎖定環境,例如我們示例中的環境(見下文),這無關緊要,因為環境批次大小很可能為空。

def _make_spec(self, td_params):
    # Under the hood, this will populate self.output_spec["observation"]
    self.observation_spec = CompositeSpec(
        th=BoundedTensorSpec(
            low=-torch.pi,
            high=torch.pi,
            shape=(),
            dtype=torch.float32,
        ),
        thdot=BoundedTensorSpec(
            low=-td_params["params", "max_speed"],
            high=td_params["params", "max_speed"],
            shape=(),
            dtype=torch.float32,
        ),
        # we need to add the ``params`` to the observation specs, as we want
        # to pass it at each step during a rollout
        params=make_composite_from_td(td_params["params"]),
        shape=(),
    )
    # since the environment is stateless, we expect the previous output as input.
    # For this, ``EnvBase`` expects some state_spec to be available
    self.state_spec = self.observation_spec.clone()
    # action-spec will be automatically wrapped in input_spec when
    # `self.action_spec = spec` will be called supported
    self.action_spec = BoundedTensorSpec(
        low=-td_params["params", "max_torque"],
        high=td_params["params", "max_torque"],
        shape=(1,),
        dtype=torch.float32,
    )
    self.reward_spec = UnboundedContinuousTensorSpec(shape=(*td_params.shape, 1))


def make_composite_from_td(td):
    # custom function to convert a ``tensordict`` in a similar spec structure
    # of unbounded values.
    composite = CompositeSpec(
        {
            key: make_composite_from_td(tensor)
            if isinstance(tensor, TensorDictBase)
            else UnboundedContinuousTensorSpec(
                dtype=tensor.dtype, device=tensor.device, shape=tensor.shape
            )
            for key, tensor in td.items()
        },
        shape=td.shape,
    )
    return composite

可復現的實驗:播種#

在初始化實驗時,播種環境是一項常見操作。EnvBase._set_seed() 的唯一目的是設定所包含模擬器的種子。如果可能,此操作不應呼叫 reset() 或與環境執行互動。父方法 EnvBase.set_seed() 包含一個機制,允許使用不同的偽隨機且可復現的種子播種多個環境。

def _set_seed(self, seed: Optional[int]):
    rng = torch.manual_seed(seed)
    self.rng = rng

將所有內容打包:EnvBase#

我們終於可以將這些部分組合起來,設計我們的環境類。規格初始化需要在環境構造期間進行,因此我們必須在 PendulumEnv.__init__() 中呼叫 _make_spec() 方法。

我們新增一個靜態方法 PendulumEnv.gen_params(),該方法確定性地生成一組將在執行期間使用的超引數。

def gen_params(g=10.0, batch_size=None) -> TensorDictBase:
    """Returns a ``tensordict`` containing the physical parameters such as gravitational force and torque or speed limits."""
    if batch_size is None:
        batch_size = []
    td = TensorDict(
        {
            "params": TensorDict(
                {
                    "max_speed": 8,
                    "max_torque": 2.0,
                    "dt": 0.05,
                    "g": g,
                    "m": 1.0,
                    "l": 1.0,
                },
                [],
            )
        },
        [],
    )
    if batch_size:
        td = td.expand(batch_size).contiguous()
    return td

我們將環境定義為非batch_locked,方法是將 homonymous 屬性設定為 False。這意味著我們**不會**強制輸入 tensordict 具有與環境匹配的 batch-size

以下程式碼將組合我們上面編寫的部分。

class PendulumEnv(EnvBase):
    metadata = {
        "render_modes": ["human", "rgb_array"],
        "render_fps": 30,
    }
    batch_locked = False

    def __init__(self, td_params=None, seed=None, device="cpu"):
        if td_params is None:
            td_params = self.gen_params()

        super().__init__(device=device, batch_size=[])
        self._make_spec(td_params)
        if seed is None:
            seed = torch.empty((), dtype=torch.int64).random_().item()
        self.set_seed(seed)

    # Helpers: _make_step and gen_params
    gen_params = staticmethod(gen_params)
    _make_spec = _make_spec

    # Mandatory methods: _step, _reset and _set_seed
    _reset = _reset
    _step = staticmethod(_step)
    _set_seed = _set_seed

測試我們的環境#

TorchRL 提供了一個簡單的函式 check_env_specs() 來檢查(轉換後的)環境的輸入/輸出結構是否與其規格相匹配。讓我們試一試

/usr/local/lib/python3.10/dist-packages/torchrl/data/tensor_specs.py:6911: DeprecationWarning:

The BoundedTensorSpec has been deprecated and will be removed in v0.8. Please use Bounded instead.

/usr/local/lib/python3.10/dist-packages/torchrl/data/tensor_specs.py:6911: DeprecationWarning:

The UnboundedContinuousTensorSpec has been deprecated and will be removed in v0.8. Please use Unbounded instead.

/usr/local/lib/python3.10/dist-packages/torchrl/data/tensor_specs.py:6911: DeprecationWarning:

The CompositeSpec has been deprecated and will be removed in v0.8. Please use Composite instead.

2025-10-15 19:14:56,019 [torchrl][INFO]    check_env_specs succeeded! [END]

我們可以檢視我們的規格,以便直觀地表示環境簽名

print("observation_spec:", env.observation_spec)
print("state_spec:", env.state_spec)
print("reward_spec:", env.reward_spec)
observation_spec: CompositeSpec(
    th: BoundedContinuous(
        shape=torch.Size([]),
        space=ContinuousBox(
            low=Tensor(shape=torch.Size([]), device=cpu, dtype=torch.float32, contiguous=True),
            high=Tensor(shape=torch.Size([]), device=cpu, dtype=torch.float32, contiguous=True)),
        device=cpu,
        dtype=torch.float32,
        domain=continuous),
    thdot: BoundedContinuous(
        shape=torch.Size([]),
        space=ContinuousBox(
            low=Tensor(shape=torch.Size([]), device=cpu, dtype=torch.float32, contiguous=True),
            high=Tensor(shape=torch.Size([]), device=cpu, dtype=torch.float32, contiguous=True)),
        device=cpu,
        dtype=torch.float32,
        domain=continuous),
    params: CompositeSpec(
        max_speed: UnboundedDiscrete(
            shape=torch.Size([]),
            space=ContinuousBox(
                low=Tensor(shape=torch.Size([]), device=cpu, dtype=torch.int64, contiguous=True),
                high=Tensor(shape=torch.Size([]), device=cpu, dtype=torch.int64, contiguous=True)),
            device=cpu,
            dtype=torch.int64,
            domain=discrete),
        max_torque: UnboundedContinuous(
            shape=torch.Size([]),
            space=ContinuousBox(
                low=Tensor(shape=torch.Size([]), device=cpu, dtype=torch.float32, contiguous=True),
                high=Tensor(shape=torch.Size([]), device=cpu, dtype=torch.float32, contiguous=True)),
            device=cpu,
            dtype=torch.float32,
            domain=continuous),
        dt: UnboundedContinuous(
            shape=torch.Size([]),
            space=ContinuousBox(
                low=Tensor(shape=torch.Size([]), device=cpu, dtype=torch.float32, contiguous=True),
                high=Tensor(shape=torch.Size([]), device=cpu, dtype=torch.float32, contiguous=True)),
            device=cpu,
            dtype=torch.float32,
            domain=continuous),
        g: UnboundedContinuous(
            shape=torch.Size([]),
            space=ContinuousBox(
                low=Tensor(shape=torch.Size([]), device=cpu, dtype=torch.float32, contiguous=True),
                high=Tensor(shape=torch.Size([]), device=cpu, dtype=torch.float32, contiguous=True)),
            device=cpu,
            dtype=torch.float32,
            domain=continuous),
        m: UnboundedContinuous(
            shape=torch.Size([]),
            space=ContinuousBox(
                low=Tensor(shape=torch.Size([]), device=cpu, dtype=torch.float32, contiguous=True),
                high=Tensor(shape=torch.Size([]), device=cpu, dtype=torch.float32, contiguous=True)),
            device=cpu,
            dtype=torch.float32,
            domain=continuous),
        l: UnboundedContinuous(
            shape=torch.Size([]),
            space=ContinuousBox(
                low=Tensor(shape=torch.Size([]), device=cpu, dtype=torch.float32, contiguous=True),
                high=Tensor(shape=torch.Size([]), device=cpu, dtype=torch.float32, contiguous=True)),
            device=cpu,
            dtype=torch.float32,
            domain=continuous),
        device=cpu,
        shape=torch.Size([]),
        data_cls=None),
    device=cpu,
    shape=torch.Size([]),
    data_cls=None)
state_spec: CompositeSpec(
    th: BoundedContinuous(
        shape=torch.Size([]),
        space=ContinuousBox(
            low=Tensor(shape=torch.Size([]), device=cpu, dtype=torch.float32, contiguous=True),
            high=Tensor(shape=torch.Size([]), device=cpu, dtype=torch.float32, contiguous=True)),
        device=cpu,
        dtype=torch.float32,
        domain=continuous),
    thdot: BoundedContinuous(
        shape=torch.Size([]),
        space=ContinuousBox(
            low=Tensor(shape=torch.Size([]), device=cpu, dtype=torch.float32, contiguous=True),
            high=Tensor(shape=torch.Size([]), device=cpu, dtype=torch.float32, contiguous=True)),
        device=cpu,
        dtype=torch.float32,
        domain=continuous),
    params: CompositeSpec(
        max_speed: UnboundedDiscrete(
            shape=torch.Size([]),
            space=ContinuousBox(
                low=Tensor(shape=torch.Size([]), device=cpu, dtype=torch.int64, contiguous=True),
                high=Tensor(shape=torch.Size([]), device=cpu, dtype=torch.int64, contiguous=True)),
            device=cpu,
            dtype=torch.int64,
            domain=discrete),
        max_torque: UnboundedContinuous(
            shape=torch.Size([]),
            space=ContinuousBox(
                low=Tensor(shape=torch.Size([]), device=cpu, dtype=torch.float32, contiguous=True),
                high=Tensor(shape=torch.Size([]), device=cpu, dtype=torch.float32, contiguous=True)),
            device=cpu,
            dtype=torch.float32,
            domain=continuous),
        dt: UnboundedContinuous(
            shape=torch.Size([]),
            space=ContinuousBox(
                low=Tensor(shape=torch.Size([]), device=cpu, dtype=torch.float32, contiguous=True),
                high=Tensor(shape=torch.Size([]), device=cpu, dtype=torch.float32, contiguous=True)),
            device=cpu,
            dtype=torch.float32,
            domain=continuous),
        g: UnboundedContinuous(
            shape=torch.Size([]),
            space=ContinuousBox(
                low=Tensor(shape=torch.Size([]), device=cpu, dtype=torch.float32, contiguous=True),
                high=Tensor(shape=torch.Size([]), device=cpu, dtype=torch.float32, contiguous=True)),
            device=cpu,
            dtype=torch.float32,
            domain=continuous),
        m: UnboundedContinuous(
            shape=torch.Size([]),
            space=ContinuousBox(
                low=Tensor(shape=torch.Size([]), device=cpu, dtype=torch.float32, contiguous=True),
                high=Tensor(shape=torch.Size([]), device=cpu, dtype=torch.float32, contiguous=True)),
            device=cpu,
            dtype=torch.float32,
            domain=continuous),
        l: UnboundedContinuous(
            shape=torch.Size([]),
            space=ContinuousBox(
                low=Tensor(shape=torch.Size([]), device=cpu, dtype=torch.float32, contiguous=True),
                high=Tensor(shape=torch.Size([]), device=cpu, dtype=torch.float32, contiguous=True)),
            device=cpu,
            dtype=torch.float32,
            domain=continuous),
        device=cpu,
        shape=torch.Size([]),
        data_cls=None),
    device=cpu,
    shape=torch.Size([]),
    data_cls=None)
reward_spec: UnboundedContinuous(
    shape=torch.Size([1]),
    space=ContinuousBox(
        low=Tensor(shape=torch.Size([1]), device=cpu, dtype=torch.float32, contiguous=True),
        high=Tensor(shape=torch.Size([1]), device=cpu, dtype=torch.float32, contiguous=True)),
    device=cpu,
    dtype=torch.float32,
    domain=continuous)

我們也可以執行幾個命令來檢查輸出結構是否與預期相符。

td = env.reset()
print("reset tensordict", td)
reset tensordict TensorDict(
    fields={
        done: Tensor(shape=torch.Size([1]), device=cpu, dtype=torch.bool, is_shared=False),
        params: TensorDict(
            fields={
                dt: Tensor(shape=torch.Size([]), device=cpu, dtype=torch.float32, is_shared=False),
                g: Tensor(shape=torch.Size([]), device=cpu, dtype=torch.float32, is_shared=False),
                l: Tensor(shape=torch.Size([]), device=cpu, dtype=torch.float32, is_shared=False),
                m: Tensor(shape=torch.Size([]), device=cpu, dtype=torch.float32, is_shared=False),
                max_speed: Tensor(shape=torch.Size([]), device=cpu, dtype=torch.int64, is_shared=False),
                max_torque: Tensor(shape=torch.Size([]), device=cpu, dtype=torch.float32, is_shared=False)},
            batch_size=torch.Size([]),
            device=None,
            is_shared=False),
        terminated: Tensor(shape=torch.Size([1]), device=cpu, dtype=torch.bool, is_shared=False),
        th: Tensor(shape=torch.Size([]), device=cpu, dtype=torch.float32, is_shared=False),
        thdot: Tensor(shape=torch.Size([]), device=cpu, dtype=torch.float32, is_shared=False)},
    batch_size=torch.Size([]),
    device=None,
    is_shared=False)

我們可以執行 env.rand_step() 以從 action_spec 域隨機生成一個動作。**必須**傳入一個包含超引數和當前狀態的 tensordict,因為我們的環境是無狀態的。在有狀態的環境中,env.rand_step() 也能完美工作。

td = env.rand_step(td)
print("random step tensordict", td)
random step tensordict TensorDict(
    fields={
        action: Tensor(shape=torch.Size([1]), device=cpu, dtype=torch.float32, is_shared=False),
        done: Tensor(shape=torch.Size([1]), device=cpu, dtype=torch.bool, is_shared=False),
        next: TensorDict(
            fields={
                done: Tensor(shape=torch.Size([1]), device=cpu, dtype=torch.bool, is_shared=False),
                params: TensorDict(
                    fields={
                        dt: Tensor(shape=torch.Size([]), device=cpu, dtype=torch.float32, is_shared=False),
                        g: Tensor(shape=torch.Size([]), device=cpu, dtype=torch.float32, is_shared=False),
                        l: Tensor(shape=torch.Size([]), device=cpu, dtype=torch.float32, is_shared=False),
                        m: Tensor(shape=torch.Size([]), device=cpu, dtype=torch.float32, is_shared=False),
                        max_speed: Tensor(shape=torch.Size([]), device=cpu, dtype=torch.int64, is_shared=False),
                        max_torque: Tensor(shape=torch.Size([]), device=cpu, dtype=torch.float32, is_shared=False)},
                    batch_size=torch.Size([]),
                    device=None,
                    is_shared=False),
                reward: Tensor(shape=torch.Size([1]), device=cpu, dtype=torch.float32, is_shared=False),
                terminated: Tensor(shape=torch.Size([1]), device=cpu, dtype=torch.bool, is_shared=False),
                th: Tensor(shape=torch.Size([]), device=cpu, dtype=torch.float32, is_shared=False),
                thdot: Tensor(shape=torch.Size([]), device=cpu, dtype=torch.float32, is_shared=False)},
            batch_size=torch.Size([]),
            device=None,
            is_shared=False),
        params: TensorDict(
            fields={
                dt: Tensor(shape=torch.Size([]), device=cpu, dtype=torch.float32, is_shared=False),
                g: Tensor(shape=torch.Size([]), device=cpu, dtype=torch.float32, is_shared=False),
                l: Tensor(shape=torch.Size([]), device=cpu, dtype=torch.float32, is_shared=False),
                m: Tensor(shape=torch.Size([]), device=cpu, dtype=torch.float32, is_shared=False),
                max_speed: Tensor(shape=torch.Size([]), device=cpu, dtype=torch.int64, is_shared=False),
                max_torque: Tensor(shape=torch.Size([]), device=cpu, dtype=torch.float32, is_shared=False)},
            batch_size=torch.Size([]),
            device=None,
            is_shared=False),
        terminated: Tensor(shape=torch.Size([1]), device=cpu, dtype=torch.bool, is_shared=False),
        th: Tensor(shape=torch.Size([]), device=cpu, dtype=torch.float32, is_shared=False),
        thdot: Tensor(shape=torch.Size([]), device=cpu, dtype=torch.float32, is_shared=False)},
    batch_size=torch.Size([]),
    device=None,
    is_shared=False)

轉換環境#

為無狀態模擬器編寫環境 transforms 比有狀態模擬器稍微複雜一些:轉換一個需要在下一次迭代中讀取的輸出條目,需要在下一次呼叫 meth.step() 之前應用逆變換。這是一個展示 TorchRL transforms 所有功能的理想場景!

例如,在以下轉換的環境中,我們將 ["th", "thdot"] 條目進行 unsqueeze 操作,以便能夠沿著最後一個維度堆疊它們。我們還將其作為 in_keys_inv 傳遞,以便在它們作為輸入傳遞到下一次迭代時將它們擠壓回原始形狀。

env = TransformedEnv(
    env,
    # ``Unsqueeze`` the observations that we will concatenate
    UnsqueezeTransform(
        dim=-1,
        in_keys=["th", "thdot"],
        in_keys_inv=["th", "thdot"],
    ),
)

編寫自定義 transforms#

TorchRL 的 transforms 可能無法涵蓋環境執行後用戶想要執行的所有操作。編寫 transform 不需要太多努力。與環境設計一樣,編寫 transform 有兩個步驟

  • 正確處理動態(前向和反向);

  • 調整環境規格。

Transform 可以在兩種場景中使用:它本身可以用作 Module。它也可以用作附加到 TransformedEnv。類的結構允許在不同上下文中自定義行為。

一個 Transform 的骨架可以總結如下

class Transform(nn.Module):
    def forward(self, tensordict):
        ...
    def _apply_transform(self, tensordict):
        ...
    def _step(self, tensordict):
        ...
    def _call(self, tensordict):
        ...
    def inv(self, tensordict):
        ...
    def _inv_apply_transform(self, tensordict):
        ...

有三個入口點(forward()_step()inv()),它們都接收 tensordict.TensorDict 例項。前兩個最終會透過 in_keys 指定的鍵,並對每個鍵呼叫 _apply_transform()。如果提供了 Transform.out_keys(否則 in_keys 將被轉換值更新),結果將被寫入 Transform.out_keys 指向的條目。如果需要執行逆變換,將執行類似的資料流,但使用 Transform.inv()Transform._inv_apply_transform() 方法,並在 in_keys_invout_keys_inv 鍵列表之間進行。下圖總結了環境和回放緩衝區的資料流。

Transform API

在某些情況下,transform 不會以單元方式處理鍵的子集,而是會對父環境執行某些操作或與整個輸入 tensordict 進行互動。在這些情況下,應重寫 _call()forward() 方法,並可以跳過 _apply_transform() 方法。

讓我們編寫新的 transforms 來計算位置角度的 sinecosine 值,因為這些值比原始角度值更有助於我們學習策略。

class SinTransform(Transform):
    def _apply_transform(self, obs: torch.Tensor) -> None:
        return obs.sin()

    # The transform must also modify the data at reset time
    def _reset(
        self, tensordict: TensorDictBase, tensordict_reset: TensorDictBase
    ) -> TensorDictBase:
        return self._call(tensordict_reset)

    # _apply_to_composite will execute the observation spec transform across all
    # in_keys/out_keys pairs and write the result in the observation_spec which
    # is of type ``Composite``
    @_apply_to_composite
    def transform_observation_spec(self, observation_spec):
        return BoundedTensorSpec(
            low=-1,
            high=1,
            shape=observation_spec.shape,
            dtype=observation_spec.dtype,
            device=observation_spec.device,
        )


class CosTransform(Transform):
    def _apply_transform(self, obs: torch.Tensor) -> None:
        return obs.cos()

    # The transform must also modify the data at reset time
    def _reset(
        self, tensordict: TensorDictBase, tensordict_reset: TensorDictBase
    ) -> TensorDictBase:
        return self._call(tensordict_reset)

    # _apply_to_composite will execute the observation spec transform across all
    # in_keys/out_keys pairs and write the result in the observation_spec which
    # is of type ``Composite``
    @_apply_to_composite
    def transform_observation_spec(self, observation_spec):
        return BoundedTensorSpec(
            low=-1,
            high=1,
            shape=observation_spec.shape,
            dtype=observation_spec.dtype,
            device=observation_spec.device,
        )


t_sin = SinTransform(in_keys=["th"], out_keys=["sin"])
t_cos = CosTransform(in_keys=["th"], out_keys=["cos"])
env.append_transform(t_sin)
env.append_transform(t_cos)
TransformedEnv(
    env=PendulumEnv(),
    transform=Compose(
            UnsqueezeTransform(dim=-1, in_keys=['th', 'thdot'], out_keys=['th', 'thdot'], in_keys_inv=['th', 'thdot'], out_keys_inv=['th', 'thdot']),
            SinTransform(keys=['th']),
            CosTransform(keys=['th'])))

將觀察值連線到“observation”條目。del_keys=False 確保我們為下一次迭代保留這些值。

cat_transform = CatTensors(
    in_keys=["sin", "cos", "thdot"], dim=-1, out_key="observation", del_keys=False
)
env.append_transform(cat_transform)
TransformedEnv(
    env=PendulumEnv(),
    transform=Compose(
            UnsqueezeTransform(dim=-1, in_keys=['th', 'thdot'], out_keys=['th', 'thdot'], in_keys_inv=['th', 'thdot'], out_keys_inv=['th', 'thdot']),
            SinTransform(keys=['th']),
            CosTransform(keys=['th']),
            CatTensors(in_keys=['cos', 'sin', 'thdot'], out_key=observation)))

再次,讓我們檢查我們的環境規格是否與收到的內容匹配

2025-10-15 19:14:56,054 [torchrl][INFO]    check_env_specs succeeded! [END]

執行 rollout#

執行 rollout 是連續的簡單步驟

  • 重置環境

  • 直到某個條件滿足

    • 根據策略計算動作

    • 使用此動作執行一步

    • 收集資料

    • 進行一次 MDP 步驟

  • 收集資料並返回

這些操作已被方便地封裝在 rollout() 方法中,我們在此下方提供了一個簡化版本。

def simple_rollout(steps=100):
    # preallocate:
    data = TensorDict({}, [steps])
    # reset
    _data = env.reset()
    for i in range(steps):
        _data["action"] = env.action_spec.rand()
        _data = env.step(_data)
        data[i] = _data
        _data = step_mdp(_data, keep_other=True)
    return data


print("data from rollout:", simple_rollout(100))
data from rollout: TensorDict(
    fields={
        action: Tensor(shape=torch.Size([100, 1]), device=cpu, dtype=torch.float32, is_shared=False),
        cos: Tensor(shape=torch.Size([100, 1]), device=cpu, dtype=torch.float32, is_shared=False),
        done: Tensor(shape=torch.Size([100, 1]), device=cpu, dtype=torch.bool, is_shared=False),
        next: TensorDict(
            fields={
                cos: Tensor(shape=torch.Size([100, 1]), device=cpu, dtype=torch.float32, is_shared=False),
                done: Tensor(shape=torch.Size([100, 1]), device=cpu, dtype=torch.bool, is_shared=False),
                observation: Tensor(shape=torch.Size([100, 3]), device=cpu, dtype=torch.float32, is_shared=False),
                params: TensorDict(
                    fields={
                        dt: Tensor(shape=torch.Size([100]), device=cpu, dtype=torch.float32, is_shared=False),
                        g: Tensor(shape=torch.Size([100]), device=cpu, dtype=torch.float32, is_shared=False),
                        l: Tensor(shape=torch.Size([100]), device=cpu, dtype=torch.float32, is_shared=False),
                        m: Tensor(shape=torch.Size([100]), device=cpu, dtype=torch.float32, is_shared=False),
                        max_speed: Tensor(shape=torch.Size([100]), device=cpu, dtype=torch.int64, is_shared=False),
                        max_torque: Tensor(shape=torch.Size([100]), device=cpu, dtype=torch.float32, is_shared=False)},
                    batch_size=torch.Size([100]),
                    device=None,
                    is_shared=False),
                reward: Tensor(shape=torch.Size([100, 1]), device=cpu, dtype=torch.float32, is_shared=False),
                sin: Tensor(shape=torch.Size([100, 1]), device=cpu, dtype=torch.float32, is_shared=False),
                terminated: Tensor(shape=torch.Size([100, 1]), device=cpu, dtype=torch.bool, is_shared=False),
                th: Tensor(shape=torch.Size([100, 1]), device=cpu, dtype=torch.float32, is_shared=False),
                thdot: Tensor(shape=torch.Size([100, 1]), device=cpu, dtype=torch.float32, is_shared=False)},
            batch_size=torch.Size([100]),
            device=None,
            is_shared=False),
        observation: Tensor(shape=torch.Size([100, 3]), device=cpu, dtype=torch.float32, is_shared=False),
        params: TensorDict(
            fields={
                dt: Tensor(shape=torch.Size([100]), device=cpu, dtype=torch.float32, is_shared=False),
                g: Tensor(shape=torch.Size([100]), device=cpu, dtype=torch.float32, is_shared=False),
                l: Tensor(shape=torch.Size([100]), device=cpu, dtype=torch.float32, is_shared=False),
                m: Tensor(shape=torch.Size([100]), device=cpu, dtype=torch.float32, is_shared=False),
                max_speed: Tensor(shape=torch.Size([100]), device=cpu, dtype=torch.int64, is_shared=False),
                max_torque: Tensor(shape=torch.Size([100]), device=cpu, dtype=torch.float32, is_shared=False)},
            batch_size=torch.Size([100]),
            device=None,
            is_shared=False),
        sin: Tensor(shape=torch.Size([100, 1]), device=cpu, dtype=torch.float32, is_shared=False),
        terminated: Tensor(shape=torch.Size([100, 1]), device=cpu, dtype=torch.bool, is_shared=False),
        th: Tensor(shape=torch.Size([100, 1]), device=cpu, dtype=torch.float32, is_shared=False),
        thdot: Tensor(shape=torch.Size([100, 1]), device=cpu, dtype=torch.float32, is_shared=False)},
    batch_size=torch.Size([100]),
    device=None,
    is_shared=False)

批次計算#

我們教程中最後一個未探索的方面是我們可以在 TorchRL 中批次計算的能力。因為我們的環境不對輸入資料形狀做出任何假設,所以我們可以無縫地對其進行批次資料執行。更好的是:對於非批次鎖定環境,如我們的 Pendulum,我們可以動態更改批次大小而無需重新建立環境。為此,我們只需生成所需形狀的引數。

batch_size = 10  # number of environments to be executed in batch
td = env.reset(env.gen_params(batch_size=[batch_size]))
print("reset (batch size of 10)", td)
td = env.rand_step(td)
print("rand step (batch size of 10)", td)
reset (batch size of 10) TensorDict(
    fields={
        cos: Tensor(shape=torch.Size([10, 1]), device=cpu, dtype=torch.float32, is_shared=False),
        done: Tensor(shape=torch.Size([10, 1]), device=cpu, dtype=torch.bool, is_shared=False),
        observation: Tensor(shape=torch.Size([10, 3]), device=cpu, dtype=torch.float32, is_shared=False),
        params: TensorDict(
            fields={
                dt: Tensor(shape=torch.Size([10]), device=cpu, dtype=torch.float32, is_shared=False),
                g: Tensor(shape=torch.Size([10]), device=cpu, dtype=torch.float32, is_shared=False),
                l: Tensor(shape=torch.Size([10]), device=cpu, dtype=torch.float32, is_shared=False),
                m: Tensor(shape=torch.Size([10]), device=cpu, dtype=torch.float32, is_shared=False),
                max_speed: Tensor(shape=torch.Size([10]), device=cpu, dtype=torch.int64, is_shared=False),
                max_torque: Tensor(shape=torch.Size([10]), device=cpu, dtype=torch.float32, is_shared=False)},
            batch_size=torch.Size([10]),
            device=None,
            is_shared=False),
        sin: Tensor(shape=torch.Size([10, 1]), device=cpu, dtype=torch.float32, is_shared=False),
        terminated: Tensor(shape=torch.Size([10, 1]), device=cpu, dtype=torch.bool, is_shared=False),
        th: Tensor(shape=torch.Size([10, 1]), device=cpu, dtype=torch.float32, is_shared=False),
        thdot: Tensor(shape=torch.Size([10, 1]), device=cpu, dtype=torch.float32, is_shared=False)},
    batch_size=torch.Size([10]),
    device=None,
    is_shared=False)
rand step (batch size of 10) TensorDict(
    fields={
        action: Tensor(shape=torch.Size([10, 1]), device=cpu, dtype=torch.float32, is_shared=False),
        cos: Tensor(shape=torch.Size([10, 1]), device=cpu, dtype=torch.float32, is_shared=False),
        done: Tensor(shape=torch.Size([10, 1]), device=cpu, dtype=torch.bool, is_shared=False),
        next: TensorDict(
            fields={
                cos: Tensor(shape=torch.Size([10, 1]), device=cpu, dtype=torch.float32, is_shared=False),
                done: Tensor(shape=torch.Size([10, 1]), device=cpu, dtype=torch.bool, is_shared=False),
                observation: Tensor(shape=torch.Size([10, 3]), device=cpu, dtype=torch.float32, is_shared=False),
                params: TensorDict(
                    fields={
                        dt: Tensor(shape=torch.Size([10]), device=cpu, dtype=torch.float32, is_shared=False),
                        g: Tensor(shape=torch.Size([10]), device=cpu, dtype=torch.float32, is_shared=False),
                        l: Tensor(shape=torch.Size([10]), device=cpu, dtype=torch.float32, is_shared=False),
                        m: Tensor(shape=torch.Size([10]), device=cpu, dtype=torch.float32, is_shared=False),
                        max_speed: Tensor(shape=torch.Size([10]), device=cpu, dtype=torch.int64, is_shared=False),
                        max_torque: Tensor(shape=torch.Size([10]), device=cpu, dtype=torch.float32, is_shared=False)},
                    batch_size=torch.Size([10]),
                    device=None,
                    is_shared=False),
                reward: Tensor(shape=torch.Size([10, 1]), device=cpu, dtype=torch.float32, is_shared=False),
                sin: Tensor(shape=torch.Size([10, 1]), device=cpu, dtype=torch.float32, is_shared=False),
                terminated: Tensor(shape=torch.Size([10, 1]), device=cpu, dtype=torch.bool, is_shared=False),
                th: Tensor(shape=torch.Size([10, 1]), device=cpu, dtype=torch.float32, is_shared=False),
                thdot: Tensor(shape=torch.Size([10, 1]), device=cpu, dtype=torch.float32, is_shared=False)},
            batch_size=torch.Size([10]),
            device=None,
            is_shared=False),
        observation: Tensor(shape=torch.Size([10, 3]), device=cpu, dtype=torch.float32, is_shared=False),
        params: TensorDict(
            fields={
                dt: Tensor(shape=torch.Size([10]), device=cpu, dtype=torch.float32, is_shared=False),
                g: Tensor(shape=torch.Size([10]), device=cpu, dtype=torch.float32, is_shared=False),
                l: Tensor(shape=torch.Size([10]), device=cpu, dtype=torch.float32, is_shared=False),
                m: Tensor(shape=torch.Size([10]), device=cpu, dtype=torch.float32, is_shared=False),
                max_speed: Tensor(shape=torch.Size([10]), device=cpu, dtype=torch.int64, is_shared=False),
                max_torque: Tensor(shape=torch.Size([10]), device=cpu, dtype=torch.float32, is_shared=False)},
            batch_size=torch.Size([10]),
            device=None,
            is_shared=False),
        sin: Tensor(shape=torch.Size([10, 1]), device=cpu, dtype=torch.float32, is_shared=False),
        terminated: Tensor(shape=torch.Size([10, 1]), device=cpu, dtype=torch.bool, is_shared=False),
        th: Tensor(shape=torch.Size([10, 1]), device=cpu, dtype=torch.float32, is_shared=False),
        thdot: Tensor(shape=torch.Size([10, 1]), device=cpu, dtype=torch.float32, is_shared=False)},
    batch_size=torch.Size([10]),
    device=None,
    is_shared=False)

使用批次資料執行 rollout 需要我們在 rollout 函式之外重置環境,因為我們需要動態定義 batch_size,而 rollout() 不支援此功能。

rollout = env.rollout(
    3,
    auto_reset=False,  # we're executing the reset out of the ``rollout`` call
    tensordict=env.reset(env.gen_params(batch_size=[batch_size])),
)
print("rollout of len 3 (batch size of 10):", rollout)
rollout of len 3 (batch size of 10): TensorDict(
    fields={
        action: Tensor(shape=torch.Size([10, 3, 1]), device=cpu, dtype=torch.float32, is_shared=False),
        cos: Tensor(shape=torch.Size([10, 3, 1]), device=cpu, dtype=torch.float32, is_shared=False),
        done: Tensor(shape=torch.Size([10, 3, 1]), device=cpu, dtype=torch.bool, is_shared=False),
        next: TensorDict(
            fields={
                cos: Tensor(shape=torch.Size([10, 3, 1]), device=cpu, dtype=torch.float32, is_shared=False),
                done: Tensor(shape=torch.Size([10, 3, 1]), device=cpu, dtype=torch.bool, is_shared=False),
                observation: Tensor(shape=torch.Size([10, 3, 3]), device=cpu, dtype=torch.float32, is_shared=False),
                params: TensorDict(
                    fields={
                        dt: Tensor(shape=torch.Size([10, 3]), device=cpu, dtype=torch.float32, is_shared=False),
                        g: Tensor(shape=torch.Size([10, 3]), device=cpu, dtype=torch.float32, is_shared=False),
                        l: Tensor(shape=torch.Size([10, 3]), device=cpu, dtype=torch.float32, is_shared=False),
                        m: Tensor(shape=torch.Size([10, 3]), device=cpu, dtype=torch.float32, is_shared=False),
                        max_speed: Tensor(shape=torch.Size([10, 3]), device=cpu, dtype=torch.int64, is_shared=False),
                        max_torque: Tensor(shape=torch.Size([10, 3]), device=cpu, dtype=torch.float32, is_shared=False)},
                    batch_size=torch.Size([10, 3]),
                    device=None,
                    is_shared=False),
                reward: Tensor(shape=torch.Size([10, 3, 1]), device=cpu, dtype=torch.float32, is_shared=False),
                sin: Tensor(shape=torch.Size([10, 3, 1]), device=cpu, dtype=torch.float32, is_shared=False),
                terminated: Tensor(shape=torch.Size([10, 3, 1]), device=cpu, dtype=torch.bool, is_shared=False),
                th: Tensor(shape=torch.Size([10, 3, 1]), device=cpu, dtype=torch.float32, is_shared=False),
                thdot: Tensor(shape=torch.Size([10, 3, 1]), device=cpu, dtype=torch.float32, is_shared=False)},
            batch_size=torch.Size([10, 3]),
            device=None,
            is_shared=False),
        observation: Tensor(shape=torch.Size([10, 3, 3]), device=cpu, dtype=torch.float32, is_shared=False),
        params: TensorDict(
            fields={
                dt: Tensor(shape=torch.Size([10, 3]), device=cpu, dtype=torch.float32, is_shared=False),
                g: Tensor(shape=torch.Size([10, 3]), device=cpu, dtype=torch.float32, is_shared=False),
                l: Tensor(shape=torch.Size([10, 3]), device=cpu, dtype=torch.float32, is_shared=False),
                m: Tensor(shape=torch.Size([10, 3]), device=cpu, dtype=torch.float32, is_shared=False),
                max_speed: Tensor(shape=torch.Size([10, 3]), device=cpu, dtype=torch.int64, is_shared=False),
                max_torque: Tensor(shape=torch.Size([10, 3]), device=cpu, dtype=torch.float32, is_shared=False)},
            batch_size=torch.Size([10, 3]),
            device=None,
            is_shared=False),
        sin: Tensor(shape=torch.Size([10, 3, 1]), device=cpu, dtype=torch.float32, is_shared=False),
        terminated: Tensor(shape=torch.Size([10, 3, 1]), device=cpu, dtype=torch.bool, is_shared=False),
        th: Tensor(shape=torch.Size([10, 3, 1]), device=cpu, dtype=torch.float32, is_shared=False),
        thdot: Tensor(shape=torch.Size([10, 3, 1]), device=cpu, dtype=torch.float32, is_shared=False)},
    batch_size=torch.Size([10, 3]),
    device=None,
    is_shared=False)

訓練一個簡單策略#

在此示例中,我們將使用獎勵作為可微分目標(例如負損失)來訓練一個簡單的策略。我們將利用我們的動態系統是完全可微分的事實,透過軌跡回報進行反向傳播,並直接調整我們的策略權重以最大化此值。當然,在許多情況下,我們所做的許多假設並不成立,例如可微分系統和對底層機制的完全訪問。

儘管如此,這是一個非常簡單的示例,它展示瞭如何使用 TorchRL 中的自定義環境來編寫訓練迴圈。

讓我們先編寫策略網路

torch.manual_seed(0)
env.set_seed(0)

net = nn.Sequential(
    nn.LazyLinear(64),
    nn.Tanh(),
    nn.LazyLinear(64),
    nn.Tanh(),
    nn.LazyLinear(64),
    nn.Tanh(),
    nn.LazyLinear(1),
)
policy = TensorDictModule(
    net,
    in_keys=["observation"],
    out_keys=["action"],
)

以及我們的最佳化器

訓練迴圈#

我們將依次

  • 生成一條軌跡

  • 對獎勵求和

  • 透過由這些操作定義的圖進行反向傳播

  • 裁剪梯度範數並執行最佳化步驟

  • 重複

在訓練迴圈結束時,我們應該得到一個接近 0 的最終獎勵,這表明倒擺已達到向上且靜止的期望狀態。

batch_size = 32
pbar = tqdm.tqdm(range(20_000 // batch_size))
scheduler = torch.optim.lr_scheduler.CosineAnnealingLR(optim, 20_000)
logs = defaultdict(list)

for _ in pbar:
    init_td = env.reset(env.gen_params(batch_size=[batch_size]))
    rollout = env.rollout(100, policy, tensordict=init_td, auto_reset=False)
    traj_return = rollout["next", "reward"].mean()
    (-traj_return).backward()
    gn = torch.nn.utils.clip_grad_norm_(net.parameters(), 1.0)
    optim.step()
    optim.zero_grad()
    pbar.set_description(
        f"reward: {traj_return: 4.4f}, "
        f"last reward: {rollout[..., -1]['next', 'reward'].mean(): 4.4f}, gradient norm: {gn: 4.4}"
    )
    logs["return"].append(traj_return.item())
    logs["last_reward"].append(rollout[..., -1]["next", "reward"].mean().item())
    scheduler.step()


def plot():
    import matplotlib
    from matplotlib import pyplot as plt

    is_ipython = "inline" in matplotlib.get_backend()
    if is_ipython:
        from IPython import display

    with plt.ion():
        plt.figure(figsize=(10, 5))
        plt.subplot(1, 2, 1)
        plt.plot(logs["return"])
        plt.title("returns")
        plt.xlabel("iteration")
        plt.subplot(1, 2, 2)
        plt.plot(logs["last_reward"])
        plt.title("last reward")
        plt.xlabel("iteration")
        if is_ipython:
            display.display(plt.gcf())
            display.clear_output(wait=True)
        plt.show()


plot()
returns, last reward
  0%|          | 0/625 [00:00<?, ?it/s]
reward: -6.0488, last reward: -5.0748, gradient norm:  8.519:   0%|          | 0/625 [00:00<?, ?it/s]
reward: -6.0488, last reward: -5.0748, gradient norm:  8.519:   0%|          | 1/625 [00:00<03:27,  3.01it/s]
reward: -7.0499, last reward: -7.4472, gradient norm:  5.073:   0%|          | 1/625 [00:00<03:27,  3.01it/s]
reward: -7.0499, last reward: -7.4472, gradient norm:  5.073:   0%|          | 2/625 [00:00<02:41,  3.85it/s]
reward: -7.0685, last reward: -7.0408, gradient norm:  5.552:   0%|          | 2/625 [00:00<02:41,  3.85it/s]
reward: -7.0685, last reward: -7.0408, gradient norm:  5.552:   0%|          | 3/625 [00:00<02:26,  4.25it/s]
reward: -6.5154, last reward: -5.9086, gradient norm:  2.527:   0%|          | 3/625 [00:00<02:26,  4.25it/s]
reward: -6.5154, last reward: -5.9086, gradient norm:  2.527:   1%|          | 4/625 [00:00<02:18,  4.47it/s]
reward: -6.2006, last reward: -5.9385, gradient norm:  8.155:   1%|          | 4/625 [00:01<02:18,  4.47it/s]
reward: -6.2006, last reward: -5.9385, gradient norm:  8.155:   1%|          | 5/625 [00:01<02:14,  4.61it/s]
reward: -6.2568, last reward: -5.4981, gradient norm:  6.223:   1%|          | 5/625 [00:01<02:14,  4.61it/s]
reward: -6.2568, last reward: -5.4981, gradient norm:  6.223:   1%|          | 6/625 [00:01<02:11,  4.70it/s]
reward: -5.8929, last reward: -8.4491, gradient norm:  4.581:   1%|          | 6/625 [00:01<02:11,  4.70it/s]
reward: -5.8929, last reward: -8.4491, gradient norm:  4.581:   1%|          | 7/625 [00:01<02:09,  4.76it/s]
reward: -6.3233, last reward: -9.0664, gradient norm:  7.596:   1%|          | 7/625 [00:01<02:09,  4.76it/s]
reward: -6.3233, last reward: -9.0664, gradient norm:  7.596:   1%|▏         | 8/625 [00:01<02:08,  4.80it/s]
reward: -6.1021, last reward: -9.5263, gradient norm:  0.9579:   1%|▏         | 8/625 [00:01<02:08,  4.80it/s]
reward: -6.1021, last reward: -9.5263, gradient norm:  0.9579:   1%|▏         | 9/625 [00:01<02:07,  4.83it/s]
reward: -6.5807, last reward: -8.8075, gradient norm:  3.212:   1%|▏         | 9/625 [00:02<02:07,  4.83it/s]
reward: -6.5807, last reward: -8.8075, gradient norm:  3.212:   2%|▏         | 10/625 [00:02<02:06,  4.85it/s]
reward: -6.2009, last reward: -8.5525, gradient norm:  2.914:   2%|▏         | 10/625 [00:02<02:06,  4.85it/s]
reward: -6.2009, last reward: -8.5525, gradient norm:  2.914:   2%|▏         | 11/625 [00:02<02:06,  4.86it/s]
reward: -6.2894, last reward: -8.0115, gradient norm:  52.06:   2%|▏         | 11/625 [00:02<02:06,  4.86it/s]
reward: -6.2894, last reward: -8.0115, gradient norm:  52.06:   2%|▏         | 12/625 [00:02<02:05,  4.88it/s]
reward: -6.0977, last reward: -6.1845, gradient norm:  18.09:   2%|▏         | 12/625 [00:02<02:05,  4.88it/s]
reward: -6.0977, last reward: -6.1845, gradient norm:  18.09:   2%|▏         | 13/625 [00:02<02:05,  4.88it/s]
reward: -6.1830, last reward: -7.4858, gradient norm:  5.233:   2%|▏         | 13/625 [00:02<02:05,  4.88it/s]
reward: -6.1830, last reward: -7.4858, gradient norm:  5.233:   2%|▏         | 14/625 [00:02<02:05,  4.88it/s]
reward: -6.2863, last reward: -5.0297, gradient norm:  1.464:   2%|▏         | 14/625 [00:03<02:05,  4.88it/s]
reward: -6.2863, last reward: -5.0297, gradient norm:  1.464:   2%|▏         | 15/625 [00:03<02:04,  4.88it/s]
reward: -6.4617, last reward: -5.5997, gradient norm:  2.903:   2%|▏         | 15/625 [00:03<02:04,  4.88it/s]
reward: -6.4617, last reward: -5.5997, gradient norm:  2.903:   3%|▎         | 16/625 [00:03<02:04,  4.89it/s]
reward: -6.1647, last reward: -6.0777, gradient norm:  4.918:   3%|▎         | 16/625 [00:03<02:04,  4.89it/s]
reward: -6.1647, last reward: -6.0777, gradient norm:  4.918:   3%|▎         | 17/625 [00:03<02:04,  4.89it/s]
reward: -6.4709, last reward: -6.6813, gradient norm:  0.8319:   3%|▎         | 17/625 [00:03<02:04,  4.89it/s]
reward: -6.4709, last reward: -6.6813, gradient norm:  0.8319:   3%|▎         | 18/625 [00:03<02:03,  4.90it/s]
reward: -6.3221, last reward: -6.5577, gradient norm:  0.8415:   3%|▎         | 18/625 [00:04<02:03,  4.90it/s]
reward: -6.3221, last reward: -6.5577, gradient norm:  0.8415:   3%|▎         | 19/625 [00:04<02:03,  4.90it/s]
reward: -6.3229, last reward: -8.3322, gradient norm:  27.31:   3%|▎         | 19/625 [00:04<02:03,  4.90it/s]
reward: -6.3229, last reward: -8.3322, gradient norm:  27.31:   3%|▎         | 20/625 [00:04<02:03,  4.89it/s]
reward: -6.0258, last reward: -8.0581, gradient norm:  12.32:   3%|▎         | 20/625 [00:04<02:03,  4.89it/s]
reward: -6.0258, last reward: -8.0581, gradient norm:  12.32:   3%|▎         | 21/625 [00:04<02:03,  4.90it/s]
reward: -5.7295, last reward: -6.7230, gradient norm:  24.23:   3%|▎         | 21/625 [00:04<02:03,  4.90it/s]
reward: -5.7295, last reward: -6.7230, gradient norm:  24.23:   4%|▎         | 22/625 [00:04<02:03,  4.90it/s]
reward: -6.0265, last reward: -6.6077, gradient norm:  52.82:   4%|▎         | 22/625 [00:04<02:03,  4.90it/s]
reward: -6.0265, last reward: -6.6077, gradient norm:  52.82:   4%|▎         | 23/625 [00:04<02:02,  4.90it/s]
reward: -6.1081, last reward: -6.1347, gradient norm:  31.16:   4%|▎         | 23/625 [00:05<02:02,  4.90it/s]
reward: -6.1081, last reward: -6.1347, gradient norm:  31.16:   4%|▍         | 24/625 [00:05<02:02,  4.90it/s]
reward: -5.5231, last reward: -4.8435, gradient norm:  11.51:   4%|▍         | 24/625 [00:05<02:02,  4.90it/s]
reward: -5.5231, last reward: -4.8435, gradient norm:  11.51:   4%|▍         | 25/625 [00:05<02:02,  4.90it/s]
reward: -5.5310, last reward: -6.5397, gradient norm:  13.18:   4%|▍         | 25/625 [00:05<02:02,  4.90it/s]
reward: -5.5310, last reward: -6.5397, gradient norm:  13.18:   4%|▍         | 26/625 [00:05<02:02,  4.90it/s]
reward: -5.6382, last reward: -4.8204, gradient norm:  10.72:   4%|▍         | 26/625 [00:05<02:02,  4.90it/s]
reward: -5.6382, last reward: -4.8204, gradient norm:  10.72:   4%|▍         | 27/625 [00:05<02:02,  4.90it/s]
reward: -5.8162, last reward: -5.1618, gradient norm:  10.44:   4%|▍         | 27/625 [00:05<02:02,  4.90it/s]
reward: -5.8162, last reward: -5.1618, gradient norm:  10.44:   4%|▍         | 28/625 [00:05<02:01,  4.90it/s]
reward: -6.1180, last reward: -5.4640, gradient norm:  7.744:   4%|▍         | 28/625 [00:06<02:01,  4.90it/s]
reward: -6.1180, last reward: -5.4640, gradient norm:  7.744:   5%|▍         | 29/625 [00:06<02:01,  4.90it/s]
reward: -5.8759, last reward: -5.7826, gradient norm:  1.796:   5%|▍         | 29/625 [00:06<02:01,  4.90it/s]
reward: -5.8759, last reward: -5.7826, gradient norm:  1.796:   5%|▍         | 30/625 [00:06<02:01,  4.90it/s]
reward: -5.8296, last reward: -6.4808, gradient norm:  2.25:   5%|▍         | 30/625 [00:06<02:01,  4.90it/s]
reward: -5.8296, last reward: -6.4808, gradient norm:  2.25:   5%|▍         | 31/625 [00:06<02:01,  4.90it/s]
reward: -5.7578, last reward: -7.5124, gradient norm:  30.52:   5%|▍         | 31/625 [00:06<02:01,  4.90it/s]
reward: -5.7578, last reward: -7.5124, gradient norm:  30.52:   5%|▌         | 32/625 [00:06<02:01,  4.90it/s]
reward: -5.9313, last reward: -7.5212, gradient norm:  7.652:   5%|▌         | 32/625 [00:06<02:01,  4.90it/s]
reward: -5.9313, last reward: -7.5212, gradient norm:  7.652:   5%|▌         | 33/625 [00:06<02:00,  4.90it/s]
reward: -6.0223, last reward: -6.6343, gradient norm:  4.224:   5%|▌         | 33/625 [00:07<02:00,  4.90it/s]
reward: -6.0223, last reward: -6.6343, gradient norm:  4.224:   5%|▌         | 34/625 [00:07<02:23,  4.11it/s]
reward: -6.2886, last reward: -5.1441, gradient norm:  3.539:   5%|▌         | 34/625 [00:07<02:23,  4.11it/s]
reward: -6.2886, last reward: -5.1441, gradient norm:  3.539:   6%|▌         | 35/625 [00:07<02:16,  4.31it/s]
reward: -6.1060, last reward: -7.1638, gradient norm:  2.407:   6%|▌         | 35/625 [00:07<02:16,  4.31it/s]
reward: -6.1060, last reward: -7.1638, gradient norm:  2.407:   6%|▌         | 36/625 [00:07<02:11,  4.46it/s]
reward: -6.2230, last reward: -5.2917, gradient norm:  5.425:   6%|▌         | 36/625 [00:07<02:11,  4.46it/s]
reward: -6.2230, last reward: -5.2917, gradient norm:  5.425:   6%|▌         | 37/625 [00:07<02:08,  4.58it/s]
reward: -6.2950, last reward: -6.2126, gradient norm:  6.035:   6%|▌         | 37/625 [00:08<02:08,  4.58it/s]
reward: -6.2950, last reward: -6.2126, gradient norm:  6.035:   6%|▌         | 38/625 [00:08<02:05,  4.67it/s]
reward: -5.9786, last reward: -5.8757, gradient norm:  2.098:   6%|▌         | 38/625 [00:08<02:05,  4.67it/s]
reward: -5.9786, last reward: -5.8757, gradient norm:  2.098:   6%|▌         | 39/625 [00:08<02:03,  4.73it/s]
reward: -6.0730, last reward: -5.1952, gradient norm:  3.982:   6%|▌         | 39/625 [00:08<02:03,  4.73it/s]
reward: -6.0730, last reward: -5.1952, gradient norm:  3.982:   6%|▋         | 40/625 [00:08<02:02,  4.76it/s]
reward: -5.9481, last reward: -5.7122, gradient norm:  4.42:   6%|▋         | 40/625 [00:08<02:02,  4.76it/s]
reward: -5.9481, last reward: -5.7122, gradient norm:  4.42:   7%|▋         | 41/625 [00:08<02:01,  4.80it/s]
reward: -6.0875, last reward: -6.7567, gradient norm:  7.728:   7%|▋         | 41/625 [00:08<02:01,  4.80it/s]
reward: -6.0875, last reward: -6.7567, gradient norm:  7.728:   7%|▋         | 42/625 [00:08<02:00,  4.83it/s]
reward: -5.6301, last reward: -6.2249, gradient norm:  9.824:   7%|▋         | 42/625 [00:09<02:00,  4.83it/s]
reward: -5.6301, last reward: -6.2249, gradient norm:  9.824:   7%|▋         | 43/625 [00:09<02:00,  4.85it/s]
reward: -5.5281, last reward: -5.7749, gradient norm:  7.223:   7%|▋         | 43/625 [00:09<02:00,  4.85it/s]
reward: -5.5281, last reward: -5.7749, gradient norm:  7.223:   7%|▋         | 44/625 [00:09<01:59,  4.85it/s]
reward: -5.5904, last reward: -5.0048, gradient norm:  11.73:   7%|▋         | 44/625 [00:09<01:59,  4.85it/s]
reward: -5.5904, last reward: -5.0048, gradient norm:  11.73:   7%|▋         | 45/625 [00:09<01:59,  4.85it/s]
reward: -5.7882, last reward: -4.8660, gradient norm:  2.094:   7%|▋         | 45/625 [00:09<01:59,  4.85it/s]
reward: -5.7882, last reward: -4.8660, gradient norm:  2.094:   7%|▋         | 46/625 [00:09<01:59,  4.85it/s]
reward: -5.8592, last reward: -4.4848, gradient norm:  30.4:   7%|▋         | 46/625 [00:09<01:59,  4.85it/s]
reward: -5.8592, last reward: -4.4848, gradient norm:  30.4:   8%|▊         | 47/625 [00:09<01:58,  4.87it/s]
reward: -5.3849, last reward: -3.5828, gradient norm:  2.244:   8%|▊         | 47/625 [00:10<01:58,  4.87it/s]
reward: -5.3849, last reward: -3.5828, gradient norm:  2.244:   8%|▊         | 48/625 [00:10<01:58,  4.87it/s]
reward: -5.5785, last reward: -2.4216, gradient norm:  0.8946:   8%|▊         | 48/625 [00:10<01:58,  4.87it/s]
reward: -5.5785, last reward: -2.4216, gradient norm:  0.8946:   8%|▊         | 49/625 [00:10<01:58,  4.87it/s]
reward: -5.4433, last reward: -3.4306, gradient norm:  16.48:   8%|▊         | 49/625 [00:10<01:58,  4.87it/s]
reward: -5.4433, last reward: -3.4306, gradient norm:  16.48:   8%|▊         | 50/625 [00:10<01:58,  4.86it/s]
reward: -5.5546, last reward: -5.3443, gradient norm:  8.319:   8%|▊         | 50/625 [00:10<01:58,  4.86it/s]
reward: -5.5546, last reward: -5.3443, gradient norm:  8.319:   8%|▊         | 51/625 [00:10<01:57,  4.87it/s]
reward: -5.5681, last reward: -7.5266, gradient norm:  5.593:   8%|▊         | 51/625 [00:10<01:57,  4.87it/s]
reward: -5.5681, last reward: -7.5266, gradient norm:  5.593:   8%|▊         | 52/625 [00:10<01:58,  4.86it/s]
reward: -5.6418, last reward: -8.1904, gradient norm:  12.34:   8%|▊         | 52/625 [00:11<01:58,  4.86it/s]
reward: -5.6418, last reward: -8.1904, gradient norm:  12.34:   8%|▊         | 53/625 [00:11<01:57,  4.87it/s]
reward: -5.6517, last reward: -8.3856, gradient norm:  4.565:   8%|▊         | 53/625 [00:11<01:57,  4.87it/s]
reward: -5.6517, last reward: -8.3856, gradient norm:  4.565:   9%|▊         | 54/625 [00:11<01:57,  4.87it/s]
reward: -5.9653, last reward: -8.4339, gradient norm:  12.73:   9%|▊         | 54/625 [00:11<01:57,  4.87it/s]
reward: -5.9653, last reward: -8.4339, gradient norm:  12.73:   9%|▉         | 55/625 [00:11<01:57,  4.86it/s]
reward: -6.0832, last reward: -8.9027, gradient norm:  6.07:   9%|▉         | 55/625 [00:11<01:57,  4.86it/s]
reward: -6.0832, last reward: -8.9027, gradient norm:  6.07:   9%|▉         | 56/625 [00:11<01:57,  4.86it/s]
reward: -6.2454, last reward: -8.9134, gradient norm:  9.312:   9%|▉         | 56/625 [00:11<01:57,  4.86it/s]
reward: -6.2454, last reward: -8.9134, gradient norm:  9.312:   9%|▉         | 57/625 [00:11<01:56,  4.86it/s]
reward: -6.1343, last reward: -9.4171, gradient norm:  16.74:   9%|▉         | 57/625 [00:12<01:56,  4.86it/s]
reward: -6.1343, last reward: -9.4171, gradient norm:  16.74:   9%|▉         | 58/625 [00:12<01:56,  4.86it/s]
reward: -5.7796, last reward: -11.1745, gradient norm:  20.83:   9%|▉         | 58/625 [00:12<01:56,  4.86it/s]
reward: -5.7796, last reward: -11.1745, gradient norm:  20.83:   9%|▉         | 59/625 [00:12<01:56,  4.85it/s]
reward: -5.4783, last reward: -6.2441, gradient norm:  8.777:   9%|▉         | 59/625 [00:12<01:56,  4.85it/s]
reward: -5.4783, last reward: -6.2441, gradient norm:  8.777:  10%|▉         | 60/625 [00:12<01:56,  4.85it/s]
reward: -5.5816, last reward: -4.1932, gradient norm:  6.328:  10%|▉         | 60/625 [00:12<01:56,  4.85it/s]
reward: -5.5816, last reward: -4.1932, gradient norm:  6.328:  10%|▉         | 61/625 [00:12<01:56,  4.85it/s]
reward: -5.6604, last reward: -4.1629, gradient norm:  3.516:  10%|▉         | 61/625 [00:12<01:56,  4.85it/s]
reward: -5.6604, last reward: -4.1629, gradient norm:  3.516:  10%|▉         | 62/625 [00:12<01:55,  4.86it/s]
reward: -5.4195, last reward: -5.1296, gradient norm:  8.378:  10%|▉         | 62/625 [00:13<01:55,  4.86it/s]
reward: -5.4195, last reward: -5.1296, gradient norm:  8.378:  10%|█         | 63/625 [00:13<01:55,  4.87it/s]
reward: -5.5165, last reward: -3.0986, gradient norm:  17.72:  10%|█         | 63/625 [00:13<01:55,  4.87it/s]
reward: -5.5165, last reward: -3.0986, gradient norm:  17.72:  10%|█         | 64/625 [00:13<01:54,  4.88it/s]
reward: -5.5596, last reward: -4.2442, gradient norm:  11.38:  10%|█         | 64/625 [00:13<01:54,  4.88it/s]
reward: -5.5596, last reward: -4.2442, gradient norm:  11.38:  10%|█         | 65/625 [00:13<01:54,  4.89it/s]
reward: -5.9834, last reward: -6.0432, gradient norm:  8.038:  10%|█         | 65/625 [00:13<01:54,  4.89it/s]
reward: -5.9834, last reward: -6.0432, gradient norm:  8.038:  11%|█         | 66/625 [00:13<01:54,  4.89it/s]
reward: -5.7958, last reward: -5.1525, gradient norm:  8.564:  11%|█         | 66/625 [00:13<01:54,  4.89it/s]
reward: -5.7958, last reward: -5.1525, gradient norm:  8.564:  11%|█         | 67/625 [00:13<01:54,  4.89it/s]
reward: -5.8544, last reward: -5.2747, gradient norm:  7.632:  11%|█         | 67/625 [00:14<01:54,  4.89it/s]
reward: -5.8544, last reward: -5.2747, gradient norm:  7.632:  11%|█         | 68/625 [00:14<01:54,  4.88it/s]
reward: -5.3922, last reward: -4.5267, gradient norm:  18.13:  11%|█         | 68/625 [00:14<01:54,  4.88it/s]
reward: -5.3922, last reward: -4.5267, gradient norm:  18.13:  11%|█         | 69/625 [00:14<01:53,  4.88it/s]
reward: -5.0917, last reward: -3.3025, gradient norm:  2.33:  11%|█         | 69/625 [00:14<01:53,  4.88it/s]
reward: -5.0917, last reward: -3.3025, gradient norm:  2.33:  11%|█         | 70/625 [00:14<01:53,  4.88it/s]
reward: -5.0968, last reward: -6.1214, gradient norm:  11.27:  11%|█         | 70/625 [00:14<01:53,  4.88it/s]
reward: -5.0968, last reward: -6.1214, gradient norm:  11.27:  11%|█▏        | 71/625 [00:14<01:53,  4.88it/s]
reward: -5.2523, last reward: -4.0580, gradient norm:  22.2:  11%|█▏        | 71/625 [00:15<01:53,  4.88it/s]
reward: -5.2523, last reward: -4.0580, gradient norm:  22.2:  12%|█▏        | 72/625 [00:15<01:53,  4.89it/s]
reward: -5.4829, last reward: -6.6886, gradient norm:  12.37:  12%|█▏        | 72/625 [00:15<01:53,  4.89it/s]
reward: -5.4829, last reward: -6.6886, gradient norm:  12.37:  12%|█▏        | 73/625 [00:15<01:52,  4.89it/s]
reward: -5.7293, last reward: -9.4615, gradient norm:  15.07:  12%|█▏        | 73/625 [00:15<01:52,  4.89it/s]
reward: -5.7293, last reward: -9.4615, gradient norm:  15.07:  12%|█▏        | 74/625 [00:15<01:52,  4.89it/s]
reward: -5.7735, last reward: -9.0859, gradient norm:  892.4:  12%|█▏        | 74/625 [00:15<01:52,  4.89it/s]
reward: -5.7735, last reward: -9.0859, gradient norm:  892.4:  12%|█▏        | 75/625 [00:15<01:52,  4.88it/s]
reward: -6.1616, last reward: -9.2996, gradient norm:  9.569:  12%|█▏        | 75/625 [00:15<01:52,  4.88it/s]
reward: -6.1616, last reward: -9.2996, gradient norm:  9.569:  12%|█▏        | 76/625 [00:15<01:52,  4.89it/s]
reward: -6.2202, last reward: -9.3199, gradient norm:  8.919:  12%|█▏        | 76/625 [00:16<01:52,  4.89it/s]
reward: -6.2202, last reward: -9.3199, gradient norm:  8.919:  12%|█▏        | 77/625 [00:16<01:52,  4.89it/s]
reward: -6.1349, last reward: -9.9361, gradient norm:  10.06:  12%|█▏        | 77/625 [00:16<01:52,  4.89it/s]
reward: -6.1349, last reward: -9.9361, gradient norm:  10.06:  12%|█▏        | 78/625 [00:16<01:51,  4.89it/s]
reward: -6.0374, last reward: -10.4791, gradient norm:  45.37:  12%|█▏        | 78/625 [00:16<01:51,  4.89it/s]
reward: -6.0374, last reward: -10.4791, gradient norm:  45.37:  13%|█▎        | 79/625 [00:16<01:51,  4.89it/s]
reward: -5.6990, last reward: -9.0426, gradient norm:  32.93:  13%|█▎        | 79/625 [00:16<01:51,  4.89it/s]
reward: -5.6990, last reward: -9.0426, gradient norm:  32.93:  13%|█▎        | 80/625 [00:16<01:51,  4.88it/s]
reward: -5.3303, last reward: -4.9148, gradient norm:  307.4:  13%|█▎        | 80/625 [00:16<01:51,  4.88it/s]
reward: -5.3303, last reward: -4.9148, gradient norm:  307.4:  13%|█▎        | 81/625 [00:16<01:51,  4.88it/s]
reward: -5.2291, last reward: -3.3632, gradient norm:  2.828:  13%|█▎        | 81/625 [00:17<01:51,  4.88it/s]
reward: -5.2291, last reward: -3.3632, gradient norm:  2.828:  13%|█▎        | 82/625 [00:17<01:51,  4.88it/s]
reward: -5.0228, last reward: -3.1018, gradient norm:  32.56:  13%|█▎        | 82/625 [00:17<01:51,  4.88it/s]
reward: -5.0228, last reward: -3.1018, gradient norm:  32.56:  13%|█▎        | 83/625 [00:17<01:50,  4.89it/s]
reward: -5.0364, last reward: -3.8503, gradient norm:  8.948:  13%|█▎        | 83/625 [00:17<01:50,  4.89it/s]
reward: -5.0364, last reward: -3.8503, gradient norm:  8.948:  13%|█▎        | 84/625 [00:17<01:51,  4.87it/s]
reward: -4.9341, last reward: -6.9319, gradient norm:  119.2:  13%|█▎        | 84/625 [00:17<01:51,  4.87it/s]
reward: -4.9341, last reward: -6.9319, gradient norm:  119.2:  14%|█▎        | 85/625 [00:17<01:51,  4.86it/s]
reward: -5.0693, last reward: -6.4436, gradient norm:  5.28:  14%|█▎        | 85/625 [00:17<01:51,  4.86it/s]
reward: -5.0693, last reward: -6.4436, gradient norm:  5.28:  14%|█▍        | 86/625 [00:17<01:50,  4.86it/s]
reward: -4.9258, last reward: -6.0461, gradient norm:  4.376:  14%|█▍        | 86/625 [00:18<01:50,  4.86it/s]
reward: -4.9258, last reward: -6.0461, gradient norm:  4.376:  14%|█▍        | 87/625 [00:18<01:50,  4.87it/s]
reward: -4.9910, last reward: -4.5681, gradient norm:  25.14:  14%|█▍        | 87/625 [00:18<01:50,  4.87it/s]
reward: -4.9910, last reward: -4.5681, gradient norm:  25.14:  14%|█▍        | 88/625 [00:18<01:50,  4.87it/s]
reward: -5.1716, last reward: -5.3157, gradient norm:  15.5:  14%|█▍        | 88/625 [00:18<01:50,  4.87it/s]
reward: -5.1716, last reward: -5.3157, gradient norm:  15.5:  14%|█▍        | 89/625 [00:18<01:49,  4.88it/s]
reward: -4.9816, last reward: -3.5950, gradient norm:  7.403:  14%|█▍        | 89/625 [00:18<01:49,  4.88it/s]
reward: -4.9816, last reward: -3.5950, gradient norm:  7.403:  14%|█▍        | 90/625 [00:18<01:49,  4.88it/s]
reward: -4.7252, last reward: -4.8815, gradient norm:  10.07:  14%|█▍        | 90/625 [00:18<01:49,  4.88it/s]
reward: -4.7252, last reward: -4.8815, gradient norm:  10.07:  15%|█▍        | 91/625 [00:18<01:49,  4.88it/s]
reward: -4.9986, last reward: -5.8680, gradient norm:  14.26:  15%|█▍        | 91/625 [00:19<01:49,  4.88it/s]
reward: -4.9986, last reward: -5.8680, gradient norm:  14.26:  15%|█▍        | 92/625 [00:19<01:49,  4.88it/s]
reward: -4.9029, last reward: -5.7132, gradient norm:  21.65:  15%|█▍        | 92/625 [00:19<01:49,  4.88it/s]
reward: -4.9029, last reward: -5.7132, gradient norm:  21.65:  15%|█▍        | 93/625 [00:19<01:49,  4.88it/s]
reward: -4.7814, last reward: -6.5231, gradient norm:  27.4:  15%|█▍        | 93/625 [00:19<01:49,  4.88it/s]
reward: -4.7814, last reward: -6.5231, gradient norm:  27.4:  15%|█▌        | 94/625 [00:19<01:49,  4.87it/s]
reward: -4.7013, last reward: -6.0821, gradient norm:  22.53:  15%|█▌        | 94/625 [00:19<01:49,  4.87it/s]
reward: -4.7013, last reward: -6.0821, gradient norm:  22.53:  15%|█▌        | 95/625 [00:19<01:49,  4.86it/s]
reward: -4.3526, last reward: -5.3718, gradient norm:  28.77:  15%|█▌        | 95/625 [00:19<01:49,  4.86it/s]
reward: -4.3526, last reward: -5.3718, gradient norm:  28.77:  15%|█▌        | 96/625 [00:19<01:49,  4.85it/s]
reward: -5.0901, last reward: -5.0493, gradient norm:  8.428:  15%|█▌        | 96/625 [00:20<01:49,  4.85it/s]
reward: -5.0901, last reward: -5.0493, gradient norm:  8.428:  16%|█▌        | 97/625 [00:20<01:49,  4.84it/s]
reward: -4.9341, last reward: -4.0375, gradient norm:  17.1:  16%|█▌        | 97/625 [00:20<01:49,  4.84it/s]
reward: -4.9341, last reward: -4.0375, gradient norm:  17.1:  16%|█▌        | 98/625 [00:20<01:48,  4.84it/s]
reward: -5.0707, last reward: -5.9903, gradient norm:  12.01:  16%|█▌        | 98/625 [00:20<01:48,  4.84it/s]
reward: -5.0707, last reward: -5.9903, gradient norm:  12.01:  16%|█▌        | 99/625 [00:20<01:48,  4.84it/s]
reward: -4.8171, last reward: -4.1591, gradient norm:  47.69:  16%|█▌        | 99/625 [00:20<01:48,  4.84it/s]
reward: -4.8171, last reward: -4.1591, gradient norm:  47.69:  16%|█▌        | 100/625 [00:20<01:48,  4.84it/s]
reward: -4.8621, last reward: -4.1783, gradient norm:  9.28:  16%|█▌        | 100/625 [00:20<01:48,  4.84it/s]
reward: -4.8621, last reward: -4.1783, gradient norm:  9.28:  16%|█▌        | 101/625 [00:20<01:48,  4.84it/s]
reward: -4.4683, last reward: -2.4896, gradient norm:  10.58:  16%|█▌        | 101/625 [00:21<01:48,  4.84it/s]
reward: -4.4683, last reward: -2.4896, gradient norm:  10.58:  16%|█▋        | 102/625 [00:21<01:47,  4.84it/s]
reward: -4.5413, last reward: -5.7029, gradient norm:  8.056:  16%|█▋        | 102/625 [00:21<01:47,  4.84it/s]
reward: -4.5413, last reward: -5.7029, gradient norm:  8.056:  16%|█▋        | 103/625 [00:21<01:47,  4.85it/s]
reward: -4.6580, last reward: -8.4799, gradient norm:  34.32:  16%|█▋        | 103/625 [00:21<01:47,  4.85it/s]
reward: -4.6580, last reward: -8.4799, gradient norm:  34.32:  17%|█▋        | 104/625 [00:21<01:47,  4.85it/s]
reward: -4.6693, last reward: -7.4469, gradient norm:  81.33:  17%|█▋        | 104/625 [00:21<01:47,  4.85it/s]
reward: -4.6693, last reward: -7.4469, gradient norm:  81.33:  17%|█▋        | 105/625 [00:21<01:47,  4.85it/s]
reward: -4.7061, last reward: -3.6757, gradient norm:  13.94:  17%|█▋        | 105/625 [00:21<01:47,  4.85it/s]
reward: -4.7061, last reward: -3.6757, gradient norm:  13.94:  17%|█▋        | 106/625 [00:21<01:46,  4.86it/s]
reward: -4.4342, last reward: -3.6883, gradient norm:  26.25:  17%|█▋        | 106/625 [00:22<01:46,  4.86it/s]
reward: -4.4342, last reward: -3.6883, gradient norm:  26.25:  17%|█▋        | 107/625 [00:22<01:46,  4.86it/s]
reward: -4.3992, last reward: -2.4497, gradient norm:  15.67:  17%|█▋        | 107/625 [00:22<01:46,  4.86it/s]
reward: -4.3992, last reward: -2.4497, gradient norm:  15.67:  17%|█▋        | 108/625 [00:22<01:46,  4.87it/s]
reward: -4.3980, last reward: -4.0425, gradient norm:  13.06:  17%|█▋        | 108/625 [00:22<01:46,  4.87it/s]
reward: -4.3980, last reward: -4.0425, gradient norm:  13.06:  17%|█▋        | 109/625 [00:22<01:45,  4.87it/s]
reward: -5.2514, last reward: -4.0430, gradient norm:  8.778:  17%|█▋        | 109/625 [00:22<01:45,  4.87it/s]
reward: -5.2514, last reward: -4.0430, gradient norm:  8.778:  18%|█▊        | 110/625 [00:22<01:45,  4.87it/s]
reward: -5.2656, last reward: -5.0365, gradient norm:  8.68:  18%|█▊        | 110/625 [00:23<01:45,  4.87it/s]
reward: -5.2656, last reward: -5.0365, gradient norm:  8.68:  18%|█▊        | 111/625 [00:23<01:45,  4.86it/s]
reward: -5.2567, last reward: -5.9920, gradient norm:  11.66:  18%|█▊        | 111/625 [00:23<01:45,  4.86it/s]
reward: -5.2567, last reward: -5.9920, gradient norm:  11.66:  18%|█▊        | 112/625 [00:23<01:45,  4.87it/s]
reward: -5.0847, last reward: -5.2160, gradient norm:  12.61:  18%|█▊        | 112/625 [00:23<01:45,  4.87it/s]
reward: -5.0847, last reward: -5.2160, gradient norm:  12.61:  18%|█▊        | 113/625 [00:23<01:44,  4.88it/s]
reward: -4.8941, last reward: -5.0903, gradient norm:  14.7:  18%|█▊        | 113/625 [00:23<01:44,  4.88it/s]
reward: -4.8941, last reward: -5.0903, gradient norm:  14.7:  18%|█▊        | 114/625 [00:23<01:44,  4.88it/s]
reward: -4.5529, last reward: -3.4350, gradient norm:  24.5:  18%|█▊        | 114/625 [00:23<01:44,  4.88it/s]
reward: -4.5529, last reward: -3.4350, gradient norm:  24.5:  18%|█▊        | 115/625 [00:23<01:45,  4.85it/s]
reward: -4.4047, last reward: -3.9059, gradient norm:  11.8:  18%|█▊        | 115/625 [00:24<01:45,  4.85it/s]
reward: -4.4047, last reward: -3.9059, gradient norm:  11.8:  19%|█▊        | 116/625 [00:24<01:44,  4.85it/s]
reward: -4.7905, last reward: -4.2659, gradient norm:  14.6:  19%|█▊        | 116/625 [00:24<01:44,  4.85it/s]
reward: -4.7905, last reward: -4.2659, gradient norm:  14.6:  19%|█▊        | 117/625 [00:24<01:44,  4.86it/s]
reward: -5.1685, last reward: -5.0558, gradient norm:  2.069:  19%|█▊        | 117/625 [00:24<01:44,  4.86it/s]
reward: -5.1685, last reward: -5.0558, gradient norm:  2.069:  19%|█▉        | 118/625 [00:24<01:44,  4.87it/s]
reward: -5.3224, last reward: -3.9649, gradient norm:  22.7:  19%|█▉        | 118/625 [00:24<01:44,  4.87it/s]
reward: -5.3224, last reward: -3.9649, gradient norm:  22.7:  19%|█▉        | 119/625 [00:24<01:43,  4.87it/s]
reward: -5.3083, last reward: -4.9055, gradient norm:  13.3:  19%|█▉        | 119/625 [00:24<01:43,  4.87it/s]
reward: -5.3083, last reward: -4.9055, gradient norm:  13.3:  19%|█▉        | 120/625 [00:24<01:43,  4.87it/s]
reward: -5.1928, last reward: -6.0475, gradient norm:  59.18:  19%|█▉        | 120/625 [00:25<01:43,  4.87it/s]
reward: -5.1928, last reward: -6.0475, gradient norm:  59.18:  19%|█▉        | 121/625 [00:25<01:43,  4.87it/s]
reward: -5.0833, last reward: -4.8086, gradient norm:  20.01:  19%|█▉        | 121/625 [00:25<01:43,  4.87it/s]
reward: -5.0833, last reward: -4.8086, gradient norm:  20.01:  20%|█▉        | 122/625 [00:25<01:43,  4.87it/s]
reward: -4.6719, last reward: -8.9463, gradient norm:  54.76:  20%|█▉        | 122/625 [00:25<01:43,  4.87it/s]
reward: -4.6719, last reward: -8.9463, gradient norm:  54.76:  20%|█▉        | 123/625 [00:25<01:43,  4.86it/s]
reward: -4.2157, last reward: -3.4610, gradient norm:  10.41:  20%|█▉        | 123/625 [00:25<01:43,  4.86it/s]
reward: -4.2157, last reward: -3.4610, gradient norm:  10.41:  20%|█▉        | 124/625 [00:25<01:43,  4.86it/s]
reward: -4.4119, last reward: -2.9298, gradient norm:  50.3:  20%|█▉        | 124/625 [00:25<01:43,  4.86it/s]
reward: -4.4119, last reward: -2.9298, gradient norm:  50.3:  20%|██        | 125/625 [00:25<01:42,  4.86it/s]
reward: -4.7378, last reward: -4.1409, gradient norm:  12.45:  20%|██        | 125/625 [00:26<01:42,  4.86it/s]
reward: -4.7378, last reward: -4.1409, gradient norm:  12.45:  20%|██        | 126/625 [00:26<01:42,  4.86it/s]
reward: -4.0920, last reward: -4.0036, gradient norm:  17.08:  20%|██        | 126/625 [00:26<01:42,  4.86it/s]
reward: -4.0920, last reward: -4.0036, gradient norm:  17.08:  20%|██        | 127/625 [00:26<01:42,  4.86it/s]
reward: -4.4453, last reward: -2.8994, gradient norm:  26.63:  20%|██        | 127/625 [00:26<01:42,  4.86it/s]
reward: -4.4453, last reward: -2.8994, gradient norm:  26.63:  20%|██        | 128/625 [00:26<01:42,  4.85it/s]
reward: -4.2940, last reward: -4.9240, gradient norm:  113.7:  20%|██        | 128/625 [00:26<01:42,  4.85it/s]
reward: -4.2940, last reward: -4.9240, gradient norm:  113.7:  21%|██        | 129/625 [00:26<01:42,  4.85it/s]
reward: -4.4657, last reward: -5.8249, gradient norm:  15.75:  21%|██        | 129/625 [00:26<01:42,  4.85it/s]
reward: -4.4657, last reward: -5.8249, gradient norm:  15.75:  21%|██        | 130/625 [00:26<01:42,  4.85it/s]
reward: -4.6821, last reward: -6.2320, gradient norm:  24.59:  21%|██        | 130/625 [00:27<01:42,  4.85it/s]
reward: -4.6821, last reward: -6.2320, gradient norm:  24.59:  21%|██        | 131/625 [00:27<01:41,  4.85it/s]
reward: -4.7717, last reward: -7.0348, gradient norm:  21.43:  21%|██        | 131/625 [00:27<01:41,  4.85it/s]
reward: -4.7717, last reward: -7.0348, gradient norm:  21.43:  21%|██        | 132/625 [00:27<01:41,  4.85it/s]
reward: -4.5923, last reward: -9.1746, gradient norm:  38.4:  21%|██        | 132/625 [00:27<01:41,  4.85it/s]
reward: -4.5923, last reward: -9.1746, gradient norm:  38.4:  21%|██▏       | 133/625 [00:27<01:41,  4.86it/s]
reward: -4.2964, last reward: -4.3941, gradient norm:  7.475:  21%|██▏       | 133/625 [00:27<01:41,  4.86it/s]
reward: -4.2964, last reward: -4.3941, gradient norm:  7.475:  21%|██▏       | 134/625 [00:27<01:41,  4.85it/s]
reward: -4.2730, last reward: -3.0781, gradient norm:  22.33:  21%|██▏       | 134/625 [00:27<01:41,  4.85it/s]
reward: -4.2730, last reward: -3.0781, gradient norm:  22.33:  22%|██▏       | 135/625 [00:27<01:41,  4.85it/s]
reward: -4.2718, last reward: -3.1451, gradient norm:  8.063:  22%|██▏       | 135/625 [00:28<01:41,  4.85it/s]
reward: -4.2718, last reward: -3.1451, gradient norm:  8.063:  22%|██▏       | 136/625 [00:28<01:40,  4.86it/s]
reward: -4.3199, last reward: -5.0931, gradient norm:  131.1:  22%|██▏       | 136/625 [00:28<01:40,  4.86it/s]
reward: -4.3199, last reward: -5.0931, gradient norm:  131.1:  22%|██▏       | 137/625 [00:28<01:59,  4.09it/s]
reward: -4.4474, last reward: -5.2053, gradient norm:  22.13:  22%|██▏       | 137/625 [00:28<01:59,  4.09it/s]
reward: -4.4474, last reward: -5.2053, gradient norm:  22.13:  22%|██▏       | 138/625 [00:28<01:53,  4.29it/s]
reward: -4.9233, last reward: -3.8841, gradient norm:  6.794:  22%|██▏       | 138/625 [00:28<01:53,  4.29it/s]
reward: -4.9233, last reward: -3.8841, gradient norm:  6.794:  22%|██▏       | 139/625 [00:28<01:49,  4.45it/s]
reward: -4.7412, last reward: -4.6784, gradient norm:  15.88:  22%|██▏       | 139/625 [00:29<01:49,  4.45it/s]
reward: -4.7412, last reward: -4.6784, gradient norm:  15.88:  22%|██▏       | 140/625 [00:29<01:46,  4.57it/s]
reward: -4.4236, last reward: -3.8232, gradient norm:  95.06:  22%|██▏       | 140/625 [00:29<01:46,  4.57it/s]
reward: -4.4236, last reward: -3.8232, gradient norm:  95.06:  23%|██▎       | 141/625 [00:29<01:43,  4.66it/s]
reward: -4.2859, last reward: -5.9936, gradient norm:  19.62:  23%|██▎       | 141/625 [00:29<01:43,  4.66it/s]
reward: -4.2859, last reward: -5.9936, gradient norm:  19.62:  23%|██▎       | 142/625 [00:29<01:42,  4.71it/s]
reward: -4.4756, last reward: -3.0061, gradient norm:  58.42:  23%|██▎       | 142/625 [00:29<01:42,  4.71it/s]
reward: -4.4756, last reward: -3.0061, gradient norm:  58.42:  23%|██▎       | 143/625 [00:29<01:41,  4.76it/s]
reward: -4.6419, last reward: -2.8358, gradient norm:  21.94:  23%|██▎       | 143/625 [00:29<01:41,  4.76it/s]
reward: -4.6419, last reward: -2.8358, gradient norm:  21.94:  23%|██▎       | 144/625 [00:29<01:40,  4.80it/s]
reward: -4.5489, last reward: -4.8108, gradient norm:  26.27:  23%|██▎       | 144/625 [00:30<01:40,  4.80it/s]
reward: -4.5489, last reward: -4.8108, gradient norm:  26.27:  23%|██▎       | 145/625 [00:30<01:39,  4.82it/s]
reward: -4.4234, last reward: -6.1971, gradient norm:  24.6:  23%|██▎       | 145/625 [00:30<01:39,  4.82it/s]
reward: -4.4234, last reward: -6.1971, gradient norm:  24.6:  23%|██▎       | 146/625 [00:30<01:39,  4.83it/s]
reward: -4.6739, last reward: -4.1551, gradient norm:  8.242:  23%|██▎       | 146/625 [00:30<01:39,  4.83it/s]
reward: -4.6739, last reward: -4.1551, gradient norm:  8.242:  24%|██▎       | 147/625 [00:30<01:38,  4.83it/s]
reward: -4.4584, last reward: -5.1256, gradient norm:  4.714:  24%|██▎       | 147/625 [00:30<01:38,  4.83it/s]
reward: -4.4584, last reward: -5.1256, gradient norm:  4.714:  24%|██▎       | 148/625 [00:30<01:38,  4.85it/s]
reward: -4.3930, last reward: -3.8382, gradient norm:  2.931:  24%|██▎       | 148/625 [00:30<01:38,  4.85it/s]
reward: -4.3930, last reward: -3.8382, gradient norm:  2.931:  24%|██▍       | 149/625 [00:30<01:38,  4.86it/s]
reward: -4.8215, last reward: -3.7751, gradient norm:  12.4:  24%|██▍       | 149/625 [00:31<01:38,  4.86it/s]
reward: -4.8215, last reward: -3.7751, gradient norm:  12.4:  24%|██▍       | 150/625 [00:31<01:37,  4.86it/s]
reward: -4.9927, last reward: -4.0620, gradient norm:  9.91:  24%|██▍       | 150/625 [00:31<01:37,  4.86it/s]
reward: -4.9927, last reward: -4.0620, gradient norm:  9.91:  24%|██▍       | 151/625 [00:31<01:37,  4.87it/s]
reward: -4.7118, last reward: -4.4055, gradient norm:  14.72:  24%|██▍       | 151/625 [00:31<01:37,  4.87it/s]
reward: -4.7118, last reward: -4.4055, gradient norm:  14.72:  24%|██▍       | 152/625 [00:31<01:37,  4.87it/s]
reward: -4.5860, last reward: -3.0642, gradient norm:  12.02:  24%|██▍       | 152/625 [00:31<01:37,  4.87it/s]
reward: -4.5860, last reward: -3.0642, gradient norm:  12.02:  24%|██▍       | 153/625 [00:31<01:36,  4.87it/s]
reward: -4.2358, last reward: -3.0014, gradient norm:  20.68:  24%|██▍       | 153/625 [00:31<01:36,  4.87it/s]
reward: -4.2358, last reward: -3.0014, gradient norm:  20.68:  25%|██▍       | 154/625 [00:31<01:36,  4.87it/s]
reward: -4.3053, last reward: -4.5390, gradient norm:  14.11:  25%|██▍       | 154/625 [00:32<01:36,  4.87it/s]
reward: -4.3053, last reward: -4.5390, gradient norm:  14.11:  25%|██▍       | 155/625 [00:32<01:36,  4.88it/s]
reward: -4.4845, last reward: -7.6566, gradient norm:  51.89:  25%|██▍       | 155/625 [00:32<01:36,  4.88it/s]
reward: -4.4845, last reward: -7.6566, gradient norm:  51.89:  25%|██▍       | 156/625 [00:32<01:36,  4.87it/s]
reward: -4.7679, last reward: -8.4566, gradient norm:  19.11:  25%|██▍       | 156/625 [00:32<01:36,  4.87it/s]
reward: -4.7679, last reward: -8.4566, gradient norm:  19.11:  25%|██▌       | 157/625 [00:32<01:36,  4.87it/s]
reward: -4.6030, last reward: -6.4867, gradient norm:  24.21:  25%|██▌       | 157/625 [00:32<01:36,  4.87it/s]
reward: -4.6030, last reward: -6.4867, gradient norm:  24.21:  25%|██▌       | 158/625 [00:32<01:36,  4.86it/s]
reward: -4.3156, last reward: -4.3057, gradient norm:  26.15:  25%|██▌       | 158/625 [00:33<01:36,  4.86it/s]
reward: -4.3156, last reward: -4.3057, gradient norm:  26.15:  25%|██▌       | 159/625 [00:33<01:35,  4.87it/s]
reward: -4.1515, last reward: -2.7400, gradient norm:  46.67:  25%|██▌       | 159/625 [00:33<01:35,  4.87it/s]
reward: -4.1515, last reward: -2.7400, gradient norm:  46.67:  26%|██▌       | 160/625 [00:33<01:35,  4.87it/s]
reward: -4.1984, last reward: -3.1343, gradient norm:  10.44:  26%|██▌       | 160/625 [00:33<01:35,  4.87it/s]
reward: -4.1984, last reward: -3.1343, gradient norm:  10.44:  26%|██▌       | 161/625 [00:33<01:35,  4.88it/s]
reward: -4.7794, last reward: -4.1895, gradient norm:  15.07:  26%|██▌       | 161/625 [00:33<01:35,  4.88it/s]
reward: -4.7794, last reward: -4.1895, gradient norm:  15.07:  26%|██▌       | 162/625 [00:33<01:34,  4.88it/s]
reward: -4.8227, last reward: -3.9495, gradient norm:  10.96:  26%|██▌       | 162/625 [00:33<01:34,  4.88it/s]
reward: -4.8227, last reward: -3.9495, gradient norm:  10.96:  26%|██▌       | 163/625 [00:33<01:34,  4.88it/s]
reward: -5.0627, last reward: -2.8677, gradient norm:  8.216:  26%|██▌       | 163/625 [00:34<01:34,  4.88it/s]
reward: -5.0627, last reward: -2.8677, gradient norm:  8.216:  26%|██▌       | 164/625 [00:34<01:34,  4.88it/s]
reward: -4.3039, last reward: -3.8106, gradient norm:  15.09:  26%|██▌       | 164/625 [00:34<01:34,  4.88it/s]
reward: -4.3039, last reward: -3.8106, gradient norm:  15.09:  26%|██▋       | 165/625 [00:34<01:34,  4.88it/s]
reward: -4.2623, last reward: -3.6619, gradient norm:  22.77:  26%|██▋       | 165/625 [00:34<01:34,  4.88it/s]
reward: -4.2623, last reward: -3.6619, gradient norm:  22.77:  27%|██▋       | 166/625 [00:34<01:34,  4.88it/s]
reward: -4.0987, last reward: -3.0736, gradient norm:  20.92:  27%|██▋       | 166/625 [00:34<01:34,  4.88it/s]
reward: -4.0987, last reward: -3.0736, gradient norm:  20.92:  27%|██▋       | 167/625 [00:34<01:33,  4.88it/s]
reward: -4.3893, last reward: -5.3442, gradient norm:  9.876:  27%|██▋       | 167/625 [00:34<01:33,  4.88it/s]
reward: -4.3893, last reward: -5.3442, gradient norm:  9.876:  27%|██▋       | 168/625 [00:34<01:33,  4.88it/s]
reward: -4.6078, last reward: -7.7466, gradient norm:  16.06:  27%|██▋       | 168/625 [00:35<01:33,  4.88it/s]
reward: -4.6078, last reward: -7.7466, gradient norm:  16.06:  27%|██▋       | 169/625 [00:35<01:33,  4.89it/s]
reward: -4.5928, last reward: -6.5101, gradient norm:  20.69:  27%|██▋       | 169/625 [00:35<01:33,  4.89it/s]
reward: -4.5928, last reward: -6.5101, gradient norm:  20.69:  27%|██▋       | 170/625 [00:35<01:32,  4.89it/s]
reward: -4.3683, last reward: -3.9307, gradient norm:  78.59:  27%|██▋       | 170/625 [00:35<01:32,  4.89it/s]
reward: -4.3683, last reward: -3.9307, gradient norm:  78.59:  27%|██▋       | 171/625 [00:35<01:32,  4.89it/s]
reward: -4.1301, last reward: -2.4966, gradient norm:  41.21:  27%|██▋       | 171/625 [00:35<01:32,  4.89it/s]
reward: -4.1301, last reward: -2.4966, gradient norm:  41.21:  28%|██▊       | 172/625 [00:35<01:32,  4.88it/s]
reward: -4.0062, last reward: -2.8255, gradient norm:  4.798:  28%|██▊       | 172/625 [00:35<01:32,  4.88it/s]
reward: -4.0062, last reward: -2.8255, gradient norm:  4.798:  28%|██▊       | 173/625 [00:35<01:32,  4.86it/s]
reward: -4.1558, last reward: -3.7388, gradient norm:  214.8:  28%|██▊       | 173/625 [00:36<01:32,  4.86it/s]
reward: -4.1558, last reward: -3.7388, gradient norm:  214.8:  28%|██▊       | 174/625 [00:36<01:32,  4.87it/s]
reward: -4.2803, last reward: -3.7403, gradient norm:  15.82:  28%|██▊       | 174/625 [00:36<01:32,  4.87it/s]
reward: -4.2803, last reward: -3.7403, gradient norm:  15.82:  28%|██▊       | 175/625 [00:36<01:32,  4.87it/s]
reward: -4.4744, last reward: -2.6246, gradient norm:  8.711:  28%|██▊       | 175/625 [00:36<01:32,  4.87it/s]
reward: -4.4744, last reward: -2.6246, gradient norm:  8.711:  28%|██▊       | 176/625 [00:36<01:32,  4.87it/s]
reward: -4.3930, last reward: -4.4075, gradient norm:  5.093:  28%|██▊       | 176/625 [00:36<01:32,  4.87it/s]
reward: -4.3930, last reward: -4.4075, gradient norm:  5.093:  28%|██▊       | 177/625 [00:36<01:32,  4.86it/s]
reward: -4.5119, last reward: -5.6155, gradient norm:  6.556:  28%|██▊       | 177/625 [00:36<01:32,  4.86it/s]
reward: -4.5119, last reward: -5.6155, gradient norm:  6.556:  28%|██▊       | 178/625 [00:36<01:32,  4.85it/s]
reward: -4.4439, last reward: -4.5042, gradient norm:  4.911:  28%|██▊       | 178/625 [00:37<01:32,  4.85it/s]
reward: -4.4439, last reward: -4.5042, gradient norm:  4.911:  29%|██▊       | 179/625 [00:37<01:31,  4.86it/s]
reward: -3.9554, last reward: -2.5403, gradient norm:  13.88:  29%|██▊       | 179/625 [00:37<01:31,  4.86it/s]
reward: -3.9554, last reward: -2.5403, gradient norm:  13.88:  29%|██▉       | 180/625 [00:37<01:31,  4.86it/s]
reward: -4.3505, last reward: -2.7444, gradient norm:  4.01:  29%|██▉       | 180/625 [00:37<01:31,  4.86it/s]
reward: -4.3505, last reward: -2.7444, gradient norm:  4.01:  29%|██▉       | 181/625 [00:37<01:31,  4.86it/s]
reward: -4.4148, last reward: -4.6757, gradient norm:  9.661:  29%|██▉       | 181/625 [00:37<01:31,  4.86it/s]
reward: -4.4148, last reward: -4.6757, gradient norm:  9.661:  29%|██▉       | 182/625 [00:37<01:31,  4.86it/s]
reward: -4.7255, last reward: -4.1250, gradient norm:  13.23:  29%|██▉       | 182/625 [00:37<01:31,  4.86it/s]
reward: -4.7255, last reward: -4.1250, gradient norm:  13.23:  29%|██▉       | 183/625 [00:37<01:31,  4.84it/s]
reward: -4.7526, last reward: -4.5914, gradient norm:  10.12:  29%|██▉       | 183/625 [00:38<01:31,  4.84it/s]
reward: -4.7526, last reward: -4.5914, gradient norm:  10.12:  29%|██▉       | 184/625 [00:38<01:30,  4.85it/s]
reward: -4.6860, last reward: -3.1830, gradient norm:  11.02:  29%|██▉       | 184/625 [00:38<01:30,  4.85it/s]
reward: -4.6860, last reward: -3.1830, gradient norm:  11.02:  30%|██▉       | 185/625 [00:38<01:30,  4.86it/s]
reward: -4.3758, last reward: -4.4231, gradient norm:  21.28:  30%|██▉       | 185/625 [00:38<01:30,  4.86it/s]
reward: -4.3758, last reward: -4.4231, gradient norm:  21.28:  30%|██▉       | 186/625 [00:38<01:30,  4.87it/s]
reward: -4.1488, last reward: -4.7337, gradient norm:  9.908:  30%|██▉       | 186/625 [00:38<01:30,  4.87it/s]
reward: -4.1488, last reward: -4.7337, gradient norm:  9.908:  30%|██▉       | 187/625 [00:38<01:29,  4.88it/s]
reward: -3.9613, last reward: -3.1772, gradient norm:  15.58:  30%|██▉       | 187/625 [00:38<01:29,  4.88it/s]
reward: -3.9613, last reward: -3.1772, gradient norm:  15.58:  30%|███       | 188/625 [00:38<01:29,  4.89it/s]
reward: -4.2562, last reward: -4.2022, gradient norm:  28.65:  30%|███       | 188/625 [00:39<01:29,  4.89it/s]
reward: -4.2562, last reward: -4.2022, gradient norm:  28.65:  30%|███       | 189/625 [00:39<01:29,  4.89it/s]
reward: -4.6174, last reward: -5.0209, gradient norm:  20.98:  30%|███       | 189/625 [00:39<01:29,  4.89it/s]
reward: -4.6174, last reward: -5.0209, gradient norm:  20.98:  30%|███       | 190/625 [00:39<01:29,  4.88it/s]
reward: -4.5392, last reward: -6.6212, gradient norm:  26.19:  30%|███       | 190/625 [00:39<01:29,  4.88it/s]
reward: -4.5392, last reward: -6.6212, gradient norm:  26.19:  31%|███       | 191/625 [00:39<01:28,  4.89it/s]
reward: -4.4612, last reward: -5.7472, gradient norm:  25.55:  31%|███       | 191/625 [00:39<01:28,  4.89it/s]
reward: -4.4612, last reward: -5.7472, gradient norm:  25.55:  31%|███       | 192/625 [00:39<01:28,  4.89it/s]
reward: -3.7723, last reward: -2.9722, gradient norm:  55.78:  31%|███       | 192/625 [00:39<01:28,  4.89it/s]
reward: -3.7723, last reward: -2.9722, gradient norm:  55.78:  31%|███       | 193/625 [00:39<01:28,  4.87it/s]
reward: -3.7303, last reward: -4.6766, gradient norm:  57.47:  31%|███       | 193/625 [00:40<01:28,  4.87it/s]
reward: -3.7303, last reward: -4.6766, gradient norm:  57.47:  31%|███       | 194/625 [00:40<01:28,  4.87it/s]
reward: -4.5050, last reward: -3.5319, gradient norm:  12.82:  31%|███       | 194/625 [00:40<01:28,  4.87it/s]
reward: -4.5050, last reward: -3.5319, gradient norm:  12.82:  31%|███       | 195/625 [00:40<01:28,  4.85it/s]
reward: -4.9510, last reward: -4.2900, gradient norm:  10.02:  31%|███       | 195/625 [00:40<01:28,  4.85it/s]
reward: -4.9510, last reward: -4.2900, gradient norm:  10.02:  31%|███▏      | 196/625 [00:40<01:28,  4.85it/s]
reward: -4.8987, last reward: -3.8858, gradient norm:  11.21:  31%|███▏      | 196/625 [00:40<01:28,  4.85it/s]
reward: -4.8987, last reward: -3.8858, gradient norm:  11.21:  32%|███▏      | 197/625 [00:40<01:27,  4.86it/s]
reward: -4.7844, last reward: -4.1996, gradient norm:  16.9:  32%|███▏      | 197/625 [00:41<01:27,  4.86it/s]
reward: -4.7844, last reward: -4.1996, gradient norm:  16.9:  32%|███▏      | 198/625 [00:41<01:27,  4.87it/s]
reward: -4.7041, last reward: -3.7807, gradient norm:  12.8:  32%|███▏      | 198/625 [00:41<01:27,  4.87it/s]
reward: -4.7041, last reward: -3.7807, gradient norm:  12.8:  32%|███▏      | 199/625 [00:41<01:27,  4.87it/s]
reward: -4.5883, last reward: -3.1343, gradient norm:  5.33:  32%|███▏      | 199/625 [00:41<01:27,  4.87it/s]
reward: -4.5883, last reward: -3.1343, gradient norm:  5.33:  32%|███▏      | 200/625 [00:41<01:27,  4.87it/s]
reward: -4.3860, last reward: -4.1545, gradient norm:  12.24:  32%|███▏      | 200/625 [00:41<01:27,  4.87it/s]
reward: -4.3860, last reward: -4.1545, gradient norm:  12.24:  32%|███▏      | 201/625 [00:41<01:26,  4.88it/s]
reward: -4.3071, last reward: -5.9397, gradient norm:  70.8:  32%|███▏      | 201/625 [00:41<01:26,  4.88it/s]
reward: -4.3071, last reward: -5.9397, gradient norm:  70.8:  32%|███▏      | 202/625 [00:41<01:26,  4.88it/s]
reward: -3.8351, last reward: -2.9276, gradient norm:  28.92:  32%|███▏      | 202/625 [00:42<01:26,  4.88it/s]
reward: -3.8351, last reward: -2.9276, gradient norm:  28.92:  32%|███▏      | 203/625 [00:42<01:26,  4.88it/s]
reward: -3.6451, last reward: -3.3669, gradient norm:  133.9:  32%|███▏      | 203/625 [00:42<01:26,  4.88it/s]
reward: -3.6451, last reward: -3.3669, gradient norm:  133.9:  33%|███▎      | 204/625 [00:42<01:26,  4.88it/s]
reward: -3.9093, last reward: -2.9751, gradient norm:  34.3:  33%|███▎      | 204/625 [00:42<01:26,  4.88it/s]
reward: -3.9093, last reward: -2.9751, gradient norm:  34.3:  33%|███▎      | 205/625 [00:42<01:26,  4.87it/s]
reward: -4.0323, last reward: -1.9548, gradient norm:  18.41:  33%|███▎      | 205/625 [00:42<01:26,  4.87it/s]
reward: -4.0323, last reward: -1.9548, gradient norm:  18.41:  33%|███▎      | 206/625 [00:42<01:25,  4.87it/s]
reward: -3.4461, last reward: -2.4580, gradient norm:  25.43:  33%|███▎      | 206/625 [00:42<01:25,  4.87it/s]
reward: -3.4461, last reward: -2.4580, gradient norm:  25.43:  33%|███▎      | 207/625 [00:42<01:25,  4.87it/s]
reward: -3.7982, last reward: -2.7564, gradient norm:  107.4:  33%|███▎      | 207/625 [00:43<01:25,  4.87it/s]
reward: -3.7982, last reward: -2.7564, gradient norm:  107.4:  33%|███▎      | 208/625 [00:43<01:25,  4.87it/s]
reward: -3.8554, last reward: -3.2339, gradient norm:  20.46:  33%|███▎      | 208/625 [00:43<01:25,  4.87it/s]
reward: -3.8554, last reward: -3.2339, gradient norm:  20.46:  33%|███▎      | 209/625 [00:43<01:25,  4.87it/s]
reward: -3.7704, last reward: -3.8807, gradient norm:  33.34:  33%|███▎      | 209/625 [00:43<01:25,  4.87it/s]
reward: -3.7704, last reward: -3.8807, gradient norm:  33.34:  34%|███▎      | 210/625 [00:43<01:25,  4.87it/s]
reward: -3.9760, last reward: -4.4843, gradient norm:  25.69:  34%|███▎      | 210/625 [00:43<01:25,  4.87it/s]
reward: -3.9760, last reward: -4.4843, gradient norm:  25.69:  34%|███▍      | 211/625 [00:43<01:24,  4.87it/s]
reward: -3.7967, last reward: -5.2582, gradient norm:  25.03:  34%|███▍      | 211/625 [00:43<01:24,  4.87it/s]
reward: -3.7967, last reward: -5.2582, gradient norm:  25.03:  34%|███▍      | 212/625 [00:43<01:24,  4.87it/s]
reward: -3.7655, last reward: -4.4343, gradient norm:  46.35:  34%|███▍      | 212/625 [00:44<01:24,  4.87it/s]
reward: -3.7655, last reward: -4.4343, gradient norm:  46.35:  34%|███▍      | 213/625 [00:44<01:24,  4.87it/s]
reward: -4.1830, last reward: -3.9914, gradient norm:  48.97:  34%|███▍      | 213/625 [00:44<01:24,  4.87it/s]
reward: -4.1830, last reward: -3.9914, gradient norm:  48.97:  34%|███▍      | 214/625 [00:44<01:24,  4.86it/s]
reward: -4.3355, last reward: -4.1371, gradient norm:  10.28:  34%|███▍      | 214/625 [00:44<01:24,  4.86it/s]
reward: -4.3355, last reward: -4.1371, gradient norm:  10.28:  34%|███▍      | 215/625 [00:44<01:24,  4.87it/s]
reward: -4.2021, last reward: -2.7219, gradient norm:  12.34:  34%|███▍      | 215/625 [00:44<01:24,  4.87it/s]
reward: -4.2021, last reward: -2.7219, gradient norm:  12.34:  35%|███▍      | 216/625 [00:44<01:24,  4.87it/s]
reward: -4.1103, last reward: -3.1725, gradient norm:  11.8:  35%|███▍      | 216/625 [00:44<01:24,  4.87it/s]
reward: -4.1103, last reward: -3.1725, gradient norm:  11.8:  35%|███▍      | 217/625 [00:44<01:23,  4.87it/s]
reward: -4.4244, last reward: -4.2578, gradient norm:  11.67:  35%|███▍      | 217/625 [00:45<01:23,  4.87it/s]
reward: -4.4244, last reward: -4.2578, gradient norm:  11.67:  35%|███▍      | 218/625 [00:45<01:23,  4.85it/s]
reward: -4.0961, last reward: -2.4116, gradient norm:  4.52:  35%|███▍      | 218/625 [00:45<01:23,  4.85it/s]
reward: -4.0961, last reward: -2.4116, gradient norm:  4.52:  35%|███▌      | 219/625 [00:45<01:23,  4.85it/s]
reward: -4.1262, last reward: -2.6491, gradient norm:  12.21:  35%|███▌      | 219/625 [00:45<01:23,  4.85it/s]
reward: -4.1262, last reward: -2.6491, gradient norm:  12.21:  35%|███▌      | 220/625 [00:45<01:23,  4.86it/s]
reward: -4.2716, last reward: -3.9329, gradient norm:  18.67:  35%|███▌      | 220/625 [00:45<01:23,  4.86it/s]
reward: -4.2716, last reward: -3.9329, gradient norm:  18.67:  35%|███▌      | 221/625 [00:45<01:22,  4.87it/s]
reward: -3.8580, last reward: -3.1444, gradient norm:  52.86:  35%|███▌      | 221/625 [00:45<01:22,  4.87it/s]
reward: -3.8580, last reward: -3.1444, gradient norm:  52.86:  36%|███▌      | 222/625 [00:45<01:22,  4.87it/s]
reward: -4.3621, last reward: -3.7214, gradient norm:  16.0:  36%|███▌      | 222/625 [00:46<01:22,  4.87it/s]
reward: -4.3621, last reward: -3.7214, gradient norm:  16.0:  36%|███▌      | 223/625 [00:46<01:22,  4.86it/s]
reward: -4.4639, last reward: -5.2648, gradient norm:  24.71:  36%|███▌      | 223/625 [00:46<01:22,  4.86it/s]
reward: -4.4639, last reward: -5.2648, gradient norm:  24.71:  36%|███▌      | 224/625 [00:46<01:22,  4.84it/s]
reward: -4.6842, last reward: -4.6974, gradient norm:  14.15:  36%|███▌      | 224/625 [00:46<01:22,  4.84it/s]
reward: -4.6842, last reward: -4.6974, gradient norm:  14.15:  36%|███▌      | 225/625 [00:46<01:22,  4.84it/s]
reward: -3.8237, last reward: -3.6540, gradient norm:  21.16:  36%|███▌      | 225/625 [00:46<01:22,  4.84it/s]
reward: -3.8237, last reward: -3.6540, gradient norm:  21.16:  36%|███▌      | 226/625 [00:46<01:22,  4.85it/s]
reward: -4.0712, last reward: -4.1515, gradient norm:  7.923:  36%|███▌      | 226/625 [00:46<01:22,  4.85it/s]
reward: -4.0712, last reward: -4.1515, gradient norm:  7.923:  36%|███▋      | 227/625 [00:46<01:22,  4.85it/s]
reward: -4.0174, last reward: -3.0392, gradient norm:  16.69:  36%|███▋      | 227/625 [00:47<01:22,  4.85it/s]
reward: -4.0174, last reward: -3.0392, gradient norm:  16.69:  36%|███▋      | 228/625 [00:47<01:21,  4.85it/s]
reward: -4.0842, last reward: -3.7785, gradient norm:  19.62:  36%|███▋      | 228/625 [00:47<01:21,  4.85it/s]
reward: -4.0842, last reward: -3.7785, gradient norm:  19.62:  37%|███▋      | 229/625 [00:47<01:21,  4.85it/s]
reward: -4.0530, last reward: -4.4058, gradient norm:  16.16:  37%|███▋      | 229/625 [00:47<01:21,  4.85it/s]
reward: -4.0530, last reward: -4.4058, gradient norm:  16.16:  37%|███▋      | 230/625 [00:47<01:21,  4.86it/s]
reward: -4.0566, last reward: -3.0590, gradient norm:  46.33:  37%|███▋      | 230/625 [00:47<01:21,  4.86it/s]
reward: -4.0566, last reward: -3.0590, gradient norm:  46.33:  37%|███▋      | 231/625 [00:47<01:20,  4.86it/s]
reward: -3.8513, last reward: -2.7985, gradient norm:  47.95:  37%|███▋      | 231/625 [00:48<01:20,  4.86it/s]
reward: -3.8513, last reward: -2.7985, gradient norm:  47.95:  37%|███▋      | 232/625 [00:48<01:20,  4.86it/s]
reward: -3.7363, last reward: -3.3588, gradient norm:  6.625:  37%|███▋      | 232/625 [00:48<01:20,  4.86it/s]
reward: -3.7363, last reward: -3.3588, gradient norm:  6.625:  37%|███▋      | 233/625 [00:48<01:20,  4.87it/s]
reward: -3.7676, last reward: -4.5312, gradient norm:  5.029:  37%|███▋      | 233/625 [00:48<01:20,  4.87it/s]
reward: -3.7676, last reward: -4.5312, gradient norm:  5.029:  37%|███▋      | 234/625 [00:48<01:20,  4.87it/s]
reward: -3.7305, last reward: -3.6823, gradient norm:  23.2:  37%|███▋      | 234/625 [00:48<01:20,  4.87it/s]
reward: -3.7305, last reward: -3.6823, gradient norm:  23.2:  38%|███▊      | 235/625 [00:48<01:20,  4.87it/s]
reward: -4.1303, last reward: -4.9328, gradient norm:  19.52:  38%|███▊      | 235/625 [00:48<01:20,  4.87it/s]
reward: -4.1303, last reward: -4.9328, gradient norm:  19.52:  38%|███▊      | 236/625 [00:48<01:19,  4.87it/s]
reward: -4.1665, last reward: -5.0729, gradient norm:  33.78:  38%|███▊      | 236/625 [00:49<01:19,  4.87it/s]
reward: -4.1665, last reward: -5.0729, gradient norm:  33.78:  38%|███▊      | 237/625 [00:49<01:19,  4.87it/s]
reward: -4.1188, last reward: -5.8531, gradient norm:  36.56:  38%|███▊      | 237/625 [00:49<01:19,  4.87it/s]
reward: -4.1188, last reward: -5.8531, gradient norm:  36.56:  38%|███▊      | 238/625 [00:49<01:34,  4.10it/s]
reward: -3.5453, last reward: -2.3132, gradient norm:  10.89:  38%|███▊      | 238/625 [00:49<01:34,  4.10it/s]
reward: -3.5453, last reward: -2.3132, gradient norm:  10.89:  38%|███▊      | 239/625 [00:49<01:29,  4.31it/s]
reward: -3.2605, last reward: -2.8357, gradient norm:  13.73:  38%|███▊      | 239/625 [00:49<01:29,  4.31it/s]
reward: -3.2605, last reward: -2.8357, gradient norm:  13.73:  38%|███▊      | 240/625 [00:49<01:26,  4.46it/s]
reward: -3.7712, last reward: -1.9925, gradient norm:  45.24:  38%|███▊      | 240/625 [00:49<01:26,  4.46it/s]
reward: -3.7712, last reward: -1.9925, gradient norm:  45.24:  39%|███▊      | 241/625 [00:49<01:23,  4.59it/s]
reward: -3.7126, last reward: -2.1642, gradient norm:  6.793:  39%|███▊      | 241/625 [00:50<01:23,  4.59it/s]
reward: -3.7126, last reward: -2.1642, gradient norm:  6.793:  39%|███▊      | 242/625 [00:50<01:22,  4.67it/s]
reward: -3.4435, last reward: -2.1223, gradient norm:  30.3:  39%|███▊      | 242/625 [00:50<01:22,  4.67it/s]
reward: -3.4435, last reward: -2.1223, gradient norm:  30.3:  39%|███▉      | 243/625 [00:50<01:21,  4.71it/s]
reward: -3.8483, last reward: -1.9589, gradient norm:  76.23:  39%|███▉      | 243/625 [00:50<01:21,  4.71it/s]
reward: -3.8483, last reward: -1.9589, gradient norm:  76.23:  39%|███▉      | 244/625 [00:50<01:21,  4.68it/s]
reward: -3.7243, last reward: -3.9248, gradient norm:  77.73:  39%|███▉      | 244/625 [00:50<01:21,  4.68it/s]
reward: -3.7243, last reward: -3.9248, gradient norm:  77.73:  39%|███▉      | 245/625 [00:50<01:20,  4.71it/s]
reward: -4.7954, last reward: -3.4635, gradient norm:  13.38:  39%|███▉      | 245/625 [00:51<01:20,  4.71it/s]
reward: -4.7954, last reward: -3.4635, gradient norm:  13.38:  39%|███▉      | 246/625 [00:51<01:19,  4.74it/s]
reward: -4.6425, last reward: -4.7224, gradient norm:  14.12:  39%|███▉      | 246/625 [00:51<01:19,  4.74it/s]
reward: -4.6425, last reward: -4.7224, gradient norm:  14.12:  40%|███▉      | 247/625 [00:51<01:19,  4.78it/s]
reward: -4.2372, last reward: -4.5707, gradient norm:  21.06:  40%|███▉      | 247/625 [00:51<01:19,  4.78it/s]
reward: -4.2372, last reward: -4.5707, gradient norm:  21.06:  40%|███▉      | 248/625 [00:51<01:18,  4.80it/s]
reward: -3.9959, last reward: -3.4874, gradient norm:  60.15:  40%|███▉      | 248/625 [00:51<01:18,  4.80it/s]
reward: -3.9959, last reward: -3.4874, gradient norm:  60.15:  40%|███▉      | 249/625 [00:51<01:18,  4.81it/s]
reward: -4.0894, last reward: -3.5227, gradient norm:  14.05:  40%|███▉      | 249/625 [00:51<01:18,  4.81it/s]
reward: -4.0894, last reward: -3.5227, gradient norm:  14.05:  40%|████      | 250/625 [00:51<01:17,  4.82it/s]
reward: -4.5161, last reward: -6.4950, gradient norm:  135.6:  40%|████      | 250/625 [00:52<01:17,  4.82it/s]
reward: -4.5161, last reward: -6.4950, gradient norm:  135.6:  40%|████      | 251/625 [00:52<01:17,  4.81it/s]
reward: -4.0824, last reward: -3.0430, gradient norm:  18.15:  40%|████      | 251/625 [00:52<01:17,  4.81it/s]
reward: -4.0824, last reward: -3.0430, gradient norm:  18.15:  40%|████      | 252/625 [00:52<01:17,  4.82it/s]
reward: -4.6468, last reward: -3.6022, gradient norm:  16.69:  40%|████      | 252/625 [00:52<01:17,  4.82it/s]
reward: -4.6468, last reward: -3.6022, gradient norm:  16.69:  40%|████      | 253/625 [00:52<01:16,  4.84it/s]
reward: -4.0601, last reward: -3.4058, gradient norm:  30.29:  40%|████      | 253/625 [00:52<01:16,  4.84it/s]
reward: -4.0601, last reward: -3.4058, gradient norm:  30.29:  41%|████      | 254/625 [00:52<01:16,  4.85it/s]
reward: -4.2424, last reward: -3.7108, gradient norm:  19.45:  41%|████      | 254/625 [00:52<01:16,  4.85it/s]
reward: -4.2424, last reward: -3.7108, gradient norm:  19.45:  41%|████      | 255/625 [00:52<01:16,  4.86it/s]
reward: -3.5179, last reward: -2.3462, gradient norm:  127.3:  41%|████      | 255/625 [00:53<01:16,  4.86it/s]
reward: -3.5179, last reward: -2.3462, gradient norm:  127.3:  41%|████      | 256/625 [00:53<01:15,  4.87it/s]
reward: -3.5197, last reward: -4.0831, gradient norm:  17.4:  41%|████      | 256/625 [00:53<01:15,  4.87it/s]
reward: -3.5197, last reward: -4.0831, gradient norm:  17.4:  41%|████      | 257/625 [00:53<01:15,  4.87it/s]
reward: -3.8827, last reward: -4.6454, gradient norm:  13.75:  41%|████      | 257/625 [00:53<01:15,  4.87it/s]
reward: -3.8827, last reward: -4.6454, gradient norm:  13.75:  41%|████▏     | 258/625 [00:53<01:15,  4.88it/s]
reward: -3.4425, last reward: -2.8616, gradient norm:  30.91:  41%|████▏     | 258/625 [00:53<01:15,  4.88it/s]
reward: -3.4425, last reward: -2.8616, gradient norm:  30.91:  41%|████▏     | 259/625 [00:53<01:14,  4.89it/s]
reward: -3.3707, last reward: -1.6766, gradient norm:  89.46:  41%|████▏     | 259/625 [00:53<01:14,  4.89it/s]
reward: -3.3707, last reward: -1.6766, gradient norm:  89.46:  42%|████▏     | 260/625 [00:53<01:14,  4.89it/s]
reward: -3.7682, last reward: -2.7231, gradient norm:  15.74:  42%|████▏     | 260/625 [00:54<01:14,  4.89it/s]
reward: -3.7682, last reward: -2.7231, gradient norm:  15.74:  42%|████▏     | 261/625 [00:54<01:14,  4.89it/s]
reward: -3.9477, last reward: -3.8103, gradient norm:  14.7:  42%|████▏     | 261/625 [00:54<01:14,  4.89it/s]
reward: -3.9477, last reward: -3.8103, gradient norm:  14.7:  42%|████▏     | 262/625 [00:54<01:14,  4.89it/s]
reward: -3.7253, last reward: -3.3617, gradient norm:  15.5:  42%|████▏     | 262/625 [00:54<01:14,  4.89it/s]
reward: -3.7253, last reward: -3.3617, gradient norm:  15.5:  42%|████▏     | 263/625 [00:54<01:14,  4.89it/s]
reward: -3.8854, last reward: -2.6403, gradient norm:  46.48:  42%|████▏     | 263/625 [00:54<01:14,  4.89it/s]
reward: -3.8854, last reward: -2.6403, gradient norm:  46.48:  42%|████▏     | 264/625 [00:54<01:13,  4.88it/s]
reward: -2.2784, last reward: -0.3983, gradient norm:  2.552:  42%|████▏     | 264/625 [00:54<01:13,  4.88it/s]
reward: -2.2784, last reward: -0.3983, gradient norm:  2.552:  42%|████▏     | 265/625 [00:54<01:13,  4.88it/s]
reward: -3.3063, last reward: -1.4367, gradient norm:  12.58:  42%|████▏     | 265/625 [00:55<01:13,  4.88it/s]
reward: -3.3063, last reward: -1.4367, gradient norm:  12.58:  43%|████▎     | 266/625 [00:55<01:13,  4.89it/s]
reward: -2.9484, last reward: -2.5394, gradient norm:  28.81:  43%|████▎     | 266/625 [00:55<01:13,  4.89it/s]
reward: -2.9484, last reward: -2.5394, gradient norm:  28.81:  43%|████▎     | 267/625 [00:55<01:13,  4.88it/s]
reward: -3.4480, last reward: -4.8011, gradient norm:  69.75:  43%|████▎     | 267/625 [00:55<01:13,  4.88it/s]
reward: -3.4480, last reward: -4.8011, gradient norm:  69.75:  43%|████▎     | 268/625 [00:55<01:13,  4.88it/s]
reward: -3.2181, last reward: -1.7389, gradient norm:  18.54:  43%|████▎     | 268/625 [00:55<01:13,  4.88it/s]
reward: -3.2181, last reward: -1.7389, gradient norm:  18.54:  43%|████▎     | 269/625 [00:55<01:12,  4.88it/s]
reward: -3.5885, last reward: -2.3872, gradient norm:  1.067e+03:  43%|████▎     | 269/625 [00:55<01:12,  4.88it/s]
reward: -3.5885, last reward: -2.3872, gradient norm:  1.067e+03:  43%|████▎     | 270/625 [00:55<01:12,  4.88it/s]
reward: -3.5645, last reward: -2.3470, gradient norm:  10.39:  43%|████▎     | 270/625 [00:56<01:12,  4.88it/s]
reward: -3.5645, last reward: -2.3470, gradient norm:  10.39:  43%|████▎     | 271/625 [00:56<01:12,  4.88it/s]
reward: -3.1180, last reward: -2.9837, gradient norm:  21.35:  43%|████▎     | 271/625 [00:56<01:12,  4.88it/s]
reward: -3.1180, last reward: -2.9837, gradient norm:  21.35:  44%|████▎     | 272/625 [00:56<01:12,  4.88it/s]
reward: -3.0020, last reward: -1.7848, gradient norm:  14.11:  44%|████▎     | 272/625 [00:56<01:12,  4.88it/s]
reward: -3.0020, last reward: -1.7848, gradient norm:  14.11:  44%|████▎     | 273/625 [00:56<01:11,  4.89it/s]
reward: -2.9024, last reward: -1.2560, gradient norm:  48.93:  44%|████▎     | 273/625 [00:56<01:11,  4.89it/s]
reward: -2.9024, last reward: -1.2560, gradient norm:  48.93:  44%|████▍     | 274/625 [00:56<01:11,  4.89it/s]
reward: -2.3769, last reward: -0.9803, gradient norm:  403.4:  44%|████▍     | 274/625 [00:56<01:11,  4.89it/s]
reward: -2.3769, last reward: -0.9803, gradient norm:  403.4:  44%|████▍     | 275/625 [00:56<01:11,  4.89it/s]
reward: -3.1577, last reward: -1.9462, gradient norm:  25.01:  44%|████▍     | 275/625 [00:57<01:11,  4.89it/s]
reward: -3.1577, last reward: -1.9462, gradient norm:  25.01:  44%|████▍     | 276/625 [00:57<01:11,  4.89it/s]
reward: -3.7512, last reward: -3.6302, gradient norm:  47.82:  44%|████▍     | 276/625 [00:57<01:11,  4.89it/s]
reward: -3.7512, last reward: -3.6302, gradient norm:  47.82:  44%|████▍     | 277/625 [00:57<01:11,  4.89it/s]
reward: -3.3241, last reward: -1.4824, gradient norm:  29.08:  44%|████▍     | 277/625 [00:57<01:11,  4.89it/s]
reward: -3.3241, last reward: -1.4824, gradient norm:  29.08:  44%|████▍     | 278/625 [00:57<01:10,  4.89it/s]
reward: -2.8900, last reward: -1.5340, gradient norm:  6.86:  44%|████▍     | 278/625 [00:57<01:10,  4.89it/s]
reward: -2.8900, last reward: -1.5340, gradient norm:  6.86:  45%|████▍     | 279/625 [00:57<01:10,  4.89it/s]
reward: -2.4089, last reward: -0.1335, gradient norm:  1.654:  45%|████▍     | 279/625 [00:58<01:10,  4.89it/s]
reward: -2.4089, last reward: -0.1335, gradient norm:  1.654:  45%|████▍     | 280/625 [00:58<01:10,  4.88it/s]
reward: -2.1500, last reward: -0.0078, gradient norm:  0.7977:  45%|████▍     | 280/625 [00:58<01:10,  4.88it/s]
reward: -2.1500, last reward: -0.0078, gradient norm:  0.7977:  45%|████▍     | 281/625 [00:58<01:10,  4.89it/s]
reward: -2.8219, last reward: -0.0230, gradient norm:  1.073:  45%|████▍     | 281/625 [00:58<01:10,  4.89it/s]
reward: -2.8219, last reward: -0.0230, gradient norm:  1.073:  45%|████▌     | 282/625 [00:58<01:10,  4.89it/s]
reward: -3.3674, last reward: -2.5903, gradient norm:  28.51:  45%|████▌     | 282/625 [00:58<01:10,  4.89it/s]
reward: -3.3674, last reward: -2.5903, gradient norm:  28.51:  45%|████▌     | 283/625 [00:58<01:09,  4.89it/s]
reward: -2.6695, last reward: -1.1400, gradient norm:  9.986:  45%|████▌     | 283/625 [00:58<01:09,  4.89it/s]
reward: -2.6695, last reward: -1.1400, gradient norm:  9.986:  45%|████▌     | 284/625 [00:58<01:09,  4.89it/s]
reward: -3.9000, last reward: -2.8705, gradient norm:  21.76:  45%|████▌     | 284/625 [00:59<01:09,  4.89it/s]
reward: -3.9000, last reward: -2.8705, gradient norm:  21.76:  46%|████▌     | 285/625 [00:59<01:09,  4.89it/s]
reward: -3.3866, last reward: -2.6675, gradient norm:  25.97:  46%|████▌     | 285/625 [00:59<01:09,  4.89it/s]
reward: -3.3866, last reward: -2.6675, gradient norm:  25.97:  46%|████▌     | 286/625 [00:59<01:09,  4.88it/s]
reward: -3.1383, last reward: -2.5193, gradient norm:  28.38:  46%|████▌     | 286/625 [00:59<01:09,  4.88it/s]
reward: -3.1383, last reward: -2.5193, gradient norm:  28.38:  46%|████▌     | 287/625 [00:59<01:09,  4.89it/s]
reward: -1.9981, last reward: -1.1067, gradient norm:  22.2:  46%|████▌     | 287/625 [00:59<01:09,  4.89it/s]
reward: -1.9981, last reward: -1.1067, gradient norm:  22.2:  46%|████▌     | 288/625 [00:59<01:08,  4.88it/s]
reward: -2.4183, last reward: -0.6585, gradient norm:  12.21:  46%|████▌     | 288/625 [00:59<01:08,  4.88it/s]
reward: -2.4183, last reward: -0.6585, gradient norm:  12.21:  46%|████▌     | 289/625 [00:59<01:08,  4.89it/s]
reward: -2.2903, last reward: -0.1044, gradient norm:  1.397:  46%|████▌     | 289/625 [01:00<01:08,  4.89it/s]
reward: -2.2903, last reward: -0.1044, gradient norm:  1.397:  46%|████▋     | 290/625 [01:00<01:08,  4.88it/s]
reward: -2.3470, last reward: -0.0267, gradient norm:  1.381:  46%|████▋     | 290/625 [01:00<01:08,  4.88it/s]
reward: -2.3470, last reward: -0.0267, gradient norm:  1.381:  47%|████▋     | 291/625 [01:00<01:08,  4.88it/s]
reward: -2.4752, last reward: -0.2300, gradient norm:  0.4783:  47%|████▋     | 291/625 [01:00<01:08,  4.88it/s]
reward: -2.4752, last reward: -0.2300, gradient norm:  0.4783:  47%|████▋     | 292/625 [01:00<01:08,  4.87it/s]
reward: -2.2931, last reward: -0.0729, gradient norm:  4.72:  47%|████▋     | 292/625 [01:00<01:08,  4.87it/s]
reward: -2.2931, last reward: -0.0729, gradient norm:  4.72:  47%|████▋     | 293/625 [01:00<01:08,  4.87it/s]
reward: -2.5747, last reward: -0.0695, gradient norm:  2.437:  47%|████▋     | 293/625 [01:00<01:08,  4.87it/s]
reward: -2.5747, last reward: -0.0695, gradient norm:  2.437:  47%|████▋     | 294/625 [01:00<01:07,  4.88it/s]
reward: -2.3089, last reward: -0.0061, gradient norm:  0.6729:  47%|████▋     | 294/625 [01:01<01:07,  4.88it/s]
reward: -2.3089, last reward: -0.0061, gradient norm:  0.6729:  47%|████▋     | 295/625 [01:01<01:07,  4.88it/s]
reward: -2.3122, last reward: -0.0378, gradient norm:  1.651:  47%|████▋     | 295/625 [01:01<01:07,  4.88it/s]
reward: -2.3122, last reward: -0.0378, gradient norm:  1.651:  47%|████▋     | 296/625 [01:01<01:07,  4.87it/s]
reward: -1.8535, last reward: -0.0574, gradient norm:  2.329:  47%|████▋     | 296/625 [01:01<01:07,  4.87it/s]
reward: -1.8535, last reward: -0.0574, gradient norm:  2.329:  48%|████▊     | 297/625 [01:01<01:07,  4.88it/s]
reward: -2.3665, last reward: -0.0111, gradient norm:  0.9808:  48%|████▊     | 297/625 [01:01<01:07,  4.88it/s]
reward: -2.3665, last reward: -0.0111, gradient norm:  0.9808:  48%|████▊     | 298/625 [01:01<01:07,  4.88it/s]
reward: -2.0677, last reward: -0.0970, gradient norm:  5.651:  48%|████▊     | 298/625 [01:01<01:07,  4.88it/s]
reward: -2.0677, last reward: -0.0970, gradient norm:  5.651:  48%|████▊     | 299/625 [01:01<01:06,  4.88it/s]
reward: -2.8268, last reward: -1.0460, gradient norm:  15.6:  48%|████▊     | 299/625 [01:02<01:06,  4.88it/s]
reward: -2.8268, last reward: -1.0460, gradient norm:  15.6:  48%|████▊     | 300/625 [01:02<01:06,  4.89it/s]
reward: -2.2015, last reward: -0.2860, gradient norm:  22.44:  48%|████▊     | 300/625 [01:02<01:06,  4.89it/s]
reward: -2.2015, last reward: -0.2860, gradient norm:  22.44:  48%|████▊     | 301/625 [01:02<01:06,  4.89it/s]
reward: -2.3683, last reward: -0.0137, gradient norm:  1.152:  48%|████▊     | 301/625 [01:02<01:06,  4.89it/s]
reward: -2.3683, last reward: -0.0137, gradient norm:  1.152:  48%|████▊     | 302/625 [01:02<01:06,  4.88it/s]
reward: -1.9836, last reward: -0.0664, gradient norm:  5.29:  48%|████▊     | 302/625 [01:02<01:06,  4.88it/s]
reward: -1.9836, last reward: -0.0664, gradient norm:  5.29:  48%|████▊     | 303/625 [01:02<01:06,  4.88it/s]
reward: -2.1668, last reward: -0.0758, gradient norm:  2.976:  48%|████▊     | 303/625 [01:02<01:06,  4.88it/s]
reward: -2.1668, last reward: -0.0758, gradient norm:  2.976:  49%|████▊     | 304/625 [01:02<01:05,  4.88it/s]
reward: -1.7214, last reward: -0.0275, gradient norm:  2.978:  49%|████▊     | 304/625 [01:03<01:05,  4.88it/s]
reward: -1.7214, last reward: -0.0275, gradient norm:  2.978:  49%|████▉     | 305/625 [01:03<01:05,  4.88it/s]
reward: -2.1655, last reward: -1.0136, gradient norm:  67.86:  49%|████▉     | 305/625 [01:03<01:05,  4.88it/s]
reward: -2.1655, last reward: -1.0136, gradient norm:  67.86:  49%|████▉     | 306/625 [01:03<01:05,  4.88it/s]
reward: -2.9232, last reward: -3.2623, gradient norm:  62.61:  49%|████▉     | 306/625 [01:03<01:05,  4.88it/s]
reward: -2.9232, last reward: -3.2623, gradient norm:  62.61:  49%|████▉     | 307/625 [01:03<01:05,  4.89it/s]
reward: -2.2422, last reward: -2.5996, gradient norm:  90.63:  49%|████▉     | 307/625 [01:03<01:05,  4.89it/s]
reward: -2.2422, last reward: -2.5996, gradient norm:  90.63:  49%|████▉     | 308/625 [01:03<01:04,  4.89it/s]
reward: -2.1574, last reward: -0.0119, gradient norm:  2.67:  49%|████▉     | 308/625 [01:03<01:04,  4.89it/s]
reward: -2.1574, last reward: -0.0119, gradient norm:  2.67:  49%|████▉     | 309/625 [01:03<01:04,  4.89it/s]
reward: -1.7745, last reward: -0.1597, gradient norm:  10.93:  49%|████▉     | 309/625 [01:04<01:04,  4.89it/s]
reward: -1.7745, last reward: -0.1597, gradient norm:  10.93:  50%|████▉     | 310/625 [01:04<01:04,  4.89it/s]
reward: -1.8866, last reward: -0.5739, gradient norm:  59.4:  50%|████▉     | 310/625 [01:04<01:04,  4.89it/s]
reward: -1.8866, last reward: -0.5739, gradient norm:  59.4:  50%|████▉     | 311/625 [01:04<01:04,  4.89it/s]
reward: -2.0082, last reward: -0.0806, gradient norm:  3.376:  50%|████▉     | 311/625 [01:04<01:04,  4.89it/s]
reward: -2.0082, last reward: -0.0806, gradient norm:  3.376:  50%|████▉     | 312/625 [01:04<01:04,  4.89it/s]
reward: -2.0180, last reward: -0.0130, gradient norm:  0.8043:  50%|████▉     | 312/625 [01:04<01:04,  4.89it/s]
reward: -2.0180, last reward: -0.0130, gradient norm:  0.8043:  50%|█████     | 313/625 [01:04<01:03,  4.89it/s]
reward: -2.1591, last reward: -0.1254, gradient norm:  7.212:  50%|█████     | 313/625 [01:04<01:03,  4.89it/s]
reward: -2.1591, last reward: -0.1254, gradient norm:  7.212:  50%|█████     | 314/625 [01:04<01:03,  4.89it/s]
reward: -1.9418, last reward: -0.0125, gradient norm:  0.6393:  50%|█████     | 314/625 [01:05<01:03,  4.89it/s]
reward: -1.9418, last reward: -0.0125, gradient norm:  0.6393:  50%|█████     | 315/625 [01:05<01:03,  4.89it/s]
reward: -2.0906, last reward: -0.0021, gradient norm:  0.7693:  50%|█████     | 315/625 [01:05<01:03,  4.89it/s]
reward: -2.0906, last reward: -0.0021, gradient norm:  0.7693:  51%|█████     | 316/625 [01:05<01:03,  4.89it/s]
reward: -2.1884, last reward: -0.0084, gradient norm:  0.9224:  51%|█████     | 316/625 [01:05<01:03,  4.89it/s]
reward: -2.1884, last reward: -0.0084, gradient norm:  0.9224:  51%|█████     | 317/625 [01:05<01:02,  4.89it/s]
reward: -2.0722, last reward: -0.0024, gradient norm:  0.6936:  51%|█████     | 317/625 [01:05<01:02,  4.89it/s]
reward: -2.0722, last reward: -0.0024, gradient norm:  0.6936:  51%|█████     | 318/625 [01:05<01:02,  4.89it/s]
reward: -2.2271, last reward: -0.0027, gradient norm:  0.3025:  51%|█████     | 318/625 [01:05<01:02,  4.89it/s]
reward: -2.2271, last reward: -0.0027, gradient norm:  0.3025:  51%|█████     | 319/625 [01:05<01:02,  4.89it/s]
reward: -2.0207, last reward: -0.0060, gradient norm:  1.949:  51%|█████     | 319/625 [01:06<01:02,  4.89it/s]
reward: -2.0207, last reward: -0.0060, gradient norm:  1.949:  51%|█████     | 320/625 [01:06<01:02,  4.89it/s]
reward: -1.8973, last reward: -0.0129, gradient norm:  0.6215:  51%|█████     | 320/625 [01:06<01:02,  4.89it/s]
reward: -1.8973, last reward: -0.0129, gradient norm:  0.6215:  51%|█████▏    | 321/625 [01:06<01:02,  4.89it/s]
reward: -1.7585, last reward: -0.0027, gradient norm:  0.5406:  51%|█████▏    | 321/625 [01:06<01:02,  4.89it/s]
reward: -1.7585, last reward: -0.0027, gradient norm:  0.5406:  52%|█████▏    | 322/625 [01:06<01:02,  4.88it/s]
reward: -2.2886, last reward: -0.0517, gradient norm:  10.62:  52%|█████▏    | 322/625 [01:06<01:02,  4.88it/s]
reward: -2.2886, last reward: -0.0517, gradient norm:  10.62:  52%|█████▏    | 323/625 [01:06<01:01,  4.89it/s]
reward: -1.8662, last reward: -0.0046, gradient norm:  2.198:  52%|█████▏    | 323/625 [01:07<01:01,  4.89it/s]
reward: -1.8662, last reward: -0.0046, gradient norm:  2.198:  52%|█████▏    | 324/625 [01:07<01:01,  4.89it/s]
reward: -2.0652, last reward: -0.0135, gradient norm:  2.58:  52%|█████▏    | 324/625 [01:07<01:01,  4.89it/s]
reward: -2.0652, last reward: -0.0135, gradient norm:  2.58:  52%|█████▏    | 325/625 [01:07<01:01,  4.88it/s]
reward: -2.0966, last reward: -0.0214, gradient norm:  1.656:  52%|█████▏    | 325/625 [01:07<01:01,  4.88it/s]
reward: -2.0966, last reward: -0.0214, gradient norm:  1.656:  52%|█████▏    | 326/625 [01:07<01:01,  4.88it/s]
reward: -2.5183, last reward: -0.0011, gradient norm:  0.705:  52%|█████▏    | 326/625 [01:07<01:01,  4.88it/s]
reward: -2.5183, last reward: -0.0011, gradient norm:  0.705:  52%|█████▏    | 327/625 [01:07<01:01,  4.87it/s]
reward: -2.3712, last reward: -0.0457, gradient norm:  1.244:  52%|█████▏    | 327/625 [01:07<01:01,  4.87it/s]
reward: -2.3712, last reward: -0.0457, gradient norm:  1.244:  52%|█████▏    | 328/625 [01:07<01:00,  4.88it/s]
reward: -2.2987, last reward: -0.0218, gradient norm:  1.368:  52%|█████▏    | 328/625 [01:08<01:00,  4.88it/s]
reward: -2.2987, last reward: -0.0218, gradient norm:  1.368:  53%|█████▎    | 329/625 [01:08<01:00,  4.88it/s]
reward: -2.3155, last reward: -0.0095, gradient norm:  0.7518:  53%|█████▎    | 329/625 [01:08<01:00,  4.88it/s]
reward: -2.3155, last reward: -0.0095, gradient norm:  0.7518:  53%|█████▎    | 330/625 [01:08<01:00,  4.89it/s]
reward: -2.1199, last reward: -0.1257, gradient norm:  5.305:  53%|█████▎    | 330/625 [01:08<01:00,  4.89it/s]
reward: -2.1199, last reward: -0.1257, gradient norm:  5.305:  53%|█████▎    | 331/625 [01:08<01:00,  4.89it/s]
reward: -1.9859, last reward: -0.0679, gradient norm:  5.372:  53%|█████▎    | 331/625 [01:08<01:00,  4.89it/s]
reward: -1.9859, last reward: -0.0679, gradient norm:  5.372:  53%|█████▎    | 332/625 [01:08<00:59,  4.89it/s]
reward: -2.4061, last reward: -0.6118, gradient norm:  94.16:  53%|█████▎    | 332/625 [01:08<00:59,  4.89it/s]
reward: -2.4061, last reward: -0.6118, gradient norm:  94.16:  53%|█████▎    | 333/625 [01:08<00:59,  4.89it/s]
reward: -3.0361, last reward: -3.3765, gradient norm:  103.5:  53%|█████▎    | 333/625 [01:09<00:59,  4.89it/s]
reward: -3.0361, last reward: -3.3765, gradient norm:  103.5:  53%|█████▎    | 334/625 [01:09<00:59,  4.89it/s]
reward: -2.2451, last reward: -0.1210, gradient norm:  3.228:  53%|█████▎    | 334/625 [01:09<00:59,  4.89it/s]
reward: -2.2451, last reward: -0.1210, gradient norm:  3.228:  54%|█████▎    | 335/625 [01:09<00:59,  4.89it/s]
reward: -1.8761, last reward: -0.0040, gradient norm:  0.777:  54%|█████▎    | 335/625 [01:09<00:59,  4.89it/s]
reward: -1.8761, last reward: -0.0040, gradient norm:  0.777:  54%|█████▍    | 336/625 [01:09<00:59,  4.89it/s]
reward: -2.9146, last reward: -3.2809, gradient norm:  51.08:  54%|█████▍    | 336/625 [01:09<00:59,  4.89it/s]
reward: -2.9146, last reward: -3.2809, gradient norm:  51.08:  54%|█████▍    | 337/625 [01:09<00:58,  4.89it/s]
reward: -3.0197, last reward: -2.2499, gradient norm:  20.1:  54%|█████▍    | 337/625 [01:09<00:58,  4.89it/s]
reward: -3.0197, last reward: -2.2499, gradient norm:  20.1:  54%|█████▍    | 338/625 [01:09<00:58,  4.89it/s]
reward: -2.9844, last reward: -2.3444, gradient norm:  18.91:  54%|█████▍    | 338/625 [01:10<00:58,  4.89it/s]
reward: -2.9844, last reward: -2.3444, gradient norm:  18.91:  54%|█████▍    | 339/625 [01:10<00:58,  4.88it/s]
reward: -2.4492, last reward: -2.3984, gradient norm:  62.17:  54%|█████▍    | 339/625 [01:10<00:58,  4.88it/s]
reward: -2.4492, last reward: -2.3984, gradient norm:  62.17:  54%|█████▍    | 340/625 [01:10<00:58,  4.89it/s]
reward: -2.1010, last reward: -0.0191, gradient norm:  1.736:  54%|█████▍    | 340/625 [01:10<00:58,  4.89it/s]
reward: -2.1010, last reward: -0.0191, gradient norm:  1.736:  55%|█████▍    | 341/625 [01:10<01:09,  4.10it/s]
reward: -2.6114, last reward: -0.2858, gradient norm:  2.123:  55%|█████▍    | 341/625 [01:10<01:09,  4.10it/s]
reward: -2.6114, last reward: -0.2858, gradient norm:  2.123:  55%|█████▍    | 342/625 [01:10<01:05,  4.32it/s]
reward: -2.4618, last reward: -0.0410, gradient norm:  2.15:  55%|█████▍    | 342/625 [01:11<01:05,  4.32it/s]
reward: -2.4618, last reward: -0.0410, gradient norm:  2.15:  55%|█████▍    | 343/625 [01:11<01:03,  4.47it/s]
reward: -2.5515, last reward: -0.4695, gradient norm:  5.609:  55%|█████▍    | 343/625 [01:11<01:03,  4.47it/s]
reward: -2.5515, last reward: -0.4695, gradient norm:  5.609:  55%|█████▌    | 344/625 [01:11<01:01,  4.58it/s]
reward: -2.8009, last reward: -2.1572, gradient norm:  34.87:  55%|█████▌    | 344/625 [01:11<01:01,  4.58it/s]
reward: -2.8009, last reward: -2.1572, gradient norm:  34.87:  55%|█████▌    | 345/625 [01:11<01:00,  4.66it/s]
reward: -3.2082, last reward: -5.0086, gradient norm:  45.63:  55%|█████▌    | 345/625 [01:11<01:00,  4.66it/s]
reward: -3.2082, last reward: -5.0086, gradient norm:  45.63:  55%|█████▌    | 346/625 [01:11<00:59,  4.72it/s]
reward: -2.8382, last reward: -3.4997, gradient norm:  50.9:  55%|█████▌    | 346/625 [01:11<00:59,  4.72it/s]
reward: -2.8382, last reward: -3.4997, gradient norm:  50.9:  56%|█████▌    | 347/625 [01:11<00:58,  4.76it/s]
reward: -2.4106, last reward: -0.8440, gradient norm:  20.79:  56%|█████▌    | 347/625 [01:12<00:58,  4.76it/s]
reward: -2.4106, last reward: -0.8440, gradient norm:  20.79:  56%|█████▌    | 348/625 [01:12<00:57,  4.79it/s]
reward: -1.9518, last reward: -0.0163, gradient norm:  1.572:  56%|█████▌    | 348/625 [01:12<00:57,  4.79it/s]
reward: -1.9518, last reward: -0.0163, gradient norm:  1.572:  56%|█████▌    | 349/625 [01:12<00:57,  4.82it/s]
reward: -2.0997, last reward: -0.0540, gradient norm:  6.954:  56%|█████▌    | 349/625 [01:12<00:57,  4.82it/s]
reward: -2.0997, last reward: -0.0540, gradient norm:  6.954:  56%|█████▌    | 350/625 [01:12<00:56,  4.84it/s]
reward: -2.0961, last reward: -0.0805, gradient norm:  2.763:  56%|█████▌    | 350/625 [01:12<00:56,  4.84it/s]
reward: -2.0961, last reward: -0.0805, gradient norm:  2.763:  56%|█████▌    | 351/625 [01:12<00:56,  4.85it/s]
reward: -2.0131, last reward: -0.0443, gradient norm:  2.295:  56%|█████▌    | 351/625 [01:12<00:56,  4.85it/s]
reward: -2.0131, last reward: -0.0443, gradient norm:  2.295:  56%|█████▋    | 352/625 [01:12<00:56,  4.87it/s]
reward: -1.5239, last reward: -0.0026, gradient norm:  0.9087:  56%|█████▋    | 352/625 [01:13<00:56,  4.87it/s]
reward: -1.5239, last reward: -0.0026, gradient norm:  0.9087:  56%|█████▋    | 353/625 [01:13<00:55,  4.88it/s]
reward: -2.3815, last reward: -0.0786, gradient norm:  5.712:  56%|█████▋    | 353/625 [01:13<00:55,  4.88it/s]
reward: -2.3815, last reward: -0.0786, gradient norm:  5.712:  57%|█████▋    | 354/625 [01:13<00:55,  4.88it/s]
reward: -2.2704, last reward: -0.0027, gradient norm:  2.876:  57%|█████▋    | 354/625 [01:13<00:55,  4.88it/s]
reward: -2.2704, last reward: -0.0027, gradient norm:  2.876:  57%|█████▋    | 355/625 [01:13<00:55,  4.89it/s]
reward: -2.2578, last reward: -0.0315, gradient norm:  1.772:  57%|█████▋    | 355/625 [01:13<00:55,  4.89it/s]
reward: -2.2578, last reward: -0.0315, gradient norm:  1.772:  57%|█████▋    | 356/625 [01:13<00:55,  4.89it/s]
reward: -2.7637, last reward: -2.6112, gradient norm:  44.13:  57%|█████▋    | 356/625 [01:13<00:55,  4.89it/s]
reward: -2.7637, last reward: -2.6112, gradient norm:  44.13:  57%|█████▋    | 357/625 [01:13<00:54,  4.89it/s]
reward: -2.6214, last reward: -2.8094, gradient norm:  34.44:  57%|█████▋    | 357/625 [01:14<00:54,  4.89it/s]
reward: -2.6214, last reward: -2.8094, gradient norm:  34.44:  57%|█████▋    | 358/625 [01:14<00:54,  4.89it/s]
reward: -2.6773, last reward: -0.9341, gradient norm:  17.79:  57%|█████▋    | 358/625 [01:14<00:54,  4.89it/s]
reward: -2.6773, last reward: -0.9341, gradient norm:  17.79:  57%|█████▋    | 359/625 [01:14<00:54,  4.89it/s]
reward: -2.0646, last reward: -0.0045, gradient norm:  0.8423:  57%|█████▋    | 359/625 [01:14<00:54,  4.89it/s]
reward: -2.0646, last reward: -0.0045, gradient norm:  0.8423:  58%|█████▊    | 360/625 [01:14<00:54,  4.89it/s]
reward: -2.2144, last reward: -0.0755, gradient norm:  2.833:  58%|█████▊    | 360/625 [01:14<00:54,  4.89it/s]
reward: -2.2144, last reward: -0.0755, gradient norm:  2.833:  58%|█████▊    | 361/625 [01:14<00:54,  4.89it/s]
reward: -2.1301, last reward: -0.1504, gradient norm:  4.438:  58%|█████▊    | 361/625 [01:14<00:54,  4.89it/s]
reward: -2.1301, last reward: -0.1504, gradient norm:  4.438:  58%|█████▊    | 362/625 [01:14<00:53,  4.89it/s]
reward: -2.2999, last reward: -0.1190, gradient norm:  3.388:  58%|█████▊    | 362/625 [01:15<00:53,  4.89it/s]
reward: -2.2999, last reward: -0.1190, gradient norm:  3.388:  58%|█████▊    | 363/625 [01:15<00:53,  4.88it/s]
reward: -2.0784, last reward: -0.0349, gradient norm:  1.901:  58%|█████▊    | 363/625 [01:15<00:53,  4.88it/s]
reward: -2.0784, last reward: -0.0349, gradient norm:  1.901:  58%|█████▊    | 364/625 [01:15<00:53,  4.88it/s]
reward: -2.2406, last reward: -0.0235, gradient norm:  1.598:  58%|█████▊    | 364/625 [01:15<00:53,  4.88it/s]
reward: -2.2406, last reward: -0.0235, gradient norm:  1.598:  58%|█████▊    | 365/625 [01:15<00:53,  4.89it/s]
reward: -2.4914, last reward: -0.5533, gradient norm:  18.79:  58%|█████▊    | 365/625 [01:15<00:53,  4.89it/s]
reward: -2.4914, last reward: -0.5533, gradient norm:  18.79:  59%|█████▊    | 366/625 [01:15<00:52,  4.89it/s]
reward: -2.1190, last reward: -1.1747, gradient norm:  50.33:  59%|█████▊    | 366/625 [01:15<00:52,  4.89it/s]
reward: -2.1190, last reward: -1.1747, gradient norm:  50.33:  59%|█████▊    | 367/625 [01:15<00:52,  4.88it/s]
reward: -1.9734, last reward: -0.0011, gradient norm:  6.159:  59%|█████▊    | 367/625 [01:16<00:52,  4.88it/s]
reward: -1.9734, last reward: -0.0011, gradient norm:  6.159:  59%|█████▉    | 368/625 [01:16<00:52,  4.88it/s]
reward: -2.4497, last reward: -0.0361, gradient norm:  1.444:  59%|█████▉    | 368/625 [01:16<00:52,  4.88it/s]
reward: -2.4497, last reward: -0.0361, gradient norm:  1.444:  59%|█████▉    | 369/625 [01:16<00:52,  4.88it/s]
reward: -1.6725, last reward: -0.0607, gradient norm:  2.076:  59%|█████▉    | 369/625 [01:16<00:52,  4.88it/s]
reward: -1.6725, last reward: -0.0607, gradient norm:  2.076:  59%|█████▉    | 370/625 [01:16<00:52,  4.88it/s]
reward: -2.1384, last reward: -0.0464, gradient norm:  1.567:  59%|█████▉    | 370/625 [01:16<00:52,  4.88it/s]
reward: -2.1384, last reward: -0.0464, gradient norm:  1.567:  59%|█████▉    | 371/625 [01:16<00:52,  4.87it/s]
reward: -1.7059, last reward: -0.0138, gradient norm:  1.031:  59%|█████▉    | 371/625 [01:16<00:52,  4.87it/s]
reward: -1.7059, last reward: -0.0138, gradient norm:  1.031:  60%|█████▉    | 372/625 [01:16<00:52,  4.86it/s]
reward: -1.9927, last reward: -0.0054, gradient norm:  0.5594:  60%|█████▉    | 372/625 [01:17<00:52,  4.86it/s]
reward: -1.9927, last reward: -0.0054, gradient norm:  0.5594:  60%|█████▉    | 373/625 [01:17<00:51,  4.86it/s]
reward: -2.4160, last reward: -0.5060, gradient norm:  29.92:  60%|█████▉    | 373/625 [01:17<00:51,  4.86it/s]
reward: -2.4160, last reward: -0.5060, gradient norm:  29.92:  60%|█████▉    | 374/625 [01:17<00:51,  4.85it/s]
reward: -2.5828, last reward: -0.1384, gradient norm:  4.958:  60%|█████▉    | 374/625 [01:17<00:51,  4.85it/s]
reward: -2.5828, last reward: -0.1384, gradient norm:  4.958:  60%|██████    | 375/625 [01:17<00:51,  4.86it/s]
reward: -1.9523, last reward: -0.0269, gradient norm:  1.721:  60%|██████    | 375/625 [01:17<00:51,  4.86it/s]
reward: -1.9523, last reward: -0.0269, gradient norm:  1.721:  60%|██████    | 376/625 [01:17<00:51,  4.87it/s]
reward: -1.8944, last reward: -0.0003, gradient norm:  0.4466:  60%|██████    | 376/625 [01:18<00:51,  4.87it/s]
reward: -1.8944, last reward: -0.0003, gradient norm:  0.4466:  60%|██████    | 377/625 [01:18<00:50,  4.87it/s]
reward: -2.2882, last reward: -0.0140, gradient norm:  1.393:  60%|██████    | 377/625 [01:18<00:50,  4.87it/s]
reward: -2.2882, last reward: -0.0140, gradient norm:  1.393:  60%|██████    | 378/625 [01:18<00:50,  4.87it/s]
reward: -2.2007, last reward: -0.0201, gradient norm:  0.9149:  60%|██████    | 378/625 [01:18<00:50,  4.87it/s]
reward: -2.2007, last reward: -0.0201, gradient norm:  0.9149:  61%|██████    | 379/625 [01:18<00:50,  4.87it/s]
reward: -2.1404, last reward: -0.2498, gradient norm:  0.7904:  61%|██████    | 379/625 [01:18<00:50,  4.87it/s]
reward: -2.1404, last reward: -0.2498, gradient norm:  0.7904:  61%|██████    | 380/625 [01:18<00:50,  4.87it/s]
reward: -1.9428, last reward: -0.0002, gradient norm:  0.3416:  61%|██████    | 380/625 [01:18<00:50,  4.87it/s]
reward: -1.9428, last reward: -0.0002, gradient norm:  0.3416:  61%|██████    | 381/625 [01:18<00:50,  4.87it/s]
reward: -1.6321, last reward: -0.0189, gradient norm:  1.258:  61%|██████    | 381/625 [01:19<00:50,  4.87it/s]
reward: -1.6321, last reward: -0.0189, gradient norm:  1.258:  61%|██████    | 382/625 [01:19<00:49,  4.87it/s]
reward: -1.9240, last reward: -0.0407, gradient norm:  0.8453:  61%|██████    | 382/625 [01:19<00:49,  4.87it/s]
reward: -1.9240, last reward: -0.0407, gradient norm:  0.8453:  61%|██████▏   | 383/625 [01:19<00:49,  4.87it/s]
reward: -1.7657, last reward: -0.1190, gradient norm:  3.86:  61%|██████▏   | 383/625 [01:19<00:49,  4.87it/s]
reward: -1.7657, last reward: -0.1190, gradient norm:  3.86:  61%|██████▏   | 384/625 [01:19<00:49,  4.88it/s]
reward: -2.2517, last reward: -0.0091, gradient norm:  2.363:  61%|██████▏   | 384/625 [01:19<00:49,  4.88it/s]
reward: -2.2517, last reward: -0.0091, gradient norm:  2.363:  62%|██████▏   | 385/625 [01:19<00:49,  4.87it/s]
reward: -2.3202, last reward: -0.0734, gradient norm:  6.84:  62%|██████▏   | 385/625 [01:19<00:49,  4.87it/s]
reward: -2.3202, last reward: -0.0734, gradient norm:  6.84:  62%|██████▏   | 386/625 [01:19<00:49,  4.86it/s]
reward: -2.4757, last reward: -0.1005, gradient norm:  1.801:  62%|██████▏   | 386/625 [01:20<00:49,  4.86it/s]
reward: -2.4757, last reward: -0.1005, gradient norm:  1.801:  62%|██████▏   | 387/625 [01:20<00:48,  4.87it/s]
reward: -2.1148, last reward: -0.4821, gradient norm:  40.67:  62%|██████▏   | 387/625 [01:20<00:48,  4.87it/s]
reward: -2.1148, last reward: -0.4821, gradient norm:  40.67:  62%|██████▏   | 388/625 [01:20<00:48,  4.87it/s]
reward: -2.3243, last reward: -0.1138, gradient norm:  2.966:  62%|██████▏   | 388/625 [01:20<00:48,  4.87it/s]
reward: -2.3243, last reward: -0.1138, gradient norm:  2.966:  62%|██████▏   | 389/625 [01:20<00:48,  4.85it/s]
reward: -2.1412, last reward: -0.0588, gradient norm:  2.561:  62%|██████▏   | 389/625 [01:20<00:48,  4.85it/s]
reward: -2.1412, last reward: -0.0588, gradient norm:  2.561:  62%|██████▏   | 390/625 [01:20<00:48,  4.86it/s]
reward: -1.8031, last reward: -0.0051, gradient norm:  2.107:  62%|██████▏   | 390/625 [01:20<00:48,  4.86it/s]
reward: -1.8031, last reward: -0.0051, gradient norm:  2.107:  63%|██████▎   | 391/625 [01:20<00:48,  4.87it/s]
reward: -2.2578, last reward: -2.3332, gradient norm:  44.11:  63%|██████▎   | 391/625 [01:21<00:48,  4.87it/s]
reward: -2.2578, last reward: -2.3332, gradient norm:  44.11:  63%|██████▎   | 392/625 [01:21<00:47,  4.87it/s]
reward: -2.5711, last reward: -3.2760, gradient norm:  42.22:  63%|██████▎   | 392/625 [01:21<00:47,  4.87it/s]
reward: -2.5711, last reward: -3.2760, gradient norm:  42.22:  63%|██████▎   | 393/625 [01:21<00:47,  4.87it/s]
reward: -2.4667, last reward: -1.7428, gradient norm:  33.16:  63%|██████▎   | 393/625 [01:21<00:47,  4.87it/s]
reward: -2.4667, last reward: -1.7428, gradient norm:  33.16:  63%|██████▎   | 394/625 [01:21<00:47,  4.88it/s]
reward: -2.0998, last reward: -0.0158, gradient norm:  2.666:  63%|██████▎   | 394/625 [01:21<00:47,  4.88it/s]
reward: -2.0998, last reward: -0.0158, gradient norm:  2.666:  63%|██████▎   | 395/625 [01:21<00:47,  4.89it/s]
reward: -2.4835, last reward: -0.1028, gradient norm:  6.602:  63%|██████▎   | 395/625 [01:21<00:47,  4.89it/s]
reward: -2.4835, last reward: -0.1028, gradient norm:  6.602:  63%|██████▎   | 396/625 [01:21<00:46,  4.87it/s]
reward: -4.1513, last reward: -2.9719, gradient norm:  31.03:  63%|██████▎   | 396/625 [01:22<00:46,  4.87it/s]
reward: -4.1513, last reward: -2.9719, gradient norm:  31.03:  64%|██████▎   | 397/625 [01:22<00:46,  4.86it/s]
reward: -3.8985, last reward: -5.0222, gradient norm:  215.2:  64%|██████▎   | 397/625 [01:22<00:46,  4.86it/s]
reward: -3.8985, last reward: -5.0222, gradient norm:  215.2:  64%|██████▎   | 398/625 [01:22<00:46,  4.86it/s]
reward: -2.2914, last reward: -0.1110, gradient norm:  3.192:  64%|██████▎   | 398/625 [01:22<00:46,  4.86it/s]
reward: -2.2914, last reward: -0.1110, gradient norm:  3.192:  64%|██████▍   | 399/625 [01:22<00:46,  4.86it/s]
reward: -1.9166, last reward: -0.0308, gradient norm:  1.668:  64%|██████▍   | 399/625 [01:22<00:46,  4.86it/s]
reward: -1.9166, last reward: -0.0308, gradient norm:  1.668:  64%|██████▍   | 400/625 [01:22<00:46,  4.86it/s]
reward: -1.8214, last reward: -0.0065, gradient norm:  0.6156:  64%|██████▍   | 400/625 [01:22<00:46,  4.86it/s]
reward: -1.8214, last reward: -0.0065, gradient norm:  0.6156:  64%|██████▍   | 401/625 [01:22<00:46,  4.85it/s]
reward: -2.2157, last reward: -2.9038, gradient norm:  114.0:  64%|██████▍   | 401/625 [01:23<00:46,  4.85it/s]
reward: -2.2157, last reward: -2.9038, gradient norm:  114.0:  64%|██████▍   | 402/625 [01:23<00:45,  4.86it/s]
reward: -2.2463, last reward: -3.3530, gradient norm:  120.8:  64%|██████▍   | 402/625 [01:23<00:45,  4.86it/s]
reward: -2.2463, last reward: -3.3530, gradient norm:  120.8:  64%|██████▍   | 403/625 [01:23<00:45,  4.86it/s]
reward: -2.0383, last reward: -0.0227, gradient norm:  1.776:  64%|██████▍   | 403/625 [01:23<00:45,  4.86it/s]
reward: -2.0383, last reward: -0.0227, gradient norm:  1.776:  65%|██████▍   | 404/625 [01:23<00:45,  4.86it/s]
reward: -1.7300, last reward: -0.0007, gradient norm:  0.414:  65%|██████▍   | 404/625 [01:23<00:45,  4.86it/s]
reward: -1.7300, last reward: -0.0007, gradient norm:  0.414:  65%|██████▍   | 405/625 [01:23<00:45,  4.87it/s]
reward: -1.7968, last reward: -0.0107, gradient norm:  0.8298:  65%|██████▍   | 405/625 [01:23<00:45,  4.87it/s]
reward: -1.7968, last reward: -0.0107, gradient norm:  0.8298:  65%|██████▍   | 406/625 [01:23<00:44,  4.87it/s]
reward: -2.0079, last reward: -0.2487, gradient norm:  0.8033:  65%|██████▍   | 406/625 [01:24<00:44,  4.87it/s]
reward: -2.0079, last reward: -0.2487, gradient norm:  0.8033:  65%|██████▌   | 407/625 [01:24<00:44,  4.88it/s]
reward: -1.8478, last reward: -0.0094, gradient norm:  0.7041:  65%|██████▌   | 407/625 [01:24<00:44,  4.88it/s]
reward: -1.8478, last reward: -0.0094, gradient norm:  0.7041:  65%|██████▌   | 408/625 [01:24<00:44,  4.88it/s]
reward: -2.2375, last reward: -0.1252, gradient norm:  0.9001:  65%|██████▌   | 408/625 [01:24<00:44,  4.88it/s]
reward: -2.2375, last reward: -0.1252, gradient norm:  0.9001:  65%|██████▌   | 409/625 [01:24<00:44,  4.87it/s]
reward: -1.9546, last reward: -0.0039, gradient norm:  0.4175:  65%|██████▌   | 409/625 [01:24<00:44,  4.87it/s]
reward: -1.9546, last reward: -0.0039, gradient norm:  0.4175:  66%|██████▌   | 410/625 [01:24<00:44,  4.87it/s]
reward: -2.3546, last reward: -0.0282, gradient norm:  14.68:  66%|██████▌   | 410/625 [01:24<00:44,  4.87it/s]
reward: -2.3546, last reward: -0.0282, gradient norm:  14.68:  66%|██████▌   | 411/625 [01:24<00:43,  4.87it/s]
reward: -2.1190, last reward: -0.7145, gradient norm:  47.83:  66%|██████▌   | 411/625 [01:25<00:43,  4.87it/s]
reward: -2.1190, last reward: -0.7145, gradient norm:  47.83:  66%|██████▌   | 412/625 [01:25<00:43,  4.87it/s]
reward: -2.1732, last reward: -0.0822, gradient norm:  2.868:  66%|██████▌   | 412/625 [01:25<00:43,  4.87it/s]
reward: -2.1732, last reward: -0.0822, gradient norm:  2.868:  66%|██████▌   | 413/625 [01:25<00:43,  4.87it/s]
reward: -2.2304, last reward: -1.3711, gradient norm:  38.48:  66%|██████▌   | 413/625 [01:25<00:43,  4.87it/s]
reward: -2.2304, last reward: -1.3711, gradient norm:  38.48:  66%|██████▌   | 414/625 [01:25<00:43,  4.86it/s]
reward: -2.1892, last reward: -0.2867, gradient norm:  2.725:  66%|██████▌   | 414/625 [01:25<00:43,  4.86it/s]
reward: -2.1892, last reward: -0.2867, gradient norm:  2.725:  66%|██████▋   | 415/625 [01:25<00:43,  4.86it/s]
reward: -1.9492, last reward: -0.0121, gradient norm:  0.8292:  66%|██████▋   | 415/625 [01:26<00:43,  4.86it/s]
reward: -1.9492, last reward: -0.0121, gradient norm:  0.8292:  67%|██████▋   | 416/625 [01:26<00:43,  4.85it/s]
reward: -1.7219, last reward: -0.0048, gradient norm:  0.6598:  67%|██████▋   | 416/625 [01:26<00:43,  4.85it/s]
reward: -1.7219, last reward: -0.0048, gradient norm:  0.6598:  67%|██████▋   | 417/625 [01:26<00:42,  4.86it/s]
reward: -2.1068, last reward: -0.0222, gradient norm:  1.108:  67%|██████▋   | 417/625 [01:26<00:42,  4.86it/s]
reward: -2.1068, last reward: -0.0222, gradient norm:  1.108:  67%|██████▋   | 418/625 [01:26<00:42,  4.86it/s]
reward: -1.7557, last reward: -0.0238, gradient norm:  1.243:  67%|██████▋   | 418/625 [01:26<00:42,  4.86it/s]
reward: -1.7557, last reward: -0.0238, gradient norm:  1.243:  67%|██████▋   | 419/625 [01:26<00:42,  4.86it/s]
reward: -1.8904, last reward: -0.0105, gradient norm:  27.15:  67%|██████▋   | 419/625 [01:26<00:42,  4.86it/s]
reward: -1.8904, last reward: -0.0105, gradient norm:  27.15:  67%|██████▋   | 420/625 [01:26<00:42,  4.86it/s]
reward: -2.1159, last reward: -0.0003, gradient norm:  0.3801:  67%|██████▋   | 420/625 [01:27<00:42,  4.86it/s]
reward: -2.1159, last reward: -0.0003, gradient norm:  0.3801:  67%|██████▋   | 421/625 [01:27<00:41,  4.86it/s]
reward: -1.7220, last reward: -0.0169, gradient norm:  1.102:  67%|██████▋   | 421/625 [01:27<00:41,  4.86it/s]
reward: -1.7220, last reward: -0.0169, gradient norm:  1.102:  68%|██████▊   | 422/625 [01:27<00:41,  4.87it/s]
reward: -1.8886, last reward: -0.0218, gradient norm:  1.461:  68%|██████▊   | 422/625 [01:27<00:41,  4.87it/s]
reward: -1.8886, last reward: -0.0218, gradient norm:  1.461:  68%|██████▊   | 423/625 [01:27<00:41,  4.87it/s]
reward: -1.6002, last reward: -0.0012, gradient norm:  0.08998:  68%|██████▊   | 423/625 [01:27<00:41,  4.87it/s]
reward: -1.6002, last reward: -0.0012, gradient norm:  0.08998:  68%|██████▊   | 424/625 [01:27<00:41,  4.87it/s]
reward: -2.3313, last reward: -0.0031, gradient norm:  0.6231:  68%|██████▊   | 424/625 [01:27<00:41,  4.87it/s]
reward: -2.3313, last reward: -0.0031, gradient norm:  0.6231:  68%|██████▊   | 425/625 [01:27<00:40,  4.88it/s]
reward: -1.9866, last reward: -0.0051, gradient norm:  0.697:  68%|██████▊   | 425/625 [01:28<00:40,  4.88it/s]
reward: -1.9866, last reward: -0.0051, gradient norm:  0.697:  68%|██████▊   | 426/625 [01:28<00:40,  4.88it/s]
reward: -2.2594, last reward: -0.0017, gradient norm:  0.5586:  68%|██████▊   | 426/625 [01:28<00:40,  4.88it/s]
reward: -2.2594, last reward: -0.0017, gradient norm:  0.5586:  68%|██████▊   | 427/625 [01:28<00:40,  4.88it/s]
reward: -2.2575, last reward: -0.0220, gradient norm:  4.928:  68%|██████▊   | 427/625 [01:28<00:40,  4.88it/s]
reward: -2.2575, last reward: -0.0220, gradient norm:  4.928:  68%|██████▊   | 428/625 [01:28<00:40,  4.88it/s]
reward: -1.8807, last reward: -0.0081, gradient norm:  0.9836:  68%|██████▊   | 428/625 [01:28<00:40,  4.88it/s]
reward: -1.8807, last reward: -0.0081, gradient norm:  0.9836:  69%|██████▊   | 429/625 [01:28<00:40,  4.87it/s]
reward: -2.0147, last reward: -0.0003, gradient norm:  0.2705:  69%|██████▊   | 429/625 [01:28<00:40,  4.87it/s]
reward: -2.0147, last reward: -0.0003, gradient norm:  0.2705:  69%|██████▉   | 430/625 [01:28<00:40,  4.87it/s]
reward: -1.8529, last reward: -0.0009, gradient norm:  0.7404:  69%|██████▉   | 430/625 [01:29<00:40,  4.87it/s]
reward: -1.8529, last reward: -0.0009, gradient norm:  0.7404:  69%|██████▉   | 431/625 [01:29<00:39,  4.87it/s]
reward: -1.9336, last reward: -0.0057, gradient norm:  0.6225:  69%|██████▉   | 431/625 [01:29<00:39,  4.87it/s]
reward: -1.9336, last reward: -0.0057, gradient norm:  0.6225:  69%|██████▉   | 432/625 [01:29<00:39,  4.87it/s]
reward: -2.3085, last reward: -0.0506, gradient norm:  1.342:  69%|██████▉   | 432/625 [01:29<00:39,  4.87it/s]
reward: -2.3085, last reward: -0.0506, gradient norm:  1.342:  69%|██████▉   | 433/625 [01:29<00:39,  4.87it/s]
reward: -2.5377, last reward: -0.0226, gradient norm:  0.4431:  69%|██████▉   | 433/625 [01:29<00:39,  4.87it/s]
reward: -2.5377, last reward: -0.0226, gradient norm:  0.4431:  69%|██████▉   | 434/625 [01:29<00:39,  4.87it/s]
reward: -2.1698, last reward: -0.1581, gradient norm:  2.587:  69%|██████▉   | 434/625 [01:29<00:39,  4.87it/s]
reward: -2.1698, last reward: -0.1581, gradient norm:  2.587:  70%|██████▉   | 435/625 [01:29<00:39,  4.87it/s]
reward: -2.5718, last reward: -0.1130, gradient norm:  6.102:  70%|██████▉   | 435/625 [01:30<00:39,  4.87it/s]
reward: -2.5718, last reward: -0.1130, gradient norm:  6.102:  70%|██████▉   | 436/625 [01:30<00:38,  4.87it/s]
reward: -2.2911, last reward: -0.3144, gradient norm:  4.01:  70%|██████▉   | 436/625 [01:30<00:38,  4.87it/s]
reward: -2.2911, last reward: -0.3144, gradient norm:  4.01:  70%|██████▉   | 437/625 [01:30<00:38,  4.87it/s]
reward: -2.7797, last reward: -0.3012, gradient norm:  2.231:  70%|██████▉   | 437/625 [01:30<00:38,  4.87it/s]
reward: -2.7797, last reward: -0.3012, gradient norm:  2.231:  70%|███████   | 438/625 [01:30<00:38,  4.87it/s]
reward: -1.8474, last reward: -0.0199, gradient norm:  1.789:  70%|███████   | 438/625 [01:30<00:38,  4.87it/s]
reward: -1.8474, last reward: -0.0199, gradient norm:  1.789:  70%|███████   | 439/625 [01:30<00:38,  4.87it/s]
reward: -2.0948, last reward: -0.0017, gradient norm:  0.3745:  70%|███████   | 439/625 [01:30<00:38,  4.87it/s]
reward: -2.0948, last reward: -0.0017, gradient norm:  0.3745:  70%|███████   | 440/625 [01:30<00:37,  4.87it/s]
reward: -2.0281, last reward: -0.0024, gradient norm:  0.4722:  70%|███████   | 440/625 [01:31<00:37,  4.87it/s]
reward: -2.0281, last reward: -0.0024, gradient norm:  0.4722:  71%|███████   | 441/625 [01:31<00:37,  4.88it/s]
reward: -2.2455, last reward: -0.0084, gradient norm:  0.9685:  71%|███████   | 441/625 [01:31<00:37,  4.88it/s]
reward: -2.2455, last reward: -0.0084, gradient norm:  0.9685:  71%|███████   | 442/625 [01:31<00:44,  4.10it/s]
reward: -1.9491, last reward: -0.0081, gradient norm:  0.7127:  71%|███████   | 442/625 [01:31<00:44,  4.10it/s]
reward: -1.9491, last reward: -0.0081, gradient norm:  0.7127:  71%|███████   | 443/625 [01:31<00:42,  4.30it/s]
reward: -2.0660, last reward: -0.0011, gradient norm:  0.4463:  71%|███████   | 443/625 [01:31<00:42,  4.30it/s]
reward: -2.0660, last reward: -0.0011, gradient norm:  0.4463:  71%|███████   | 444/625 [01:31<00:40,  4.46it/s]
reward: -2.0021, last reward: -0.0043, gradient norm:  0.8505:  71%|███████   | 444/625 [01:32<00:40,  4.46it/s]
reward: -2.0021, last reward: -0.0043, gradient norm:  0.8505:  71%|███████   | 445/625 [01:32<00:39,  4.58it/s]
reward: -2.2601, last reward: -0.0044, gradient norm:  0.6368:  71%|███████   | 445/625 [01:32<00:39,  4.58it/s]
reward: -2.2601, last reward: -0.0044, gradient norm:  0.6368:  71%|███████▏  | 446/625 [01:32<00:38,  4.67it/s]
reward: -2.1654, last reward: -0.0008, gradient norm:  0.9723:  71%|███████▏  | 446/625 [01:32<00:38,  4.67it/s]
reward: -2.1654, last reward: -0.0008, gradient norm:  0.9723:  72%|███████▏  | 447/625 [01:32<00:37,  4.72it/s]
reward: -1.7645, last reward: -0.0014, gradient norm:  0.6832:  72%|███████▏  | 447/625 [01:32<00:37,  4.72it/s]
reward: -1.7645, last reward: -0.0014, gradient norm:  0.6832:  72%|███████▏  | 448/625 [01:32<00:37,  4.76it/s]
reward: -2.1802, last reward: -0.0016, gradient norm:  0.4254:  72%|███████▏  | 448/625 [01:32<00:37,  4.76it/s]
reward: -2.1802, last reward: -0.0016, gradient norm:  0.4254:  72%|███████▏  | 449/625 [01:32<00:36,  4.79it/s]
reward: -1.9047, last reward: -0.0029, gradient norm:  0.6538:  72%|███████▏  | 449/625 [01:33<00:36,  4.79it/s]
reward: -1.9047, last reward: -0.0029, gradient norm:  0.6538:  72%|███████▏  | 450/625 [01:33<00:36,  4.81it/s]
reward: -2.3640, last reward: -0.0064, gradient norm:  1.098:  72%|███████▏  | 450/625 [01:33<00:36,  4.81it/s]
reward: -2.3640, last reward: -0.0064, gradient norm:  1.098:  72%|███████▏  | 451/625 [01:33<00:36,  4.83it/s]
reward: -2.1285, last reward: -0.0338, gradient norm:  1.303:  72%|███████▏  | 451/625 [01:33<00:36,  4.83it/s]
reward: -2.1285, last reward: -0.0338, gradient norm:  1.303:  72%|███████▏  | 452/625 [01:33<00:35,  4.83it/s]
reward: -1.6215, last reward: -0.0049, gradient norm:  2.223:  72%|███████▏  | 452/625 [01:33<00:35,  4.83it/s]
reward: -1.6215, last reward: -0.0049, gradient norm:  2.223:  72%|███████▏  | 453/625 [01:33<00:35,  4.85it/s]
reward: -1.5373, last reward: -0.0090, gradient norm:  1.162:  72%|███████▏  | 453/625 [01:33<00:35,  4.85it/s]
reward: -1.5373, last reward: -0.0090, gradient norm:  1.162:  73%|███████▎  | 454/625 [01:33<00:35,  4.86it/s]
reward: -1.8666, last reward: -0.0247, gradient norm:  1.893:  73%|███████▎  | 454/625 [01:34<00:35,  4.86it/s]
reward: -1.8666, last reward: -0.0247, gradient norm:  1.893:  73%|███████▎  | 455/625 [01:34<00:34,  4.86it/s]
reward: -1.9899, last reward: -0.0080, gradient norm:  1.12:  73%|███████▎  | 455/625 [01:34<00:34,  4.86it/s]
reward: -1.9899, last reward: -0.0080, gradient norm:  1.12:  73%|███████▎  | 456/625 [01:34<00:34,  4.88it/s]
reward: -2.1262, last reward: -0.1049, gradient norm:  10.91:  73%|███████▎  | 456/625 [01:34<00:34,  4.88it/s]
reward: -2.1262, last reward: -0.1049, gradient norm:  10.91:  73%|███████▎  | 457/625 [01:34<00:34,  4.87it/s]
reward: -2.1425, last reward: -0.0472, gradient norm:  2.676:  73%|███████▎  | 457/625 [01:34<00:34,  4.87it/s]
reward: -2.1425, last reward: -0.0472, gradient norm:  2.676:  73%|███████▎  | 458/625 [01:34<00:34,  4.88it/s]
reward: -2.2573, last reward: -0.0005, gradient norm:  0.3421:  73%|███████▎  | 458/625 [01:34<00:34,  4.88it/s]
reward: -2.2573, last reward: -0.0005, gradient norm:  0.3421:  73%|███████▎  | 459/625 [01:34<00:33,  4.89it/s]
reward: -1.5790, last reward: -0.0079, gradient norm:  0.8352:  73%|███████▎  | 459/625 [01:35<00:33,  4.89it/s]
reward: -1.5790, last reward: -0.0079, gradient norm:  0.8352:  74%|███████▎  | 460/625 [01:35<00:33,  4.89it/s]
reward: -1.8268, last reward: -0.0108, gradient norm:  0.8433:  74%|███████▎  | 460/625 [01:35<00:33,  4.89it/s]
reward: -1.8268, last reward: -0.0108, gradient norm:  0.8433:  74%|███████▍  | 461/625 [01:35<00:33,  4.89it/s]
reward: -1.8524, last reward: -0.0019, gradient norm:  0.4605:  74%|███████▍  | 461/625 [01:35<00:33,  4.89it/s]
reward: -1.8524, last reward: -0.0019, gradient norm:  0.4605:  74%|███████▍  | 462/625 [01:35<00:33,  4.88it/s]
reward: -1.9559, last reward: -0.0026, gradient norm:  2.404:  74%|███████▍  | 462/625 [01:35<00:33,  4.88it/s]
reward: -1.9559, last reward: -0.0026, gradient norm:  2.404:  74%|███████▍  | 463/625 [01:35<00:33,  4.87it/s]
reward: -2.3517, last reward: -2.4639, gradient norm:  109.4:  74%|███████▍  | 463/625 [01:35<00:33,  4.87it/s]
reward: -2.3517, last reward: -2.4639, gradient norm:  109.4:  74%|███████▍  | 464/625 [01:35<00:33,  4.88it/s]
reward: -2.8051, last reward: -4.1254, gradient norm:  80.4:  74%|███████▍  | 464/625 [01:36<00:33,  4.88it/s]
reward: -2.8051, last reward: -4.1254, gradient norm:  80.4:  74%|███████▍  | 465/625 [01:36<00:32,  4.88it/s]
reward: -2.2793, last reward: -3.5528, gradient norm:  133.8:  74%|███████▍  | 465/625 [01:36<00:32,  4.88it/s]
reward: -2.2793, last reward: -3.5528, gradient norm:  133.8:  75%|███████▍  | 466/625 [01:36<00:32,  4.88it/s]
reward: -2.4257, last reward: -0.0111, gradient norm:  0.8815:  75%|███████▍  | 466/625 [01:36<00:32,  4.88it/s]
reward: -2.4257, last reward: -0.0111, gradient norm:  0.8815:  75%|███████▍  | 467/625 [01:36<00:32,  4.87it/s]
reward: -2.0900, last reward: -0.0090, gradient norm:  0.5581:  75%|███████▍  | 467/625 [01:36<00:32,  4.87it/s]
reward: -2.0900, last reward: -0.0090, gradient norm:  0.5581:  75%|███████▍  | 468/625 [01:36<00:32,  4.87it/s]
reward: -2.0726, last reward: -0.0278, gradient norm:  0.9816:  75%|███████▍  | 468/625 [01:37<00:32,  4.87it/s]
reward: -2.0726, last reward: -0.0278, gradient norm:  0.9816:  75%|███████▌  | 469/625 [01:37<00:31,  4.88it/s]
reward: -2.2132, last reward: -0.0311, gradient norm:  1.074:  75%|███████▌  | 469/625 [01:37<00:31,  4.88it/s]
reward: -2.2132, last reward: -0.0311, gradient norm:  1.074:  75%|███████▌  | 470/625 [01:37<00:31,  4.89it/s]
reward: -2.2571, last reward: -0.0172, gradient norm:  0.7882:  75%|███████▌  | 470/625 [01:37<00:31,  4.89it/s]
reward: -2.2571, last reward: -0.0172, gradient norm:  0.7882:  75%|███████▌  | 471/625 [01:37<00:31,  4.88it/s]
reward: -2.0257, last reward: -0.0171, gradient norm:  0.715:  75%|███████▌  | 471/625 [01:37<00:31,  4.88it/s]
reward: -2.0257, last reward: -0.0171, gradient norm:  0.715:  76%|███████▌  | 472/625 [01:37<00:31,  4.88it/s]
reward: -2.7457, last reward: -0.0086, gradient norm:  11.82:  76%|███████▌  | 472/625 [01:37<00:31,  4.88it/s]
reward: -2.7457, last reward: -0.0086, gradient norm:  11.82:  76%|███████▌  | 473/625 [01:37<00:31,  4.88it/s]
reward: -2.3554, last reward: -0.2600, gradient norm:  3.902:  76%|███████▌  | 473/625 [01:38<00:31,  4.88it/s]
reward: -2.3554, last reward: -0.2600, gradient norm:  3.902:  76%|███████▌  | 474/625 [01:38<00:30,  4.89it/s]
reward: -1.9478, last reward: -0.0921, gradient norm:  6.198:  76%|███████▌  | 474/625 [01:38<00:30,  4.89it/s]
reward: -1.9478, last reward: -0.0921, gradient norm:  6.198:  76%|███████▌  | 475/625 [01:38<00:30,  4.89it/s]
reward: -1.8998, last reward: -0.0534, gradient norm:  2.329:  76%|███████▌  | 475/625 [01:38<00:30,  4.89it/s]
reward: -1.8998, last reward: -0.0534, gradient norm:  2.329:  76%|███████▌  | 476/625 [01:38<00:30,  4.88it/s]
reward: -2.2714, last reward: -0.0140, gradient norm:  0.7061:  76%|███████▌  | 476/625 [01:38<00:30,  4.88it/s]
reward: -2.2714, last reward: -0.0140, gradient norm:  0.7061:  76%|███████▋  | 477/625 [01:38<00:30,  4.88it/s]
reward: -1.8072, last reward: -0.0004, gradient norm:  0.2785:  76%|███████▋  | 477/625 [01:38<00:30,  4.88it/s]
reward: -1.8072, last reward: -0.0004, gradient norm:  0.2785:  76%|███████▋  | 478/625 [01:38<00:30,  4.88it/s]
reward: -1.9878, last reward: -0.0031, gradient norm:  0.5887:  76%|███████▋  | 478/625 [01:39<00:30,  4.88it/s]
reward: -1.9878, last reward: -0.0031, gradient norm:  0.5887:  77%|███████▋  | 479/625 [01:39<00:29,  4.88it/s]
reward: -1.9777, last reward: -0.0108, gradient norm:  1.364:  77%|███████▋  | 479/625 [01:39<00:29,  4.88it/s]
reward: -1.9777, last reward: -0.0108, gradient norm:  1.364:  77%|███████▋  | 480/625 [01:39<00:29,  4.88it/s]
reward: -2.2559, last reward: -0.0164, gradient norm:  0.69:  77%|███████▋  | 480/625 [01:39<00:29,  4.88it/s]
reward: -2.2559, last reward: -0.0164, gradient norm:  0.69:  77%|███████▋  | 481/625 [01:39<00:29,  4.88it/s]
reward: -1.9692, last reward: -0.0161, gradient norm:  0.7074:  77%|███████▋  | 481/625 [01:39<00:29,  4.88it/s]
reward: -1.9692, last reward: -0.0161, gradient norm:  0.7074:  77%|███████▋  | 482/625 [01:39<00:29,  4.89it/s]
reward: -1.9088, last reward: -0.0093, gradient norm:  0.5972:  77%|███████▋  | 482/625 [01:39<00:29,  4.89it/s]
reward: -1.9088, last reward: -0.0093, gradient norm:  0.5972:  77%|███████▋  | 483/625 [01:39<00:29,  4.89it/s]
reward: -1.6735, last reward: -0.0022, gradient norm:  0.6743:  77%|███████▋  | 483/625 [01:40<00:29,  4.89it/s]
reward: -1.6735, last reward: -0.0022, gradient norm:  0.6743:  77%|███████▋  | 484/625 [01:40<00:28,  4.88it/s]
reward: -1.5895, last reward: -0.0004, gradient norm:  0.1763:  77%|███████▋  | 484/625 [01:40<00:28,  4.88it/s]
reward: -1.5895, last reward: -0.0004, gradient norm:  0.1763:  78%|███████▊  | 485/625 [01:40<00:28,  4.88it/s]
reward: -2.2496, last reward: -0.0066, gradient norm:  0.5032:  78%|███████▊  | 485/625 [01:40<00:28,  4.88it/s]
reward: -2.2496, last reward: -0.0066, gradient norm:  0.5032:  78%|███████▊  | 486/625 [01:40<00:28,  4.88it/s]
reward: -2.1070, last reward: -0.0170, gradient norm:  0.8796:  78%|███████▊  | 486/625 [01:40<00:28,  4.88it/s]
reward: -2.1070, last reward: -0.0170, gradient norm:  0.8796:  78%|███████▊  | 487/625 [01:40<00:28,  4.88it/s]
reward: -2.1649, last reward: -0.0368, gradient norm:  1.901:  78%|███████▊  | 487/625 [01:40<00:28,  4.88it/s]
reward: -2.1649, last reward: -0.0368, gradient norm:  1.901:  78%|███████▊  | 488/625 [01:40<00:28,  4.88it/s]
reward: -2.3717, last reward: -0.0190, gradient norm:  0.6673:  78%|███████▊  | 488/625 [01:41<00:28,  4.88it/s]
reward: -2.3717, last reward: -0.0190, gradient norm:  0.6673:  78%|███████▊  | 489/625 [01:41<00:27,  4.89it/s]
reward: -2.4690, last reward: -0.0244, gradient norm:  2.987:  78%|███████▊  | 489/625 [01:41<00:27,  4.89it/s]
reward: -2.4690, last reward: -0.0244, gradient norm:  2.987:  78%|███████▊  | 490/625 [01:41<00:27,  4.88it/s]
reward: -3.9800, last reward: -2.4005, gradient norm:  84.83:  78%|███████▊  | 490/625 [01:41<00:27,  4.88it/s]
reward: -3.9800, last reward: -2.4005, gradient norm:  84.83:  79%|███████▊  | 491/625 [01:41<00:27,  4.88it/s]
reward: -3.9788, last reward: -3.1078, gradient norm:  61.26:  79%|███████▊  | 491/625 [01:41<00:27,  4.88it/s]
reward: -3.9788, last reward: -3.1078, gradient norm:  61.26:  79%|███████▊  | 492/625 [01:41<00:27,  4.88it/s]
reward: -2.8486, last reward: -0.2049, gradient norm:  2.378:  79%|███████▊  | 492/625 [01:41<00:27,  4.88it/s]
reward: -2.8486, last reward: -0.2049, gradient norm:  2.378:  79%|███████▉  | 493/625 [01:41<00:27,  4.87it/s]
reward: -2.3804, last reward: -0.2427, gradient norm:  8.888:  79%|███████▉  | 493/625 [01:42<00:27,  4.87it/s]
reward: -2.3804, last reward: -0.2427, gradient norm:  8.888:  79%|███████▉  | 494/625 [01:42<00:26,  4.87it/s]
reward: -2.7383, last reward: -0.0216, gradient norm:  0.3409:  79%|███████▉  | 494/625 [01:42<00:26,  4.87it/s]
reward: -2.7383, last reward: -0.0216, gradient norm:  0.3409:  79%|███████▉  | 495/625 [01:42<00:26,  4.87it/s]
reward: -2.2972, last reward: -0.0008, gradient norm:  0.1397:  79%|███████▉  | 495/625 [01:42<00:26,  4.87it/s]
reward: -2.2972, last reward: -0.0008, gradient norm:  0.1397:  79%|███████▉  | 496/625 [01:42<00:26,  4.88it/s]
reward: -1.7317, last reward: -0.4504, gradient norm:  431.0:  79%|███████▉  | 496/625 [01:42<00:26,  4.88it/s]
reward: -1.7317, last reward: -0.4504, gradient norm:  431.0:  80%|███████▉  | 497/625 [01:42<00:26,  4.87it/s]
reward: -1.9472, last reward: -0.0047, gradient norm:  0.4756:  80%|███████▉  | 497/625 [01:42<00:26,  4.87it/s]
reward: -1.9472, last reward: -0.0047, gradient norm:  0.4756:  80%|███████▉  | 498/625 [01:42<00:26,  4.88it/s]
reward: -2.6030, last reward: -0.0010, gradient norm:  0.7292:  80%|███████▉  | 498/625 [01:43<00:26,  4.88it/s]
reward: -2.6030, last reward: -0.0010, gradient norm:  0.7292:  80%|███████▉  | 499/625 [01:43<00:25,  4.87it/s]
reward: -1.8096, last reward: -0.0002, gradient norm:  0.4949:  80%|███████▉  | 499/625 [01:43<00:25,  4.87it/s]
reward: -1.8096, last reward: -0.0002, gradient norm:  0.4949:  80%|████████  | 500/625 [01:43<00:25,  4.86it/s]
reward: -1.6683, last reward: -0.0004, gradient norm:  0.4736:  80%|████████  | 500/625 [01:43<00:25,  4.86it/s]
reward: -1.6683, last reward: -0.0004, gradient norm:  0.4736:  80%|████████  | 501/625 [01:43<00:25,  4.87it/s]
reward: -1.9906, last reward: -0.0021, gradient norm:  0.673:  80%|████████  | 501/625 [01:43<00:25,  4.87it/s]
reward: -1.9906, last reward: -0.0021, gradient norm:  0.673:  80%|████████  | 502/625 [01:43<00:25,  4.87it/s]
reward: -2.2903, last reward: -0.0044, gradient norm:  0.5502:  80%|████████  | 502/625 [01:43<00:25,  4.87it/s]
reward: -2.2903, last reward: -0.0044, gradient norm:  0.5502:  80%|████████  | 503/625 [01:43<00:25,  4.87it/s]
reward: -1.9797, last reward: -0.0132, gradient norm:  7.029:  80%|████████  | 503/625 [01:44<00:25,  4.87it/s]
reward: -1.9797, last reward: -0.0132, gradient norm:  7.029:  81%|████████  | 504/625 [01:44<00:24,  4.86it/s]
reward: -2.2245, last reward: -0.0062, gradient norm:  0.3676:  81%|████████  | 504/625 [01:44<00:24,  4.86it/s]
reward: -2.2245, last reward: -0.0062, gradient norm:  0.3676:  81%|████████  | 505/625 [01:44<00:24,  4.87it/s]
reward: -1.7487, last reward: -0.0040, gradient norm:  0.3802:  81%|████████  | 505/625 [01:44<00:24,  4.87it/s]
reward: -1.7487, last reward: -0.0040, gradient norm:  0.3802:  81%|████████  | 506/625 [01:44<00:24,  4.87it/s]
reward: -1.9054, last reward: -0.0013, gradient norm:  0.4617:  81%|████████  | 506/625 [01:44<00:24,  4.87it/s]
reward: -1.9054, last reward: -0.0013, gradient norm:  0.4617:  81%|████████  | 507/625 [01:44<00:24,  4.87it/s]
reward: -1.9537, last reward: -0.0003, gradient norm:  0.4139:  81%|████████  | 507/625 [01:45<00:24,  4.87it/s]
reward: -1.9537, last reward: -0.0003, gradient norm:  0.4139:  81%|████████▏ | 508/625 [01:45<00:24,  4.87it/s]
reward: -1.9811, last reward: -0.0037, gradient norm:  0.4968:  81%|████████▏ | 508/625 [01:45<00:24,  4.87it/s]
reward: -1.9811, last reward: -0.0037, gradient norm:  0.4968:  81%|████████▏ | 509/625 [01:45<00:23,  4.87it/s]
reward: -2.0120, last reward: -0.0066, gradient norm:  0.4458:  81%|████████▏ | 509/625 [01:45<00:23,  4.87it/s]
reward: -2.0120, last reward: -0.0066, gradient norm:  0.4458:  82%|████████▏ | 510/625 [01:45<00:23,  4.87it/s]
reward: -2.0880, last reward: -0.0170, gradient norm:  0.4251:  82%|████████▏ | 510/625 [01:45<00:23,  4.87it/s]
reward: -2.0880, last reward: -0.0170, gradient norm:  0.4251:  82%|████████▏ | 511/625 [01:45<00:23,  4.87it/s]
reward: -2.7379, last reward: -0.5845, gradient norm:  22.38:  82%|████████▏ | 511/625 [01:45<00:23,  4.87it/s]
reward: -2.7379, last reward: -0.5845, gradient norm:  22.38:  82%|████████▏ | 512/625 [01:45<00:23,  4.87it/s]
reward: -2.5455, last reward: -0.2139, gradient norm:  6.013:  82%|████████▏ | 512/625 [01:46<00:23,  4.87it/s]
reward: -2.5455, last reward: -0.2139, gradient norm:  6.013:  82%|████████▏ | 513/625 [01:46<00:23,  4.87it/s]
reward: -2.4104, last reward: -0.0107, gradient norm:  0.9234:  82%|████████▏ | 513/625 [01:46<00:23,  4.87it/s]
reward: -2.4104, last reward: -0.0107, gradient norm:  0.9234:  82%|████████▏ | 514/625 [01:46<00:22,  4.86it/s]
reward: -1.9657, last reward: -0.0201, gradient norm:  2.032:  82%|████████▏ | 514/625 [01:46<00:22,  4.86it/s]
reward: -1.9657, last reward: -0.0201, gradient norm:  2.032:  82%|████████▏ | 515/625 [01:46<00:22,  4.86it/s]
reward: -2.2164, last reward: -0.0025, gradient norm:  0.2708:  82%|████████▏ | 515/625 [01:46<00:22,  4.86it/s]
reward: -2.2164, last reward: -0.0025, gradient norm:  0.2708:  83%|████████▎ | 516/625 [01:46<00:22,  4.86it/s]
reward: -2.2957, last reward: -0.0005, gradient norm:  0.9441:  83%|████████▎ | 516/625 [01:46<00:22,  4.86it/s]
reward: -2.2957, last reward: -0.0005, gradient norm:  0.9441:  83%|████████▎ | 517/625 [01:46<00:22,  4.87it/s]
reward: -1.9742, last reward: -0.0045, gradient norm:  0.3999:  83%|████████▎ | 517/625 [01:47<00:22,  4.87it/s]
reward: -1.9742, last reward: -0.0045, gradient norm:  0.3999:  83%|████████▎ | 518/625 [01:47<00:21,  4.88it/s]
reward: -2.1574, last reward: -0.0078, gradient norm:  0.8513:  83%|████████▎ | 518/625 [01:47<00:21,  4.88it/s]
reward: -2.1574, last reward: -0.0078, gradient norm:  0.8513:  83%|████████▎ | 519/625 [01:47<00:21,  4.88it/s]
reward: -1.8835, last reward: -0.0095, gradient norm:  0.5518:  83%|████████▎ | 519/625 [01:47<00:21,  4.88it/s]
reward: -1.8835, last reward: -0.0095, gradient norm:  0.5518:  83%|████████▎ | 520/625 [01:47<00:21,  4.89it/s]
reward: -2.4242, last reward: -0.4031, gradient norm:  225.8:  83%|████████▎ | 520/625 [01:47<00:21,  4.89it/s]
reward: -2.4242, last reward: -0.4031, gradient norm:  225.8:  83%|████████▎ | 521/625 [01:47<00:21,  4.89it/s]
reward: -1.9132, last reward: -0.0034, gradient norm:  0.4315:  83%|████████▎ | 521/625 [01:47<00:21,  4.89it/s]
reward: -1.9132, last reward: -0.0034, gradient norm:  0.4315:  84%|████████▎ | 522/625 [01:47<00:21,  4.89it/s]
reward: -2.3352, last reward: -0.0129, gradient norm:  0.2119:  84%|████████▎ | 522/625 [01:48<00:21,  4.89it/s]
reward: -2.3352, last reward: -0.0129, gradient norm:  0.2119:  84%|████████▎ | 523/625 [01:48<00:20,  4.89it/s]
reward: -2.0629, last reward: -0.2873, gradient norm:  7.375:  84%|████████▎ | 523/625 [01:48<00:20,  4.89it/s]
reward: -2.0629, last reward: -0.2873, gradient norm:  7.375:  84%|████████▍ | 524/625 [01:48<00:20,  4.88it/s]
reward: -2.2347, last reward: -0.0025, gradient norm:  0.4424:  84%|████████▍ | 524/625 [01:48<00:20,  4.88it/s]
reward: -2.2347, last reward: -0.0025, gradient norm:  0.4424:  84%|████████▍ | 525/625 [01:48<00:20,  4.88it/s]
reward: -2.2983, last reward: -0.0170, gradient norm:  0.5518:  84%|████████▍ | 525/625 [01:48<00:20,  4.88it/s]
reward: -2.2983, last reward: -0.0170, gradient norm:  0.5518:  84%|████████▍ | 526/625 [01:48<00:20,  4.88it/s]
reward: -1.6817, last reward: -0.0020, gradient norm:  0.4182:  84%|████████▍ | 526/625 [01:48<00:20,  4.88it/s]
reward: -1.6817, last reward: -0.0020, gradient norm:  0.4182:  84%|████████▍ | 527/625 [01:48<00:20,  4.87it/s]
reward: -2.2043, last reward: -0.0008, gradient norm:  0.2703:  84%|████████▍ | 527/625 [01:49<00:20,  4.87it/s]
reward: -2.2043, last reward: -0.0008, gradient norm:  0.2703:  84%|████████▍ | 528/625 [01:49<00:19,  4.87it/s]
reward: -1.8662, last reward: -0.0026, gradient norm:  1.062:  84%|████████▍ | 528/625 [01:49<00:19,  4.87it/s]
reward: -1.8662, last reward: -0.0026, gradient norm:  1.062:  85%|████████▍ | 529/625 [01:49<00:19,  4.88it/s]
reward: -2.1564, last reward: -0.0035, gradient norm:  0.4355:  85%|████████▍ | 529/625 [01:49<00:19,  4.88it/s]
reward: -2.1564, last reward: -0.0035, gradient norm:  0.4355:  85%|████████▍ | 530/625 [01:49<00:19,  4.88it/s]
reward: -2.5856, last reward: -0.0278, gradient norm:  0.4754:  85%|████████▍ | 530/625 [01:49<00:19,  4.88it/s]
reward: -2.5856, last reward: -0.0278, gradient norm:  0.4754:  85%|████████▍ | 531/625 [01:49<00:19,  4.88it/s]
reward: -2.3204, last reward: -0.0163, gradient norm:  0.5904:  85%|████████▍ | 531/625 [01:49<00:19,  4.88it/s]
reward: -2.3204, last reward: -0.0163, gradient norm:  0.5904:  85%|████████▌ | 532/625 [01:49<00:19,  4.88it/s]
reward: -2.6885, last reward: -0.2438, gradient norm:  2.277:  85%|████████▌ | 532/625 [01:50<00:19,  4.88it/s]
reward: -2.6885, last reward: -0.2438, gradient norm:  2.277:  85%|████████▌ | 533/625 [01:50<00:18,  4.87it/s]
reward: -2.2555, last reward: -0.0452, gradient norm:  0.9628:  85%|████████▌ | 533/625 [01:50<00:18,  4.87it/s]
reward: -2.2555, last reward: -0.0452, gradient norm:  0.9628:  85%|████████▌ | 534/625 [01:50<00:18,  4.87it/s]
reward: -3.0695, last reward: -0.7870, gradient norm:  30.08:  85%|████████▌ | 534/625 [01:50<00:18,  4.87it/s]
reward: -3.0695, last reward: -0.7870, gradient norm:  30.08:  86%|████████▌ | 535/625 [01:50<00:18,  4.87it/s]
reward: -2.9792, last reward: -0.7378, gradient norm:  15.69:  86%|████████▌ | 535/625 [01:50<00:18,  4.87it/s]
reward: -2.9792, last reward: -0.7378, gradient norm:  15.69:  86%|████████▌ | 536/625 [01:50<00:18,  4.87it/s]
reward: -3.3185, last reward: -0.8053, gradient norm:  10.1:  86%|████████▌ | 536/625 [01:50<00:18,  4.87it/s]
reward: -3.3185, last reward: -0.8053, gradient norm:  10.1:  86%|████████▌ | 537/625 [01:50<00:18,  4.86it/s]
reward: -3.3615, last reward: -0.7426, gradient norm:  32.47:  86%|████████▌ | 537/625 [01:51<00:18,  4.86it/s]
reward: -3.3615, last reward: -0.7426, gradient norm:  32.47:  86%|████████▌ | 538/625 [01:51<00:17,  4.87it/s]
reward: -2.8675, last reward: -0.8165, gradient norm:  107.7:  86%|████████▌ | 538/625 [01:51<00:17,  4.87it/s]
reward: -2.8675, last reward: -0.8165, gradient norm:  107.7:  86%|████████▌ | 539/625 [01:51<00:17,  4.87it/s]
reward: -2.1532, last reward: -0.0066, gradient norm:  0.5248:  86%|████████▌ | 539/625 [01:51<00:17,  4.87it/s]
reward: -2.1532, last reward: -0.0066, gradient norm:  0.5248:  86%|████████▋ | 540/625 [01:51<00:17,  4.88it/s]
reward: -1.9298, last reward: -0.0014, gradient norm:  0.328:  86%|████████▋ | 540/625 [01:51<00:17,  4.88it/s]
reward: -1.9298, last reward: -0.0014, gradient norm:  0.328:  87%|████████▋ | 541/625 [01:51<00:17,  4.89it/s]
reward: -2.4598, last reward: -0.0155, gradient norm:  0.431:  87%|████████▋ | 541/625 [01:51<00:17,  4.89it/s]
reward: -2.4598, last reward: -0.0155, gradient norm:  0.431:  87%|████████▋ | 542/625 [01:51<00:16,  4.88it/s]
reward: -2.2100, last reward: -0.0003, gradient norm:  0.4409:  87%|████████▋ | 542/625 [01:52<00:16,  4.88it/s]
reward: -2.2100, last reward: -0.0003, gradient norm:  0.4409:  87%|████████▋ | 543/625 [01:52<00:16,  4.88it/s]
reward: -2.0063, last reward: -0.0017, gradient norm:  0.3312:  87%|████████▋ | 543/625 [01:52<00:16,  4.88it/s]
reward: -2.0063, last reward: -0.0017, gradient norm:  0.3312:  87%|████████▋ | 544/625 [01:52<00:16,  4.88it/s]
reward: -2.1692, last reward: -0.0344, gradient norm:  0.6026:  87%|████████▋ | 544/625 [01:52<00:16,  4.88it/s]
reward: -2.1692, last reward: -0.0344, gradient norm:  0.6026:  87%|████████▋ | 545/625 [01:52<00:19,  4.10it/s]
reward: -2.4494, last reward: -0.0029, gradient norm:  0.2738:  87%|████████▋ | 545/625 [01:52<00:19,  4.10it/s]
reward: -2.4494, last reward: -0.0029, gradient norm:  0.2738:  87%|████████▋ | 546/625 [01:52<00:18,  4.31it/s]
reward: -1.9326, last reward: -0.0023, gradient norm:  0.3547:  87%|████████▋ | 546/625 [01:53<00:18,  4.31it/s]
reward: -1.9326, last reward: -0.0023, gradient norm:  0.3547:  88%|████████▊ | 547/625 [01:53<00:17,  4.47it/s]
reward: -2.0056, last reward: -0.0011, gradient norm:  0.4607:  88%|████████▊ | 547/625 [01:53<00:17,  4.47it/s]
reward: -2.0056, last reward: -0.0011, gradient norm:  0.4607:  88%|████████▊ | 548/625 [01:53<00:16,  4.59it/s]
reward: -2.2037, last reward: -0.0005, gradient norm:  0.4285:  88%|████████▊ | 548/625 [01:53<00:16,  4.59it/s]
reward: -2.2037, last reward: -0.0005, gradient norm:  0.4285:  88%|████████▊ | 549/625 [01:53<00:16,  4.67it/s]
reward: -2.2003, last reward: -0.0001, gradient norm:  0.7362:  88%|████████▊ | 549/625 [01:53<00:16,  4.67it/s]
reward: -2.2003, last reward: -0.0001, gradient norm:  0.7362:  88%|████████▊ | 550/625 [01:53<00:15,  4.74it/s]
reward: -1.2650, last reward: -0.0000, gradient norm:  0.2252:  88%|████████▊ | 550/625 [01:53<00:15,  4.74it/s]
reward: -1.2650, last reward: -0.0000, gradient norm:  0.2252:  88%|████████▊ | 551/625 [01:53<00:15,  4.78it/s]
reward: -1.5291, last reward: -0.0001, gradient norm:  0.351:  88%|████████▊ | 551/625 [01:54<00:15,  4.78it/s]
reward: -1.5291, last reward: -0.0001, gradient norm:  0.351:  88%|████████▊ | 552/625 [01:54<00:15,  4.81it/s]
reward: -2.1972, last reward: -0.0454, gradient norm:  6.832:  88%|████████▊ | 552/625 [01:54<00:15,  4.81it/s]
reward: -2.1972, last reward: -0.0454, gradient norm:  6.832:  88%|████████▊ | 553/625 [01:54<00:14,  4.82it/s]
reward: -1.9404, last reward: -0.0000, gradient norm:  0.4075:  88%|████████▊ | 553/625 [01:54<00:14,  4.82it/s]
reward: -1.9404, last reward: -0.0000, gradient norm:  0.4075:  89%|████████▊ | 554/625 [01:54<00:14,  4.84it/s]
reward: -2.3901, last reward: -0.0043, gradient norm:  0.2454:  89%|████████▊ | 554/625 [01:54<00:14,  4.84it/s]
reward: -2.3901, last reward: -0.0043, gradient norm:  0.2454:  89%|████████▉ | 555/625 [01:54<00:14,  4.85it/s]
reward: -2.1442, last reward: -0.0016, gradient norm:  0.398:  89%|████████▉ | 555/625 [01:54<00:14,  4.85it/s]
reward: -2.1442, last reward: -0.0016, gradient norm:  0.398:  89%|████████▉ | 556/625 [01:54<00:14,  4.87it/s]
reward: -2.5808, last reward: -0.0063, gradient norm:  3.177:  89%|████████▉ | 556/625 [01:55<00:14,  4.87it/s]
reward: -2.5808, last reward: -0.0063, gradient norm:  3.177:  89%|████████▉ | 557/625 [01:55<00:13,  4.87it/s]
reward: -2.3110, last reward: -0.1865, gradient norm:  4.909:  89%|████████▉ | 557/625 [01:55<00:13,  4.87it/s]
reward: -2.3110, last reward: -0.1865, gradient norm:  4.909:  89%|████████▉ | 558/625 [01:55<00:13,  4.88it/s]
reward: -2.2579, last reward: -0.0129, gradient norm:  0.3089:  89%|████████▉ | 558/625 [01:55<00:13,  4.88it/s]
reward: -2.2579, last reward: -0.0129, gradient norm:  0.3089:  89%|████████▉ | 559/625 [01:55<00:13,  4.89it/s]
reward: -2.2661, last reward: -0.0258, gradient norm:  1.13:  89%|████████▉ | 559/625 [01:55<00:13,  4.89it/s]
reward: -2.2661, last reward: -0.0258, gradient norm:  1.13:  90%|████████▉ | 560/625 [01:55<00:13,  4.89it/s]
reward: -2.2963, last reward: -0.4148, gradient norm:  11.31:  90%|████████▉ | 560/625 [01:56<00:13,  4.89it/s]
reward: -2.2963, last reward: -0.4148, gradient norm:  11.31:  90%|████████▉ | 561/625 [01:56<00:13,  4.88it/s]
reward: -2.0830, last reward: -0.0138, gradient norm:  0.3773:  90%|████████▉ | 561/625 [01:56<00:13,  4.88it/s]
reward: -2.0830, last reward: -0.0138, gradient norm:  0.3773:  90%|████████▉ | 562/625 [01:56<00:12,  4.88it/s]
reward: -2.0689, last reward: -0.0016, gradient norm:  1.096:  90%|████████▉ | 562/625 [01:56<00:12,  4.88it/s]
reward: -2.0689, last reward: -0.0016, gradient norm:  1.096:  90%|█████████ | 563/625 [01:56<00:12,  4.89it/s]
reward: -2.2374, last reward: -0.0940, gradient norm:  3.178:  90%|█████████ | 563/625 [01:56<00:12,  4.89it/s]
reward: -2.2374, last reward: -0.0940, gradient norm:  3.178:  90%|█████████ | 564/625 [01:56<00:12,  4.89it/s]
reward: -2.4075, last reward: -0.0054, gradient norm:  0.4273:  90%|█████████ | 564/625 [01:56<00:12,  4.89it/s]
reward: -2.4075, last reward: -0.0054, gradient norm:  0.4273:  90%|█████████ | 565/625 [01:56<00:12,  4.89it/s]
reward: -2.5810, last reward: -0.4576, gradient norm:  30.6:  90%|█████████ | 565/625 [01:57<00:12,  4.89it/s]
reward: -2.5810, last reward: -0.4576, gradient norm:  30.6:  91%|█████████ | 566/625 [01:57<00:12,  4.89it/s]
reward: -2.0336, last reward: -0.0071, gradient norm:  0.3727:  91%|█████████ | 566/625 [01:57<00:12,  4.89it/s]
reward: -2.0336, last reward: -0.0071, gradient norm:  0.3727:  91%|█████████ | 567/625 [01:57<00:11,  4.88it/s]
reward: -2.4358, last reward: -0.0337, gradient norm:  2.027:  91%|█████████ | 567/625 [01:57<00:11,  4.88it/s]
reward: -2.4358, last reward: -0.0337, gradient norm:  2.027:  91%|█████████ | 568/625 [01:57<00:11,  4.88it/s]
reward: -2.3988, last reward: -0.0015, gradient norm:  0.4643:  91%|█████████ | 568/625 [01:57<00:11,  4.88it/s]
reward: -2.3988, last reward: -0.0015, gradient norm:  0.4643:  91%|█████████ | 569/625 [01:57<00:11,  4.88it/s]
reward: -2.2093, last reward: -0.0042, gradient norm:  0.2236:  91%|█████████ | 569/625 [01:57<00:11,  4.88it/s]
reward: -2.2093, last reward: -0.0042, gradient norm:  0.2236:  91%|█████████ | 570/625 [01:57<00:11,  4.89it/s]
reward: -1.7894, last reward: -0.0001, gradient norm:  0.5424:  91%|█████████ | 570/625 [01:58<00:11,  4.89it/s]
reward: -1.7894, last reward: -0.0001, gradient norm:  0.5424:  91%|█████████▏| 571/625 [01:58<00:11,  4.88it/s]
reward: -2.0149, last reward: -0.0005, gradient norm:  0.5926:  91%|█████████▏| 571/625 [01:58<00:11,  4.88it/s]
reward: -2.0149, last reward: -0.0005, gradient norm:  0.5926:  92%|█████████▏| 572/625 [01:58<00:10,  4.88it/s]
reward: -2.3232, last reward: -0.0703, gradient norm:  1.67:  92%|█████████▏| 572/625 [01:58<00:10,  4.88it/s]
reward: -2.3232, last reward: -0.0703, gradient norm:  1.67:  92%|█████████▏| 573/625 [01:58<00:10,  4.89it/s]
reward: -1.5762, last reward: -0.0003, gradient norm:  0.3608:  92%|█████████▏| 573/625 [01:58<00:10,  4.89it/s]
reward: -1.5762, last reward: -0.0003, gradient norm:  0.3608:  92%|█████████▏| 574/625 [01:58<00:10,  4.88it/s]
reward: -2.3711, last reward: -0.0000, gradient norm:  0.3172:  92%|█████████▏| 574/625 [01:58<00:10,  4.88it/s]
reward: -2.3711, last reward: -0.0000, gradient norm:  0.3172:  92%|█████████▏| 575/625 [01:58<00:10,  4.88it/s]
reward: -2.3527, last reward: -0.0001, gradient norm:  3.841:  92%|█████████▏| 575/625 [01:59<00:10,  4.88it/s]
reward: -2.3527, last reward: -0.0001, gradient norm:  3.841:  92%|█████████▏| 576/625 [01:59<00:10,  4.88it/s]
reward: -1.9138, last reward: -0.0004, gradient norm:  0.363:  92%|█████████▏| 576/625 [01:59<00:10,  4.88it/s]
reward: -1.9138, last reward: -0.0004, gradient norm:  0.363:  92%|█████████▏| 577/625 [01:59<00:09,  4.88it/s]
reward: -2.3048, last reward: -0.0007, gradient norm:  0.399:  92%|█████████▏| 577/625 [01:59<00:09,  4.88it/s]
reward: -2.3048, last reward: -0.0007, gradient norm:  0.399:  92%|█████████▏| 578/625 [01:59<00:09,  4.88it/s]
reward: -1.9566, last reward: -0.0011, gradient norm:  0.5855:  92%|█████████▏| 578/625 [01:59<00:09,  4.88it/s]
reward: -1.9566, last reward: -0.0011, gradient norm:  0.5855:  93%|█████████▎| 579/625 [01:59<00:09,  4.88it/s]
reward: -2.4461, last reward: -0.0148, gradient norm:  1.622:  93%|█████████▎| 579/625 [01:59<00:09,  4.88it/s]
reward: -2.4461, last reward: -0.0148, gradient norm:  1.622:  93%|█████████▎| 580/625 [01:59<00:09,  4.88it/s]
reward: -2.6084, last reward: -0.0063, gradient norm:  6.955:  93%|█████████▎| 580/625 [02:00<00:09,  4.88it/s]
reward: -2.6084, last reward: -0.0063, gradient norm:  6.955:  93%|█████████▎| 581/625 [02:00<00:09,  4.88it/s]
reward: -3.1225, last reward: -0.7400, gradient norm:  92.97:  93%|█████████▎| 581/625 [02:00<00:09,  4.88it/s]
reward: -3.1225, last reward: -0.7400, gradient norm:  92.97:  93%|█████████▎| 582/625 [02:00<00:08,  4.88it/s]
reward: -3.3131, last reward: -1.9206, gradient norm:  591.2:  93%|█████████▎| 582/625 [02:00<00:08,  4.88it/s]
reward: -3.3131, last reward: -1.9206, gradient norm:  591.2:  93%|█████████▎| 583/625 [02:00<00:08,  4.88it/s]
reward: -2.5562, last reward: -0.2136, gradient norm:  4.752:  93%|█████████▎| 583/625 [02:00<00:08,  4.88it/s]
reward: -2.5562, last reward: -0.2136, gradient norm:  4.752:  93%|█████████▎| 584/625 [02:00<00:08,  4.88it/s]
reward: -1.9200, last reward: -0.0085, gradient norm:  0.5597:  93%|█████████▎| 584/625 [02:00<00:08,  4.88it/s]
reward: -1.9200, last reward: -0.0085, gradient norm:  0.5597:  94%|█████████▎| 585/625 [02:00<00:08,  4.88it/s]
reward: -2.2839, last reward: -0.0135, gradient norm:  0.5916:  94%|█████████▎| 585/625 [02:01<00:08,  4.88it/s]
reward: -2.2839, last reward: -0.0135, gradient norm:  0.5916:  94%|█████████▍| 586/625 [02:01<00:07,  4.88it/s]
reward: -2.1346, last reward: -0.0095, gradient norm:  2.234:  94%|█████████▍| 586/625 [02:01<00:07,  4.88it/s]
reward: -2.1346, last reward: -0.0095, gradient norm:  2.234:  94%|█████████▍| 587/625 [02:01<00:07,  4.88it/s]
reward: -2.2311, last reward: -0.0026, gradient norm:  0.3546:  94%|█████████▍| 587/625 [02:01<00:07,  4.88it/s]
reward: -2.2311, last reward: -0.0026, gradient norm:  0.3546:  94%|█████████▍| 588/625 [02:01<00:07,  4.88it/s]
reward: -1.8353, last reward: -0.0001, gradient norm:  0.4645:  94%|█████████▍| 588/625 [02:01<00:07,  4.88it/s]
reward: -1.8353, last reward: -0.0001, gradient norm:  0.4645:  94%|█████████▍| 589/625 [02:01<00:07,  4.88it/s]
reward: -1.9739, last reward: -0.0033, gradient norm:  2.222:  94%|█████████▍| 589/625 [02:01<00:07,  4.88it/s]
reward: -1.9739, last reward: -0.0033, gradient norm:  2.222:  94%|█████████▍| 590/625 [02:01<00:07,  4.88it/s]
reward: -2.2696, last reward: -0.1279, gradient norm:  3.818:  94%|█████████▍| 590/625 [02:02<00:07,  4.88it/s]
reward: -2.2696, last reward: -0.1279, gradient norm:  3.818:  95%|█████████▍| 591/625 [02:02<00:06,  4.87it/s]
reward: -2.2685, last reward: -0.0089, gradient norm:  0.844:  95%|█████████▍| 591/625 [02:02<00:06,  4.87it/s]
reward: -2.2685, last reward: -0.0089, gradient norm:  0.844:  95%|█████████▍| 592/625 [02:02<00:06,  4.88it/s]
reward: -2.2583, last reward: -0.0056, gradient norm:  0.2895:  95%|█████████▍| 592/625 [02:02<00:06,  4.88it/s]
reward: -2.2583, last reward: -0.0056, gradient norm:  0.2895:  95%|█████████▍| 593/625 [02:02<00:06,  4.88it/s]
reward: -2.3198, last reward: -0.2449, gradient norm:  18.06:  95%|█████████▍| 593/625 [02:02<00:06,  4.88it/s]
reward: -2.3198, last reward: -0.2449, gradient norm:  18.06:  95%|█████████▌| 594/625 [02:02<00:06,  4.88it/s]
reward: -2.2948, last reward: -0.0019, gradient norm:  0.4655:  95%|█████████▌| 594/625 [02:02<00:06,  4.88it/s]
reward: -2.2948, last reward: -0.0019, gradient norm:  0.4655:  95%|█████████▌| 595/625 [02:02<00:06,  4.88it/s]
reward: -2.1368, last reward: -0.1032, gradient norm:  1.97:  95%|█████████▌| 595/625 [02:03<00:06,  4.88it/s]
reward: -2.1368, last reward: -0.1032, gradient norm:  1.97:  95%|█████████▌| 596/625 [02:03<00:05,  4.88it/s]
reward: -2.0820, last reward: -0.0000, gradient norm:  0.2516:  95%|█████████▌| 596/625 [02:03<00:05,  4.88it/s]
reward: -2.0820, last reward: -0.0000, gradient norm:  0.2516:  96%|█████████▌| 597/625 [02:03<00:05,  4.88it/s]
reward: -2.3768, last reward: -0.0006, gradient norm:  0.723:  96%|█████████▌| 597/625 [02:03<00:05,  4.88it/s]
reward: -2.3768, last reward: -0.0006, gradient norm:  0.723:  96%|█████████▌| 598/625 [02:03<00:05,  4.88it/s]
reward: -2.2649, last reward: -0.0010, gradient norm:  0.8623:  96%|█████████▌| 598/625 [02:03<00:05,  4.88it/s]
reward: -2.2649, last reward: -0.0010, gradient norm:  0.8623:  96%|█████████▌| 599/625 [02:03<00:05,  4.88it/s]
reward: -2.5340, last reward: -0.0005, gradient norm:  0.6933:  96%|█████████▌| 599/625 [02:04<00:05,  4.88it/s]
reward: -2.5340, last reward: -0.0005, gradient norm:  0.6933:  96%|█████████▌| 600/625 [02:04<00:05,  4.88it/s]
reward: -2.5290, last reward: -0.0018, gradient norm:  2.335:  96%|█████████▌| 600/625 [02:04<00:05,  4.88it/s]
reward: -2.5290, last reward: -0.0018, gradient norm:  2.335:  96%|█████████▌| 601/625 [02:04<00:04,  4.88it/s]
reward: -2.1673, last reward: -0.0003, gradient norm:  3.073:  96%|█████████▌| 601/625 [02:04<00:04,  4.88it/s]
reward: -2.1673, last reward: -0.0003, gradient norm:  3.073:  96%|█████████▋| 602/625 [02:04<00:04,  4.88it/s]
reward: -2.6205, last reward: -0.0079, gradient norm:  5.206:  96%|█████████▋| 602/625 [02:04<00:04,  4.88it/s]
reward: -2.6205, last reward: -0.0079, gradient norm:  5.206:  96%|█████████▋| 603/625 [02:04<00:04,  4.87it/s]
reward: -5.1828, last reward: -4.6680, gradient norm:  54.94:  96%|█████████▋| 603/625 [02:04<00:04,  4.87it/s]
reward: -5.1828, last reward: -4.6680, gradient norm:  54.94:  97%|█████████▋| 604/625 [02:04<00:04,  4.86it/s]
reward: -5.8211, last reward: -5.8027, gradient norm:  13.15:  97%|█████████▋| 604/625 [02:05<00:04,  4.86it/s]
reward: -5.8211, last reward: -5.8027, gradient norm:  13.15:  97%|█████████▋| 605/625 [02:05<00:04,  4.87it/s]
reward: -6.0052, last reward: -5.2599, gradient norm:  7.317:  97%|█████████▋| 605/625 [02:05<00:04,  4.87it/s]
reward: -6.0052, last reward: -5.2599, gradient norm:  7.317:  97%|█████████▋| 606/625 [02:05<00:03,  4.86it/s]
reward: -5.9510, last reward: -5.8142, gradient norm:  6.936:  97%|█████████▋| 606/625 [02:05<00:03,  4.86it/s]
reward: -5.9510, last reward: -5.8142, gradient norm:  6.936:  97%|█████████▋| 607/625 [02:05<00:03,  4.87it/s]
reward: -5.4776, last reward: -5.6192, gradient norm:  13.72:  97%|█████████▋| 607/625 [02:05<00:03,  4.87it/s]
reward: -5.4776, last reward: -5.6192, gradient norm:  13.72:  97%|█████████▋| 608/625 [02:05<00:03,  4.86it/s]
reward: -5.0379, last reward: -3.9016, gradient norm:  25.06:  97%|█████████▋| 608/625 [02:05<00:03,  4.86it/s]
reward: -5.0379, last reward: -3.9016, gradient norm:  25.06:  97%|█████████▋| 609/625 [02:05<00:03,  4.87it/s]
reward: -2.5771, last reward: -0.1840, gradient norm:  1.342:  97%|█████████▋| 609/625 [02:06<00:03,  4.87it/s]
reward: -2.5771, last reward: -0.1840, gradient norm:  1.342:  98%|█████████▊| 610/625 [02:06<00:03,  4.87it/s]
reward: -2.4566, last reward: -0.3031, gradient norm:  46.21:  98%|█████████▊| 610/625 [02:06<00:03,  4.87it/s]
reward: -2.4566, last reward: -0.3031, gradient norm:  46.21:  98%|█████████▊| 611/625 [02:06<00:02,  4.86it/s]
reward: -2.3758, last reward: -0.0001, gradient norm:  0.6069:  98%|█████████▊| 611/625 [02:06<00:02,  4.86it/s]
reward: -2.3758, last reward: -0.0001, gradient norm:  0.6069:  98%|█████████▊| 612/625 [02:06<00:02,  4.86it/s]
reward: -2.2030, last reward: -0.0016, gradient norm:  0.5892:  98%|█████████▊| 612/625 [02:06<00:02,  4.86it/s]
reward: -2.2030, last reward: -0.0016, gradient norm:  0.5892:  98%|█████████▊| 613/625 [02:06<00:02,  4.86it/s]
reward: -1.9065, last reward: -0.0472, gradient norm:  1.085:  98%|█████████▊| 613/625 [02:06<00:02,  4.86it/s]
reward: -1.9065, last reward: -0.0472, gradient norm:  1.085:  98%|█████████▊| 614/625 [02:06<00:02,  4.88it/s]
reward: -2.7741, last reward: -0.4854, gradient norm:  23.05:  98%|█████████▊| 614/625 [02:07<00:02,  4.88it/s]
reward: -2.7741, last reward: -0.4854, gradient norm:  23.05:  98%|█████████▊| 615/625 [02:07<00:02,  4.87it/s]
reward: -2.3814, last reward: -2.3419, gradient norm:  107.5:  98%|█████████▊| 615/625 [02:07<00:02,  4.87it/s]
reward: -2.3814, last reward: -2.3419, gradient norm:  107.5:  99%|█████████▊| 616/625 [02:07<00:01,  4.86it/s]
reward: -2.7114, last reward: -1.2236, gradient norm:  16.27:  99%|█████████▊| 616/625 [02:07<00:01,  4.86it/s]
reward: -2.7114, last reward: -1.2236, gradient norm:  16.27:  99%|█████████▊| 617/625 [02:07<00:01,  4.86it/s]
reward: -2.3560, last reward: -0.0010, gradient norm:  2.488:  99%|█████████▊| 617/625 [02:07<00:01,  4.86it/s]
reward: -2.3560, last reward: -0.0010, gradient norm:  2.488:  99%|█████████▉| 618/625 [02:07<00:01,  4.87it/s]
reward: -1.7539, last reward: -0.0022, gradient norm:  0.4706:  99%|█████████▉| 618/625 [02:07<00:01,  4.87it/s]
reward: -1.7539, last reward: -0.0022, gradient norm:  0.4706:  99%|█████████▉| 619/625 [02:07<00:01,  4.86it/s]
reward: -1.9285, last reward: -0.0051, gradient norm:  0.3408:  99%|█████████▉| 619/625 [02:08<00:01,  4.86it/s]
reward: -1.9285, last reward: -0.0051, gradient norm:  0.3408:  99%|█████████▉| 620/625 [02:08<00:01,  4.87it/s]
reward: -2.3782, last reward: -0.0073, gradient norm:  0.4432:  99%|█████████▉| 620/625 [02:08<00:01,  4.87it/s]
reward: -2.3782, last reward: -0.0073, gradient norm:  0.4432:  99%|█████████▉| 621/625 [02:08<00:00,  4.87it/s]
reward: -2.0915, last reward: -0.0086, gradient norm:  0.3351:  99%|█████████▉| 621/625 [02:08<00:00,  4.87it/s]
reward: -2.0915, last reward: -0.0086, gradient norm:  0.3351: 100%|█████████▉| 622/625 [02:08<00:00,  4.87it/s]
reward: -2.5187, last reward: -0.1573, gradient norm:  7.866: 100%|█████████▉| 622/625 [02:08<00:00,  4.87it/s]
reward: -2.5187, last reward: -0.1573, gradient norm:  7.866: 100%|█████████▉| 623/625 [02:08<00:00,  4.87it/s]
reward: -2.4126, last reward: -0.0157, gradient norm:  0.8849: 100%|█████████▉| 623/625 [02:08<00:00,  4.87it/s]
reward: -2.4126, last reward: -0.0157, gradient norm:  0.8849: 100%|█████████▉| 624/625 [02:08<00:00,  4.87it/s]
reward: -2.0543, last reward: -0.0045, gradient norm:  0.2265: 100%|█████████▉| 624/625 [02:09<00:00,  4.87it/s]
reward: -2.0543, last reward: -0.0045, gradient norm:  0.2265: 100%|██████████| 625/625 [02:09<00:00,  4.87it/s]
reward: -2.0543, last reward: -0.0045, gradient norm:  0.2265: 100%|██████████| 625/625 [02:09<00:00,  4.84it/s]

結論#

在本教程中,我們學習瞭如何從頭開始編寫無狀態環境。我們觸及了以下主題:

  • 編寫環境時需要注意的四個基本組成部分(stepreset、播種和構建規格)。我們看到了這些方法和類如何與 TensorDict 類互動;

  • 如何使用 check_env_specs() 測試環境是否已正確編寫;

  • 如何在無狀態環境的上下文中附加 transforms 以及如何編寫自定義變換;

  • 如何訓練策略在完全可微分的模擬器上執行。

指令碼總執行時間: (2 分鐘 9.745 秒)