Reward2GoTransform¶

根據回合獎勵和折扣因子計算回合獎勵。

由於 Reward2GoTransform 僅是一個逆變換，因此 in_keys 將直接用於 in_keys_inv。回合獎勵只能在回合結束後計算。因此，該變換應應用於回放緩衝區，而不是收集器或環境內部。

引數:

gamma (float 或 torch.Tensor) – 折扣因子。預設為 1.0。
in_keys (NestedKey 序列) – 要重新命名的條目。如果未提供，則預設為 ("next", "reward")。
out_keys (NestedKey 序列) – 要重新命名的條目。如果未提供，則預設為 in_keys 的值。
done_key (NestedKey) – done 條目。預設為 "done"。
truncated_key (NestedKey) – truncated 條目。預設為 "truncated"。如果未找到 truncated 條目，則僅使用 "done"。

示例

>>> # Using this transform as part of a replay buffer
>>> from torchrl.data import ReplayBuffer, LazyTensorStorage
>>> torch.manual_seed(0)
>>> r2g = Reward2GoTransform(gamma=0.99, out_keys=["reward_to_go"])
>>> rb = ReplayBuffer(storage=LazyTensorStorage(100), transform=r2g)
>>> batch, timesteps = 4, 5
>>> done = torch.zeros(batch, timesteps, 1, dtype=torch.bool)
>>> for i in range(batch):
...     while not done[i].any():
...         done[i] = done[i].bernoulli_(0.1)
>>> reward = torch.ones(batch, timesteps, 1)
>>> td = TensorDict(
...     {"next": {"done": done, "reward": reward}},
...     [batch, timesteps],
... )
>>> rb.extend(td)
>>> sample = rb.sample(1)
>>> print(sample["next", "reward"])
tensor([[[1.],
         [1.],
         [1.],
         [1.],
         [1.]]])
>>> print(sample["reward_to_go"])
tensor([[[4.9010],
         [3.9404],
         [2.9701],
         [1.9900],
         [1.0000]]])

也可以直接將此變換與收集器一起使用：確保追加變換的 inv 方法。

示例

>>> from torchrl.envs.utils import RandomPolicy        >>> from torchrl.collectors import SyncDataCollector
>>> from torchrl.envs.libs.gym import GymEnv
>>> t = Reward2GoTransform(gamma=0.99, out_keys=["reward_to_go"])
>>> env = GymEnv("Pendulum-v1")
>>> collector = SyncDataCollector(
...     env,
...     RandomPolicy(env.action_spec),
...     frames_per_batch=200,
...     total_frames=-1,
...     postproc=t.inv
... )
>>> for data in collector:
...     break
>>> print(data)
TensorDict(
    fields={
        action: Tensor(shape=torch.Size([200, 1]), device=cpu, dtype=torch.float32, is_shared=False),
        collector: TensorDict(
            fields={
                traj_ids: Tensor(shape=torch.Size([200]), device=cpu, dtype=torch.int64, is_shared=False)},
            batch_size=torch.Size([200]),
            device=cpu,
            is_shared=False),
        done: Tensor(shape=torch.Size([200, 1]), device=cpu, dtype=torch.bool, is_shared=False),
        next: TensorDict(
            fields={
                done: Tensor(shape=torch.Size([200, 1]), device=cpu, dtype=torch.bool, is_shared=False),
                observation: Tensor(shape=torch.Size([200, 3]), device=cpu, dtype=torch.float32, is_shared=False),
                reward: Tensor(shape=torch.Size([200, 1]), device=cpu, dtype=torch.float32, is_shared=False)},
            batch_size=torch.Size([200]),
            device=cpu,
            is_shared=False),
        observation: Tensor(shape=torch.Size([200, 3]), device=cpu, dtype=torch.float32, is_shared=False),
        reward: Tensor(shape=torch.Size([200, 1]), device=cpu, dtype=torch.float32, is_shared=False),
        reward_to_go: Tensor(shape=torch.Size([200, 1]), device=cpu, dtype=torch.float32, is_shared=False)},
    batch_size=torch.Size([200]),
    device=cpu,
    is_shared=False)

將此變換用作環境的一部分將引發異常

示例

>>> t = Reward2GoTransform(gamma=0.99)
>>> TransformedEnv(GymEnv("Pendulum-v1"), t)  # crashes

注意

在存在多個 done 條目的設定中，應為每個 done-reward 對構建一個單獨的 Reward2GoTransform。

forward(tensordict: TensorDictBase) → TensorDictBase[原始碼]¶

讀取輸入 tensordict，並對選定的鍵應用轉換。

預設情況下，此方法

直接呼叫 _apply_transform()。
不呼叫 _step() 或 _call()。

此方法不會在任何時候在 env.step 中呼叫。但是，它會在 sample() 中呼叫。

注意

forward 也可以使用 dispatch 將引數名稱轉換為鍵，並使用常規關鍵字引數。

示例

>>> class TransformThatMeasuresBytes(Transform):
...     '''Measures the number of bytes in the tensordict, and writes it under `"bytes"`.'''
...     def __init__(self):
...         super().__init__(in_keys=[], out_keys=["bytes"])
...
...     def forward(self, tensordict: TensorDictBase) -> TensorDictBase:
...         bytes_in_td = tensordict.bytes()
...         tensordict["bytes"] = bytes
...         return tensordict
>>> t = TransformThatMeasuresBytes()
>>> env = env.append_transform(t) # works within envs
>>> t(TensorDict(a=0))  # Works offline too.

Reward2GoTransform¶

文件

教程

資源