KLRewardTransform¶
- class torchrl.envs.transforms.KLRewardTransform(actor: ProbabilisticTensorDictModule, coef=1.0, in_keys=None, out_keys=None, requires_grad=False, log_prob_key: NestedKey = 'sample_log_prob', action_key: NestedKey | None = None, functional: bool | None = None, device: torch.device | None = None)[原始碼]¶
一個用於在獎勵中新增 KL[pi_current||pi_0] 校正項的轉換。
此轉換用於約束策略使其接近其原始配置,從而在透過 RLHF 進行微調時限制過擬合。
- 引數:
actor (ProbabilisticTensorDictModule) – 一個機率性 actor。它必須具備以下特徵:必須有一組輸入(
in_keys)和輸出鍵(out_keys)。必須有一個get_dist方法,該方法輸出動作的分佈。coef (
float) – KL 項的係數。預設為1.0。in_keys (str 或 list of str/tuples of str) – 應從中獲取獎勵的輸入鍵。預設為
"reward"。out_keys (str 或 list of str/tuples of str) – 應將獎勵寫入的輸出鍵。預設為
"reward"。requires_grad (bool, optional) – 如果為
True,則凍結的引數將由原始引數的可微分克隆組成。預設為False。
注意
如果引數不可微分(預設),則在呼叫 dtype 或 device 轉換操作(如
cuda()、to()等)時,它們將 *不會* 隨模組一起移動。當requires_grad=True時,轉換操作將按預期工作。示例
>>> from torchrl.envs.libs.gym import GymEnv >>> from torchrl.envs import TransformedEnv >>> from tensordict.nn import TensorDictModule as Mod, NormalParamExtractor >>> from torchrl.modules import ProbabilisticActor >>> from tensordict import TensorDict >>> from torchrl.modules.distributions import TanhNormal >>> from torch import nn >>> base_env = GymEnv("Pendulum-v1") >>> n_obs = base_env.observation_spec["observation"].shape[-1] >>> n_act = base_env.action_spec.shape[-1] >>> module = Mod( ... nn.Sequential(nn.Linear(n_obs, n_act * 2), NormalParamExtractor()), ... in_keys=["observation"], ... out_keys=["loc", "scale"], ... ) >>> actor = ProbabilisticActor( ... module, ... in_keys=["loc", "scale"], ... distribution_class=TanhNormal, ... return_log_prob=True, ... ) >>> transform = KLRewardTransform(actor, out_keys="reward_kl") >>> env = TransformedEnv(base_env, transform) >>> with torch.no_grad(): ... # modify the actor parameters ... _ = TensorDict(dict(actor.named_parameters()), []).apply_(lambda x: x.data.copy_(x.data + 1)) ... td = env.rollout(3, actor) >>> # check that rewards have been modified >>> assert (td.get(("next", "reward")) != td.get(("next", "reward_kl"))).all()
注意
由於 KL 公式並非總是可用,並且原始分佈的引數可能未被記錄,我們使用 KL 散度的隨機估計。
- forward(tensordict: TensorDictBase) TensorDictBase[原始碼]¶
讀取輸入 tensordict,並對選定的鍵應用轉換。
預設情況下,此方法
直接呼叫
_apply_transform()。不呼叫
_step()或_call()。
此方法不會在任何時候在 env.step 中呼叫。但是,它會在
sample()中呼叫。注意
forward也可以使用dispatch將引數名稱轉換為鍵,並使用常規關鍵字引數。示例
>>> class TransformThatMeasuresBytes(Transform): ... '''Measures the number of bytes in the tensordict, and writes it under `"bytes"`.''' ... def __init__(self): ... super().__init__(in_keys=[], out_keys=["bytes"]) ... ... def forward(self, tensordict: TensorDictBase) -> TensorDictBase: ... bytes_in_td = tensordict.bytes() ... tensordict["bytes"] = bytes ... return tensordict >>> t = TransformThatMeasuresBytes() >>> env = env.append_transform(t) # works within envs >>> t(TensorDict(a=0)) # Works offline too.