TensorDictTokenizer¶
- class torchrl.data.TensorDictTokenizer(tokenizer, max_length, key='text', padding='max_length', truncation=True, return_tensordict=True, device=None)[原始碼]¶
用於處理應用分詞器到文字示例的工廠函式。
- 引數:
tokenizer (來自 transformers 庫的分詞器) – 要使用的分詞器。
max_length (int) – 序列的最大長度。
key (str, optional) – 儲存文字的鍵。預設為
"text"。padding (str, optional) – 填充的型別。預設為
"max_length"。truncation (bool, optional) – 序列是否應截斷到 max_length。
return_tensordict (bool, optional) – 如果為
True,則返回一個 TensoDict。否則,將返回原始資料。device (torch.device, optional) – 用於儲存資料的裝置。如果
return_tensordict=False,則忽略此選項。
- 有關分詞器的更多資訊,請參閱 transformers 庫。
填充和截斷:https://huggingface.tw/docs/transformers/pad_truncation
返回:一個
tensordict.TensorDict例項,其批處理大小與輸入資料相同。示例
>>> from transformers import AutoTokenizer >>> tokenizer = AutoTokenizer.from_pretrained("gpt2") >>> tokenizer.pad_token = 100 >>> process = TensorDictTokenizer(tokenizer, max_length=10) >>> # example with a single input >>> example = {"text": "I am a little worried"} >>> process(example) TensorDict( fields={ attention_mask: Tensor(shape=torch.Size([10]), device=cpu, dtype=torch.int64, is_shared=False), input_ids: Tensor(shape=torch.Size([10]), device=cpu, dtype=torch.int64, is_shared=False)}, batch_size=torch.Size([]), device=None, is_shared=False) >>> # example with a multiple inputs >>> example = {"text": ["Let me reassure you", "It will be ok"]} >>> process(example) TensorDict( fields={ attention_mask: Tensor(shape=torch.Size([2, 10]), device=cpu, dtype=torch.int64, is_shared=False), input_ids: Tensor(shape=torch.Size([2, 10]), device=cpu, dtype=torch.int64, is_shared=False)}, batch_size=torch.Size([2]), device=None, is_shared=False)