注意

轉到底部下載完整的示例程式碼。

TorchRL LLM：構建支援工具的環境¶

作者：Vincent Moens

本教程演示瞭如何在 TorchRL 中構建和組合具有工具功能的 LLM 環境。我們將展示如何建立一個能夠執行工具、格式化響應以及處理 LLM 與外部工具之間互動的完整環境。

本教程以網路瀏覽為例，但這些概念適用於 TorchRL 的 LLM 框架中的任何工具整合。

主要收穫

理解 TorchRL 的 LLM 環境組合
建立和新增工具轉換
格式化工具響應和 LLM 互動
處理工具執行和狀態管理

先決條件：基本熟悉 TorchRL 的環境概念。

安裝¶

首先，使用 LLM 支援安裝 TorchRL。如果您在 Jupyter notebook 中執行此命令，可以使用以下命令安裝包

%pip install "torchrl[llm]"    # Install TorchRL with all LLM dependencies

“torchrl[llm]”包包含 LLM 功能所需的所有必要依賴項，包括 transformers、vllm 和 playwright（用於瀏覽器自動化）。

安裝後，您需要設定瀏覽器自動化元件

!playwright install            # Install browser binaries

注意：“!”和“%pip”字首僅適用於 Jupyter notebooks。在常規終端中，請在不帶字首的情況下使用這些命令。

環境設定¶

TorchRL 的 LLM 介面圍繞可組合的環境和轉換構建。關鍵元件是：

基本環境 (ChatEnv)
工具執行轉換
資料載入轉換
獎勵計算轉換

讓我們匯入必要的元件並設定我們的環境。

from __future__ import annotations

import warnings

import torch

from tensordict import set_list_to_stack, TensorDict
from torchrl import torchrl_logger
from torchrl.data import CompositeSpec, Unbounded
from torchrl.envs import Transform
from torchrl.envs.llm import ChatEnv
from torchrl.envs.llm.transforms.browser import BrowserTransform
from transformers import AutoTokenizer

warnings.filterwarnings("ignore")

步驟 1：基本環境配置¶

我們將建立一個 ChatEnv 並配置其瀏覽器自動化功能。首先，我們啟用 TensorDict 的列表到堆疊轉換，這對於 LLM 環境中的正確批處理處理是必需的。

# Enable list-to-stack conversion for TensorDict
set_list_to_stack(True).set()

現在我們將建立 tokenizer 和基本環境。環境需要一個批處理大小，即使我們只執行單個例項。

tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-7B-Instruct")
env = ChatEnv(
    batch_size=(1,),
    tokenizer=tokenizer,
    apply_template=True,
    system_prompt=(
        "You are a helpful assistant that can use tools to accomplish tasks. "
        "Tools will be executed and their responses will be added to our conversation."
    ),
)

接下來，我們將新增具有安全配置的瀏覽器轉換。此轉換啟用了具有域限制的瀏覽器功能，以確保安全。

browser_transform = BrowserTransform(
    allowed_domains=["google.com", "github.com"],
    headless=False,  # Set to False to see the browser actions
)
env = env.append_transform(browser_transform)

我們還可以設計一個轉換來為環境分配獎勵。例如，我們可以解析瀏覽器轉換的結果，以便在達到特定目標時分配獎勵。在此示例中，如果 LLM 找到問題的答案（巴黎），則獎勵 2；如果 LLM 到達目標網站，則獎勵 1；否則獎勵 0。

class RewardTransform(Transform):
    """A transform that assigns rewards based on the LLM's responses.

    This transform parses the browser responses in the environment's history and assigns
    rewards based on specific achievements:

    - Finding the correct answer (Paris): reward = 2.0
    - Successfully reaching Google: reward = 1.0
    - Otherwise: reward = 0.0

    """

    def _call(self, tensordict: TensorDict) -> TensorDict:
        """Process the tensordict and assign rewards based on the LLM's response.

        Args:
            tensordict (TensorDict): The tensordict containing the environment state.
                Must have a "history" key containing the conversation history.

        Returns:
            TensorDict: The tensordict with an added "reward" key containing the
                computed reward with shape (B, 1) where B is the batch size.
        """
        # ChatEnv has created a history item. We just pick up the last item,
        # and check if `"Paris"` is in the response.
        # We use index 0 because we are in a single-instance environment.
        history = tensordict[0]["history"]
        last_item = history[-1]
        if "Paris" in last_item.content:
            torchrl_logger.info("Found the answer to the question: Paris")
            # Recall that rewards have a trailing singleton dimension.
            tensordict["reward"] = torch.full((1, 1), 2.0)
        # Check if we successfully reached the website
        elif (
            "google.com" in last_item.content
            and "executed successfully" in last_item.content
        ):
            torchrl_logger.info("Reached the website google.com")
            tensordict["reward"] = torch.full((1, 1), 1.0)
        else:
            tensordict["reward"] = torch.full((1, 1), 0.0)
        return tensordict

    def transform_reward_spec(self, reward_spec: CompositeSpec) -> CompositeSpec:
        """Transform the reward spec to include our custom reward.

        This method is required to override the reward spec since the environment
        is initially reward-agnostic.

        Args:
            reward_spec (CompositeSpec): The original reward spec from the environment.

        Returns:
            CompositeSpec: The transformed reward spec with our custom reward definition.
                The reward will have shape (B, 1) where B is the batch size.
        """
        reward_spec["reward"] = Unbounded(
            shape=reward_spec.shape + (1,), dtype=torch.float32
        )
        return reward_spec


# We append the reward transform to the environment.
env = env.append_transform(RewardTransform())

步驟 2：工具執行助手¶

為了使我們與工具的互動更加有條理，我們將建立一個助手函式來執行工具操作並顯示結果。

def execute_tool_action(
    env: ChatEnv,
    current_state: TensorDict,
    action: str,
    verbose: bool = True,
) -> tuple[TensorDict, TensorDict]:
    """Execute a tool action and show the formatted interaction."""
    s = current_state.set("text_response", [action])
    s, s_ = env.step_and_maybe_reset(s)

    if verbose:
        print("\nLLM Action:")
        print("-----------")
        print(action)
        print("\nEnvironment Response:")
        print("--------------------")
        torchrl_logger.info(s_["history"].apply_chat_template(tokenizer=env.tokenizer))

    return s, s_

步驟 3：開始互動¶

讓我們從初始化環境並輸入一個問題開始，然後導航到搜尋引擎。請注意，用作環境輸入的 tensordict 必須與環境共享相同的批處理大小。文字查詢被放入長度為 1 的列表中，以便與環境的批處理大小相容。

reset = env.reset(
    TensorDict(
        text=["What is the capital of France?"],
        batch_size=(1,),
    )
)

現在我們將使用瀏覽器轉換導航到 Google。該轉換期望操作採用特定的 JSON 格式，並用工具標籤包裝。在實踐中，此操作應該是我們的 LLM 的輸出，它將在“text_response”鍵中寫入響應字串。

s, s_ = execute_tool_action(
    env,
    reset,
    """
    Let me search for that:
    <tool>browser
    {
        "action": "navigate",
        "url": "https://google.com"
    }
    </tool><|im_end|>
    """,
)

步驟 4：執行搜尋¶

在開啟瀏覽器後，我們現在可以輸入我們的查詢並執行搜尋。首先，我們將搜尋查詢輸入到 Google 的搜尋框中。

s, s_ = execute_tool_action(
    env,
    s_,
    """
    Let me type the search query:
    <tool>browser
    {
        "action": "type",
        "selector": "[name='q']",
        "text": "What is the capital of France?"
    }
    </tool><|im_end|>
    """,
)

接下來，我們將單擊搜尋按鈕來執行搜尋。請注意我們如何使用 CSS 選擇器來識別頁面上的元素。

s, s_ = execute_tool_action(
    env,
    s_,
    """
    Now let me click the search button:
    <tool>browser
    {
        "action": "click",
        "selector": "[name='btnK']"
    }
    </tool><|im_end|>
    """,
)

步驟 5：提取結果¶

最後，我們將從頁面中提取搜尋結果。瀏覽器轉換可以從指定的元素中提取文字內容和 HTML。

s, s_ = execute_tool_action(
    env,
    s_,
    """
    Let me extract the results:
    <tool>browser
    {
        "action": "extract",
        "selector": "#search",
        "extract_type": "text"
    }
    </tool><|im_end|>
    """,
)

讓我們關閉環境。

env.close()

結論¶

本教程演示瞭如何在 TorchRL 中構建和組合具有工具功能的 LLM 環境。我們已經展示瞭如何建立一個能夠執行工具、格式化響應以及處理 LLM 與外部工具之間互動的完整環境。

關鍵概念是：

理解 TorchRL 的 LLM 環境組合
建立和新增工具轉換
格式化工具響應和 LLM 互動
處理工具執行和狀態管理
與 LLM 包裝器整合 (vLLM, Transformers)

有關如何使用 TorchRL 構建支援工具的環境的更多資訊，請參閱 ref_llms 教程。

由 Sphinx-Gallery 生成的畫廊