評價此頁

★ ★ ★ ★ ★

intermediate/realtime_rpi

在 Google Colab 中執行

在 Raspberry Pi 4 和 5 上進行即時推理（40 fps！）#

創建於：2022 年 02 月 08 日 | 最後更新：2025 年 09 月 30 日 | 最後驗證：2024 年 11 月 05 日

作者：Tristan Rice

PyTorch 對 Raspberry Pi 4 和 5 提供了開箱即用的支援。本教程將指導您如何設定 Raspberry Pi 以執行 PyTorch，並在 CPU 上即時（30-40 fps）執行 MobileNet v2 分類模型。

所有這些都已在 Raspberry Pi 4 Model B 4GB 上進行了測試，但應該也適用於 2GB 型號，以及在 3B 上執行（效能會降低）。

https://user-images.githubusercontent.com/909104/153093710-bc736b6f-69d9-4a50-a3e8-9f2b2c9e04fd.gif

先決條件#

要遵循本教程，您需要一臺 Raspberry Pi 4 或 5，一個攝像頭以及所有其他標準配件。

Raspberry Pi 4 Model B 2GB+
Raspberry Pi 攝像頭模組
散熱片和風扇（可選但推薦）
5V 3A USB-C 電源介面卡
SD 卡（至少 8GB）
SD 卡讀寫器

Raspberry Pi 設定#

PyTorch 只提供 Arm 64 位（aarch64）的 pip 包，因此您需要在 Raspberry Pi 上安裝 64 位版本的作業系統。

您需要安裝官方 rpi-imager 來安裝 Rasbperry Pi OS。

32 位 Raspberry Pi OS 將無法工作。

https://user-images.githubusercontent.com/909104/152866212-36ce29b1-aba6-4924-8ae6-0a283f1fca14.gif

安裝過程將至少需要幾分鐘，具體取決於您的網際網路速度和 SD 卡速度。完成後，應該會顯示如下內容：

https://user-images.githubusercontent.com/909104/152867425-c005cff0-5f3f-47f1-922d-e0bbb541cd25.png

現在將 SD 卡插入 Raspberry Pi，連線攝像頭並啟動。

https://user-images.githubusercontent.com/909104/152869862-c239c980-b089-4bd5-84eb-0a1e5cf22df2.png

Raspberry Pi 4 配置#

如果您使用的是 Raspberry Pi 4，則需要進行一些額外的配置更改。Raspberry Pi 5 不需要這些更改。

作業系統啟動並完成初始設定後，您需要編輯 `/boot/config.txt` 檔案以啟用攝像頭。

# This enables the extended features such as the camera.
start_x=1

# This needs to be at least 128M for the camera processing, if it's bigger you can just leave it as is.
gpu_mem=128

然後重新啟動。

安裝 PyTorch 和 picamera2#

PyTorch 和所有其他我們需要的庫都有 ARM 64 位/aarch64 版本，因此您可以透過 pip 安裝它們，就像在任何其他 Linux 系統上一樣工作。

$ sudo apt install -y python3-picamera2 python3-libcamera
$ pip install torch torchvision --break-system-packages

https://user-images.githubusercontent.com/909104/152874260-95a7a8bd-0f9b-438a-9c0b-5b67729e233f.png

我們現在可以檢查所有安裝是否正常

$ python -c "import torch; print(torch.__version__)"

https://user-images.githubusercontent.com/909104/152874271-d7057c2d-80fd-4761-aed4-df6c8b7aa99f.png

影片捕獲#

首先，在終端中執行 `libcamera-hello` 來測試攝像頭是否正常工作。

對於影片捕獲，我們將使用 picamera2 來捕獲影片幀。

我們使用的模型（MobileNetV2）接受的影像尺寸為 `224x224`，因此我們可以直接從 picamera2 請求該尺寸的影片，幀率為 36fps。我們目標是 30fps 的模型幀率，但我們請求稍高的幀率，以便始終有足夠的幀。

from picamera2 import Picamera2

picam2 = Picamera2()

# print available sensor modes
print(picam2.sensor_modes)

config = picam2.create_still_configuration(main={
    "size": (224, 224),
    "format": "BGR888",
}, display="main")
picam2.configure(config)
picam2.set_controls({"FrameRate": 36})
picam2.start()

為了捕獲幀，我們可以呼叫 `capture_image` 來返回一個 `PIL.Image` 物件，該物件可用於 PyTorch。

# read frame
image = picam2.capture_image("main")

# show frame for testing
image.show()

此資料讀取和處理大約需要 `3.5ms`。

影像預處理#

我們需要將幀轉換為模型期望的格式。這與在任何機器上使用標準 torchvision 變換進行的處理相同。

from torchvision import transforms

preprocess = transforms.Compose([
    # convert the frame to a CHW torch tensor for training
    transforms.ToTensor(),
    # normalize the colors to the range that mobilenet_v2/3 expect
    transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]),
])
input_tensor = preprocess(image)
# The model can handle multiple images simultaneously so we need to add an
# empty dimension for the batch.
# [3, 224, 224] -> [1, 3, 224, 224]
input_batch = input_tensor.unsqueeze(0)

模型選擇#

有許多模型可供選擇，它們具有不同的效能特徵。並非所有模型都提供 `qnnpack` 預訓練變體，因此為了測試目的，您應該選擇一個提供該變體的模型，但如果您自己訓練和量化模型，則可以使用任何模型。

在本教程中，我們使用 `mobilenet_v2`，因為它具有良好的效能和準確性。

Raspberry Pi 4 效能測試結果

模型	FPS	總時間（ms/幀）	模型時間（ms/幀）	qnnpack 預訓練
mobilenet_v2	33.7	29.7	26.4	真
mobilenet_v3_large	29.3	34.1	30.7	真
resnet18	9.2	109.0	100.3	假
resnet50	4.3	233.9	225.2	假
resnext101_32x8d	1.1	892.5	885.3	假
inception_v3	4.9	204.1	195.5	假
googlenet	7.4	135.3	132.0	假
shufflenet_v2_x0_5	46.7	21.4	18.2	假
shufflenet_v2_x1_0	24.4	41.0	37.7	假
shufflenet_v2_x1_5	16.8	59.6	56.3	假
shufflenet_v2_x2_0	11.6	86.3	82.7	假

MobileNetV2：量化和 JIT#

為了獲得最佳效能，我們需要一個量化和融合的模型。量化意味著它使用 int8 進行計算，這比標準的 float32 數學效能更優越。融合意味著連續的操作已儘可能合併到一個性能更優的版本中。通常，像啟用函式（`ReLU`）可以在推理時合併到之前的層（`Conv2d`）中。

pytorch 的 aarch64 版本要求使用 `qnnpack` 引擎。

import torch
torch.backends.quantized.engine = 'qnnpack'

在此示例中，我們將使用 torchvision 開箱即用提供的 MobileNetV2 的預量化和融合版本。

from torchvision import models
net = models.quantization.mobilenet_v2(pretrained=True, quantize=True)

然後，我們將使用 JIT 編譯模型以減少 Python 開銷並融合任何操作。JIT 編譯可以讓我們獲得約 30fps 的效能，而如果沒有 JIT 編譯，則約為 20fps。

net = torch.jit.script(net)

整合#

現在我們可以將所有部分組合在一起並執行它。

import time

import torch
from torchvision import models, transforms
from picamera2 import Picamera2

torch.backends.quantized.engine = 'qnnpack'

picam2 = Picamera2()

# print available sensor modes
print(picam2.sensor_modes)

config = picam2.create_still_configuration(main={
    "size": (224, 224),
    "format": "BGR888",
}, display="main")
picam2.configure(config)
picam2.set_controls({"FrameRate": 36})
picam2.start()

preprocess = transforms.Compose([
    transforms.ToTensor(),
    transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]),
])

net = models.quantization.mobilenet_v2(pretrained=True, quantize=True)
# jit model to take it from ~20fps to ~30fps
net = torch.jit.script(net)

started = time.time()
last_logged = time.time()
frame_count = 0

with torch.no_grad():
    while True:
        # read frame
        image = picam2.capture_image("main")


        # preprocess
        input_tensor = preprocess(image)

        # create a mini-batch as expected by the model
        input_batch = input_tensor.unsqueeze(0)

        # run model
        output = net(input_batch)
        # do something with output ...
        print(output.argmax())

        # log model performance
        frame_count += 1
        now = time.time()
        if now - last_logged > 1:
            print(f"{frame_count / (now-last_logged)} fps")
            last_logged = now
            frame_count = 0

執行它表明我們在 Raspberry Pi 4 上執行速度約為 30 fps，在 Raspberry Pi 5 上執行速度約為 41 fps。

https://user-images.githubusercontent.com/909104/152892609-7d115705-3ec9-4f8d-beed-a51711503a32.png

這是在 Raspberry Pi OS 的所有預設設定下進行的。如果您停用了 UI 和所有其他預設啟用的後臺服務，其效能和穩定性會更高。

如果我們檢查 `htop`，我們會看到 CPU 利用率接近 100%。

https://user-images.githubusercontent.com/909104/152892630-f094b84b-19ba-48f6-8632-1b954abc59c7.png

為了驗證端到端的執行情況，我們可以計算類的機率，並使用 ImageNet 類標籤來列印檢測結果。

top = list(enumerate(output[0].softmax(dim=0)))
top.sort(key=lambda x: x[1], reverse=True)
for idx, val in top[:10]:
    print(f"{val.item()*100:.2f}% {classes[idx]}")

即時執行 `mobilenet_v3_large`

檢測到一個橙子

https://user-images.githubusercontent.com/909104/153092153-d9c08dfe-105b-408a-8e1e-295da8a78c19.jpg

檢測到一個馬克杯

https://user-images.githubusercontent.com/909104/153092155-4b90002f-a0f3-4267-8d70-e713e7b4d5a0.jpg

故障排除：效能#

PyTorch 預設會使用所有可用的核心。如果 Raspberry Pi 上有任何後臺執行的程式，它可能會與模型推理競爭，導致延遲尖峰。為緩解此問題，您可以減少執行緒數，這會以少量的效能損失降低峰值延遲。

torch.set_num_threads(2)

對於 `shufflenet_v2_x1_5`，使用 `2 threads` 而不是 `4 threads` 會將最佳情況延遲從 `60ms` 提高到 `72ms`，但消除了 `128ms` 的延遲尖峰。

下一步#

您可以建立自己的模型或微調現有模型。如果您在 torchvision.models.quantized 的模型上進行微調，那麼大部分融合和量化工作已經為您完成，因此您可以直接在 Raspberry Pi 上以良好的效能進行部署。

瞭解更多

量化，瞭解有關如何量化和融合模型的更多資訊。
遷移學習教程，瞭解如何使用遷移學習對現有模型進行微調以適應您的資料集。