光流：使用 RAFT 模型預測運動¶

注意

嘗試在 Colab 上執行，或前往末尾下載完整的示例程式碼。

光流是指預測兩幅影像之間的運動，通常是影片的兩個連續幀。光流模型以兩幅影像作為輸入，並預測一個流：該流指示第一幅影像中每個畫素的位移，並將其對映到第二幅影像中的相應畫素。流是 (2, H, W) 維度的張量，其中第一個軸對應於預測的水平和垂直位移。

以下示例說明了如何使用 torchvision 和我們實現的 RAFT 模型來預測流。我們還將演示如何將預測的流轉換為 RGB 影像進行視覺化。

import numpy as np
import torch
import matplotlib.pyplot as plt
import torchvision.transforms.functional as F


plt.rcParams["savefig.bbox"] = "tight"


def plot(imgs, **imshow_kwargs):
    if not isinstance(imgs[0], list):
        # Make a 2d grid even if there's just 1 row
        imgs = [imgs]

    num_rows = len(imgs)
    num_cols = len(imgs[0])
    _, axs = plt.subplots(nrows=num_rows, ncols=num_cols, squeeze=False)
    for row_idx, row in enumerate(imgs):
        for col_idx, img in enumerate(row):
            ax = axs[row_idx, col_idx]
            img = F.to_pil_image(img.to("cpu"))
            ax.imshow(np.asarray(img), **imshow_kwargs)
            ax.set(xticklabels=[], yticklabels=[], xticks=[], yticks=[])

    plt.tight_layout()

使用 Torchvision 讀取影片¶

我們首先使用 read_video() 讀取影片。或者，也可以使用新的 VideoReader API（如果 torchvision 是從原始碼構建的）。我們將在此處使用的影片來自 pexels.com 免費提供，版權歸 Pavel Danilyuk 所有。

import tempfile
from pathlib import Path
from urllib.request import urlretrieve


video_url = "https://download.pytorch.org/tutorial/pexelscom_pavel_danilyuk_basketball_hd.mp4"
video_path = Path(tempfile.mkdtemp()) / "basketball.mp4"
_ = urlretrieve(video_url, video_path)

read_video() 返回影片幀、音訊幀以及與影片關聯的元資料。在本例中，我們只需要影片幀。

此處我們將僅在兩個預先選擇的幀對之間進行兩次預測，即幀 (100, 101) 和 (150, 151)。這些幀對中的每一對都對應一個模型輸入。

from torchvision.io import read_video
frames, _, _ = read_video(str(video_path), output_format="TCHW")

img1_batch = torch.stack([frames[100], frames[150]])
img2_batch = torch.stack([frames[101], frames[151]])

plot(img1_batch)

/pytorch/vision/torchvision/io/_video_deprecation_warning.py:9: UserWarning: The video decoding and encoding capabilities of torchvision are deprecated from version 0.22 and will be removed in version 0.24. We recommend that you migrate to TorchCodec, where we'll consolidate the future decoding/encoding capabilities of PyTorch: https://github.com/pytorch/torchcodec
  warnings.warn(
/pytorch/vision/torchvision/io/video.py:199: UserWarning: The pts_unit 'pts' gives wrong results. Please use pts_unit 'sec'.
  warnings.warn("The pts_unit 'pts' gives wrong results. Please use pts_unit 'sec'.")

RAFT 模型接受 RGB 影像。我們首先從 read_video() 中獲取幀，並調整它們的大小以確保它們的尺寸可被 8 整除。請注意，我們明確使用了 antialias=False，因為這些模型就是這樣訓練的。然後，我們使用與權重捆綁在一起的變換來預處理輸入，並將值重新縮放到所需的 [-1, 1] 區間。

from torchvision.models.optical_flow import Raft_Large_Weights

weights = Raft_Large_Weights.DEFAULT
transforms = weights.transforms()


def preprocess(img1_batch, img2_batch):
    img1_batch = F.resize(img1_batch, size=[520, 960], antialias=False)
    img2_batch = F.resize(img2_batch, size=[520, 960], antialias=False)
    return transforms(img1_batch, img2_batch)


img1_batch, img2_batch = preprocess(img1_batch, img2_batch)

print(f"shape = {img1_batch.shape}, dtype = {img1_batch.dtype}")

shape = torch.Size([2, 3, 520, 960]), dtype = torch.float32

使用 RAFT 估計光流¶

我們將使用來自 raft_large() 的 RAFT 實現，它遵循與原始論文中描述的相同的架構。我們還提供了 raft_small() 模型構建器，它更小、執行速度更快，但準確性稍有犧牲。

from torchvision.models.optical_flow import raft_large

# If you can, run this example on a GPU, it will be a lot faster.
device = "cuda" if torch.cuda.is_available() else "cpu"

model = raft_large(weights=Raft_Large_Weights.DEFAULT, progress=False).to(device)
model = model.eval()

list_of_flows = model(img1_batch.to(device), img2_batch.to(device))
print(f"type = {type(list_of_flows)}")
print(f"length = {len(list_of_flows)} = number of iterations of the model")

Downloading: "https://download.pytorch.org/models/raft_large_C_T_SKHT_V2-ff5fadd5.pth" to /root/.cache/torch/hub/checkpoints/raft_large_C_T_SKHT_V2-ff5fadd5.pth
type = <class 'list'>
length = 12 = number of iterations of the model

RAFT 模型輸出預測流的列表，其中每個條目都是一個 (N, 2, H, W) 的預測流批次，對應於模型中的給定“迭代”。有關模型迭代特性的更多詳細資訊，請參閱原始論文。在這裡，我們只對最終預測的流（它們是最準確的）感興趣，所以我們只檢索列表中的最後一個條目。

如上所述，流是維度為 (2, H, W)（或對於流批次為 (N, 2, H, W)）的張量，其中每個條目對應於從第一幅影像到第二幅影像的每個畫素的水平和垂直位移。請注意，預測的流以“畫素”為單位，它們並未根據影像的尺寸進行歸一化。

predicted_flows = list_of_flows[-1]
print(f"dtype = {predicted_flows.dtype}")
print(f"shape = {predicted_flows.shape} = (N, 2, H, W)")
print(f"min = {predicted_flows.min()}, max = {predicted_flows.max()}")

dtype = torch.float32
shape = torch.Size([2, 2, 520, 960]) = (N, 2, H, W)
min = -3.8997151851654053, max = 6.400382995605469

視覺化預測的流¶

Torchvision 提供了 flow_to_image() 工具，用於將流轉換為 RGB 影像。它也支援流批次。流中的每個“方向”都將對映到給定的 RGB 顏色。在下面的影像中，模型假定顏色相似的畫素以相似的方向移動。模型能夠正確預測球和球員的運動。請特別注意第一個影像中球（向左移動）和第二個影像中球（向上移動）的預測方向不同。

from torchvision.utils import flow_to_image

flow_imgs = flow_to_image(predicted_flows)

# The images have been mapped into [-1, 1] but for plotting we want them in [0, 1]
img1_batch = [(img1 + 1) / 2 for img1 in img1_batch]

grid = [[img1, flow_img] for (img1, flow_img) in zip(img1_batch, flow_imgs)]
plot(grid)

附加：建立預測流的 GIF¶

在上面的示例中，我們只展示了 2 對幀的預測流。應用光流模型的一種有趣方式是讓模型處理整個影片，並從所有預測的流建立一個新影片。下面是一段可以幫助你入門的程式碼片段。我們註釋掉了程式碼，因為此示例正在沒有 GPU 的機器上渲染，執行它會花費太長時間。

# from torchvision.io import write_jpeg
# for i, (img1, img2) in enumerate(zip(frames, frames[1:])):
#     # Note: it would be faster to predict batches of flows instead of individual flows
#     img1, img2 = preprocess(img1, img2)

#     list_of_flows = model(img1.to(device), img2.to(device))
#     predicted_flow = list_of_flows[-1][0]
#     flow_img = flow_to_image(predicted_flow).to("cpu")
#     output_folder = "/tmp/"  # Update this to the folder of your choice
#     write_jpeg(flow_img, output_folder + f"predicted_flow_{i}.jpg")

一旦 .jpg 流影像被儲存，你就可以使用 ffmpeg 將它們轉換為影片或 GIF，例如：

ffmpeg -f image2 -framerate 30 -i predicted_flow_%d.jpg -loop -1 flow.gif

指令碼總執行時間： (0 分鐘 9.066 秒)

由 Sphinx-Gallery 生成的畫廊