Stream Diffusionのexamples/vid2vidを触ってみる。

前回の続きで、vid2vidも試してみた。

長めの動画だとMemoryErrorで落ちた。

av.error.MemoryError: [Errno 12] Cannot allocate memory

短めの動画で再挑戦したが、こんな感じのエラー。

RuntimeError: The expanded size of the tensor (1920) must match the existing size (1080) at non-singleton dimension 1.  Target sizes: [1080, 1920, 3].  Tensor sizes: [1920, 1080, 3]

どっかにバグがありそう。
縦と横が取り違えらえれてるっぽいので、
89行目あたりを修正。

    #video_result = torch.zeros(video.shape[0], width, height, 3)
    video_result = torch.zeros(video.shape[0], height, width, 3)

すると、とりあえずさっきのエラーが起きた個所は乗り越えることができた。

こんな感じで、RAMをかなり使う感じ。設定によるんだろうけど。

進捗中のコマンドプロンプト画面は以下のような感じ。

A matching Triton is not available, some optimizations will not be enabled.
Error caught was: No module named 'triton'
D:\Program\StreamDiffusion\.venv\Lib\site-packages\torchvision\io\video.py:161: UserWarning: The pts_unit 'pts' gives wrong results. Please use pts_unit 'sec'.
  warnings.warn("The pts_unit 'pts' gives wrong results. Please use pts_unit 'sec'.")
text_encoder\model.safetensors not found
Loading pipeline components...:  14%|███████▍                                            | 1/7 [00:01<00:08,  1.36s/it]D:\Program\StreamDiffusion\.venv\Lib\site-packages\transformers\models\clip\feature_extraction_clip.py:28: FutureWarning: The class CLIPFeatureExtractor is deprecated and will be removed in version 5 of Transformers. Please use CLIPImageProcessor instead.
  warnings.warn(
Loading pipeline components...: 100%|████████████████████████████████████████████████████| 7/7 [00:03<00:00,  1.99it/s]
D:\Program\StreamDiffusion\.venv\Lib\site-packages\diffusers\loaders\lora.py:952: FutureWarning: `fuse_text_encoder_lora` is deprecated and will be removed in version 0.25. You are using an old version of LoRA backend. This will be deprecated in the next releases in favor of PEFT make sure to install the latest PEFT and transformers packages in the future.
  deprecate("fuse_text_encoder_lora", "0.25", LORA_DEPRECATION_MESSAGE)
 73%|██████████████████████████████████████████████████████████▍                     | 272/372 [04:42<01:43,  1.03s/it]

60FPSで6秒程度の動画で、全部で372フレームある。フレームごとに書き出してるっぽいね。
1秒1フレームくらい。6秒の動画に6分くらいかかった。

動画のサイズ37.8MBに対してRAMは25.8GBくらい使った。

で、出力された動画を見ると、90度傾いた状態だった…
どうも修正した89行目付近はそのままでよくて、他の場所が間違ってたっぽい。

改めてエラー箇所を確認。

    for i in tqdm(range(video.shape[0])):
        output_image = stream(video[i].permute(2, 0, 1))
        video_result[i] = output_image.permute(1, 2, 0)

permuteは多分これっぽい。
ってわけで、以下のように修正。

    for i in tqdm(range(video.shape[0])):
        output_image = stream(video[i].permute(2, 0, 1))
#        video_result[i] = output_image.permute(1, 2, 0)
        video_result[i] = output_image.permute(2, 1, 0)

で、とりあえずさっさとテストしたいので、動画の長さを1秒以下になるよう、さらにトリムした。
再挑戦したけど…動画のサイズはあってるんだけど、動画の中身が横に傾いた状態…。

ってわけでもう1回挑戦。

    for _ in range(stream.batch_size):
#        stream(image=video[0].permute(2, 0, 1))
        stream(image=video[0].permute(0, 2, 1))

    for i in tqdm(range(video.shape[0])):
#        output_image = stream(video[i].permute(2, 0, 1))
        output_image = stream(video[i].permute(0, 2, 1))
        video_result[i] = output_image.permute(1, 2, 0)

が、これもエラー。

RuntimeError: Given groups=1, weight of size [64, 3, 3, 3], expected input[1, 1080, 1920, 1080] to have 3 channels, but got 1080 channels instead

試行錯誤の結果、以下のようにしたらエラーなく、縦横もあった状態で出力された。

    for _ in range(stream.batch_size):
#        stream(image=video[0].permute(2, 0, 1))
        stream(image=video[0].permute(2, 1, 0))

    for i in tqdm(range(video.shape[0])):
#        output_image = stream(video[i].permute(2, 0, 1))
        output_image = stream(video[i].permute(2, 1, 0))
#        video_result[i] = output_image.permute(1, 2, 0)
        video_result[i] = output_image.permute(2, 1, 0)

が、出力結果はとてもじゃないけど画質が悪くて見れたもんじゃない…
うーん。ブロックノイズが多い感じで、エンコードの問題のような気もする。

動画の長さを長くして再度やってみたけど、画質については同じ。
コード修正の仕方がおかしいか、そもそも512×512前提になってる可能性もあるね。

動画編集ソフトで512×512にクロップした動画を作ろうとしたけど、とりあえず無料のソフト入れるところから必要になるから、また今度やろうかな。