Given an input video of a person and a new garment, SwiftTry can synthesize a new video where the person is wearing the specified garment while maintaining spatiotemporal consistency. We reconceptualize video virtual try-on as a conditional video inpainting task, with garments serving as input conditions. Specifically, our approach enhances image diffusion models by incorporating temporal attention layers to improve temporal coherence. To reduce computational overhead, we propose ShiftCaching, a novel technique that maintains temporal consistency while minimizing redundant computations. Furthermore, we introduce the TikTokDress dataset, a new video try-on dataset featuring more complex backgrounds, challenging movements, and higher resolution compared to existing public datasets.
To further achieve good temporal coherence and smoothness without recomputing the overlapped regions, we propose a shifting mechanism during inference.
(a) The long video is divided into non-overlapping chunks (\(S=0\), \(N=8\)). At each DDIM sampling timestep \(t\), we shift these chunks by a predefined value \(\Delta=4\) between two consecutive frames, allowing the model to process different compositions of noisy chunks at each step. To further accelerate the inference process, we can skip a random chunk to reduce redundant computation during denoising. However, naively dropping chunks without adjustment can lead to abrupt changes in noise levels in the final results.@article{hung2025SwiftTry,
author = {Hung Nguyen*, Quang Qui-Vinh Nguyen*, Khoi Nguyen, Rang Nguyen},
title = {SwiftTry: Fast and Consistent Video Virtual Try-On with Diffusion Models},
journal = {AAAI},
year = {2025},
}