Textual content-To-4D Dynamic Scene Era
Summary
We current MAV3D (Make-A–Video3D), a technique for producing three-dimensional
dynamic scenes from textual content descriptions. Our method makes use of a 4D dynamic Neural Radiance Area (NeRF),
which is optimized for scene look, density, and movement consistency by querying a Textual content-to-Video (T2V)
diffusion-based mannequin. The dynamic video output generated from the offered textual content may be seen from any
digital camera location and angle, and may be composited into any 3D surroundings. MAV3D doesn’t require any 3D
or 4D information and the T2V mannequin is educated solely on Textual content-Picture pairs and unlabeled movies. We reveal the
effectiveness of our method utilizing complete quantitative and qualitative experiments and present an
enchancment over beforehand established inside baselines. To the perfect of our information, our technique is
the primary to generate 3D dynamic scenes given a textual content description.
Textual content-to-4D
A squirrel playing the saxophone.
A humanoid robot playing the violin.
A kangaroo cooking a meal.
A baby panda eating ice cream.
A yorkie dog eating a donut.
Chihuahua running on the grass.
Shark swimming in the desert.
A monkey eating a candy bar.
Image-to-4D
Input Image | Generate Video | Input Image | Generate Video | ||
→ |
|
→ |
|
||
→ |
|
→ |
|
Citation
@article{singer2023text4d,
author = {Singer, Uriel and Sheynin, Shelly and Polyak, Adam and Ashual, Oron and
Makarov, Iurii and Kokkinos, Filippos and Goyal, Naman and Vedaldi, Andrea and
Parikh, Devi and Johnson, Justin and Taigman, Yaniv},
title = {Text-To-4D Dynamic Scene Generation},
journal = {arXiv:2301.11280},
year = {2023},
}