ModelScope Text To Video Synthesis - a Hugging Face Space by ali-vilab

Introduction

Model Description

The ModelScope Text To Video Synthesis is a revolutionary AI model that generates videos from text inputs. This diffusion model consists of three sub-networks: text feature extraction, text feature-to-video latent space diffusion model, and video latent space to video visual space. With approximately 1.7 billion model parameters, it supports English input and adopts the Unet3D structure to realize video generation through an iterative denoising process from pure Gaussian noise videos.

Features

Model Limitations and Biases

While the ModelScope Text To Video Synthesis is a powerful tool, it's essential to understand its limitations and biases. The model is trained on public datasets such as Webvid, which may result in deviations related to the distribution of training data. It cannot achieve perfect film and television quality generation, and it's not capable of generating clear text. Additionally, the model is primarily trained on English corpus and does not support other languages at the moment. Its performance needs to be improved on complex compositional generation tasks.

Usage

How to Use

Experience the ModelScope Text To Video Synthesis directly on ModelScope Studio and Hugging Face or refer to the Colab page to build it yourself. To facilitate the experience, users can refer to the Aliyun Notebook Tutorial to quickly develop this Text-to-Video model. This demo requires about 16GB CPU RAM and 16GB GPU RAM. Under the ModelScope framework, the current model can be used by calling a simple Pipeline, where the input must be in dictionary format, with the legal key value being 'text', and the content being a short text. This model currently only supports inference on the GPU.

Operating Environment

The operating environment requires Python packages such as modelscope, open_clip_torch, and pytorch-lightning.

Benefits

Model Limitations and Biases

While the ModelScope Text To Video Synthesis has its limitations and biases, it's essential to understand them to utilize the model effectively. The model is trained on public datasets such as Webvid, which may result in deviations related to the distribution of training data. It cannot achieve perfect film and television quality generation, and it's not capable of generating clear text. Additionally, the model is primarily trained on English corpus and does not support other languages at the moment. Its performance needs to be improved on complex compositional generation tasks.

Conclusion

Training Data

The training data includes LAION5B, ImageNet, Webvid, and other public datasets. Image and video filtering is performed after pre-training, such as aesthetic score, watermark score, and deduplication.

Citation

The model should be cited using the provided citation information.

Spaces

The model is used in various spaces, including saifytechnologies/ai-text-to-video-generation-saify-technologies, ali-vilab/modelscope-text-to-video-synthesis, NeuralInternet/Text-to-video_Playground, Abidlabs/cinemascope, Libra7578/Image-to-video, Heathlia/modelscope-text-to-video-synthesis, masbejo99/modelscope-text-to-video-synthesis, zekewilliams/video, adamirus/VideoGEN, monkeybird420/modelscope-text-to-video-synthesis, and raoyang111/modelscope-text-to-video-synthesis.

ModelScope Text To Video Synthesis - a Hugging Face Space by ali-vilab

Discover amazing ML apps made by the community

Introduction

Introduction

Model Description

Features

Model Limitations and Biases

Usage

How to Use

Operating Environment

Benefits

Model Limitations and Biases

Conclusion

Training Data

Citation

Spaces

Traffic Data

Traffic Overview

Visits Over Time

Traffic Sources