Text-Guided Synthesis of Crowd Animation

Xuebo Ji1, Zherong Pan2, Xifeng Gao2, Jia Pan1
1The University of Hong Kong, Centre for Transformative Garment Production (TransGP), 2LightSpeed Studios
SIGGRAPH 2024 (Conference Track)

Given the environment maps, we illustrate two crowd animation scenarios generated using the text prompts below. The text prompt is first canonicalized using an LLM, which is then used to guide a diffusion model to generate several crowd distribution fields and velocity fields. These velocity fields are used to guide distinct groups of agents, forming a crowd animation.

Abstract

Creating vivid crowd animations is core to immersive virtual environments in digital games. This work focuses on tackling the challenges of the crowd behavior generation problem. Existing approaches are labor-intensive, relying on practitioners to manually craft the complex behavior systems. We propose a machine learning approach to synthesize diversified dynamic crowd animation scenarios for a given environment based on a text description input. We first train two conditional diffusion models that generate text-guided agent distribution fields and velocity fields. Assisted by local navigation algorithms, the fields are then used to control multiple groups of agents. We further employ Large-Language Model (LLM) to canonicalize the general script into a structured sentence for more stable training and better scalability. To train our diffusion models, we devise a constructive method to generate random environments and crowd animations. We show that our trained diffusion models can generate crowd animations for both unseen environments and novel scenario descriptions. Our method paves the way towards automatic generating of crowd behaviors for virtual environments.

Pipeline

We introduce the first-ever pipeline that targets at language-guided generation of environment-compatible scenarios involving a large number of agents navigating in real-time.

The overall pipeline:

We assume agents are divided into multiple groups where each group is controlled by a common velocity field. Our method takes as input a map of the environment, and a general script describing the behavior of these groups.

  • We utilize the powerful Large Language Models (LLM) to canonicalize the script into a set of structured sentences, one for each agent group.
  • Each structured sentence is then input to Latent Diffusion Models (LDMs) to predict the start/goal agent distribution and guiding velocity field for the agent group.
  • Agents are sampled according to the distribution map and velocity field is used to guide the RVO agent simulator.

Dataset Construction

As there is no existing datasets that provide the complete set of the environment map, group-wise description and behavior, we propose a constructive method for data generation.

We randomly sample the environment with multiple entities:

With sampled agent group paths in the environment, we generate the groundtruth canonical sentences:

Guided by the sampled path, we construct the velocity field and propose a simulation-assisted velocity adjustment procedure:

Initial velocity field Adjusted velocity field

Qualitative Results

We show the generated diversified dynamic crowd animation scenarios. Please refer to the main paper and supplementary material for detailed text descriptions and velocity fields.

Bridge Subway
Circle Garden
Crosswalk Larger crosswalk
Evacuation Larger evacuation