MMM: Generative Masked Motion Model (CVPR 2024, Highlight)

Ekkasit Pinyoanuntapong°, Pu Wang°, Minwoo Lee°, Chen Chen†

°University of North Carolina at Charlotte, †University of Central Florida

(a) MMM generates motion from text in only 0.081 seconds per sentence and achieves lowest FID score at 0.089 on HumanML3D, outperforming current state-of-the-art methods (the closer to the origin the better) while maintaining editable ability for multiple editing tasks e.g., (b) upper body editing, (c) motion in-betweening, and (d) long sequence generation. The red indicates generated parts and the blue refers to conditioned parts. Green indicates the transition between the short motion sequences with different text prompts for long motion sequence generation.

Abstract

Recent advances in text-to-motion generation using diffusion and autoregressive models have shown promising results. However, these models often suffer from a trade-off between real-time performance, high fidelity, and motion editability. To address this gap, we introduce MMM, a novel yet simple motion generation paradigm based on Masked Motion Model. MMM consists of two key components: (1) a motion tokenizer that transforms 3D human motion into a sequence of discrete tokens in latent space, and (2) a conditional masked motion transformer that learns to predict randomly masked motion tokens, conditioned on the pre-computed text tokens. By attending to motion and text tokens in all directions, MMM explicitly captures inherent dependency among motion tokens and semantic mapping between motion and text tokens. During inference, this allows parallel and iterative decoding of multiple motion tokens that are highly consistent with fine-grained text descriptions, therefore simultaneously achieving high-fidelity and high-speed motion generation. In addition, MMM has innate motion editability. By simply placing mask tokens in the place that needs editing, MMM automatically fills the gaps while guaranteeing smooth transitions between editing and non-editing parts. Extensive experiments on the HumanML3D and KIT-ML datasets demonstrate that MMM surpasses current leading methods in generating high-quality motion (evidenced by superior FID scores of 0.08 and 0.429), while offering advanced editing features such as body-part modification, motion in-betweening, and the synthesis of long motion sequences. In addition, MMM is two orders of magnitude faster on a single mid-range GPU than editable motion diffusion models.

Fast, High Quality, and Editable

Comparison of the inference speed and quality of generation on text-to-motion along with the editable capability of each model. "✓" means editable while "✗" is not and "-" refers to has-capability but no application provided. We calculate the Average Inference Time per Sentence (AITS) on the test set of HumanML3D without model or data loading parts. All tests are performed on a single NVIDIA RTX A5000.

Overall architecture of MMM

(a) Motion Tokenizer transforms the raw motion sequence into discrete motion tokens according to a learned codebook. (b) Conditional Masked Transformer learns to predict masked motion tokens, conditioned on word and sentence tokens obtained from CLIP text encoders. (c) Motion Generation starts from an empty canvas and the masked transformer concurrently and progressively predicts multiple high-confidence motion tokens.

Motion Editing

Motion Editing. (Left) Motion in-betweening. (Middle) Long Sequence Generation. (Right) Upper Body Editing. “M” refers to mask token. “T” is text conditioned tokens and “L” denotes lower body part conditioned tokens.

Compared to SOTA

Text to Motion 1:

"a person walks forward then turns completely around and does a cartwheel"

MMM (our)

(Realistic Motion)

Ground Truth

T2M-GPT

(Turn around too much)

MLD

(Unrealistic motion and lack a complete cartwheel motion)

MDM

(Does not execute cartwheel motion)

Text to Motion 2:

"a man walks forward, stumbles to the right, and then regains his balance and keeps walking forwards."

MMM (our)

(Stronger correlation to text than ground thruth is observed, as it has seen a many similar text-motion relationships during training.)

Ground Truth

T2M-GPT

(Stumbles to the left instead of right)

MLD

(Not "stumble" motion)

MDM

(Not "stumble" motion)

Motion Temporal Inpainting (Motion In-betweening):

Generating 50% motion in the middle based on the text “A person jumps forward” conditioned on first 25% and last 25% of motion of “a person walks forward, then is pushed to their left and then returns to walking in the line they were.”

MMM (our)

(Realistic Motion)

MDM

(Not jump)

Generating 50% motion in the middle based on the text “a man throws a ball” conditioned on first 25% and last 25% of motion of “a person walks backward, turns around and walks backward the other way.”

MMM (our)

(Realistic Motion)

MDM

(Not continuous transition)

Upper body editing:

Generating upper body part based on the text “a man throws a ball” conditioned on lower body part of “a man rises from the ground, walks in a circle and sits back down on the ground.”

[Click play for normal speed]

MMM (our)

(Realistic Motion)

[Click play for normal speed]

MDM

(Unrealistic body in the last frame)

More Results

Text to Motion:

MMM (our)

a person bouncing around while throwing jabs and upper cuts.

MMM (our)

a person start to dance with legs

MMM (our)

a person steps forward and leans over; they grab a cup with their left hand and empty it before putting it down and stepping back to their original position.

MMM (our)

walking forward and kicking foot.

Long Sequence Generation:

Generating long sequence motion by combining multiple motions as follow: 'a person walks forward then turn left.', 'a person crawling from left to right', 'a person dribbles a basketball then shoots it.', 'the person is walking in a counter counterclockwise circle.', 'a person is sitting in a chair, wobbles side to side, stands up, and then start walking.' Red frames indicate the generated short motion sequences. Blue frames indicate transition frames.

MMM (our)

Motion Completion:

MMM (our)

Completing first 50% motion based on the text “a person performs jumping jacks.” conditioned on last 50% of motion of “a person crawling from left to right”

MMM (our)

Completing last 50% motion based on the text “a person performs jumping jacks.” conditioned on first 50% of motion of “a person crawling from left to right”

Motion Temporal Outpainting:

MMM (our)

Generating first 25% and last 25% of motion of based on the text “A person sits down” conditioned on 50% motion in the middle of motion of “a person is running in place at a medium pace.”

MMM (our)

Generating first 25% and last 25% of motion of based on the text “person walks backward.” conditioned on 50% motion in the middle of motion of “a person walks forward in a straight line.”