ControlMM: Controllable Masked Motion Generation

Ekkasit Pinyoanuntapong1, Muhammad Usama Saleem1, Korrawe Karunratanakul2, Pu Wang1, Hongfei Xue1, Chen Chen3, Chuan Guo4, Junli Cao4, Jian Ren4, Sergey Tulyakov4
1University of North Carolina at Charlotte, 2ETH Zurich, 3University of Central Florida, 4Snap Inc.
arXiv Code
twbs

Abstract

Recent advances in motion diffusion models have enabled spatially controllable text-to-motion generation. However, despite achieving acceptable control precision, these models suffer from generation speed and fidelity limitations. To address these challenges, we propose ControlMM, a novel approach incorporating spatial control signals into the generative masked motion model. ControlMM achieves real-time, high-fidelity, and high-precision controllable motion generation simultaneously. Our approach introduces two key innovations. First, we propose masked consistency modeling, which ensures high-fidelity motion generation via random masking and reconstruction, while minimizing the inconsistency between the input control signals and the extracted control signals from the generated motion. To further enhance control precision, we introduce inference-time logit editing, which manipulates the predicted conditional motion distribution so that the generated motion, sampled from the adjusted distribution, closely adheres to the input control signals. During inference, ControlMM enables parallel and iterative decoding of multiple motion tokens, allowing for high-speed motion generation. Extensive experiments show that, compared to the state of the art, ControlMM delivers superior results in motion quality, with better FID scores (0.061 vs 0.271), and higher control precision (average error 0.0091 vs 0.0108). ControlMM generates motions 20 times faster than diffusion-based methods. Additionally, ControlMM unlocks diverse applications such as any joint any frame control, body part timeline control, and obstacle avoidance.
twbs
Comparison of FID score, spatial control error, and motion generation speed (circle size) for our accurate and fast models comparing to state-of-the-art models. The closer the point is to the origin and the smaller the circle, the better performance.

Method

twbs

Compared to SOTA - Multiple Joints

a person crosses their arms for chest fly

ControlMM (our)

OmniControl

a person jumps in the air once

ControlMM (our)

OmniControl

a person walks in a circle clockwise

ControlMM (our)

OmniControl

a person walks forward and waves his hands

ControlMM (our)

OmniControl

Compared to SOTA - Pelvis Only

a person walks forward and waves his hands

ControlMM (our)

GMD

a person dances to salsa music

ControlMM (our)

GMD

a person walks forward and come back to the same position from where we started

ControlMM (our)

GMD

Dense Signals

the person draws a heart with hand

person walks down and up in a figure 8 pattern

A figure walks forward in a zig zag pattern

a person waves both his arms

someone is lifting something up

a person is doing pullups

a person dances to salsa music

a person stands and waving

person is doing squats

a person stands and bows

a man walks in a curved line with his hands at his sides

person is walking as if injured

a person walks through queue

a person walks with support

the person walks in a oval anticlockwise while waving hands

a person walks

a person walks in a circle

a person walks forward carrying something

Sparse Signals

A person walks forward with their hands up in a surrender pose

person walks over and sits down in a chair.

A person jumps and kicks a football in the air with their head

a person walks slowly

A person walks forward, casually greeting others with a wave or hello

A person walks forward and raises both arms high.

a man walks left and right

A person walks forward giving a high five

A person walks, pauses, and performs a high kick in the air.

Body Part Timeline Control

Upper Body: a person puts hands in the air.
Left Foot : a person kicks left legs.
Lower Body: a person jumps forward.
0 frames 60 120 frames
Generating motion for the upper body from frames 0 to 120 based on the “a person puts hands in the air.” For the lower body, motion is generated in two parts: From frames 0 to 60, based on the “a person kicks left legs.” From frames 60 to 120, based on the “a person jumps forward.”
Upper Body: the person is bending over forward
Left Foot : shake with their left leg
0 frames 60 120 frames
Generating upper body motion from frames 0 to 120 based on the prompt: "the person is bending over forward" Simultaneously, lower body motion is generated from frames 0 to 120 based on the prompt: "shake with their left leg"

Obstacle Avoidance

the man walks zig zag.
the man walks forward in a straight line.