EMMA: Extracting Multiple physical parameters from Multimodal Data

CVPR 2026

Farhat Shaikh, Ayan Banerjee, Sandeep K. S. Gupta

IMPACT Lab, School of Computing & Augmented Intelligence (SCAI), Arizona State University

Overview

EMMA is a physics-informed multimodal framework that recovers all identifiable dynamical parameters of a system directly from raw video, audio, and image-based time-series observations. Unlike prior video-only approaches that struggle with occluded states, hidden actuation inputs, and assumptions about known initial conditions, EMMA performs joint inference of explicit parameters, implicit dynamical components, and calibration invariants within a unified continuous-time model.

The user supplies the parametric structure of the governing ODE; EMMA solves the inverse problem of recovering its parameters, along with any latent forcing and invariants, from multimodal observations.

Key contributions

Multi-modal dynamical parameter extraction from video, audio, and time-series reconstructed from visual charts.
Recovery under unobserved forcing inputs by inferring latent actuation (e.g. wheel speed) from audio.
Estimation of implicit dynamics associated with unmeasured physical effects (e.g. frictional drag).
Invariant calibration from raw video, eliminating assumptions about known initial conditions or coordinate frames.
Extensive validation on 100+ scenarios: Delfys benchmark (75 videos), real-world rover and quadrotor, and simulation charts.

Architecture

EMMA architecture

EMMA follows a three-step pipeline: Sense · Learn · Verify.

Sense. Video, audio, and chart images are converted into time-aligned signals through modality-specific pipelines.
Learn. A Liquid Time-Constant (LTC) network models the system's latent dynamics in continuous time.
Verify. A differentiable ODE solver simulates the recovered parameters and checks them against the observations under a physics-informed loss.

Results

EMMA delivers accurate multi-parameter recovery across diverse physical systems. Full tables and ablations are in the paper.

System	Parameters recovered	EMMA error	Best baseline
Pendulum (90 cm)	Length L, damping τ	L = 0.86 ± 0.07 m (GT 0.90)	Delfys, PySINDy
Torricelli (med.)	Drainage k	0.0132 ± 0.0008 (GT 0.0128)	matches Delfys
Sliding block (med.)	Angle α, friction μ	α = 24.72°, μ = 0.205 (GT 25°, 0.20)	Delfys, PySINDy
LED decay (med.)	γ	0.91 ± 0.0 (GT 0.92)	matches Delfys
Rover	9 params (5 with known ground truth)	8.8 % ± 1.7 % mean error	first work under hidden forcing
Quadrotor	12 params (7 with known ground truth)	15.9 % ± 7.4 % mean error	first work under hidden forcing
Simulation charts	Lotka-Volterra, Lorenz, F8 Crusader, HIV, AID	>10× lower error than PySINDy on implicit dynamics	PySINDy

Compared against PAIG, NIRPI, and Delfys on the video benchmarks and PySINDy on the chart-based simulations.

Supported systems

Category	Systems
Delfys benchmark	Pendulum, Torricelli drainage, Sliding block, LED decay, Free fall
Real-world platforms	Differential-drive rover (9 params), 6-DoF quadrotor (12 params)
Simulation charts	Lotka-Volterra, Chaotic Lorenz, F8 Crusader, HIV therapy, AID (Type-1 diabetes)

Installation

Tested with Python 3.10+ on macOS and Linux.

git clone https://github.com/ImpactLabASU/EMMA-CVPR2026.git
cd EMMA-CVPR2026
python3 -m venv .venv && source .venv/bin/activate   # optional but recommended
pip install -r requirements.txt

System tools

FFmpeg on your PATH (MoviePy uses it for audio extraction): brew install ffmpeg (macOS) or sudo apt install ffmpeg (Ubuntu).
YOLO weights (default yolo11m.pt): pip install ultralytics then yolo download model=yolo11m.pt, or download from the Ultralytics releases page.
A CUDA GPU is optional; every script falls back to CPU automatically.

Repository layout

Folder	Purpose	Entry points
`Baseline/`	Physics-informed EMMA pipelines (Free Fall, LED, Pendulum, Sliding Block, Torricelli) plus ablation utilities.	`FreeFall/free_fall.py`, `LED/led.py`, `Pendulum/run-.py`, `Sliding block/sliding_block.py`, `Torricelli/toricelli*.py`, `architecture_ablation.py`, `run_additional_ablations.py`
`Rover/`	Rover perception, parameter estimation, multimodal ablations, helper shell script.	`run.py`, `rover-ablation.py`, `rover_multimodal_ablation.py`, `run_rover_ablation.sh`
`Drone/`	Drone pipeline orchestrator (vision + audio + EMMA optimization).	`new_run.py`
`CGM/`	Continuous glucose monitor chart digitizer.	`extract_cgm_data.py`

Data

Baseline datasets come from the Delfys "Physical Parameter Prediction" set on Kaggle (https://www.kaggle.com/datasets/jaswar/physical-parameter-prediction). Download it and copy the experiment folders into Baseline/; the scripts discover the data automatically.
Sample rover and drone videos are available here: Dropbox. Place them under Rover/ and Drone/.

Usage

Baseline pipelines

Each baseline follows the same recipe:

cd Baseline/<Experiment>/
Edit the configuration block inside main():
- video_path: path to the source video; leave empty to reuse existing data files.
- weights_path: YOLO weights (yolo11m.pt by default).
- pixel_to_meter (Free Fall, Torricelli, Sliding Block): set from your calibration grid.
- output_folder: a unique run directory (e.g. run_01); the script creates output/ and data/ under it.
Run python3 <script>.py.
Optional: python3 <script>.py --simulation-only skips retraining and reuses the latest *_coefficients.csv and *_emma_final_model.pth (Free Fall, LED, Pendulum).

Experiment	Script	Key outputs
Free Fall	`FreeFall/free_fall.py` (`free_fall-m.py` for the medium set)	trajectory CSV, `free_fall_coefficients.csv`, trained model, annotated video
LED decay	`LED/led.py`	trajectory CSV, `led_coefficients.csv`, trained model, intensity figures
Pendulum	`Pendulum/run-45.py`, `run-90.py`, `run-150.py`	`thetaData.txt`, `omegaData.txt`, `pendulum_coefficients.csv`, trained model
Sliding block	`Sliding block/sliding_block.py` (`-low`, `-med` variants)	trajectory CSVs, `sliding_block_coefficients.csv`, trained model
Torricelli	`Torricelli/toricelli.py` (`toricelli-m.py`, `torricelli-sm.py`)	height trajectories, `torricelli_coefficients.csv`, trained model

PySINDy baselines. Each experiment folder has pysindy_results/pysindy.py; run it from that folder (after the main pipeline has written the EMMA-formatted CSVs) for sparse-regression baselines.

Ablations. From Baseline/: python3 architecture_ablation.py and python3 run_additional_ablations.py (require pendulum datasets under Baseline/Pendulum-EMMA/<angle>_v*/data/).

Rover

cd Rover
# set video_path and weights_path in run.py (see the CONFIGURATION SECTION)
python3 run.py

Outputs: rover_coefficients.csv, rover_EMMA_final_model.pth, plots, GIF. Ablations: python3 rover-ablation.py, python3 rover_multimodal_ablation.py, or bash run_rover_ablation.sh (edit variables first). If you already have processed data/*.txt, set video_path = "" to skip detection.

Drone

cd Drone
EMMA_RUN_ORCHESTRATOR=1 python3 new_run.py --video /path/to/DroneVideo.mp4 --weights /path/to/yolo11m.pt

Note: Full orchestration also needs an external Dronepipeline/ folder containing droneExtract.py, droneExtractAudio.py, and EMMA_drone_torch_ltc_optimized.py. These are not bundled here; without them, new_run.py falls back to the local vision-only pipeline.

CGM chart digitizer

cd CGM
python3 extract_cgm_data.py   # reads CGMData.png, writes cgm_data.txt + a visualization

Troubleshooting

Module not found: re-run pip install -r requirements.txt in the active virtual environment. For torch/torchvision, use the PyTorch selector.
YOLO weights missing: download yolo11m.pt and point weights_path to it.
FFmpeg errors: install FFmpeg (brew install ffmpeg / sudo apt install ffmpeg).

Citation

@InProceedings{Shaikh_2026_CVPR,
    author    = {Shaikh, Farhat and Banerjee, Ayan and Gupta, Sandeep},
    title     = {EMMA: Extracting Multiple physical parameters from Multimodal Data},
    booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
    month     = {June},
    year      = {2026},
    pages     = {1716-1725}
}

Also on arXiv.