AI

Audio-Visual VR Viewport Prediction

PyTorchExtended Kalman FilterLSTMFirst-Order AmbisonicsD-SAV360

Overview

This project predicts head viewport position 2.5 seconds into the future using a Kalman + LSTM hybrid architecture with 3D unit vector representation and cosine similarity loss. The system processes head tracking data from the D-SAV360 dataset, extracts spatial audio features from first-order ambisonics (FOA), and fuses trajectory with audio cues. The EKF provides physics-based motion priors while the LSTM learns non-linear corrections for saccades and intent-driven head movements. Data is split by participant to prevent leakage, with proper train/validation/test sets.

Key Features

  • 2.5-second viewport prediction using LSTM-EKF hybrid
  • 3D unit vector representation with tangent space velocities
  • First-order ambisonic audio feature extraction (9D)
  • Participant-based train/validation/test splitting
  • Cosine similarity loss with multi-horizon evaluation
  • Gated fusion of EKF physics prior with LSTM corrections
Data Inputs
Head Tracking
FOA Audio (9D)
Participant Split
70%
15%
15%
TrainValTest
LSTM-EKF Hybrid
15D Input
EKF
Physics Prior
LSTM
128 × 2 layers
Gated Fusion
2.5s Prediction
Training
Epoch 1
TrainVal
Cosine Loss
L = 1 - (p̂ · p)
MAE by Horizon
0.5s
1s
13°
1.5s
18°
2s
25°
2.5s
32°