Audio-Visual VR Viewport Prediction
PyTorchExtended Kalman FilterLSTMFirst-Order AmbisonicsD-SAV360
Overview
This project predicts head viewport position 2.5 seconds into the future using a Kalman + LSTM hybrid architecture with 3D unit vector representation and cosine similarity loss. The system processes head tracking data from the D-SAV360 dataset, extracts spatial audio features from first-order ambisonics (FOA), and fuses trajectory with audio cues. The EKF provides physics-based motion priors while the LSTM learns non-linear corrections for saccades and intent-driven head movements. Data is split by participant to prevent leakage, with proper train/validation/test sets.
Key Features
- 2.5-second viewport prediction using LSTM-EKF hybrid
- 3D unit vector representation with tangent space velocities
- First-order ambisonic audio feature extraction (9D)
- Participant-based train/validation/test splitting
- Cosine similarity loss with multi-horizon evaluation
- Gated fusion of EKF physics prior with LSTM corrections