Audio-Visual VR Viewport Prediction

PyTorchExtended Kalman FilterLSTMFirst-Order AmbisonicsD-SAV360

Overview

This project predicts head viewport position 2.5 seconds into the future using a Kalman + LSTM hybrid architecture with 3D unit vector representation and cosine similarity loss. The system processes head tracking data from the D-SAV360 dataset, extracts spatial audio features from first-order ambisonics (FOA), and fuses trajectory with audio cues. The EKF provides physics-based motion priors while the LSTM learns non-linear corrections for saccades and intent-driven head movements. Data is split by participant to prevent leakage, with proper train/validation/test sets.

Key Features

2.5-second viewport prediction using LSTM-EKF hybrid
3D unit vector representation with tangent space velocities
First-order ambisonic audio feature extraction (9D)
Participant-based train/validation/test splitting
Cosine similarity loss with multi-horizon evaluation
Gated fusion of EKF physics prior with LSTM corrections

View on GitHub

Data Inputs

Head Tracking

FOA Audio (9D)

Participant Split

70%

15%

TrainValTest

LSTM-EKF Hybrid

15D Input

EKF

Physics Prior

LSTM

128 × 2 layers

Gated Fusion

2.5s Prediction

Training

Epoch 1

TrainVal

Cosine Loss

L = 1 - (p̂ · p)

MAE by Horizon

0.5s

8°

13°

1.5s

18°

25°

2.5s

32°