HRTF Spatialization: Mono to Stereo

I took a plain mono audio file and transformed it into a stereo track that sounds like it’s coming from a real direction in space — using HRTF data from a SOFA file.

HRTF (Head-Related Transfer Function) describes how sound from a specific 3D location is filtered by the shape of our head, ears, and torso before reaching the eardrum. This simulates how humans naturally localize sounds.

After setting parameters and loading both the mono audio and HRTF (SOFA) file, the code convolves the mono signal separately for the left and right ears using their respective impulse responses. It then saves the result as a stereo file and plots the left and right waveforms along with their difference to visualize the spatial effect.

Note: This is a basic implementation. Try different SOFA files and parameters to explore how spatialization changes, results can vary significantly depending on the HRTF dataset used.

import numpy as np
import soundfile as sf
from scipy.signal import fftconvolve
import pysofaconventions as sofa
import matplotlib.pyplot as plt

# --- SETTINGS ---
input_audio_path = 'yourMonoAudio'         # mono WAV file
sofa_file_path   = 'subject_165.sofa'      # HRTF in SOFA format
output_path      = 'outputStereo.wav'      # spatialized output
azimuth_target   = 20          # degrees (e.g. -90 = left, 0 = front, 90 = right)
elevation_target = 15          # degrees (e.g. 0 = horizontal plane)

# --- Load mono audio and SOFA HRTF file ---
signal, samplerate = sf.read(input_audio_path)
if signal.ndim > 1:
    raise ValueError("Input audio must be mono")

hrtf = sofa.SOFAFile(sofa_file_path, 'r')

# --- Find the closest available azimuth/elevation ---
positions = hrtf.getVariableValue('SourcePosition')  # shape: (N, 3)
azimuths = positions[:, 0]
elevations = positions[:, 1]

# Find index of closest match
distances = (azimuths - azimuth_target)**2 + (elevations - elevation_target)**2
idx = int(np.argmin(distances))

# --- Get HRIRs for left and right ears ---
hrir_l = hrtf.getDataIR()[idx, 0, :]  # left ear
hrir_r = hrtf.getDataIR()[idx, 1, :]  # right ear

# --- Convolve input signal with HRIRs ---
left  = fftconvolve(signal, hrir_l, mode='full')
right = fftconvolve(signal, hrir_r, mode='full')

# Match length
min_len = min(len(left), len(right))
stereo = np.stack((left[:min_len], right[:min_len]), axis=1)

# --- Save stereo output ---
sf.write(output_path, stereo, samplerate)

# --- Plot waveforms ---
times = np.arange(min_len) / samplerate

plt.figure(figsize=(12, 6))

# Plot left and right channels
plt.subplot(2, 1, 1)
plt.plot(times, stereo[:, 0], label='Left', alpha=0.75)
plt.plot(times, stereo[:, 1], label='Right', alpha=0.75)
plt.title('Stereo Output Waveforms')
plt.xlabel('Time (seconds)')
plt.ylabel('Amplitude')
plt.legend()
plt.grid(True)

# Plot difference waveform (L - R)
plt.subplot(2, 1, 2)
plt.plot(times, stereo[:, 0] - stereo[:, 1], color='purple')
plt.title('Difference Waveform (Left - Right)')
plt.xlabel('Time (seconds)')
plt.ylabel('Amplitude')
plt.grid(True)

plt.tight_layout()
plt.show()

[1] HRTF measurement for accurate sound localization cues

[2] HRTF Estimation using a Score-based Prior

[3] Semi-Automatic Mono to Stereo Up-Mixing Using Sound Source Formation

[4] Sofacoustics database