HRTF Spatialization: Mono to Stereo
I took a plain mono audio file and transformed it into a stereo track that sounds like it’s coming from a real direction in space — using HRTF data from a SOFA file.
HRTF (Head-Related Transfer Function) describes how sound from a specific 3D location is filtered by the shape of our head, ears, and torso before reaching the eardrum. This simulates how humans naturally localize sounds.
After setting parameters and loading both the mono audio and HRTF (SOFA) file, the code convolves the mono signal separately for the left and right ears using their respective impulse responses. It then saves the result as a stereo file and plots the left and right waveforms along with their difference to visualize the spatial effect.
Note: This is a basic implementation. Try different SOFA files and parameters to explore how spatialization changes, results can vary significantly depending on the HRTF dataset used.
import numpy as np import soundfile as sf from scipy.signal import fftconvolve import pysofaconventions as sofa import matplotlib.pyplot as plt # --- SETTINGS --- input_audio_path = 'yourMonoAudio' # mono WAV file sofa_file_path = 'subject_165.sofa' # HRTF in SOFA format output_path = 'outputStereo.wav' # spatialized output azimuth_target = 20 # degrees (e.g. -90 = left, 0 = front, 90 = right) elevation_target = 15 # degrees (e.g. 0 = horizontal plane) # --- Load mono audio and SOFA HRTF file --- signal, samplerate = sf.read(input_audio_path) if signal.ndim > 1: raise ValueError("Input audio must be mono") hrtf = sofa.SOFAFile(sofa_file_path, 'r') # --- Find the closest available azimuth/elevation --- positions = hrtf.getVariableValue('SourcePosition') # shape: (N, 3) azimuths = positions[:, 0] elevations = positions[:, 1] # Find index of closest match distances = (azimuths - azimuth_target)**2 + (elevations - elevation_target)**2 idx = int(np.argmin(distances)) # --- Get HRIRs for left and right ears --- hrir_l = hrtf.getDataIR()[idx, 0, :] # left ear hrir_r = hrtf.getDataIR()[idx, 1, :] # right ear # --- Convolve input signal with HRIRs --- left = fftconvolve(signal, hrir_l, mode='full') right = fftconvolve(signal, hrir_r, mode='full') # Match length min_len = min(len(left), len(right)) stereo = np.stack((left[:min_len], right[:min_len]), axis=1) # --- Save stereo output --- sf.write(output_path, stereo, samplerate) # --- Plot waveforms --- times = np.arange(min_len) / samplerate plt.figure(figsize=(12, 6)) # Plot left and right channels plt.subplot(2, 1, 1) plt.plot(times, stereo[:, 0], label='Left', alpha=0.75) plt.plot(times, stereo[:, 1], label='Right', alpha=0.75) plt.title('Stereo Output Waveforms') plt.xlabel('Time (seconds)') plt.ylabel('Amplitude') plt.legend() plt.grid(True) # Plot difference waveform (L - R) plt.subplot(2, 1, 2) plt.plot(times, stereo[:, 0] - stereo[:, 1], color='purple') plt.title('Difference Waveform (Left - Right)') plt.xlabel('Time (seconds)') plt.ylabel('Amplitude') plt.grid(True) plt.tight_layout() plt.show()[1] HRTF measurement for accurate sound localization cues [2] HRTF Estimation using a Score-based Prior [3] Semi-Automatic Mono to Stereo Up-Mixing Using Sound Source Formation [4] Sofacoustics database