Audio AI Reading Notes

A curated list of papers and articles I’ve read recently, focused on neural audio, DSP, and music AI. Each entry includes the original source and my key takeaways.

This page is being updated regularly. More notes and resources will be added soon.

Audio Signal Processing in the Artificial Intelligence Era: Challenges and Directions

– Jul 2025, ResearchGate link

*Authors: Christian J Steinmetz, Christian Uhle, Flavio Everardo, Christopher Mitcheltree, James Keith McElveen, Jean-Marc Jot, Gordon Wichern

*Key Terms: Differentiable digital signal processing, problem taxonomy, perceptual loss function, data bottleneck

What I learned:

AI’s role in audio spans three distinct tasks: labelling, processing, and generation
Differentiable digital signal processing (DDSP) as a cornerstone in the future
The importance of human-centric challenges as the same as technical ones
The data bottleneck as a major limiting factor
The necessity of perceptual loss functions for tasks like audio synthesis, effects, etc.
Gap between a functioning AI model and a usable tool for a creative professional
Crucial distinction between technically “correct” processing and artistically “good” processing

Enhancing Neural Audio Fingerprint Robustness to Audio Degradation for Music Identification

– Jun 2025, https://arxiv.org/abs/2506.22661

*Authors: R. Oguz Araz, Guillem Cortès-Sebastià, Emilio Molina, Joan Serrà, Xavier Serra, Yuki Mitsufuji, Dmitry Bogdanov

*Key Terms: Audio fingerprinting, audio degradation, triplet loss function

What I learned:

Importance of realistic training data (acoustical background noise and impulse responses applied) for real world performance
Usage simpler triplet loss function over NT-Xent loss function
Interaction between the loss function and the training data when multiple degraded versions of the training data added
Importance of keeping bass frequencies (down to 160Hz) , especially in noisy environments where bass travels better

Neural Audio Fingerprint for High-Specific Audio Retrieval Based on Contrastive Learning

– Feb 2021, https://arxiv.org/abs/2010.11910

*Authors: Sungkyun Chang, Donmoon Lee, Jeongsoo Park, Hyungui Lim, Kyogu Lee, Karam Ko, Yoonchang Han

*Key Terms: Acoustic fingerprint, self-supervised learning, data augmentation, music information retrieval

What I learned:

Contrastive learning where comparing the original “clean” audio with its augmented “noisy” version
Interaction between short queries and their corresponding segments in a large database
Advantage of neural audio fingerprints for being small sized data
Using unrelated segments to learn more unique features for each fingerprint, directly resulting in higher identification accuracy

Latent Granular Resynthesis Using Neural Audio Codecs

– Jul 2025, https://www.arxiv.org/abs/2507.19202

*Authors: Nao Tokui, Tom Baker

*Key Terms: Latent granular resynthesis, granular codebook, timbre transfer

What I learned:

A technique operating in the latent space for timbre transfer by using granular synthesis
Concept of codebook process for encoded sound source like a palette of its unique sonic textures
Using pre-trained neural audio codecs for a source and a target sound input to start generating new audio immediately
Key role of implicit interpolation to avoid the audible clicks and pops caused by granular and concatenative synthesis
Help of temperature control on matching between source and target inputs

Task-Specific Audio Coding for Machines: Machine-Learned Latent Features Are Codes for That Machine

– Jul 2025, https://www.arxiv.org/abs/2507.12701

*Authors: Anastasia Kuznetsova, Inseon Jang, Wootaek Lim, Minje Kim

*Key Terms: Neural audio codec, task specific quantization, model layer pipeline

What I learned:

Distinction between traditional and neural audio codecs
Control of quantization point: quantizing a deeper layer (lower bitrate) or quantizing an earlier layer
ML model with two parts for data transmitting: sender (encoder + quantizer) and receiver (decoder)

A Streamable Neural Audio Codec with Residual Scalar-Vector Quantization for Real-Time Communication

– Apr 2025, https://arxiv.org/abs/2504.06561

*Authors: Xiao-Hang Jiang, Yang Ai, Rui-Chen Zheng, Zhen-Hua Ling

*Key Terms: Streamable neural audio codec, residual scalar-vector quantizer, real-time communication

What I learned:

The role of neural codes in various fields, such as real-time communication, speech language models (SLMs)
Basic structure of neural codec = an encoder + a decoder + a residual vector quantizer (RVQ)
How the quantizer affects the latency in real time audio communication
Trade-off between data size reduction and loss of precision

Unified Cross-modal Translation of Score Images, Symbolic Music, and Performance Audio

– May 2025, https://arxiv.org/abs/2505.12863

*Authors: Jongmin Jung, Dongmin Kim, Sihun Lee, Seola Cho, Hyungjoon Soh, Irmak Bukey, Chris Donahue, Dasaem Jeong

*Key Terms: Cross-modal music translation, optical music recognition, image-to-audio, MIDI-to-audio, music information retrieval

What I learned:

How optical music recognition systems (OMR) work
Bottlenecks in OMR (e.g., limited training data for various styles, orchestral scores complexity)
Machine-readable notation formats such as MusicXML
Use of RQVAE (Residual Quantized Variational Autoencoder) for image tokenization
The reasons for data augmentation and slide segmentation

Simultaneous Music Separation and Generation Using Multi-Track Latent Diffusion Models

– Dec 2024, https://arxiv.org/abs/2409.12346

*Authors: Tornike Karchkhadze, Mohammad Rasool Izadi, Shlomo Dubnov

*Key Terms: Source separation, music generation, latent diffusion models

What I learned:

Fundamental insight about unification of music source separation and music generation into a single, cohesive framework
MSG-LD (Music Separation and Generation with Latent Diffusion) model as a core technical innovation
The power of structured latent spaces (created by VAE) for high level reasoning

Compositional Audio Representation Learning

– Dec 2024, https://arxiv.org/abs/2505.12863

*Authors: Sripathi Sridhar, Mark Cartwright

*Key Terms: Source-centric learning, audio representation learning, audio classification

What I learned:

Design choices for compositional audio representation learning (CARL)
Gap between the performance of the supervised and unsupervised models

Omni-R1: Do You Really Need Audio to Fine-Tune Your Audio LLM?

– May 2025, https://arxiv.org/abs/2505.09439

*Authors: Andrew Rouditchenko, Saurabhchand Bhati, Edson Araujo, Samuel Thomas, Hilde Kuehne, Rogerio Feris, James Glass

*Key Terms: Audio Large Language Models (LLMs)

What I learned:

Human-labelled or transcribed audio data requires significant resources
Audio LLMs can be trained effectively, using text-only data and auto-generated questions

Cue Point Estimation Using Object Detection

– Jul 2024, https://arxiv.org/abs/2407.06823

*Authors: Giulia Argüello, Luca A. Lanzendörfer, Roger Wattenhofer

*Key Terms: Cue point, Mel spectrograms, DJ dataset, object detection

What I learned:

Musical moments can be processed in a computer vision domain via Mel spectrograms
The need for a massive DJ dataset to train AI to think like a DJ

AcousticWave Modeling Using 2D FDTD: Applications in Unreal Engine for Dynamic Sound Rendering

– Jul 2025, https://arxiv.org/abs/2507.09376

*Authors: Bilkent Samsurya

*Key Terms:

What I learned:

Two-dimensional finite-difference time-domain (FDTD) framework that simulates sound propagation as a wave-based model in Unreal Engine
Model’s high accuracy when performed in low frequencies
Use of an exponential sine sweep (ESS) to excite the environment with equal energy per octave band
Model’s computational cost by differential equations, impulse response and convolution calculations

End-to-end optical music recognition for piano form sheet music

– May 2023, ResearchGate link

*Authors: Antonio Ríos-Vila, Jorge Calvo-Zaragoza, David Rizo, José M. Iñesta

*Key terms: Optical music recognition, polyphonic music scores, GrandStaff, neural networks

What I learned:

90-degree rotation of the score image for a well-understood sequential multi-line text recognition
Having special dataset like GrandStaff and data augmentation
KERN notation for MusicXML
Superiority of transformers over RNNs for long and complex sequences for music score analysing

MIDISpace: Finding Linear Directions in Latent Space for Music Generation

– June 2022, https://dl.acm.org/doi/fullHtml/10.1145/3527927.3532790

*Authors: Meliksah Turker, Alara Dirik, Pinar Yanardag

*Key Terms: Laten space manipulation, generative models, VAE, sequence models

What I learned:

Finding simple and straight-line directions within the latent space that correspond to intuitive musical concepts
Segment manipulation after discovering its primary direction in the latent space by simple math
Choosing monophonic music over complex structures for a simple and linear approach

RAVE: A variational autoencoder for fast and high-quality neural audio synthesis

– Dec 2021, https://arxiv.org/abs/2111.05011

*Authors: Antoine Caillon, Philippe Esling

*Key Terms: Variational Autoencoder (VAE), Generative Adversarial Network (GAN), latent space, multiband decomposition, timbre transfer

What I learned:

Two stage training process: first, Variational Autoencoder (VAE) to learn a compact and meaningful latent representation – second, adversarial network (GAN) to fine-tune only the audio generator for high-fidelity output
Multiband decomposition for both high-fidelity audio and real-time speed
Controlling directly the trade-off between audio quality and file size after training
Successful smaller model to get better results (less parameters, less memory)
Versatility of RAVE to do timbre transfer

Semi-Automatic Mono to Stereo Up-Mixing Using Sound Source Formation

– May 2007, ResearchGate link

*Authors: Mathieu Lagrange, Luis Gustavo Martins, George Tzanetakis

*Key Terms: Sinusoidal modelling, up-mixing, sound source formation, resynthesis

What I learned:

Analysis process represented as a collection of sinusoids, each defined by its frequency, amplitude, and phase
Using all the sinusoidal components as nodes in a graph just before clustering them
Semi-automatic approach to decide sound sources and panning for them
Masking some frequencies, by Fourier based method, for left and right channels

Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modelling

– Dec 2014, https://arxiv.org/abs/1412.3555

*Authors: Junyoung Chung, Caglar Gulcehre, KyungHyun Cho, Yoshua Bengio

*Key Terms: Recurrent neural networks, gated recurrent unit, long short term memory, tanh

What I learned:

Ineffectiveness of standard recurrent neural networks to learn long-term dependencies
Advantage of gating mechanism like LSTM and GRU to selectively remember important features over long periods and forget irrelevant ones
GRU (with the simpler architecture) : strong and efficient alternative to LSTM