Audio AI Reading Notes

A curated list of papers and articles I’ve read recently, focused on neural audio, DSP, and music AI. Each entry includes the original source and my key takeaways.

This page is being updated regularly. More notes and resources will be added soon.

Audio Signal Processing in the Artificial Intelligence Era: Challenges and Directions

– Jul 2025, ResearchGate link

*Authors: Christian J Steinmetz, Christian Uhle, Flavio Everardo, Christopher Mitcheltree, James Keith McElveen, Jean-Marc Jot, Gordon Wichern

*Key Terms: Differentiable digital signal processing, problem taxonomy, perceptual loss function, data bottleneck

What I learned:

  • AI’s role in audio spans three distinct tasks: labelling, processing, and generation
  • Differentiable digital signal processing (DDSP) as a cornerstone in the future
  • The importance of human-centric challenges as the same as technical ones
  • The data bottleneck as a major limiting factor
  • The necessity of perceptual loss functions for tasks like audio synthesis, effects, etc.
  • Gap between a functioning AI model and a usable tool for a creative professional
  • Crucial distinction between technically “correct” processing and artistically “good” processing

 

Enhancing Neural Audio Fingerprint Robustness to Audio Degradation for Music Identification

– Jun 2025, https://arxiv.org/abs/2506.22661

*Authors: R. Oguz Araz, Guillem Cortès-Sebastià, Emilio Molina, Joan Serrà, Xavier Serra, Yuki Mitsufuji, Dmitry Bogdanov

*Key Terms: Audio fingerprinting, audio degradation, triplet loss function

What I learned:

  • Importance of realistic training data (acoustical background noise and impulse responses applied) for real world performance
  • Usage simpler triplet loss function over NT-Xent loss function
  • Interaction between the loss function and the training data when multiple degraded versions of the training data added
  • Importance of keeping bass frequencies (down to 160Hz) , especially in noisy environments where bass travels better

 

Neural Audio Fingerprint for High-Specific Audio Retrieval Based on Contrastive Learning

– Feb 2021, https://arxiv.org/abs/2010.11910

*Authors: Sungkyun Chang, Donmoon Lee, Jeongsoo Park, Hyungui Lim, Kyogu Lee, Karam Ko, Yoonchang Han

*Key Terms: Acoustic fingerprint, self-supervised learning, data augmentation, music information retrieval

What I learned:

  • Contrastive learning where comparing the original “clean” audio with its augmented “noisy” version
  • Interaction between short queries and their corresponding segments in a large database
  • Advantage of neural audio fingerprints for being small sized data
  • Using unrelated segments to learn more unique features for each fingerprint, directly resulting in higher identification accuracy

 

Latent Granular Resynthesis Using Neural Audio Codecs

– Jul 2025, https://www.arxiv.org/abs/2507.19202

*Authors: Nao Tokui, Tom Baker

*Key Terms: Latent granular resynthesis, granular codebook, timbre transfer

What I learned:

  • A technique operating in the latent space for timbre transfer by using granular synthesis
  • Concept of codebook process for encoded sound source like a palette of its unique sonic textures
  • Using pre-trained neural audio codecs for a source and a target sound input to start generating new audio immediately
  • Key role of implicit interpolation to avoid the audible clicks and pops caused by granular and concatenative synthesis
  • Help of temperature control on matching between source and target inputs

 

Task-Specific Audio Coding for Machines: Machine-Learned Latent Features Are Codes for That Machine

– Jul 2025, https://www.arxiv.org/abs/2507.12701

*Authors: Anastasia Kuznetsova, Inseon Jang, Wootaek Lim, Minje Kim

*Key Terms: Neural audio codec, task specific quantization, model layer pipeline

What I learned:

  • Distinction between traditional and neural audio codecs
  • Control of quantization point: quantizing a deeper layer (lower bitrate) or quantizing an earlier layer
  • ML model with two parts for data transmitting: sender (encoder + quantizer) and receiver (decoder)

 

A Streamable Neural Audio Codec with Residual Scalar-Vector Quantization for Real-Time Communication

– Apr 2025, https://arxiv.org/abs/2504.06561

*Authors: Xiao-Hang Jiang, Yang Ai, Rui-Chen Zheng, Zhen-Hua Ling

*Key Terms: Streamable neural audio codec, residual scalar-vector quantizer, real-time communication

What I learned:

  • The role of neural codes in various fields, such as real-time communication, speech language models (SLMs)
  • Basic structure of neural codec = an encoder + a decoder + a residual vector quantizer (RVQ)
  • How the quantizer affects the latency in real time audio communication
  • Trade-off between data size reduction and loss of precision

 

Unified Cross-modal Translation of Score Images, Symbolic Music, and Performance Audio

– May 2025, https://arxiv.org/abs/2505.12863

*Authors: Jongmin Jung, Dongmin Kim, Sihun Lee, Seola Cho, Hyungjoon Soh, Irmak Bukey, Chris Donahue, Dasaem Jeong

*Key Terms: Cross-modal music translation, optical music recognition, image-to-audio, MIDI-to-audio, music information retrieval

What I learned:

  • How optical music recognition systems (OMR) work
  • Bottlenecks in OMR (e.g., limited training data for various styles, orchestral scores complexity)
  • Machine-readable notation formats such as MusicXML
  • Use of RQVAE (Residual Quantized Variational Autoencoder) for image tokenization
  • The reasons for data augmentation and slide segmentation

 

Simultaneous Music Separation and Generation Using Multi-Track Latent Diffusion Models

– Dec 2024, https://arxiv.org/abs/2409.12346

*Authors: Tornike Karchkhadze, Mohammad Rasool Izadi, Shlomo Dubnov

*Key Terms: Source separation, music generation, latent diffusion models

What I learned:

  • Fundamental insight about unification of music source separation and music generation into a single, cohesive framework
  • MSG-LD (Music Separation and Generation with Latent Diffusion) model as a core technical innovation
  • The power of structured latent spaces (created by VAE) for high level reasoning

 

Compositional Audio Representation Learning

– Dec 2024, https://arxiv.org/abs/2505.12863

*Authors: Sripathi Sridhar, Mark Cartwright

*Key Terms: Source-centric learning, audio representation learning, audio classification

What I learned:

  • Design choices for compositional audio representation learning (CARL)
  • Gap between the performance of the supervised and unsupervised models

 

Omni-R1: Do You Really Need Audio to Fine-Tune Your Audio LLM?

– May 2025, https://arxiv.org/abs/2505.09439

*Authors: Andrew Rouditchenko, Saurabhchand Bhati, Edson Araujo, Samuel Thomas, Hilde Kuehne, Rogerio Feris, James Glass

*Key Terms: Audio Large Language Models (LLMs)

What I learned:

  • Human-labelled or transcribed audio data requires significant resources
  • Audio LLMs can be trained effectively, using text-only data and auto-generated questions

 

Cue Point Estimation Using Object Detection

– Jul 2024, https://arxiv.org/abs/2407.06823

*Authors: Giulia Argüello, Luca A. Lanzendörfer, Roger Wattenhofer

*Key Terms: Cue point, Mel spectrograms, DJ dataset,  object detection

What I learned:

  • Musical moments can be processed in a computer vision domain via Mel spectrograms
  • The need for a massive DJ dataset to train AI to think like a DJ

 

AcousticWave Modeling Using 2D FDTD: Applications in Unreal Engine for Dynamic Sound Rendering

– Jul 2025, https://arxiv.org/abs/2507.09376

*Authors: Bilkent Samsurya

*Key Terms: 

What I learned:

  • Two-dimensional finite-difference time-domain (FDTD) framework that simulates sound propagation as a wave-based model in Unreal Engine
  • Model’s high accuracy when performed in low frequencies
  • Use of an exponential sine sweep (ESS) to excite the environment with equal energy per octave band
  • Model’s computational cost by differential equations, impulse response and convolution calculations

 

End-to-end optical music recognition for piano form sheet music

– May 2023, ResearchGate link

*Authors: Antonio Ríos-Vila, Jorge Calvo-Zaragoza, David Rizo, José M. Iñesta

*Key terms: Optical music recognition, polyphonic music scores, GrandStaff, neural networks

What I learned: 

  • 90-degree rotation of the score image for a well-understood sequential multi-line text recognition
  • Having special dataset like GrandStaff and data augmentation
  • KERN notation for MusicXML
  • Superiority of transformers over RNNs for long and complex sequences for music score analysing

 

MIDISpace: Finding Linear Directions in Latent Space for Music Generation

– June 2022, https://dl.acm.org/doi/fullHtml/10.1145/3527927.3532790

*Authors: Meliksah Turker, Alara Dirik, Pinar Yanardag

*Key Terms: Laten space manipulation, generative models, VAE, sequence models

What I learned:

  • Finding simple and straight-line directions within the latent space that correspond to intuitive musical concepts
  • Segment manipulation after discovering its primary direction in the latent space by simple math
  • Choosing monophonic music over complex structures for a simple and linear approach

 

RAVE: A variational autoencoder for fast and high-quality neural audio synthesis

– Dec 2021, https://arxiv.org/abs/2111.05011

*Authors: Antoine Caillon, Philippe Esling

*Key Terms: Variational Autoencoder (VAE), Generative Adversarial Network (GAN), latent space, multiband decomposition, timbre transfer

What I learned:

  • Two stage training process: first, Variational Autoencoder (VAE) to learn a compact and meaningful latent representation – second, adversarial network (GAN) to fine-tune only the audio generator for high-fidelity output
  • Multiband decomposition for both high-fidelity audio and real-time speed
  • Controlling directly the trade-off between audio quality and file size after training
  • Successful smaller model to get better results (less parameters, less memory)
  • Versatility of RAVE to do timbre transfer

 

Semi-Automatic Mono to Stereo Up-Mixing Using Sound Source Formation

– May 2007, ResearchGate link

*Authors: Mathieu Lagrange, Luis Gustavo Martins, George Tzanetakis

*Key Terms: Sinusoidal modelling, up-mixing, sound source formation, resynthesis

What I learned: 

  • Analysis process represented as a collection of sinusoids, each defined by its frequency, amplitude, and phase
  • Using all the sinusoidal components as nodes in a graph just before clustering them
  • Semi-automatic approach to decide sound sources and panning for them
  • Masking some frequencies, by Fourier based method,  for left and right channels

 

Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modelling

– Dec 2014, https://arxiv.org/abs/1412.3555

*Authors: Junyoung Chung, Caglar Gulcehre, KyungHyun Cho, Yoshua Bengio

*Key Terms: Recurrent neural networks, gated recurrent unit, long short term memory, tanh

What I learned:

  • Ineffectiveness of standard recurrent neural networks to learn long-term dependencies
  • Advantage of gating mechanism like LSTM and GRU to selectively remember important features over long periods and forget irrelevant ones
  • GRU (with the simpler architecture) : strong and efficient alternative to LSTM

 

A Fourier Explanation of AI music Artifacts

– Jun 2025, https://arxiv.org/abs/2506.19108

*Authors: Darius Afchar, Gabriel Meseguer-Brocal, Kamil Akesbi, Romain Hennequin

*Key Terms: Fourier analysis, frequency artifacts, deconvolution, AI music detection

What I learned:

  • Systematic frequency artifacts in their outputs created by generative AI models, known as checkboard artifacts
  • These artifacts caused by the chosen model architecture instead of the training data or learned model weights
  • “Periodization” of the signal’s spectrum during the deconvolution process, particularly due to zero-insertion upsampling, causing the artifacts
  • Efficiency of interpretable linear model leveraging these artifact fingerprints for detection accuracy on par with deep learning-based approaches

 

Single Channel Speech Enhancement: Using Wiener Filtering with Recursive Noise Estimation

– May 2016, Research Gate Link

*Authors: Navneet Upadhyay, Rahul Jaiswal

*Key Terms: Fourier analysis, speech denoising, Wiener filter, noise estimation, musical noise

What I learned:

  • Minimizing mean-squared error via Wiener filtering combined with a recursive method for estimating noise
  • Importance of accurate noise power estimation on the quality of enhanced speech
  • Musical noise caused by traditional spectral subtraction methods
  • Direct affect of noise and gender types
  • The important role of smoothing parameter in the recursive noise estimation

 

Can Large Language Models Predict Audio Effects Parameters from Natural Language?

– Jul 2025, https://arxiv.org/abs/2505.20770

*Authors: Seungheon Doh, Junghyun Koo, Marco A. Martínez-Ramírez, Wei-Hsiang Liao, Juhan Nam, Yuki Mitsufuji

*Key Terms: Effect parameter, LLM, DSP, audio processing

What I learned:

  • Usage of LLMs as “zero-shot” audio engineers through the underlying relationships between descriptive words and audio engineering concepts from their vast pre-training on general text data
  • Providing specialized context, such as DSP code, audio features, and few shot examples, for high performance
  • Importance of model size and task complexity
  • Benefits of such a model for human-computer interaction in music production instead of requiring users to manipulate complex knobs and sliders

 

FlowSep: Language-Queried Sound Separation with Rectified Flow Matching

– Sep 2024, https://arxiv.org/abs/2409.07614

*Authors: Yi Yuan, Xubo Liu, Haohe Liu, Mark D. Plumbley, Wenwu Wang

*Key Terms: FlowSep, generative AI, Rectified Flow Matching

What I learned:

  • FlowSep “generative” approach as reconstructing the desired sound from scratch, guided by the text prompt and the original mix
  • Efficiency of Rectified Flow Matching technique when compared to other advanced generative methods like diffusion models
  • Robustness of FlowSep approach for complex, overlapping sounds in the name of source separation

 

On the importance of initialization and momentum in deep learning

– Jan 2013, Research Gate Link

*Authors: Ilya Sutskever, James Martens, George Dahl, Geo↵rey Hinton

*Key Terms: Optimizer, generative AI, neural networks

What I learned:

  • That the problem on the difficulty of training deep nets was the tools, not the task
  • Necessity of thoughtful initialization for successful training with specific momentum strategy
  • Superiority of Nesterov Accelerated Gradient (NAG) over classical momentum for deep learning

 

Long-form music generation with latent diffusion

– Apr 2024, https://arxiv.org/abs/2404.10301

*Authors: Zach Evans, Julian D. Parker, CJ Carr, Zack Zukowski, Josiah Taylor, Jordi Pons

*Key Terms: Generative AI, neural networks, latent diffusion architecture

What I learned:

  • Benefit of full-context generation – instead of generating piece by piece – for generating full-length and structurally coherent music
  • Need for powerful autoencoders to compress the audio into a highly compact “latent” representation
  • That the model can learn the principles of musical structure implicitly by training on a long enough time window (the full track)

 

ITO-Master: Inference-Time Optimization for Audio Effects Modeling of Music Mastering Processors

– Jul 2025, https://arxiv.org/abs/2506.16889

*Authors: Junghyun Koo, Marco A. Martínez-Ramírez, Wei-Hsiang Liao, Giorgio Fabbro, Michele Mancusi, Yuki Mitsufuji

*Key Terms: Music mastering process, style transfer, time inference optimization

What I learned:

  • Mastering style transfer approach for analysing a reference track’s sonic elements and then applying these to an unmastered track
  • A novel technique for fine tuning called Inference-Time Optimization (ITO)
  • Trade-off between “black-box” and “white-box” AI models

 

Attention Is All You Need

– Aug 2023, https://arxiv.org/abs/1706.03762#

*Authors: Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, Illia Polosukhin

*Key Terms: Transformer, self-attention, positional encoding

What I learned:

  • Attention mechanism allows the model to assign different weights to each token in an input sequence, based on its importance, when generating an output
  • Parallelizable efficiency of transformer, allowing it to be trained much more quickly and on vastly larger datasets
  • Positional Encoding, a unique mathematical signal (a vector created with sine and cosine functions) that tells the model where each token is in the sequence

 

General Purpose Audio Effect Removal

– Aug 2023, https://arxiv.org/abs/2308.16177

*Authors: Matthew Rice, Christian J. Steinmetz, George Fazekas, Joshua D. Reiss

*Key Terms: Audio effects, deep learning, audio engineering

What I learned:

  • Complexity of removing chains of effects, unknown effects applied in an unknown order, as a real problem
  • Inefficiency of single “do-it-all” model
  • That the best approach is “detect and then remove” method one by one with clever training strategy
  • Limitations making the task extremely hard to solve