Optimizer Comparision
Choosing the right optimizer is essential when training neural networks, especially for audio tasks like denoising or filter tuning. The optimizer controls how model parameters are updated, affecting convergence speed, stability, and final performance. A poor choice can slow training or produce suboptimal results, while the right one helps the model learn efficiently and preserve audio quality. In practice, some optimizers reduce error quickly but leave artifacts, while others may take longer yet maintain natural sound.
Different optimizers have different strengths. SGD is simple but can be slow on complex problems. Nesterov Accelerated Gradient (NAG) uses momentum with a lookahead step for faster convergence. Adam adapts learning rates per parameter, balancing speed and precision, making it ideal for audio tasks that require fine waveform and spectral detail. Picking the right optimizer ensures faster, more stable training and better final results in any deep learning audio project.

This is a simple demonstration comparing different optimizers applied to real audio denoising, showing their impact on learning speed and audio quality:
import torch
import torch.nn as nn
import torch.optim as optim
import soundfile as sf
import matplotlib.pyplot as plt
# ----- Load real audio -----
x_np, sr = sf.read("testMonoAudio.wav")
waveform = torch.tensor(x_np, dtype=torch.float32)
if waveform.ndim == 1:
waveform = waveform.unsqueeze(0) # (channel=1, samples)
# use short segment for speed (2 seconds)
waveform = waveform[:, :sr*2]
# add synthetic noise
noisy = waveform + 0.05 * torch.randn_like(waveform)
noisy = torch.clamp(noisy, -1.0, 1.0)
# add batch dimension
x = noisy.unsqueeze(0) # (batch=1, channel=1, time)
y = waveform.unsqueeze(0)
# ----- Define a small 1D CNN denoiser -----
class DenoiserCNN(nn.Module):
def __init__(self):
super().__init__()
self.net = nn.Sequential(
nn.Conv1d(1, 32, 9, padding=4),
nn.ReLU(),
nn.Conv1d(32, 32, 9, padding=4),
nn.ReLU(),
nn.Conv1d(32, 1, 9, padding=4)
)
def forward(self, x):
return self.net(x)
# ----- Training function -----
def train_model(optimizer_name, epochs=15):
model = DenoiserCNN()
criterion = nn.MSELoss()
if optimizer_name == 'SGD':
optimizer = optim.SGD(model.parameters(), lr=0.05)
elif optimizer_name == 'NAG':
optimizer = optim.SGD(model.parameters(), lr=0.05, momentum=0.9, nesterov=True)
elif optimizer_name == 'Adam':
optimizer = optim.Adam(model.parameters(), lr=0.01)
losses = []
for epoch in range(epochs):
optimizer.zero_grad()
out = model(x)
loss = criterion(out, y)
loss.backward()
optimizer.step()
losses.append(loss.item())
print(f"[{optimizer_name}] Epoch {epoch+1}/{epochs}, Loss={loss.item():.6f}")
return model, losses, out.detach()
# ----- Train all optimizers -----
model_sgd, loss_sgd, out_sgd = train_model('SGD')
model_nag, loss_nag, out_nag = train_model('NAG')
model_adam, loss_adam, out_adam = train_model('Adam')
# ----- Save denoised outputs with soundfile -----
for name, out in zip(
["denoised_SGD.wav", "denoised_NAG.wav", "denoised_Adam.wav"],
[out_sgd, out_nag, out_adam]
):
out_np = out.squeeze().cpu().numpy() # remove batch/channel, convert to NumPy
out_np = out_np.clip(-1.0, 1.0) # clamp to [-1,1]
sf.write(name, out_np, sr)
print("\nSaved: denoised_SGD.wav, denoised_NAG.wav, denoised_Adam.wav")
# ----- Plot losses -----
plt.figure(figsize=(8,4))
plt.plot(loss_sgd, label='Stochastic Gradient Descent')
plt.plot(loss_nag, label='Nesterov Accelerated Gradient')
plt.plot(loss_adam, label='Adaptive Moment Estimation')
plt.xlabel('Epoch')
plt.ylabel('MSE Loss')
plt.title('Real Audio Denoising: Optimizer Comparison')
plt.legend()
plt.show()
Here are simple Python functions demonstrating single-step updates for three common optimizers used in deep learning and audio tasks: SGD, Nesterov Accelerated Gradient (NAG), and Adam.
def sgd_step(params, grads, lr=0.01):
"""
Single step of plain SGD.
params: list of torch tensors (model parameters)
grads: list of torch tensors (gradients corresponding to params)
lr: learning rate
"""
with torch.no_grad():
for p, g in zip(params, grads):
p -= lr * g
def nag_step(params, grads_func, v, lr=0.01, momentum=0.9):
"""
Single step of Nesterov Accelerated Gradient.
params: list of torch tensors (model parameters)
grads_func: function returning gradients evaluated at lookahead position
v: list of torch tensors storing previous velocity (same shape as params)
lr: learning rate
momentum: momentum factor
"""
with torch.no_grad():
# Lookahead step: compute gradients at (param + momentum*v)
grads = grads_func([p + momentum * vi for p, vi in zip(params, v)])
for i, (p, vi, g) in enumerate(zip(params, v, grads)):
vi.mul_(momentum).sub_(lr * g) # update velocity
p.add_(vi) # update parameters
v[i] = vi
def adam_step(params, grads, m, v, t, lr=0.001, beta1=0.9, beta2=0.999, eps=1e-8):
"""
Single step of Adam optimizer.
params: list of torch tensors (model parameters)
grads: list of torch tensors (gradients)
m: list of first moment estimates (same shape as params)
v: list of second moment estimates (same shape as params)
t: current timestep (int)
lr: learning rate
beta1, beta2: decay rates
eps: small number to avoid division by zero
"""
with torch.no_grad():
for i, (p, g) in enumerate(zip(params, grads)):
m[i] = beta1 * m[i] + (1 - beta1) * g
v[i] = beta2 * v[i] + (1 - beta2) * g.pow(2)
m_hat = m[i] / (1 - beta1**t)
v_hat = v[i] / (1 - beta2**t)
p -= lr * m_hat / (v_hat.sqrt() + eps)[1] On the Importance of Initialization and Momentum in Deep Learning (2013), I. Sutskever, J. Martens, G. Dahl, G. Hinton
[2] Top Optimizers for Neural Networks