Pretrained Mono CNN in JUCE Plugin

—*Testing Stage* —

I tried adding a pretrained mono CNN into a JUCE plugin to experiment with real-time audio processing, mainly thinking about a vocal denoiser. The process turned out simpler than I expected, and it was fun to see AI running inside a plugin.

I started with a medium-sized CNN trained on mono audio frames. To make it run fast enough for real time, I applied int8 quantization, which made the model smaller and much quicker without losing much accuracy. Then I exported it using TorchScript, which gives a standalone file that JUCE can load directly.

On the JUCE side, loading the model is straightforward — you just point to the .pt file. Since the model was trained on mono audio, any stereo input has to be downmixed first, or you can feed only mono input. I split the audio into frames matching the CNN’s input size, ran each frame through the model, and wrote the output back to the buffer. The int8 model was smooth enough for real-time processing.

A few things I learned along the way: calibrate quantization with real audio frames when possible, double-check that frame sizes match the CNN’s input, and compare float32 and int8 outputs to make sure everything works as expected. Overall, this workflow makes it pretty easy to integrate AI-powered audio processing into a JUCE plugin without taxing the CPU.

import torch
import torch.nn as nn
import torch.quantization

#--- Example medium CNN for audio frame classification ---

class AudioCNN(nn.Module):
    def __init__(self):
        super(AudioCNN, self).__init__()
        self.conv1 = nn.Conv1d(1, 16, kernel_size=3, padding=1)
        self.bn1 = nn.BatchNorm1d(16)
        self.conv2 = nn.Conv1d(16, 32, kernel_size=3, padding=1)
        self.bn2 = nn.BatchNorm1d(32)
        self.fc1 = nn.Linear(32*128, 64)
        self.fc2 = nn.Linear(64, 10)  # example: 10 classes

    def forward(self, x):
        x = torch.relu(self.bn1(self.conv1(x)))
        x = torch.relu(self.bn2(self.conv2(x)))
        x = x.view(x.size(0), -1)
        x = torch.relu(self.fc1(x))
        x = self.fc2(x)
        return x
        
#--- Convert to int8 (Post-Training Quantization) ---

# Load previously trained model by AudioCNN
model = AudioCNN()
model.load_state_dict(torch.load("audio_cnn.pth")) 
model.eval()

# Set quantization configuration
model.qconfig = torch.quantization.get_default_qconfig("fbgemm")
torch.quantization.prepare(model, inplace=True)

# Calibration with real audio frames recommended; using dummy here
model(torch.randn(32, 1, 128))

# Convert to int8
torch.quantization.convert(model, inplace=True)

# --- 4. TorchScript export for C++ ---
scripted_model = torch.jit.script(model)
scripted_model.save("audio_cnn_int8.pt")
print("Quantized TorchScript model saved: audio_cnn_int8.pt")

#
#--- C++ side ---
#

#include <torch/script.h> // LibTorch
#include <vector>

int main() 
{
    // Load scripted model
    torch::jit::script::Module model = torch::jit::load("audio_cnn_int8.pt");
    model.eval();

    // Example audio frame (float) from JUCE callback
    std::vector<float> audioFrame(128, 0.0f);

    // Convert to tensor
    torch::Tensor input = torch::from_blob(audioFrame.data(), {1, 1, 128});

    // Run inference
    torch::Tensor output = model.forward({input}).toTensor();

    // Print output shape
    std::cout << "Output shape: " << output.sizes() << std::endl;

    return 0;
}

[1] PyTorch Quantization Guide

[2] TorchScript Documentation

[3] LibTorch (PyTorch C++ API) Guide

[4] JUCE Documentation