How Audio Denoisers Work

How Audio Denoisers Work Learn how audio denoisers work, from classic DSP techniques to modern AI models. A deep technical guide for developers and engineers with research references.

in AI Audio

April 28, 2025 4 min read
How Audio Denoisers Work

Generate AI Voices, Indistinguishable from Humans

Get started for free
Conversational AI voice
Conversational AI voice
AI Voiceover
AI Voiceover
Character AI voice
Character AI voice
Create a AI Voice

Table of Contents

Audio denoising is one of the core challenges in digital signal processing (DSP), machine learning, and real-time communication systems.
If you’ve ever wondered how audio denoisers work under the hood — this is your deep dive.

We’ll cover both traditional DSP approaches and modern AI-based models, with references to key papers and practical resources.

The Problem: What Exactly Is Audio Noise?

In digital recordings, “noise” refers to any unwanted signal that corrupts the original audio.
Common noise types include:

  • Broadband noise (white noise, fan hum)
  • Narrowband noise (powerline interference, specific tones)
  • Transient noise (keyboard clicks, barking dogs)
  • Reverberation and environmental sounds

An audio denoiser’s job is to suppress or remove this noise while preserving the underlying “clean” signal — usually speech or music.


Traditional DSP-Based Audio Denoising

Before the rise of deep learning, audio denoising was handled almost entirely with classical DSP techniques.
The two most popular approaches were:

1. Spectral Subtraction

Spectral subtraction operates on the basic assumption that noise is additive.
It estimates the noise spectrum during silent portions of the recording, then subtracts this noise profile from the overall signal.

Pipeline:

  • Perform Short-Time Fourier Transform (STFT) to get time-frequency representation
  • Estimate noise spectrum (during non-speech periods)
  • Subtract noise magnitude from the signal magnitude
  • Inverse STFT to reconstruct the waveform

Key paper:
Boll, Steven F. “Suppression of acoustic noise in speech using spectral subtraction.” IEEE Transactions on Acoustics, Speech, and Signal Processing, 1979.
Read here

Limitations:

  • Musical noise artifacts (tonal warbling sounds)
  • Sensitive to inaccurate noise estimation
  • Doesn’t generalize well to non-stationary noise

2. Wiener Filtering

Wiener filters optimize the tradeoff between removing noise and minimizing signal distortion.

They are based on the Minimum Mean Squared Error (MMSE) criterion:

Where:

  • S(f)S(f) is the speech power spectral density
  • N(f)N(f) is the noise power spectral density

You basically apply a “soft subtraction” instead of completely removing certain frequencies.

Reference:
Lim, Jae S., and Alan V. Oppenheim. “Enhancement and bandwidth compression of noisy speech.” Proceedings of the IEEE, 1979.
Read here

Modern Deep Learning-Based Denoising

Today, most cutting-edge denoisers are built using deep learning models.
Instead of manually estimating noise profiles, the model learns to predict clean audio from noisy input through training.

Common architectures:

1. Deep Denoising Autoencoders (DDAE)

Autoencoders learn to compress noisy inputs into a lower-dimensional representation and reconstruct clean outputs.

Training:

  • Input: Noisy audio spectrogram
  • Target: Clean audio spectrogram
  • Loss function: Typically L2 (mean squared error) or L1 losses

Reference paper:
Vincent, Pascal, et al. “Stacked denoising autoencoders: Learning useful representations in a deep network with a local denoising criterion.” Journal of Machine Learning Research, 2010.
Read here

2. Time-Domain Speech Enhancement with CNNs

Recent models like Wave-U-Net and Demucs operate directly on raw waveforms instead of spectrograms.
This removes the need for STFT and inverse STFT steps, allowing cleaner phase preservation.

Key ideas:

  • Encoder-decoder architecture
  • Skip connections (U-Net style)
  • Specialized loss functions like multi-resolution STFT loss

Reference paper:
Défossez, Alexandre, et al. “Real Time Speech Enhancement in the Waveform Domain.” Interspeech 2020.
Read here

3. Transformer-Based Denoisers

Inspired by success in NLP, transformers are now being applied to audio tasks.
Models like SpeechT5 and SE-Conformer combine attention mechanisms with convolutional layers to model long-term dependencies and local audio features simultaneously.

Reference paper:
Chen, Guoliang, et al. “SE-Conformer: Time-Domain Speech Enhancement Using Conformer.” arXiv, 2022.
Read here

Common Loss Functions in Deep Learning Denoisers

  • Mean Squared Error (MSE): Between clean and predicted spectrograms
  • Spectral Convergence Loss: Focuses on matching frequency structure
  • Perceptual Loss: Based on high-level feature distances extracted from pre-trained models (e.g., using VGG or wav2vec2.0)
  • STFT Magnitude Loss: Directly penalizes mismatch in frequency domain magnitudes

Challenges in Audio Denoising

Despite advances, audio denoising is still tricky because:

  • Generalization: Models trained on synthetic noise often fail in real-world scenarios.
  • Trade-off: Aggressive noise removal risks destroying important audio content.
  • Real-time constraints: Models need to be small and fast for real-time applications like video calls and streaming.
  • Non-stationary noise: Barking dogs, keyboard clicks, laughter — these are harder to predict than steady background hums.

Real-World Implementations

Some practical libraries and APIs for denoising:

  • RNNoise: Lightweight RNN-based noise suppressor designed by Jean-Marc Valin GitHub
  • Open-Unmix: Music separation and denoising model GitHub
  • PlayAI Audio Cleaner: AI-powered tool to automatically clean background noise, reverb, and crowd sounds. Try here

Audio denoisers have come a long way — from simple spectral subtraction methods to cutting-edge transformer-based models operating in the raw waveform domain.

For developers and engineers, understanding both traditional DSP techniques and modern deep learning models is crucial if you’re building, fine-tuning, or evaluating denoising systems.

The next evolution? Likely even tighter integration between speech synthesis, separation, and enhancement — where systems not only clean audio but fully reconstruct natural-sounding speech even from heavily degraded inputs.

Recent Posts

Listen & Rate TTS Voices

See Leaderboard

Top AI Apps

Alternatives

Similar articles