Audio denoising is one of the core challenges in digital signal processing (DSP), machine learning, and real-time communication systems.
If you’ve ever wondered how audio denoisers work under the hood — this is your deep dive.
We’ll cover both traditional DSP approaches and modern AI-based models, with references to key papers and practical resources.
In digital recordings, “noise” refers to any unwanted signal that corrupts the original audio.
Common noise types include:
An audio denoiser’s job is to suppress or remove this noise while preserving the underlying “clean” signal — usually speech or music.
Before the rise of deep learning, audio denoising was handled almost entirely with classical DSP techniques.
The two most popular approaches were:
Spectral subtraction operates on the basic assumption that noise is additive.
It estimates the noise spectrum during silent portions of the recording, then subtracts this noise profile from the overall signal.
Pipeline:
Key paper:
Boll, Steven F. “Suppression of acoustic noise in speech using spectral subtraction.” IEEE Transactions on Acoustics, Speech, and Signal Processing, 1979.
Read here
Limitations:
Wiener filters optimize the tradeoff between removing noise and minimizing signal distortion.
They are based on the Minimum Mean Squared Error (MMSE) criterion:
Where:
You basically apply a “soft subtraction” instead of completely removing certain frequencies.
Reference:
Lim, Jae S., and Alan V. Oppenheim. “Enhancement and bandwidth compression of noisy speech.” Proceedings of the IEEE, 1979.
Read here
Today, most cutting-edge denoisers are built using deep learning models.
Instead of manually estimating noise profiles, the model learns to predict clean audio from noisy input through training.
Common architectures:
Autoencoders learn to compress noisy inputs into a lower-dimensional representation and reconstruct clean outputs.
Training:
Reference paper:
Vincent, Pascal, et al. “Stacked denoising autoencoders: Learning useful representations in a deep network with a local denoising criterion.” Journal of Machine Learning Research, 2010.
Read here
Recent models like Wave-U-Net and Demucs operate directly on raw waveforms instead of spectrograms.
This removes the need for STFT and inverse STFT steps, allowing cleaner phase preservation.
Key ideas:
Reference paper:
Défossez, Alexandre, et al. “Real Time Speech Enhancement in the Waveform Domain.” Interspeech 2020.
Read here
Inspired by success in NLP, transformers are now being applied to audio tasks.
Models like SpeechT5 and SE-Conformer combine attention mechanisms with convolutional layers to model long-term dependencies and local audio features simultaneously.
Reference paper:
Chen, Guoliang, et al. “SE-Conformer: Time-Domain Speech Enhancement Using Conformer.” arXiv, 2022.
Read here
Despite advances, audio denoising is still tricky because:
Some practical libraries and APIs for denoising:
Audio denoisers have come a long way — from simple spectral subtraction methods to cutting-edge transformer-based models operating in the raw waveform domain.
For developers and engineers, understanding both traditional DSP techniques and modern deep learning models is crucial if you’re building, fine-tuning, or evaluating denoising systems.
The next evolution? Likely even tighter integration between speech synthesis, separation, and enhancement — where systems not only clean audio but fully reconstruct natural-sounding speech even from heavily degraded inputs.