Introducing Peregrine: Text to Speech Model with Emotion and Laughter

By Hammad Syed in TTS

September 19, 2022 4 min read

Generate AI Voices, Indistinguishable from Humans

The future is AI generated

Content creation and creative work is changing forever with the advent of generative ML models like GPT3 & Bloom (text generation), DALLE & Stable Diffusion (image generation), and RunwayML (video generation).

Today we are introducing our first model, Peregrine, an ultra-realistic Text to Speech model for the missing modality in that new set of models which is Generative AI Voice.

We believe in a future where all content creation will be generated by AI but guided by humans, and the most creative work will depend on the human ability to articulate their desired creation to the model.

The challenges of creating human-like Text to Speech

Text to speech (TTS) synthesizers have gone through great advances since the introduction of neural networks.

As a result, TTS systems are now able to synthesize multi-language, multi-speaker, multi-style high quality speech.

However, despite these achievements, current TTS systems usually demand high quality studio-recorded annotated audio from different speakers with different styles and emotions in order to fulfill the needs for commercial applications.

Furthermore, the addition of a new speaker to the model usually requires at least 30 minutes of clean studio recorded data with phonetic annotations.

And yet, the synthesized speech would still sound mostly unnatural due to its prosody lacking expressiveness (tempo, rhythm, power).

Our approach moves beyond the current technology by introducing a novel TTS method which is able to synthesize speech with a higher degree of realism, making it basically undistinguishable from natural speech as spoken by humans.

And to achieve this we don’t rely on high quality annotated data but audio itself as its naturally uttered.

Unlike most standard Speech Synthesis ML models and Text to Speech APIs that are designed to trade quality and expressiveness for compute, Peregrine was designed from the ground up to generate the most expressive and emotional speech and imitate a human voice vividly.

Peregrine employs the same concept as large language models such as Dalle and GPT-2.

As a result our model, Peregrine, can not only speak in thousands of voices, but has also learned the intricacies of human speech like emotion, tone, even laughter – all in a self-supervised manner.

Aside from the great improvement on naturalness, voice cloning can be done with less than 30 seconds of recorded audio from a single speaker without the need of transcripts, bringing the multi-speaker, multi-style capability of TTS based applications to another level of performance.

And because it is a Large Language Model, it has the ability to compress 100s of thousands of voices in a few GBs of knowledge that can then generate an infinite number of voice variations, emotions, and styles.

We believe it is a stepping stone in the field of AI Voice Generation and Voice Cloning.

Generating Expressive Speech with Emotions

Based on the context of the text, Peregrine is able to generate emotional expressive speech that demonstrates its understanding of the text. In some cases it can be quite dramatic as you will hear.

Making Text to Speech laugh

Interestingly, the Peregrine is also capable of laughing when it encounters words like ‘haha’ or ‘ahhaha’. And since it understands the language and the context, it is capable of choosing a voice tone just as a human would – all in a self-supervised manner.

The model was not hinted to select a tone while generating the above samples. Based on the text, the model ‘chose’ a tone of its own.

And this is the first model in a set of models and tools we are building to help unlock truly expressive human-like voice generation at scale.