Introducing Peregrine: A Truly Realistic Text to Speech Model with Emotion and Laughter

 

The future is AI generated

Content creation and creative work is changing forever with the advent of generative ML models like GPT3 & Bloom (text generation), DALLE & Stable Diffusion (image generation), and RunwayML (video generation).

Today we are introducing our first model, Peregrine, an ultra-realistic Text to Speech model for the missing modality in that new set of models which is Generative Audio.

We believe in a future where all content creation will be generated by AI but guided by humans, and the most creative work will depend on the human ability to articulate their desired creation to the model.

 

 

The challenges of creating human-like Text to Speech

Text to speech (TTS) synthesizers have gone through great advances since the introduction of neural networks.

As a result, TTS systems are now able to synthesize multi-language, multi-speaker, multi-style high quality speech.

However, despite these achievements, current TTS systems usually demand high quality studio-recorded annotated audio from different speakers with different styles and emotions in order to fulfill the needs for commercial applications.

Furthermore, the addition of a new speaker to the model usually requires at least 30 minutes of clean studio recorded data with phonetic annotations.

And yet, the synthesized speech would still sound mostly unnatural due to its prosody lacking expressiveness (tempo, rhythm, power).

Our approach moves beyond the current technology by introducing a novel TTS method which is able to synthesize speech with a higher degree of realism, making it basically undistinguishable from natural speech as spoken by humans.

And to achieve this we don’t rely on high quality annotated data but audio itself as its naturally uttered.

 

 

A Generative Audio approach with Peregrine

Unlike most standard Speech Synthesis ML models and Text to Speech APIs that are designed to trade quality and expressiveness for compute, Peregrine was designed from the ground up to generate the most expressive and emotional speech and imitate a human voice vividly.

Peregrine employs the same concept as large language models such as Dalle and GPT-2.

As a result our model, Peregrine, can not only speak in thousands of voices, but has also learned the intricacies of human speech like emotion, tone, even laughter – all in a self-supervised manner.

Aside from the great improvement on naturalness, voice cloning can be done with less than 30 seconds of recorded audio from a single speaker without the need of transcripts, bringing the multi-speaker, multi-style capability of TTS based applications to another level of performance.

And because it is a Large Language Model, it has the ability to compress 100s of thousands of voices in a few GBs of knowledge that can then generate an infinite number of voice variations, emotions, and styles.

We believe it is a stepping stone in the field of AI Voice Generation and Voice Cloning.

 

 

Ultra-Realistic Voice Cloning

Here we demonstrate the impressive voice cloning abilities of Peregrine.

The following audio clips were generated by the model with just a handful of audio samples and were directly exported without edits.

Elon Musk

 

Kevin Hart

Joe Rogan

 

Tom Hanks

 

Gary Vaynerchuk

 

Nick Offerman

 

John F. Kennedy

 

 

Generating Expressive Speech with Emotions

Based on the context of the text, Peregrine is able to generate emotional expressive speech that demonstrates its understanding of the text. In some cases it can be quite dramatic as you will hear – 

A female voice narrating excerpts from a book in neutral, sad and excited tones

A male voice narrating the same excerpts in the same tones –

Making Text to Speech laugh

Interestingly, the Peregrine is also capable of laughing when it encounters words like ‘haha’ or ‘ahhaha’. And since it understands the language and the context, it is capable of choosing a voice tone just as a human would – all in a self-supervised manner.

The model was not hinted to select a tone while generating the above samples. Based on the text, the model ‘chose’ a tone of its own.

Why Text to Speech

Speech has an instant practical use case in real world applications.

In industries such as gaming, animation, film and eLearning, voice plays a crucial role.

But creating voice has always been a challenge either in terms of cost, time or the countless back-n-forth in editing.

With novel technologies like Peregrine, we are able to reduce the voice production costs, save time and provide instant access to a library of voices that can narrate, explain, engage and captivate the listeners attention like never before.

And this is the first model in a set of models and tools we are building to help unlock truly expressive human-like voice generation at scale.

Try Peregrine today

Peregrine is available in Beta for all users. You can try Peregrine on our web application for free by following these steps –

  1. Create an account at Play.ht

 

2. From the dashboard, access the ‘Ultra-realistic voices’ tab

 

3. Open the text to speech editor by clicking on the ‘Create Audio’ button

 

4. Type text, select a voice and create speech using Peregrine!

If you have any feedback or questions, reach us out at support at play.ht.