Can modern Text to Speech voices replace professional voice over actors?
One area of technology that is seeing rapid development is the Text to Speech technology. The four tech giants – Amazon, Google, Microsoft, and IBM along with a lot of other opensource projects are silently competing with each other to create better and more realistic voices. But are the voices realistic enough to replace professional voice over actors? Let’s find out.
There are four key things one would consider while hiring a voice-over actor:
- Quality of voice
- Delivery time
- Commercial rights
Let’s compare these things with what Text to Speech voices have to offer.
Quality of Voice
Early Text to Speech Voices
The early Text to Speech (TTS) voices sounded extremely robotic because they were generated through a process known as a concatenative approach where sounds of words would be fist recorded and later stitched together to create audio. The resulting voice would sound monotonous and lacked any intonation or expression.
Here is a sample of a male TTS voice using the concatenative approach –
Here is a female TTS voice reading an excerpt from an article using the same concatenative approach –
Modern Text to Speech Voices
The latest TTS voices, however, are dynamically generated based on a process called neural learning which is based on machine learning. A computer model is first trained using a high-quality dataset which then learns to predict the speech based on the context of input texts.
The resulting voices sound shockingly real.
Here’s one of the neural TTS male voices –
A neural TTS female voice reading out the same article excerpt –
Here’s a TTS voice that’s designed to read out news –
As you can notice the newer Text to Speech voices don’t sound anything like the older voices. And in some cases they sound so real, it’s hard to identify if it’s a machine or human.
Cost of creating audio
On average, a professional voice actor charges $10 for 100 words. Text to Speech, on the other hand, costs a fraction of that price.
There are two types of Text to Speech voices that are available – standard and neural. The standard voices cost around $0.04 for 1000 words and the neural voices cost around $0.16 for 1000 words.
Delivery of time
A voice actor typically takes around 3-4 days to create and deliver the audio. With Text to Speech technology, you can create the audio in almost real-time.
You also have the benefit of doing unlimited revisions that are limited and time consuming with a voice actor.
Broadcast and Commercial rights
Although a voice actor grants all the rights you would require to commercially use the audio, they typically charge to provide these rights.
With Text to Speech voices though, you don’t have to worry or pay any extra fees to use the audio commercially.
Applications more suited for Text to Speech voices
There are some of the applications that are more suited Text to Speech technology than hiring voice actors:
- Creating audio versions of articles and blog posts to repurpose content boost and user engagement.
- Create voice-over audio for YouTube videos.
- Create voice-over audio for presentations and product demos.
- Create announcements.
- Create audio for avatars for VR or video games.
- Create audio content for courses and eLearning material.
All in all, the significant improvements that you see in today’s Text to Speech voices have definitely made them end-user-consumable, and have opened up a plethora of applications for them but they are still not applicable to certain use cases such as creating commercials, narrating audiobooks, etc wherein a human voice is needed to convey an emotion in the audio.
We believe it’s just a matter of time that TTS voices will catchup to sound exactly alike, or even better than professional voice actors.