Can modern Text to Speech voices replace professional voice over actors?

Can modern Text to Speech voices replace professional voice over actors?

One area of technology that is seeing rapid development is the Text to Speech technology. The four tech giants – Amazon, Google, Microsoft, and IBM along with a lot of other opensource projects are silently competing with each other to create better and more realistic voices. But are the voices realistic enough to replace professional voice over actors? Let’s find out.

There are four key things one would consider while hiring a voice-over actor:

  1. Quality of voice
  2. Cost
  3. Delivery time
  4. Commercial rights

Let’s compare these things with what Text to Speech voices have to offer.

Quality of Voice

Early Text to Speech Voices

The early Text to Speech (TTS) voices sounded extremely robotic because they were generated through a process known as a concatenative approach where sounds of words would be fist recorded and later stitched together to create audio. The resulting voice would sound monotonous and lacked any intonation or expression.

Here is a sample of a male TTS voice using the concatenative approach –

TTS male voice introducing himself

Here is a female TTS voice reading an excerpt from an article using the same concatenative approach –

TTS female voice reading out an article paragraph

 

Modern Text to Speech Voices

The latest TTS voices, however, are dynamically generated based on a process called neural learning which is based on machine learning. A computer model is first trained using a high-quality dataset which then learns to predict the speech based on the context of input texts.

The resulting voices sound shockingly real.

Here’s one of the neural TTS male voices –

A male TTS voice introducing himself

A neural TTS female voice reading out the same article excerpt –

A female TTS voice reading a paragraph

Here’s a TTS voice that’s designed to read out news –

A male TTS voice reading out the news

As you can notice the newer Text to Speech voices don’t sound anything like the older voices. And in some cases they sound so real, it’s hard to identify if it’s a machine or human.

Cost of creating audio

On average, a professional voice actor charges $10 for 100 words. Text to Speech, on the other hand, costs a fraction of that price.

There are two types of Text to Speech voices that are available – standard and neural. The standard voices cost around $0.04 for 1000 words and the neural voices cost around $0.16 for 1000 words.

Delivery of time

A voice actor typically takes around 3-4 days to create and deliver the audio. With Text to Speech technology, you can create the audio in almost real-time.

You also have the benefit of doing unlimited revisions that are limited and time consuming with a voice actor.

Broadcast and Commercial rights

Although a voice actor grants all the rights you would require to commercially use the audio, they typically charge to provide these rights.

With Text to Speech voices though, you don’t have to worry or pay any extra fees to use the audio commercially.


Applications more suited for Text to Speech voices

There are some of the applications that are more suited Text to Speech technology than hiring voice actors:

  1. Creating audio versions of articles and blog posts to repurpose content boost and user engagement.
  2. Create voice-over audio for YouTube videos.
  3. Create voice-over audio for presentations and product demos.
  4. Create announcements.
  5. Create audio for avatars for VR or video games.
  6. Create audio content for courses and eLearning material.

All in all, the significant improvements that you see in today’s Text to Speech voices have definitely made them end-user-consumable, and have opened up a plethora of applications for them but they are still not applicable to certain use cases such as creating commercials, narrating audiobooks, etc wherein a human voice is needed to convey an emotion in the audio.

We believe it’s just a matter of time that TTS voices will catchup to sound exactly alike, or even better than professional voice actors.

Leave a Reply