AI Voice Acting
Do we rage against the machine?
This is a sensitive subject in the industry right now. I’ve spent a lot of time talking about it with my wife, Helen Kennedy Turner who is a professional voice actor & vocal coach with credits ranging from Disney to World of Warcraft and other AAA games.
She’s helped me temper my Stanford-born enthusiasm for all things in emerging technology with a humanistic understanding of how AI seems like a serious existential danger to an entire profession and people’s livelihoods.
Given my background in art & design, and that I come from a family where my mother is a playwright and theatre professor, and my father is an actor and director —
I’m especially sensitive to helping creatives navigate through this new world, rather than suggesting that we let AI ride roughshod over them and their careers
My perspective is that it doesn’t help to turn a blind eye to emerging technology. We’ve got to understand the state-of-the-art, imagine where it is going, and then find how we need to evolve, and hopefully take advantage of the new opportunities.
A word of hope for humans
Let me pre-empt my conclusion by saying:
As a voice actor, I believe your craft will not be diminished by the emergence of AI
However, the emphasis is on craft. The danger to voice actors, and creative professionals as a whole is that:
Paint-by-the-numbers creativity absolutely WILL be subsumed by AI
But if you take your craft seriously as an actor — that is something that won’t readily be reproduced by AI, and I’ll explain why in my actual conclusion.
State-of-the-art today
Let’s assess what AI is able to do today. I’ve framed my deep-dive into AI by trying to reproduce an episode of The Hitchhiker’s Guide to the Galaxy using AI tools.
TEXT-TO-VOICE GENERATION
As the video above shows, you can get a passable read of a piece of text with voice generation tools.
We’re at the tweak-it-and-see phase of AI tooling
Right now, if you plug a piece of text into a voice generator, you may get some controls or the ability to specify a general emotion like ‘Angry’ — but fine-grained control is not possible.
VOICE CLONING
This is maybe a more worrying topic to voice actors. Recently, James Earl Jones approved the use the AI generation of his voice as Darth Vader:
…the actor signed off on using his archival voice recordings to keep Vader alive and vital even by artificial means…
While Jones was treated with respect and kept in the loop, it’s not difficult to imagine a less ethical production studio using archived audio from previous voiceover performances to create new material.
And the we have VALL-E
The demo (which now seems to be offline) let you play with examples of 3 seconds of voice capture which would then create any length of reading in that voice.
Where before you had to read very specific scripts to capture a voice, now a very small snippet can capture the essence of your voice.
There’s a lot of potential for abuse here, and while most services post an ‘ethical statement’ — the cat’s out of the bag. You can clone existing voices pretty easily, and as an actor I’m not sure you can easily prove that your own voice was the source material.
Legislate, Regulate, or Capitulate?
Every rapidly emerging, industry disruptive technology has its watershed moment where that industry realizes it has to respond, and usually the initial response is: Stop this thing!
We’ve seen this with mp3’s and the music industry, Uber/Lyft and the taxi industry or currently with streaming services and the movie industry with its theatrical release windows.
Popular demand always wins
Once the general public has a taste of the new technology, you can try to restrict it, but I firmly believe that if it truly meets a need, that need will win out. Now — just because something is popular doesn’t mean it’s right, but it’s a lot harder to put the genie back in the bottle unless you make it completely illegal, and even then it still won’t go away.
Embrace the version of the future that includes you
Rather than swimming upstream, I think it’s better to accept the most likely outcome, plan for it, and even work towards it. My belief is that this all can play out in a way that values original creativity more than ever before. So with that in mind — where’s it all going in the future?
Worst-case scenario — and why it’s unlikely to happen
AI replaces voice actors entirely
Here’s the fear: you have a producer or director who puts the script into the AI tool, chooses whatever character voices it wants, and the AI performs the script perfectly, the job is done, everyone goes home early.
But — if you’ve witnessed a voice recording session, there’s no such thing as a perfect read, even with a human voice actor. The director will want different readings, to convey the right voice and tone of the work.
Let’s say that the director can instruct the AI precisely, adjust intonations, emphasis words, emotion until they get the precise reading that they want.
This is the digital equivalent of a director giving an actor a line-reading
There’s a lot of literature on why that’s a bad idea — it essentially boils down to the director stepping into the shoes of the actor. So, in effect, the director is acting all of the roles, which is as problematic as a single actor acting all of the roles, which leads us to the next scenario.
Slightly better, but still bad scenario
One actor does all the roles with vocals replaced by cloning
So if we’ve decided that we’ll never get an actor’s performance from the AI, then the next natural assumption might be to let an actor perform all of the parts, and then replace their voice with whatever character voice you need. The actor is essentially puppeteering a digital voice.
Now it’s worth thinking more deeply about an actor’s craft. What is an actor bringing to a role, to their performance and to the reading of their lines?
What’s my motivation?
A good actor will be thinking about their character deeply. Their backstory, the context of the scene and their character’s relationship to the other characters. That informs the nuanced reading of their lines.
If you’re asking a single actor to embody multiple characters, you’re multiplying their ‘homework’ (and cognitive load) with every additional character.
Outcome: performances will degrade with every additional character voiced by a single actor
Presumably, the goal in using a single actor would be to cut down on production time, so if you’re not giving you actor additional time, it stands to reason they’ll be able to do less of their ‘homework’ and bring less understanding to each of their roles.
The Fine-Crafted Future
This is a concept I’ve toyed with for well over 10 years now. It boils down to this:
As information becomes commoditized by technology, craft will be the currency we value most
As we value craft, and understand craft more deeply, we begin to understand the layers of expertise that a voice actor brings to their work.
But this is all better said in the words of my wife, Helen, as she reviews the latest in voice generation technology:
Right, let’s talk about this audio. Honestly, it’s not great. It sounds like the speaker is standing a long way from the mic and it’s coming to you through an old radio or a recording of an old VHS your Dad dug out from the attic. I do not honestly feel that I am listening to someone interact in a masterful way in front of the mic — there are so many weird and wonderful elements that actors can bring by changing how close or far etc they get to the mic and that’s just a tiny aspect of what a human can deliver.
The inflection is all over the place and the rhythm is just too mechanical and repetitive, like it’s stuck in a loop or something giving me a sense of rising panic. Then, there’s the pronunciation issue which doesn’t ring true for certain accents so it immediately screams “fake!!!”
The additional non-verbal vocalisations (We call it “Pre-Life” in the Vo industry) is way too smooth and forced rather than coming from an organic, human impulse — this is the stuff that grounds all human performance. Breath! Most people don’t actually breathe properly at the best of times and certainly not when “sad”.
Speaking of sad, it’s got this one-note “I’m sad” voice going on, which is just not enough. People are complex, we laugh when we’re hurting then cry then laugh again. It’s not enough to simply adopt a “boo hoo” voice and expect the audience to feel something. We need to feel the emotion, the ups and downs.
And the “metatags”? They’re off. When it’s supposed to be “SAD”, half the time, it doesn’t even sound sad. It’s like it can’t pick up on sarcasm or any other subtle stuff. Unlike TV or Film, VO needs to provide so many rich levels of vocal nuance and unfortunately that’s just not there. Left me cold.
What should a voice actor or SAG/AFTRA be doing?
In my opinion, we need to accept that we’re heading towards a great ‘Creative Divergence’.
The ‘Creative Divergence’ will result in quite a lot of mediocre mainstream media, much of which will be AI facilitated, separated from the truly exceptional original work that will push the industry forward.
Some people like to eat candy. Some people eat lots and lots of candy. But candy doesn’t put the fine-dining restaurant industry out of business. AI-driven media is the equivalent of candy-on-tap for everyone 24 hours a day. So if that’s the case, you want to be in the fine-dining side to the food industry.
As a creative, continually deepen your understanding of your craft and promote the depth of experience your field requires — this is how you build a ‘moat’ between your abilities and what AI is capable of doing.
But you don’t have to take my word for it…
Here are some further discussions and thoughtful investigations on the topic of voice cloning & AI: