Using AI for training voiceover means crafting specialized scripts.
I was writing scripts to teach educators how to provide for at-risk students when the client let me know they’d be using AI to read the voiceover.
“Oh no,” I thought. “That’s going to be a problem.”
Not because AI voiceover is inherently suboptimal… although it generally is. And anyhow, it’s the client’s choice to do what they think is best. They hire me to write, not to consult on production.
But “Oh no,” I thought, because:
AIs have a bad habit of not understanding context while simultaneously thinking they understand context.
Tweet
Now, lots (if not most) of my eLearning writing is designed for use by a human narrator or actor. And most of the techniques I use to provide optimum readability work great. Humans need some guidance on where to put emphasis, when to pause, or how exactly to pronounce a word. Especially if the narrator isn’t an industry pro familiar with industry-specific jargon.
But in my experience, humans are smart. AI voiceover technology is not.
This means that I need to ensure my writing caters to the computer’s limitations, rather than to grammar or spelling or the capable insight of a clever and professional voice artist, to ensure my client gets the narration they need.
Of course, not every AI voiceover is suboptimal in exactly the same way, so I asked which software they’d be using. Then I ran off to do some tests.
And the results were interesting.
EMPHASIS
In A View From The Top, Mike Myers has a classic line: “You put the wrong emPHAsis on the wrong syLLAble.”
Funny then. Not so funny in an important and expensive video project.
I find that with an AI narrator, emphasis is put in odd places… for instance, if the word “now” is in the sentence, AI assumes an emphasis on it for urgency even if the sentence doesn’t require it. “Dinner is ready now” becomes “Dinner is ready NOW!” It changes the subtext of the delivery, making it sound like there had somehow been disappointment that dinner wasn’t ready earlier.
But that’s not the only way “now” causes confusion. In the sentence “Every now and then I’m right” the AI adds a slight pause — what I’ll call a “comma” even though it’s infinitesimally shorter than that — after the word “now” and puts emphasis on “and then I’m right”… like it’s saying “I’m wrong first, and then I’m right.” Which seems to infer that the software is looking ahead at the second half of the sentence, interpreting “and then” as an indication that something new has happened, and backfilling the reading of the first half to fit that intention (adding the invisible comma.)
Hm. Maybe here the problem is “and then” not “now.”
But my point stands.
It’s trying to be smart and predictive.
And failing.
ROMAN NUMERALS AREN’T NUMBERS
Roman numerals are used in everything from law to library science. And there’s a big difference between how they’re styled and how they’re pronounced.
This particular client often refers to US federal civil rights law Title IX… which is styled as IX, but pronounced “nine.” Now, it might be reasonable to assume that human narrators know how to pronounce Roman numerals… but to ensure they get it right, I will write it as either “9” or “nine.”
AI, on the other hand, for sure will say “Title Eye Ecks” or “Title Icks” depending on which software the client’s using.
The point is, clarity should be the default. But especially when AI is involved. Because it’s not as smart as real people.
SOMETIMES ACRONYMS ARE WORDS
In general, acronyms are spelled without periods between the letters. But there are exceptions… U.S.S.R. was/is often styled with periods. The AP Stylebook prefers the abbreviation U.S. with periods, while the Chicago Manual of Style prefers US without periods… unless sometimes it also allows for U.S. in some situations. Because reasons.
If I were just writing prose, I’d follow the Style Guide. But I’m not… I’m writing a narration script, and someone or something is going to read it aloud.
An acronym, with or without periods, can be read aloud in different ways; how the organization has officially styled it to be read, how the community has collectively standardized its common pronunciation, or both. It’s possible, for instance, that the Society for Proper Attribution in Marketing has a style guide that defines the acronym is always read with each letter independently as “S” “P” “A” “M” while its members and the industry at large colloquially refer to it as a noun — “spam.”
This particular client uses a lot of acronyms.
They refer to the U.S. Department of Housing and Urban Development, or HUD, styled without periods and pronounced like a noun… “hud”.
Now, guidance pronouncing acronyms is necessary whether we’re writing for a human or an AI, so that’s not the problem. I’ll need to be explicit for either.
In this case, the AI does the right thing, and reads it as “hud.”
Yay.
On the other hand, the National Council for History Education is styled as NCHE. In this professional community, unlike HUD, NCHE is not pronounced like a noun… instead, each letter is pronounced individually. And the letters are likely run together without pausing between them: “ensee-eightchee.”
But of course, the AI is AI, and doesn’t recognize the concept of acronyms at all. So it pronounces it as “enchee.”
If I was writing for a human narrator, I’d likely use periods to clarify pronunciation: N.C.H.E.
But I can’t use periods to define the letters for AI, because it sees the period as the end of a sentence, and it’ll say each letter as though it’s the end of a sentence: “The organization is called N. (pause) C. (pause) H. (pause) E.” Which is absurd. So I need to write it with spaces but no periods: N C H E. This gets the AI closer… I get smaller pauses between letters. But the delivery is odd in the context of the whole sentence because it doesn’t “phrase” the letters together (ensee-eightchee). It gives each letter the same value, all in a row.
It’s just… weird. Uncanny Valley, even.
AI narration is the uncanny valley of the spoken word.
Tweet
STAKEHOLDERS WILL GET CONFUSED
Making sure the narrator — whether an AI or a real breathing person — pronounces things correctly means communicating with the client. As the writer, it’s on me to confirm with my project lead how they want the AI to read an acronym or a Roman numeral, and craft the script accordingly.
Then I prepare for confusion.
Because invariably, now that I’ve styled the law reference as “Title Nine” and the acronym as “N C H E” the stakeholders will review my writing and correct all the perceived mistakes.
They’ll change “Title Nine” back to “Title IX” and they’ll remove commas and replace colons with periods. Then I’ll have to change it all back with a comment that the script is a phonetic document designed to define pronunciation and pacing, not to adhere to grammatical stylings.
Then the next stakeholder will ignore my comments and change my “Title Nine” back to “Title IX” again.
I don’t want stakeholders — who are usually further up the food chain — to think I don’t know their industry or what I’m doing. And more importantly, I don’t want them to think my project lead isn’t communicating properly with me. So the best I can do is include an explanation at the very top of the document:
“DIALOG: As spoken. Written and punctuated phonetically to guide voiceover AI. Not subject to standard editing concerns.”
I highlight it bright orange, take a deep breath, and prepare to patiently correct the inevitable corrections.
But hopefully, I can help reduce at least some internal confusion and frustration among stakeholders.
So everybody wins.
EVEN HUMAN NARRATORS GET CONFUSED
Like i said, even breathing human readers need guidance. But the AI narrator comes with some specifically AI quirks that are valuable to consider. And because each AI model acts slightly differently, it’s worth doing some testing to decide exactly how to format the script accordingly.
My scripts don’t adhere to literary standards when it comes to spelling, grammar, or punctuation, because I’ve learned to modify those to help guide the talent and streamline production. I’ll put more commas than some might like, so the narrator knows to put emphasis on what follows. I’ll use ellipses to infer a different kind of pause I might get from a comma or period. And if there’s a line about calling for emergency medical assistance, I’ll write “Call nine-one-one” so that it doesn’t get pronounced “Call nine-eleven.”
Shouldn’t the context tell a human reader that “911” is not the same as “9-11”? Of course. But production moves quickly, and every minute in a booth or editing bay costs money. By working with the client to confirm these variables, then baking them into the script, I’m making sure time isn’t wasted by a question, a Zoom call among stakeholders to confirm, and then a search for a pen so the reader can make a manual notation through the rest of the script.
Time. Is. Money.
WHAT I DO
This might all feel a little nitpicky, or maybe a little obvious.
Until it isn’t.
When an awkward pronunciation or a strange emPHAsis draws attention to the script, it’s drawing attention away from the message. Your message, your teaching, and your value get diluted.
Yes words matter. But their delivery matters even more.
Tweet
It’s not enough for a writer to write a video script and make sure all the words are there. They need to think about how the words are delivered, what they need to be communicating, and who’s hearing them.
I’ve spent years not only writing for marketing and eLearning but for film and literary projects. I think A LOT about how people talk, what they hear, and communicating ideas efficiently and artfully. Maybe to a fault… but it’s what always brought me joy and satisfaction.
If you’d like to learn more about working with me, please check out my writing services here.
