Drunk speech but sober captions: How manner captions do the heavy lifting

A frame from Scott Pilgrim vs. the World featuring Michael Cera, Ellen Wong and the caption: (In slow-motion): Wow!

How writing homogenizes speech and how the non-speech manner caption attempts to re-embody speech.

If a character’s manner of speaking is significant to the narrative, it needs to be described in the closed captions. In a previous post on manner captions, I gave a few examples of potentially noteworthy ways of pronouncing spoken words: with more volume (yelling, shouting), with less volume (whispering, quietly, indistinctly), at an unusual or altered pitch (in a deep voice), with emotion (angrily, with sarcasm, sobbing), in an altered state (drunken slurring), with an emphasis on certain words, in a different voice (mocking, baby talk, impersonation), or with a thick accent (an Irish brogue).

Non-speech manner captions tell readers how speech is pronounced. Adverbs and adverbial phrases are commmon ways of describing manner of speaking:

Non-speech Manner Caption Accompanying Speech Caption
(drunken slurring): It’s a little late, isn’t it?
(WHISPERS) Don’t go!
(DISTORTED) Who do you think you are, Pilgrim?
(sobbing deeply): She’s dead! She’s dead…
(DISTANTLY) He’s in. Jake, can you hear me?
(SCORNFULLY) their deity.
(IN BELLA’S VOICE) Gran, I’d like you to meet…
(IN SILLY VOICE) And I’ll tell you what the first thing
(HEAVILY ACCENTED) Dr. Gopnik, I believe the results

Every speech caption could theoretically be accompanied by a manner caption, because everyone speaks with an accent. “An accent is a way of pronouncing a language. It is therefore impossible to speak without an accent” (Linguist List). But just because someone speaks with an accent doesn’t mean that accent should be named in the captions. Rather than riddle a caption track with parenthetical references to the manner in which every word is pronounced, the captioner must decide when manner of speaking is significant within the context of the narrative.

Typically, the non-speech manner caption will carry all or nearly all of the responsibility for informing caption readers about the manner in which words are pronounced. That’s because speech captions will most likely follow conventions of standard English, with some allowance for informal forms that are characteristic of speech (g-dropping, contractions and reductions [“gonna”], vocalized pauses and exclamations [“like,” “hmmm,” “um,” “ohhh”], types of aphaeresis [’bout, ’cause]). Experienced caption readers read captions quickly in part because they are intimately familiar with the standard conventions of English orthography, grammar, syntax, style, etc.

Speech captions are readable and accessible to the extent that they reflect standard or conventional forms, especially well-worn linguistic conventions of spelling, punctuation, grammar, syntax, etc. Captions that approximate or reflect standard forms — i.e. captioned speech that is written in standard English — will be more accessible and easier to read than captions that endeavor — e.g. through unusual or nonstandard spellings – to display phonetically or typographically how that speech sounds. Non-standard spellings are almost never used in closed captioning.

But standard written English — and here’s what I’ve been working up to — tends to squeeze out so much of the linguistic variation that distinguishes one speaker from another: gender, dialect, age, pitch, intonation, quality, timbre, reverberation, speed, etc. Sylvie Dubois & Barbara H. Horvath (2002), in “Sounding Cajun: The Rhetorical Use of Dialect in Speech and Writing,” suggest that writing homogenizes speech:

People can often use their conscious or unconscious knowledge of dialectal variation to achieve some rhetorical effect: friendliness, humor, earthiness, honesty, nostalgia, and a host of other possibilities. But in writing, standardization imposes a special problem for using linguistic variation rhetorically. Written languages homogenize much of the linguistic variation that identifies a speaker’s background, and if writers want readers to know a narrator’s or a character’s social and geographic background, they either have to state it explicitly or break the rules — primarily, but certainly not exclusively, the spelling rules. (Dubois & Horvath 2002: 264)

Violating spelling rules rarely happens in closed captioned speech. Even informal forms of speech (’cause, gonna) follow well-known conventions when written down. As a result, writing tends to homogenize speech. Every speaker tends to “sound” the same in writing, unless the transcriber “breaks the rules.” But breaking the rules can slow readers down, making texts less accessible by introducing unfamiliar spellings or new variants that take longer to process. Nonstandard spellings may also lead readers to assume the speaker is “rustic” (p. 265), which may or may not be warranted. Within this context, the non-speech manner caption — e.g. “(drunken slurring)” — makes sense as a solution to the problem of putting into writing how words sound or are pronounced. But the manner caption carries a heavy burden, rarely getting help from the speech captions associated with it (because speech is almost always captioned in standard English using formal spellings and punctuation).

Drunk speech (like all captioned speech) will almost always look completely sober when captioned, because captioned speech, regardless of accent or manner, is almost always formal and sober in its approximation of standard written English. Remove the non-speech manner caption and try to guess which speech captions go with which manner of speaking. You might get some help from an exclamation point (is that sobbing? or perhaps yelling?) or multiple ellipses close together (which might indicate halting, drunk speech). But to a large extent, speech captions do not betray the manner in which words are uttered. The non-speech manner caption alone does that. Tearful speech rarely looks tearful when captioned. Slowed speech doesn’t look slowed down in the captions. Distorted speech looks exactly like the perfectly formed (undistorted) speech captions that surround it.

1. Drunk speech

In captioning, there’s usually only one difference between drunk speech and sober speech: A parenthetical non-speech caption (e.g. [drunken slurring]) that precedes the first drunk speech caption. In a scene from It’s Always Sunny in Philadelphia (Season 4, Episode 4 : “Mac’s Banging the Waitress,” 2008), Charlie is slurring his words because he’s very drunk. (Warning: The clip may be offensive to some viewers.)

The non-speech manner caption tells us that he is “drunken[ly] slurring” his words, but the words themselves appear to be quite sober. There’s a hesitation or two (marked by ellipses) and a repetition of “my” and “we.” Otherwise, the speech resembles any other speech captions — drunk or sober — we’re likely to encounter, perhaps with just a bit more marked hesitation. I don’t see any significant differences between Mac’s sober speech captions and Charlie’s drunk speech captions. The slurring is carried entirely on the back of the manner caption. Indeed, every speech caption, regardless of manner, accent, or level of emotion, tends to look the same. Put another way, Charlie’s drunk speech captions could also be recited sober. There’s nothing inherently sotty about the way his speech is captioned. The drunkenness comes from the manner caption that introduces his speech.

What marks speech as drunk (slurring, hesitant, irregular, slowed, incoherent, garbled) is evacuated from the speech captions themselves, reduced to a single description, and carried by that description at the start of the drunk sequence.

2. Distorted speech

The same thing happens to distorted speech, which doesn’t look distorted when captioned. Distortion is carried by a vague non-speech manner caption (e.g. “[in distorted voice”]). In this clip from District 9 (2009), Wikus hears his own speech echo back to him, and then similarly hears the indistinct speech of others.

Two more examples of distorted speech, the first from Scott Pilgrim vs. the World (2010) and the second from Alien vs. Predator: Requiem (2007).

3. Impersonating the speech of others

When speakers attempt to impersonate others, or assume the voices of other speakers (as in the Twilight example below), these changes need to be closed captioned. But as with other examples of manner of speaking, special speech (drunk, distorted, impersonated, etc.) doesn’t look any different from regular/normal speech. Only the non-speech manner caption alerts caption readers to the special status of the speech that follows it.

Three examples of impersonation: An Education (2009), Twilight: New Moon (2009), and The Office (2009, Season 6, Episode 12: “Scott’s Tots“).

The absence of impersonation (i.e. normal speech) may also need a non-speech caption if caption readers were expecting an impersonation. (The same is true for silences, which need to be captioned if caption readers were expecting sound.) In this example from Family Guy (Season 8, Episode 3: “Spies Reminiscent of Us“), Peter attempts to imitate John Wayne but his impersonation sounds so much like the way Peter’s voice usually sounds that a non-speech manner caption is needed. If we expect an impersonation, and one is not given, perhaps for humorous effect, a “normal voice” non-speech caption is needed:

4. Heavily accented speech

Any relevant, unusual, or particularly thick accent may need to be indicated with a non-speech manner caption. In this example from A Serious Man (2009), one of Larry Gopnik’s students tries to appeal his failing grade. The student’s thick Korean accent is noted by the non-speech manner caption — “(heavily accented)” — and also spills over into the speech captions.

In this example, the speech captions actually do provide small clues that the speaker is not a native speaker of English. The formal phrasing (“the failing grade”), misuse of articles, stilted vocabulary (“unjust”), subject/verb disagreement [“it cover”], and other grammatical infelicities (“I was unaware to be examined…”) suggest, even without the non-speech manner caption, that the student does not have a strong command over spoken English. Nevertheless, the accented words, when captioned, continue to conform to standard conventions of spelling, which leaves the manner caption with the responsibility of conveying how those words sound when pronounced. In this example, it might have been more effective to name the origin of Clive Park’s accent too — e.g. “(thick Korean accent).”

5. Slow motion speech

When spoken words are part of some highly stylized, slow-motion sequence, they may need to be accompanied by a non-speech manner caption (e.g. “in slow-motion”). The speech captions in such a slow-motion sequence will not look any different from normal-motion (unmarked) speech captions. Take away the non-speech manner caption and there’s nothing left in the speech captions to indicate slowed-down speech, because speech captions almost always approximate standard written English. A slowed-down “Woooooooooooow!” is still captioned as “Wow!, as in this example from Scott Pilgrim vs. the World (2010).

6. Whispered speech

Speech that is louder or quieter than normal may need some help from a non-speech manner caption. In this example from Scott Pilgrim vs. the World (2010), the whispered words are preceded by “(whispers)” and followed by an exclamation mark (“Don’t go!”). The exclamation mark in English is associated with strong feelings and/or loud volume, leading to the potential for conflict between whispering and exclaiming that makes this example particularly interesting to me. Put differently, if you take away the non-speech manner caption, the speech captions not only give no indication of whispering but could actually mean the opposite of whispering (i.e. shouting). Hence my point about the heavy responsibility that non-speech manner captions carry, because captioned speech may give few clues about manner.

One more example: A compilation of decreases in volume (gently, softly, weakly) from Avatar (2009):

7. Sobbing speech

From Alien vs. Predator: Requiem (2007), this time featuring tearful speech:

8. Distant speech

From Avatar (2009), as Jake enters the avatar body for the first time and hears the doctor’s voices calling out to him:

9. Bored speech

When an announcer’s tone suddenly shifts, it may need to be described with a non-speech caption, as in this parody of Superfriends from Family Guy (2009, Season 8, Episode 2: “Family Goy“):

Of course, caption readers don’t have a baseline for interpreting the announcer’s tone or level of enthusiasm prior to the “bored” caption (i.e. the voiceover is simply prefaced with a generic SpeakerID: “ANNOUNCER”), but “bored” may be sufficient here to establish that the preceding tone has not been bored at all (and perhaps even heroic and enthusiastic in order to complement the “bold theme”).

A related example from The Big Year (2011) of “unenthusiastic” speech:

10. Wavering speech

From American Dad (2010, Season 5, Episode 18: “Great Space Roaster“). Because speech captions, which tend to approximate standard English regardless of manner of speaking, erase the distinctive qualities of the speaker’s voice, those qualities need to be described by an accompanying non-speech manner caption. Here, Roger’s voice is full of emotion, but that’s not apparent from an analysis of the speech captions alone:

This is just a sampling of non-speech manner captions. Examples abound because it is often necessary to alert caption readers to how words are pronounced. This very small collection has tried to show similarities in captioned speech across a pretty broad spectrum of manner types.

Captioning as rhetorical transcription

When transcribing speech, the linguist faces a challenge similar to that of the captioner describing manner of speech. In a 1991 article in American Speech, Ronald Macaulay discusses the inherent limitations of phonetic transcription: “Any transcription, no matter how detailed, is an interpretation of the tape and necessarily selective in what it includes or leaves out” (p. 282). The researcher can strive to offer an “exhaustive” transcript, but such a transcript would almost certainly be less accessible. Because a detailed phonetic transcript “is not easy to read,” it would take “longer to comprehend” (p. 282). Moreover, it would contain quite a bit of annoying “code-noise” because it “would provide a massive amount of information that is largely irrelevant” (Labov & Fanshel 1977, qtd in Macaulay 1991: 282). It “may look more authentic, but the value of any transcription depends upon its effectiveness for the reader” (p. 289). Macaulay suggests, among other things, that “the purpose of the transcription” (p. 282) should drive the process of selecting what to include in the transcript. In the case of the southwest Scottish variety of English that he studied in this article, Macaulay says he “wanted the transcription to be as readable as possible while providing some information on how the speakers sounded” (p. 286). A readable transcript of speech, according to Macaulay, “should not slow the reader’s eye, unless there is a particular purpose for doing so that is relevant to the transcribed passage” (p. 289).

Macaulay’s article includes a list of six proposed guidelines for displaying dialect in writing (p. 287), three of which seem highly applicable to closed captioning:

  • A transcript caption file should be appropriate for the specific purpose for which it is to be used.
  • The aim of any transcription caption track is to make the reader’s task as simple as possible.
  • The success of a transcription caption track is not to be judged on how much the transcriber captioner has managed to include but on how much the reader succeeds in getting out of it. (Macaulay 1991: 287, with my cross-outs and substitutions)

Unlike the linguist, the captioner is not trying to produce a phonetic transcript of speech, but like the linguist, the captioner is, ideally, driven by the purpose of the narrative, text, or scene being captioned when making decisions about how or whether to indicate manner of speech. The captioner is committed to making captions as readable as possible. And the captioner is concerned with ensuring that captions do “not slow the reader’s eye.” No wonder that speech captions, regardless of the manner in which that speech is pronounced, reflect conventions of standard written English, which are simply more accessible to literate readers than a system of new spellings that aims to produce a phonetic transcription.

So what?

The manner of speaking non-speech caption is a powerful tool for signaling how words are pronounced. The speech captions themselves do not usually betray the manner in which words are spoken, because writing tends to homogenize and formalize speech. As a result, the typical non-speech manner caption carries a heavy burden. In the absence of (m)any significant cues in the speech captions that someone is speaking differently, the caption reader must depend on the non-speech captions (along with the visual clues in the faces and bodies of speakers) to interpret homogenized speech.

When a significant manner of speaking extends over multiple captions, it is usually only indicated once in the captions. For example, the first caption of a drunk sequence will be preceded by “(drunken slurring)” or something similar, but subsequent drunk speech captions will not be preceded by reminders that the speech is drunk. Readers must remember (and of course, they have visual reminders in the video too) that sober looking speech is actually the speech of a drunk.

I’m not proposing an alternative method of captioning but simply calling attention to 1) how writing homogenizes speech, and 2) the absolutely vital role that the non-speech manner caption plays in re-embodying speech.


  • Dubois, Sylvie & Barbara Horvath (2002) “Sounding Cajun: The rhetorical use of dialect in speech and writing.” American Speech 77.3: 264-287.
  • Macaulay, Ronald (1991) “’Coz it izny spelt when they say it’: Displaying dialect in writing.” American Speech 66.3: 280-291.

[Fair use notice: The videos on this site are transformative works used in good faith, in keeping with Section 107 of U.S. copyright law, and as such constitute fair use of copyrighted material. Read this site’s full fair use notice.]

S. Zdenek

Dr. Sean Zdenek is an associate professor of technical and professional writing at the University of Delaware. He is the author of Reading Sounds: Closed-Captioned Media and Popular Culture (University of Chicago Press, 2015).


1 Response

  1. Really enjoyed reading this article. The subtle nuances you point out in captioning how dialogue is being said when it is ‘out of the norm’ is the standard that all closed captioning, or English Subtitles for the Hard of Hearing as its called in the UK should conform to. Some great examples here.