Stylistic standards for closed captioning and data mining

When speaker IDs, musical lyrics, and sound descriptions have their own distinctive stylistic treatments, they can be extracted from closed caption files and studied as separate units of discourse. The only efficient and practical way to study hundreds or thousands of sound descriptions at one time is to use a program to separate speech from non-speech.

Such a program would seek out specific patterns: marks of punctuation (e.g. brackets, parentheses, musical notes, and/or colons) perhaps used in conjunction with specific typographic treatments. For example, a word in all caps immediately preceding a colon is always a speaker ID. A word in all caps surrounded by parentheses or brackets is always a sound description. A lower case word in parentheses or brackets is always a sound description. Well, that’s been my experience so far, anyway.

Researching sound descriptions or other types of non-speech information (NSI) is challenged by a lack of uniformity across the industry. Still, I think it’s possible to undertake a large scale study of sound descriptions, despite the differences in companies’ captioning styles. My study, tentatively titled Anatomy of a Sound Description, sets out to provide the first such large scale study of sound descriptions. At this point, I’m using a software program written by my son to 1) extract all the NSI from each movie caption file, 2) separate sound descriptions from speaker IDs, and then 3) organize each type into its own HTML table. In some cases, a caption will contain both a speaker ID and a sound description. These captions are copied into both tables. The full list of NSI can be ordered chronologically or alphabetically. Each example of NSI can be linked to its location in the full caption file, which is useful for seeing a sound description in its original context.

**SPEAKER IDs That Can Be Programmatically Determined**
Movie	Sample Speaker ID
An Education	Oh, no, they’re not, really. DANNY: It’s true.
A Serious Man	ARTHUR: I’ll be out in a minute.
Avatar	Come on. That’s… GRACE: All right, knock it off, you two.
District 9	WIKUS: This is the largest operation that MNU has ever undertaken,

**NON-SPEECH SOUNDS That Can Be Programmatically Determined**
Movie	Sample Non-Speech Sound
An Education	(JENNY LAUGHING) when she goes all speccy and spotty.
A Serious Man	(LARRY SCREAMS)
Avatar	(CREATURE COOING)
District 9	[POWERS UP AND RUMBLES]

While the sound descriptions in these examples are not stylistically identical (e.g. District 9 uses brackets instead of parentheses), they are all still programmatically identifiable as sound descriptions. Even when movie captions use lower case for NSI (e.g. New York, I Love You, Adventureland, etc.), it is still possible to automatically extract NSI by isolating everything in parentheses or brackets (since it’s unlikely that parenthetical information will not be a sound description). If a caption file uses a seemingly identical treatment for both speaker IDs and sound descriptions, differences may remain that can be exploited. In Shaun of the Dead, for example, both sound descriptions and speaker IDs are surrounded by brackets. But the speaker IDs are set in italics, which may allow for just enough difference to allow them to be extracted into their own table.

Why is such a study warranted?

A large scale study of sound descriptions will provide a detailed description of current NSI practices. We simply don’t know enough about the range and nature of sound descriptions. How many sound descriptions are used in the typical movie? What are the main types of sound description? Is there a relationship between the number of sound descriptions in a movie and its genre? A detailed study might help improve current practices by giving captioners a deeper understanding of this most inventive aspect of captioning. I’m currently focusing on the best movies of 2009 and hope very soon to have a collection of 50 DVD caption files with NSI neatly extracted from each for analysis.