I’d be interested in seeing the results (if any) of usability tests for NBC.com‘s video player, which has built-in support for closed captioning on full episodes. Captions are displayed on the right side of the video player and automatically scroll either up or down. Rather than occupying a layer within (or on top of) the video display itself, the captions scroll on a separate canvas. The user can control the vertical direction of the captions, and each caption occupies its own text box.
A screenshot of NBC's video player. Captured is a frame from an episode of 30 Rock. Six captions are displayed in a vertical stack to the right of the video content.NBC’s non-traditional approach to captioning raises some pretty obvious usability red flags:

  • Separation: Because captions have been removed from the video canvas, users are required to shuttle back and forth between caption and visual display. While every user of captions has to learn how to move seamlessly between caption and visual content (regardless of interface), the NBC interface requires users to travel visually off the video canvas in order to read captions, and then return again to the video. This process, when repeated countless times, can be disorienting and tiring. The user is required to integrate two separate displays as they evolve and change, sometimes rapidly. It’s easy to miss what’s happening on the video screen while tracking the scrolling captions. Typically, captions are integrated with video content because they float on top of — and become part of — the video display. The user is not required to turn her gaze from the viewing area onto another visual display in order to read, process, and map captions onto the visual field. With NBC’s player, however, captions and video content remain cut off from each other. (I suspect that separation also disrupts the practice of lip reading, which is an important tool for cognitively fitting captions and video together.)
  • Synchronization: NBC’s caption feed was intended to be used within the seamless environment of television, where captions can be perfectly synchronized with speech and other sounds. On the web, bandwidth and other differences (e.g. computer processor speeds) can interfere with the synchronization of captions, especially when captions have been moved off-site to a separate scrolling frame. In the screenshot above, for example, “Liz” is identified as the speaker of the fourth caption from the top, because she is speaking while the video shows another character on the phone. This technique works well (indeed, it is standard practice) when captions are perfectly synchronized, but when they are not (as is the case here), or when the user has to move back and forth repeatedly from video to caption scroll, the identifier (“Liz”) is little consolation to a user who may have to work much harder than usual to match out-of-sync captions with specific speakers.
  • Wayfinding: When the user must negotiate a long, scrolling bank of captions (six in the screenshot above), it’s easy to lose one’s way, especially when the caption bank is moving rapidly, has been separated from the video content, and spans a length greater the length of the video frame itself. Rather than negotiate two or three lines of captions (as is standard), the user must try following a towering, bloated bank of moving captions, with few identifiers and less than perfect synchronization.

Captions on the side may simply take some getting used to. But I’m highly skeptical. I’d be interested in finding out how well other users of captions negotiate this territory, especially very young users who are learning to read. My concern is that captions on the side overly complicate a cognitive process that should be intuitive, simple, and efficient. For young users of captions, the focus should be on the intuitive integration of video with caption, not on learning to negotiate a bloated multi-display interface.