Closed Captioning for Streaming Media

Though relatively few websites are required to provide closed captions for their videos, any website with significant video content should consider captioning. Not only does it provide access for deaf and hard-of-hearing viewers, but captions and the associated metadata can dramatically improve video search engine optimization. In this introduction to closed captions, you’ll learn about who needs to caption and who doesn’t (and why you may want to anyway), the available workflows for captioning live events and on-demand files, and a bit about web caption formats and how to marry them to your streaming files.

Let’s jump in.

What Are Closed Captions?

Closed captions enable deaf and hard-of-hearing individuals to access the audio components of the audio/video experience. Closed captions incorporate all elements of the audio experience, including identifying background sounds, the identity of the speaker, the existence of background music, descriptions of how the speaker is talking, and similar information. They are also closed, so they can be disabled for viewers with no hearing disabilities. In contrast, open captions are burned into the video stream and can’t be disabled.

Subtitles are typically implemented to allow viewers to watch videos produced in different languages. Technically, background sounds and other nonvocal audio don’t have to be incorporated into the text description, since subtitles are not designed for the deaf and hard-of-hearing, but these elements are often included. There are many closed captions standards, and several are discussed here. While each has a unique format and structure, the content of all closed caption files is similar, primarily consisting of the textual data and time code information that dictates when it’s displayed.

Who Has to Use Closed Captions?

Two classes of websites caption; those required by law and those who caption voluntarily. Let’s start with those legally required to caption.

Section 508 of the Rehabilitation Act

Four laws create the obligation to caption. Starting with federal agencies, Section 508 1194.24(c) of the Rehabilitation Act (29 U.S.C. 794d) states that “All training and informational video and multimedia productions which support the agency’s mission, regardless of format, that contain speech or other audio information necessary for the comprehension of the content, shall be open or closed captioned.” Beyond these federal requirements, note that states that receive federal funds under the Assistive Technology Act of 1998 must also comply with Section 508 to some degree.

Interestingly, several agencies meet this requirement via YouTube’s captioning. For example, the surgeon general videos available from the Department of Health and Human Services website use YouTube (Figure 1), as do videos from and my home state of Virginia. More on how to use YouTube follows.

Figure 1. Videos from the Department of Health and Human Services website use YouTube’s captioning system to meet the Section 508 requirements.

Twenty-First Century Communications and Video Accessibility Act of 2010

The next class of producers who must add closed captions to their streaming videos are broadcasters, but only with regard to content that has been previously played on TV with closed captions. Specifically, under powers flowing from the Twenty-First Century Communications and Video Accessibility Act of 2010, the Federal Communications Commission issued regulations called the IP Closed Captioning rules in August 2012, which went into effect for some classes of video soon thereafter on Sept. 30, 2012.

That is, prerecorded programming that was not edited for internet distribution must be captioned for the web if it was shown on television on or after Sept. 30, 2012. If content was edited for internet distribution, the deadline is pushed back a year until Sept. 30, 2013. There are several other classes of content covered, including live content published with captions on TV, which must be captioned on the internet by March 30, 2013, and older content that predates the act.

The FCC regulations and interpretations thereof make several points clear. If the content is never shown on television with captions, there is no requirement to caption for streaming. The rules also only relate to full-length programming, not clips or highlights of this programming. This last point explained why ESPN’s full-length shows such as Mike & Mike in the Morning (Figure 2) do have captioning, while none of the highlights that I watched do.

Figure 2. ESPN captions programs that displayed in their entirety on TV.

For more information on the regulations, check out “FCC’s New Closed Captioning Rules Kick Into Gear” on the FCC Law Blog.

Other Closed Captioning Provisions

Several additional federal laws may impose captioning requirements on varying classes of publisher. For example, some authorities opine that Title II of the Americans with Disabilities Act (“Title II”) and Section 504 of the Rehabilitation Act of 1973 (“Section 504”) imposed captioning requirements on public schools, public universities, as well as towns and other municipal bodies. There were conflicting opinions here, though, and I bring it up not to take a position either way but to advise that the obligation may exist. Don’t contact me; contact your attorney.

Large web-only broadcasters should also be concerned about a recent ruling in the National Association of the Deaf’s (NAD) lawsuit against Netflix, in which NAD asserted that the Americans with Disabilities Act imposed an obligation for Netflix to caption its “watch instantly” site. In rejecting Netflix’s motion to dismiss, the court ruled that the internet is a “place of public accommodation” and therefore is subject to the act. Netflix later settled the lawsuit, agreeing to caption all of its content over a 3-year schedule and paying around $800,000 for NAD’s legal fees and other costs. In the blog post referenced previously, the attorney stated: “[P]roviding captioning earlier, rather than later, should be a goal of any video programming owner since it will likely be a delivery requirement of most distributors and, in any event, may serve to avoid potential ADA claims.” This sounds like good advice for any significant publisher of web-based video.

Voluntary Captioners

Beyond those who must caption, there is a growing group of organizations that caption for the benefit of their hearing-impaired viewers, to help monetize their video, or both. One company big into caption-related monetization is Boston-based RAMP. I spoke briefly with Nate Treloar, the company’s VP of strategic partnerships.

In a nutshell, Treloar related that captions provide metadata that can feed publishing processes that enhance search engine optimization (SEO), such as topic pages and microsites, that are impossible without captions. RAMP was originally created to produce such metadata for its clients, and it only branched into captioning when it became clear that many of its TV clients, which include CNBC, FOX Business, Comcast SportsNet, and the Golf Channel, would soon have to caption on the web.

Figure 3. RAMP’s cloud-based captioning and metadata creation workflow

As shown in Figure 3, RAMP’s technology can input closed captions from text or convert them from audio, with patented algorithms that prioritize accuracy for the nouns that drive most text searches. This content is processed and converted into timecoded transcripts and tags for use in SEO with dynamic thumbnailing for applications like video search. The transcripts can then be edited for use in web captioning.

RAMP’s prices depend upon volume commitment and the precise services performed, but Treloar stated that video processing ranges from pennies to a several dollars per minute. With customers such as FOX News reporting a 129% growth in SEO traffic, the investment seems worth it for sites where video comprises a substantial bulk of their overall content.

Now that we know who’s captioning for the web, let’s explore how it’s done.

Creating Closed Captions for Streaming Video

There are two elements to creating captions: The first involves creating the text itself; the second involves matching the text to the video. Before we get into the specifics of both, let’s review how captions work for broadcast TV.

In the U.S., TV captioning became required under the Television Decoder Circuitry Act of 1990, which prompted the Electronics Industry Association to create EIA-608, which is now called CEA-608 for the Consumer Electronics Association that succeeded the EIA. Briefly, Section 608 captions are limited to 60 characters per second, with one Latin-based character set that can only be used for a limited set of languages. These tracks are embedded into the line 21 data area of the analog broadcast (also called the vertical blanking interval), so they are retrieved and decoded along with the audio/video content.

Where CEA-608 is for analog broadcasts, CEA-708 (formerly EIA-708) is for digital broadcasts. The CEA-708 specification is much more flexible and can support Unicode characters with multiple fonts, sizes, colors, and styles. CEA-708 captions are embedded as a text track into the transport stream carrying the video, which is typically MPEG-2 or MPEG-4.

A couple of key points here. First, if you’re converting analog or digital TV broadcasts, it’s likely that the text track is already contained therein, so the caption creation task is done. Most enterprise encoding tools can pass through that text track when transcoding video for distribution to OTT or other devices that can recognize and play embedded text tracks.

Unfortunately, though QuickTime and iOS players can recognize and play these embedded text tracks, other formats, such as Flash, Smooth Streaming, and HTML5, can’t. So to create captioning for these other formats, you’ll need to extract the caption file from the broadcast feed and format it in a number of caption-based formats that are discussed in more detail later. Not all enterprise encoders can do this today, though it’s a critical feature that most products should support in the near future.

Captioning Your Live Event

If you’re not starting with broadcast input, you’ll have to create the captions yourself. For live events, Step 1 is to find a professional captioning company such as CaptionMax, which has provided captioning services for live and on-demand presentations since 1993. I spoke with COO Gerald Freda, who described this live event workflow.

With CaptionMax, you contract for a live stenographer (aka real-time captioner) who is typically off-site and who receives an audio feed via telephone or streaming connection. The steno machine has 22 keys representing phonetic parts of words and phrases, rather than 60-plus keys on a typical computer keyboard. The input feeds through software, which converts it to text. This text is typically sent to an IP address in real time, where it’s formatted as necessary for the broadcast and transmitted in real time. The text is linked programmatically with the video player and presented either in a sidecar window or, preferably, atop the streaming video just like TV.

Unlike captions for on-demand videos, there’s no attempt to synchronize the text with the spoken word—the text is presented as soon as available. If you’ve ever watched captioning of a live broadcast, you’ll notice that this is how it’s done on television, and there’s usually a lag of 2–4 seconds between the spoken word and the caption.

According to Phil McLaughlin, president of EEG Enterprises, Inc., seemingly small variations in how streaming text is presented could impact whether the captioning meets the requirements of the various statutes that dictate its use. By way of background, EEG was one of the original manufacturers of “encoders” that multiplex analog and digital broadcasts with the closed caption text; it currently manufactures a range of web and broadcast-related caption products. The company also has a web-based service for captioning live events.

Here are some of the potential issues McLaughlin was referring to. By way of background, note that language in the hearings related to the FCC regulations that mandated captioning for TV broadcasters discussed providing a “captioning experience … equivalent to … [the] television captioning experience.” McLaughlin doubts that presenting the captions in a sidecar meets this equivalency requirement because the viewer has to shift between the sidecar and video, which is much harder than watching the text over the video. At NAB 2012, McLaughlin’s company demonstrated how to deliver live captions over the Flash Player using industry-standard components, so sidecars may be history by the time broadcasters have to caption live events in March 2013.

Interestingly, McLaughlin also questions whether the use of the SMPTE-TT (Society of Motion Picture and Television Engineers Timed Text format) allowed in the FCC regulations provides the equivalency the statute is seeking for live captioned content. Specifically, McLaughlin noted that SMPTE-TT lacked the ability to smooth scroll during live playback, as TV captions do. Instead, the captions jump up from line to line, which is harder to read. You can avoid this problem by tunneling 608 data with the SMPTE-TT spec, but not all web producers are using this technique.

McLaughlin feels that using the embedded captioning available in MP4 and MPEG-2 transport streams, like iOS devices can, is the simplest approach and provides the best captioning experience. Note that neither the sidecar or smooth scrolling issues present problems for on-demand broadcasts. With on-demand files, the captions are synchronized with the spoken word and predominantly presented via pop-up captions over the video, which are much easier to follow when they match what’s happening on screen.

While this won’t meet the FCC requirements, another option for private organizations seeking to provide a feed for deaf and hard-of-hearing viewers is the New York City Mayor Bloomberg approach of supplying a real-time American Sign Language interpretation of the live feed. This was the approach originally used by Lockheed Martin Corp. for its live events. Ultimately, the company found using real-time captioning to be more effective and less expensive. You can see a presentation on this topic by Thomas Aquilone, enterprise technology programs manager for Lockheed Martin, from Streaming Media East.

Captioning On-Demand Files

As you would suspect, captioning on-demand files is simpler and cheaper than captioning live events. There are many service providers such as CaptionMax, where you can upload your video files (or low resolution proxies) and download caption files in any and all required formats. You can also buy software such as MacCaption from CPC Computer Prompting & Captioning Co. to create and synchronize your captions (Figure 4).

Figure 4. Using MacCaption to create captions for this short video clip

For low-volume producers, there are several freebie guerilla approaches that you can use to create and format captions. For a 64-second test clip, I used the speech-to-text feature in Adobe Creative Suite to create a rough transcript, which I then cleaned up in a text editor. Next, I uploaded the source video file to YouTube, and then uploaded the transcript. Using proprietary technology, YouTube synchronized the actual speech with the text, which you can edit online as shown in Figure 5.

Figure 5. Correcting captions in YouTube

From there, you can download an .sbv file from YouTube, which you can convert to the necessary captioning format using one of several online or open source tools. I used the online Captions Format Convert from 3Play Media for my tests. Note that YouTube has multiple suggestions for creating its transcripts, many summarized in a recent article. If you’re going to use YouTube for captioning, you should check this out.

Which approach is best for your captioning requirements? Remember that there are multiple qualitative aspects to captioning, and messing up any of them is a very visible faux pas for your deaf and hard-of-hearing viewers. For example, to duplicate the actual audio experience, you have to add descriptive comments about other audio in the file (applause, rock music). With multiple speakers, you may need to position the text in different parts of the video frame so it’s obvious who’s talking or add names or titles to the text. There are also more basic rules about how to chunk the text for online viewing.

Basically, if you don’t know what you’re doing and need the captioning to reflect well on your organization, you should hire someone to do it for you, at least until you learn the ropes. For infrequent use, transcription and caption formatting is very affordable, though few services publish their prices online. The lowest pricing I found was around $3 per minute, but this will vary greatly with turnaround requirements and volume commitments. Remember, again, that there is a both an accuracy component and a qualitative component to captioning, so the least expensive provider is not always the best.

Captioning Your Streaming Video

Once you have the caption file, matching it to the streaming video file is done programmatically when creating the player, and all you need is the captioned file in the proper format. For example, Flash uses the World Wide Web Consortium (W3C) Timed Text XML file format (TTML—formerly known as DFXP), which you can add via the FLVPLaybackCaptioning component. Brightcove, Inc., one of the OVPs used by streaming media, can accept either DFXP or the aforementioned SMPTE-TT format, for presenting captions (Figure 6). Most other high-end OVPs, as well as open source players such as LongTail Video’s JW Player and Zencoder, Inc.’s Video.js, also support captions with extensive documentation on their respective sites.

Figure 6. Captioning in Brightcove: The player is superimposed over the video properties control where you upload the caption file.

In a way that seems almost uniquely native to the streaming media market, the various players have evolved away from a unified standard, adding confusion and complexity to a function that broadcast markets neatly standardized years ago. Examples abound. Windows Media captions need to be in the SAMI format, for Synchronized Accessible Media Interchange, while Smooth Streaming uses TTML.

As mentioned, iOS devices can decode embedded captions in the transport stream, eliminating the need for a separate caption file. With HTTP Live Streaming, however, Apple is moving toward the WebVTT spec proposed by the Web Hypertext Application Technology Working Group as the standard for HTML5 video closed captioning. Speaking of HTML5, it has two competing standards, TTML and WebVTT, though browser adaption for either standard is nascent. This lack of captioning is yet another reason that large three-letter networks, which are forced to caption, can’t use HTML5 as their primary player on the desktop.

For a very useful description of the origin of most caption-related formats and standards, check out “The Zencoder Guide to Closed Captioning for Web, Mobile, and Connected TV.”

Trans-Platform Captioning

What about producers who publish video on multiple platforms, say Flash for desktops and HLS for mobile? We’re just starting to see transmuxing support in streaming server products that can input captions for one platform, such as HLS, and convert the captions for use in another platform, such as Flash.

For example, I spoke with Jeff Dicker from Denver-based RealEyes Media, a digital agency specializing in Flash-based rich internet applications. He reported that Adobe Media Server 5.0.1 can input captioned streams targeted for either HLS or RTMP and transmux them for deployment for any target supported by the server. For HLS, this means captions embedded in the MPEG-2 transport stream; for delivery to Flash, this means breaking out and delivering a separate caption file with the audio/video chunks.

At press time, Wowza announced Wowza Media Server 3.5, which has the ability to accept “caption data from a variety of in-stream and file-based sources before converting captions into the appropriate formats for live and on-demand video streaming using the Apple HLS, Adobe HDS and RTMP protocols.” So once you have your captions created and formatted for one platform, these and similar products will automatically convert them as needed for the other target platforms supported by the servers. (We’ll have an article on using Wowza Media Server for closed captioning early next week.)

About Jan Ozer

Avatar photo
I help companies train new technical hires in streaming media-related positions; I also help companies optimize their codec selections and encoding stacks, and evaluate new encoders and codecs.

Check Also

SVT-AV1 and Libaom Tune for PSNR by Default

As of June 22, 2022, libaom-AV1 and SVT-AV1 tune for PSNR by default, and libaom-AV1 …