The creativity behind immersive VR filmmaking is only matched by the software engineering creativity that makes this type of video possible.
I just started a consulting project relating to 360° VR video, and have some introductory conclusions. I am only an egg, as Robert Heinlein might say, but I thought I would share them.
First, done right, VR can be an incredibly powerful medium, capable of a level of immersion that can’t be matched in the world of “flat video,” the pejorative designation for the 2D video we watch on phones, TVs, and computer screens. The key phrase is “done right.”
Second, VR is a challenging medium with little margin for error. Where producers discuss quality of experience for flat video with terms like abandonment rate, in the VR world it’s cybersickness or nausea, which presumably generates a much stronger antipathy toward your company or brand than does a pre-roll ad.
Third, resolution is critical for effective VR video, and the numbers work against you. YouTube delivers its top-quality VR video at 4K, which sounds great until you realize that this 4K represents the entire 360° view. Devices like the Oculus Rift or HTC Vive have field of views of 110 percent, or 30 percent of the video (110/360), which means at any one time, you see only 30 percent of the horizontal pixels, or about 1200 pixels from the original 4K. The Rift and Vive have a display resolution of 1080×1200 per eye, so it’s a pretty good match, but if you deliver 2K video, you have to scale the video to double the resolution. That can cause pixelation and softness.
These are the numbers for mono video, where both eyes see the same image. If you shoot stereoscopic VR, or a video for each eye, you have to pack both videos into the 4K stream, halving the resolution of each.
While field of view is your enemy from a resolution perspective, it can also be your friend. That is, since the viewer only sees what’s in the field of view at any one time, why not just send that field of view, with a lower-quality buffer around the edges to make sure there’s something to watch if the viewer quickly turns his or her head.
To start, understand that the default frame for VR video is called an equirectangular layout, which is what you get when you map a sphere to a rectangle. As an example, a typical world map is an equirectangular image. There’s lots of distortion at the poles since you have to stretch them horizontally to achieve the same width as the equator. The problem with the equirectangular image is that it shows the same quality for all parts of image, even though the viewer can only see, at most, 30 percent of the image at a time.
In a blog post called “Next-Generation Video Encoding Techniques for 360° Video and VR,” two Facebook employees discussed pyramid encoding, which divides each uploaded video into 30 viewports. Each viewport contains a full-resolution version of the current viewport, plus much lower resolution detail of the rest of the original frame. Using 1-second segments, the player monitors the field of view and changes quickly with head movement. If the viewer whips her head around, she’ll see a low resolution version for a moment, but quality should quickly improve. The authors claim that this approach reduced delivered file size by 80 percent.
A different approach was described in a paper called “Viewport-Aware Adaptive 360° Video Streaming Using Tiles for Virtual Reality.” Here, the authors divided the equirectangular frame into multiple tiles that were encoded separately at declining quality, like a traditional encoding ladder. The player retrieves the highest quality rungs from the field of view, with declining quality from other segments. So the tiles directly behind the current field of view, and the poles, might be the lowest quality, with improving quality delivered closer to the current viewport. This is all managed via a DASH manifest.
I’m working out what all of this means from a compression perspective, but it’s refreshing to see such creative solutions to bandwidth-related problems.