The Intersection of AI and Video Encoding
FIGURE 1. THE HIGH-LEVEL AI WORKFLOW (FROM ESCON INFO SYSTEMS)

The Intersection of AI and Video Encoding

The intersection of video processing and artificial intelligence (AI) delivers exciting new functionality, from real-time quality enhancement for video publishers to object detection and optical character recognition for security applications. Forward-thinking product designs incorporate both video processing (decode/scale/overlay/encode) and AI to provide the optimal platform for video-related AI applications like those discussed below.  

For example, I recently joined NETINT in a marketing role. One key feature in NETINT’s Quadra Video Processing Units is two onboard Neural Processing Units (NPUs). Combined with Quadra’s integrated decoding, scaling, and transcoding hardware, this creates an integrated AI and video processing architecture that requires minimal interaction from the host CPU. As you’ll learn in this post, this architecture makes Quadra the ideal platform for executing video-related AI applications.

This post introduces the reader to what AI is, how it works, and how you deploy AI applications on NETINT Quadra. Along the way, we’ll explore one Quadra-supported AI application, Region of Interest (ROI) encoding.

Let’s start by defining some terms and concepts. Artificial intelligence refers to a program that can sense, reason,  act, and adapt. One AI subset that’s a bit easier to grasp is called machine learning, which refers to algorithms whose performance improves as they are exposed to more data over time.

Machine learning involves the five steps shown in the figure below. Let’s assume we’re building an application that can identify dogs in a video stream. The first step is to prepare your data. You might start with 100 pictures of dogs and then extract features, or represent them mathematically, that identify them as dogs: four legs, whiskers, two ears, two eyes, and a tail. So far, so good.

 The Intersection of AI and Video Encoding
FIGURE 1. THE HIGH-LEVEL AI WORKFLOW (FROM ESCON INFO SYSTEMS)

To train the model, you apply your dog-finding algorithm to a picture database of 1,000 animals, only to find that rats, cats, possums, and small ponies are also identified as dogs. As you evaluate and further train the model, you extract new features from all the other animals that disqualify them from being a dog, along with more dog-like features that help identify true canines. This is the ”machine learning” that improves the algorithm.

As you train and evaluate your model, at some point it achieves the desired accuracy rate and it’s ready to deploy.

The NETINT AI Tool Chain

Then it’s time to run the model. Here, you export the model for deployment on an AI-capable hardware platform like the NETINT Quadra. What makes Quadra ideal for video-related AI applications is the power of the Neural Processing Units (NPU) and the proximity of the video to the NPUs. That is, since the video is entirely processed in Quadra, there are no transfers to a CPU or GPU, which minimizes latency and enables faster performance. More on this is below.

Figure 2 shows the NETINT AI Toolchain workflow for creating and running models on Quadra. On the left are third-party tools for creating and training AI-related models. Once these models are complete, you use the free NETINT AI Toolkit to input the models and translate, export, and run them on the Quadra NPUs – you’ll see an example of how that’s done in a moment. On the NPUs, they perform the functions for which they were created and trained, like identifying dogs in a video stream.

The Intersection of AI and Video Encoding
FIGURE 2. THE NETINT AI TOOL CHAIN.

Quadra Region of Interest (ROI) Filter

Let’s look at a real-world example. One AI function supplied with Quadra is an ROI filter, which analyzes the input video to detect faces and generate Region of Interest (ROI) data to improve the encoding quality of the faces. Specifically, when the AI Engine identifies a face, it draws a box around the face and sends the box’s coordinates to the encoder, with encoding instructions specific to the box.

Technically, Quadra identifies the face using what’s called a YOLOv4 object detection model. YOLO stands for You Only Look Once, which is a technology that requires only a single pass of the image (or one look) for object detection. By way of background, YOLO is a highly regarded family of “deep learning object detection models. The original versions of YOLO are implemented using the DARKNET framework, which you see as an input to the NETINT AI Toolkit in Figure 2.

Deep learning is different from the traditional machine learning described above in that it uses large datasets to create the model, rather than human intervention. To create the model deployed in the ROI filter, we trained the YOLOv4 model in DARKNET using hundreds of thousands of publicly available image data with labels (where the labels are bounding boxes on people’s faces). This produced a highly accurate model with minimum manual input, which is faster and cheaper than traditional machine learning. Obviously, where relevant training data is available, deep learning is a better alternative than traditional machine learning.

Using the ROI Function

Most users will access the ROI function via FFmpeg, where it’s presented as a video filter with the filter-specific command string shown below. To execute the function, you call the filter (ni_quadra_roi), enter the name and location of the model (yolov4_head.nb), and a QP value to adjust the quality within each box (qpoffset=-0.6). Negative values increase video quality, while positive values decrease it so that the command string would increase the quality of the faces by approximately 60% over other regions in the video.

-vf ‘ni_quadra_roi=nb=./yolov4_head.nb:qpoffset=-0.6’

Obviously, this video is highly compressed; in a surveillance video, the ROI filter could preserve facial quality for face detection; in a gambling or similar video compressed at a higher bitrate, it could ensure that the players’ or performers’ faces look their best.

FIGURE 3. THE REGION OF INTEREST FILTER AT WORK; ORIGINAL ON LEFT, ROI FILTER ON THE RIGHT. Click the image to see it at full resolution.

In terms of performance, a single Quadra unit can process about 200 frames per second or at least six 30fps streams. This would allow a single Quadra to detect faces and transcode streams from six security cameras or six player inputs in an interactive gambling application, along with other transcoding tasks performed without region of interest detection.

Figure 4 shows the processing workflow within the Quadra VPU. Here we see the face detection operating within Quadra’s NPUs, with the location and processing instructions passing directly from the NPU to the encoder. As mentioned, since all instructions are processed on Quadra, there are no memory transfers outside the unit, reducing latency to a minimum and improving overall throughput and performance. This architecture represents the ideal execution environment for any video-related AI application.

 The Intersection of AI and Video Encoding
FIGURE 4. QUADRA’S ON-BOARD AI AND ENCODING PROCESSING.

NETINT offers several other AI functions, including background removal and replacement, with others like optical character recognition, video enhancement, camera video quality detection, and voice-to-text on the long-term drawing board. Of course, via the NETINT Tool Chain, Quadra should be able to run most models created in any machine learning platform.

Here in late 2022, we’re only touching the surface of how AI can enhance video, whether by improving visual quality, extracting data, or any number of as-yet unimagined applications. Looking ahead, the NETINT AI Tool Chain should ensure that any AI model that you build will run on Quadra. Once deployed, Quadra’s integrated video processing/AI architecture should ensure highly efficient and extremely low-latency operation for that model.

About Jan Ozer

Avatar photo
I help companies train new technical hires in streaming media-related positions; I also help companies optimize their codec selections and encoding stacks and evaluate new encoders and codecs. I am a contributing editor to Streaming Media Magazine, writing about codecs and encoding tools. I have written multiple authoritative books on video encoding, including Video Encoding by the Numbers: Eliminate the Guesswork from your Streaming Video (https://amzn.to/3kV6R1j) and Learn to Produce Video with FFmpeg: In Thirty Minutes or Less (https://amzn.to/3ZJih7e). I have multiple courses relating to streaming media production, all available at https://bit.ly/slc_courses. I currently work as www.netint.com as a Senior Director in Marketing.

Check Also

Announcing Free Course on Controlling the AMD MA35D with FFmpeg

I’m pleased to announce a new free course, MA35D & FFmpeg Quick Start: Essential Skills …

Choosing the Best Preset for Live Transcoding

When choosing a preset for VOD transcoding, it almost always makes sense to use the …

There are no codec comparisons. There are only codec implementation comparisons.

I was reminded of this recently as I prepared for a talk on AV1 readiness …

Leave a Reply

Your email address will not be published. Required fields are marked *