If you’re optimizing x265 for speed, enabling Wavefront Parallel Processing (WPP) looks like a no-brainer. Table 1 shows a staggering 7.3x improvement in encoding time. A 3:15 encode with WPP turns into a painful 23:51 without it. The quality penalty? Negligible. VMAF drops just 0.19, with the low-frame VMAF off by only 0.77 (low-frame is the lowest VMAF score of any frame in the video, a predictor of transient quality issues).
Given the fabulous performance improvement and low quality penalty, WPP seems like a slam dunk. It may not be.

Contents
What Wavefront Parallel Processing Does
WPP works by dividing each frame into rows and distributing them across multiple threads. That speeds up encoding, but those threads don’t come from thin air. On a 32-core system, that means fewer encodes in parallel.
Figure 2 tells the story. WPP-enabled encodes consume significant CPU cycles. While this might be fine for a single-file test encode, it might not be optimal for multiple-instance production encodes.

Testing WPP Under Load
To test this, we ran batches of 1080p30 encodes with and without WPP at various thread counts on a 32-core system, measuring total throughput in frames per second. Figure 3 shows the results. The highest total throughput came from encoding without WPP. Though the advantage is only about 9%, this comes with a slight increase in overall VMAF and .75 point increase in low-frame score.

The point is simple. Don’t let single-file performance dictate your production settings. The fastest encode on a quiet machine may be the worst choice when the system is loaded.
Figure 4. The most efficient encoding operation occurs not when the CPU is flat-lined but when it is at or near the ceiling.
In this regard, Figure 4 tracks overall CPU utilization during the WPP=0 tests performed in Figure 3. Running the system with 16 instances running two threads each kept overall CPU utilization at or very near 100%, and delivered the best overall performance. Next best was 32 instances running one thread each, which flat-lined CPU consumption but delivered 23% lower throughput. The other configurations used a much lower CPU with correspondingly lower throughput.
The Real Takeaway
The bottom line? What seems like a great configuration option during single-file testing might not be the best alternative in production. Your optimal production configuration will vary by resolution, CPU cores, the number of simultaneous instances, thread count, preset, and, as we just saw, whether WPP is enabled or disabled. You probably will achieve optimal throughput with a configuration that pushes overall CPU utilization to close to 100% but doesn’t flatline it.