Back in 2009, when HP shipped new workstations powered by the Nehalem line of CPUs, the performance boost was so significant that they instantly rendered obsolete workstations based upon previous architectures. Now that Intel has launched Ivy Bridge-based CPUs that triple the core count of early Nehalem-based workstations from four to 12, can we expect similar performance gains? That’s what I’ll explore in this article, which compares the performance of a 12-core (24 with HTT) HP Z800 against a 24/48 core HP Z820 in both editing and streaming encoding functions.
The Z800, which I reviewed in EventDV in 2010, incorporated two 3.33GHz X5680 Xeon processors, with 24GB of RAM running the 64-bit version of Windows 7. The graphics card was an NVIDIA Quadro FX 4800 with 1.5GB of dedicated memory and access to 3.5GB more system memory.
The Z820 (Figure 1, below) includes two 2.7 GHz E5-2697E CPUs, with 64 GB of RAM also running Windows 7. Graphics is supplied by an NVIDIA Quadro K5000 with 4 GB of video RAM. By virtue of its updated architecture, the Z820 also enjoys a faster system bus than the Z800 (8 GT compared to 6.4GT), faster memory (1866 MHz compared to 1333 MHz) and one additional memory channel (3 vs. 4), all contributing to a greater maximum memory throughput of 59.7 GB/sec (Gigabypres per second) compared to 32 GB/sec for the Z800.
Figure 1. The Z820 hasn’t changed much on the outside.
Though there are some minor hardware differences inside the box, HP did not retool the enclosure for its latest workstation generation. Put the Z800 next to the Z820 and obscure the product name, and only the most eagle-eyed observer will be able to tell them apart.
All of my tests were rendering tests, and the key question was how much faster the Z820 would perform than the Z800, which I measured as the percentage reduction in rendering time. If the Z800 took 10 minutes to render a project, and the Z820 took five minutes, the Z820 cut rendering time by 50% (10 min-5 min/10). How much of a performance boost is reasonable to expect?
Let’s start with simple theory. Since the Z820 has twice as many cores as the Z800, performing the same work in half the time sounds reasonable, making the 50% number seem attainable. However, since the CPUs on the Z800 were about 20% faster (3.33 GHz compared to 2.7 GHz), each core should operate about 20% faster, cutting the 50% down to about 40%. However, on tasks that involve lots of data, like the RED and 4K projects, the faster memory bandwidth of the Z820 should also pay some performance dividends.
So, theory would suggest that the Z820 should perform between 40-60% faster, depending upon the tasks. And this is actually a pretty decent starting point. However, keep in mind that just because there are 48 cores doesn’t mean that all tasks are efficiently split over those 48 cores. As an example, Figure 2 (below) shows the Performance tab of the Windows Task Manager on the Z820 while encoding a single file using the VP6 codec in the Adobe Media Encoder. You can see this view on any Windows computer via the three-finger salute (Ctrl-Alt-Delete), choosing Windows Task Manager, and clicking the Performance tab.
Figure 2. This view of Windows Task Manager makes Intel engineers cry.
Why is CPU utilization so low? Because the VP6 codec is licensed from On2 (or what formerly was On2) and it’s always been highly inefficient from a multiprocessing perspective, meaning that it doesn’t make efficient use of additional CPU when available. That’s largely because VP6 was developed before multicore computers were widely available, and was put out to pasture before it made sense to update the code to take advantage of multiple cores.
None of my tests involved outputting to VP6. The high-level point is that multicore efficiency varies from program to program, and even task to task within a program. If a task is particularly inefficient from a multicore perspective, the faster clock speed on the Z800’s CPUs (3.3 Ghz) would be a bigger advantage than the extra cores on the Z820 with the slower CPU speed (2.7 Ghz). In addition, even when a program does effectively split operation over multiple cores, this involves some overhead and management, which poaches resources away from the rendering or other operation taking place.
For all these reasons, it’s not surprising when a particular program, or function within a program, doesn’t come close to harvesting the theoretical performance benefits the additional cores would seem to make available. This is particularly so with applications such as Premiere Pro, which uses a range of third-party codecs to work with DV, HDV, AVCHD, and the alphabet soup of other codecs presented by the various input formats. Since a program can never be faster than its slowest operation, if these codecs are inefficiently written, they can slow the entire operation.
OK, now that our expectations are set, looks move on to our tests.
Editing and Rendering Tests
To test editing and rendering, I created 29 test sequences in Adobe Premiere Pro (version 7.0.1-105) from different camera formats, from DV to 4K RED, which I rendered into multiple formats, including MPEG-2 for DVD and Blu-ray, and H.264 for Blu-ray and for uploading to services like YouTube. In terms of workflow, I created the sequences in Premiere Pro and rendered them in the Adobe Media Encoder (AME), which is the typical editing and output workflow.
The tests include a range of real-world projects that I produced and synthetic tests that I pulled together from footage shot in formats that I don’t typically produce with. I also liberally used projects and content provided by Adobe with their Creative Suite product launch materials, since they typically use more advanced formats than I do, with a more indie movie-type feel than the event-type of productions that I typically produce. Though a couple of the Adobe projects are effects and layer-laden, for the most part the projects are relatively straightforward, real world-type projects, not strewn with extra effects and layers to maximize the potential impact of the new hardware capabilities.
My analysis involved two series of tests. I ran the first series of tests in dedicated mode, with no other operations running on the computer. Then I ran a second series of tests with Adobe Encore producing an H.264-encoded Blu-ray Disc in the background during the entire rendering cycle. Within these two test scenarios, I ran every test twice—once with hyperthreading (HTT) enabled, once with HTT disabled—and used the faster of the two results in my comparisons. In general, the Z820 performed faster with HTT disabled, while the Z800 performed faster with HTT enabled.
Table 1 (below) presents the results by format. The first column, No Blu-ray, represents the results of the comparison tests performed in dedicated mode; If you edit and render in dedicated mode, the first column is for you. The second column shows the results when a Blu-ray Disc project was rendering in the background; if you frequently edit with CPU-intensive tasks running in the background, the results in the second column will be more relevant.
Table 1. Results by format, with and without Blu-ray rendering in the background.
Using Excel’s conditional formatting feature, I color-coded the results to show a red background when the Z820 was slower than the Z800, yellow when the Z820 was less than 30% faster than the Z800 and green when the Z820 was 30% faster than the Z800 or higher. These numbers are obviously arbitrary, but the colors do tend to separate the winners from the losers.
Time and space don’t allow for a comprehensive discussion of the projects and project types, but let me touch on the outliers. The MPEG-2 project is a real-world project with source video created by a TriCaster that I used to mix the live production. While this probably isn’t a significant use case for many producers, it’s interesting that MPEG-2-based XDCAM EX codec was also a lackluster performer comparatively on the Z820. That said, HDV-based projects, which included three real-world projects, did seem to benefit from the extra capabilities of the Z820, as did XDCAM HD.
The DVCPRO test sequence came straight out of an Adobe Creative Suite sample project, and is of the indie film genre (short snippets of multiple cameras), rather than event (a simple multicamera mix). While there probably aren’t a lot of DVCPRO HD producers out there any longer, this format didn’t seem particularly efficient from a multicore perspective.
The DSLR projects comprised a ballet audition I produced while reviewing the Canon 7D, plus two synthetic projects created from Canon 5D videos included in Adobe’s Creative Suite demo projects. I would have expected to see more benefit with these projects, but when I ran Windows Task Manager to test CPU utilization on the Z820, it hovered around 40%, perhaps indicating that the decoder used by these formats was not that efficient.
The RED and other 4K formats not only benefited from the extra CPU cores (see Figure 3), but also from the enhanced capability for data I/O, particularly with other CPU-intensive tasks running in the background. As you move into 4K production, it appears that multiple cores could deliver some significant benefit.
Beyond these musings, what should you take from this diverse data set from disparate projects built using different source formats? First, don’t assume that extra cores on a new computer will deliver a by the numbers proportional benefit. With some formats and projects it may, with others it might not.
Before buying a new computer to speed rendering and performance, run Windows Task Manager while rendering on your current computer and see how efficiently your primary applications are using your existing cores. If it looks like Figure 2, throwing more cores at the problem could actually slow things down. If it looks like Figure 3 (below), which reflects Adobe Media Encoder rendering a RED project, there’s a good chance that your investment in additional cores will be rewarded by significant performance improvement.
Figure 3. This view of Windows Task Manager makes Intel engineers jump for joy.
Encoding for Streaming
In my streaming encoding trials, I tested three programs; Adobe Media Encoder, ProMedia Carbon, and Sorenson Squeeze. You can see the results in Table 2 (below), which uses the same color-coding as Table 1. That is, green if the benefit is greater than 30%, yellow if between 0% and 30%, and red if less than 0%. As you can see, the Z820 delivered significant time savings in all programs.
Table 2. Comparative results from streaming encoding programs.
All tests involved rendering one or more files to one or more streaming formats. By design, the tests varied by program, using different quantities of source files and different output presets. So you should not draw any conclusions about comparative performance from these results, only how each program performs on a more capable computer.
Note that all three tests were one-to-many trials where I encoded multiple source files to multiple outputs. If you’re encoding a single file to a single output, don’t expect the same benefit; more on this in the sidebar.
Since the tests varied by program, I’ll describe them individually. With Adobe Media Encoder, I encoded six 3-4 minute 720p source files to six canned presets supplied by Adobe, which simplified producing the identical test on both computers. Perhaps one of the best-kept secrets about Adobe Media Encoder is that when you assign multiple H.264 presets to a single file, it encodes them in parallel, which obviously makes good use of the additional cores (Figure 4, below).
Figure 4. Adobe Media Encoder rendering six outputs from a single test input.
Let me take a moment to distinguish this set of tests from those reported above. In the editing-related output tests, I rendered sequences from various Premiere Pro projects in Adobe Media Encoder to a single-output preset. Accordingly, a substantial portion of the encoding time related to rendering the project, then transcoding to the selected preset. In these tests, I input multiple disk-based files, not projects, so all encoding time related solely to transcoding that file to the selected presets.
Our next test involves ProMedia Carbon, is a $5,000 batch-encoding program from Harmonic that has proven itself one of the most efficient consumers of multiple cores that I’ve ever seen, essentially redlining CPU utilization on every computer that I’ve ever tested. In these tests, I encoded six 3-4 minute 720p files to 11 presets that bore no relation to those used in Adobe Media Encoder or Squeeze. Not surprisingly, Carbon efficiently leveraged the Z820 to a 42% reduction in encoding time; if you’re running Carbon on an older computer and need more throughput, don’t buy another copy of Carbon (sorry, Harmonic); buy a computer with additional cores.
Sorenson Squeeze is a $749 desktop encoding product from Sorenson Media, and to test Squeeze, I rendered six 3-4 minute 720p files to 9 different encoding presets. One little-known feature of Sorenson Squeeze is that you can open multiple instances of the program to improve encoding efficiency on multicore computers. On Windows computers, the procedure is simple: You start the program as you normally would, either via the Start menu or an icon on your desktop, and then do the same to open another instance. On the Mac it’s a bit more challenging, but you can learn about the procedure in an article on StreamingLearningCenter.com.
To produce the results shown in Table 2, I opened three instances of Squeeze, each encoding two source files to eight outputs. In this mode of operation, Squeeze definitely put the additional resources to good use.
How should these results affect your CPU selection? I’ll discuss that after the next section.
Like many workstations, you can configure the Z820 with different CPUs of different speeds. Choosing the optimal CPU depends upon the type of activity and whether it’s performed standalone or with other CPU intensive applications running in the background. To flesh this out, HP supplied three sets of CPUs for the Z820, as shown in Table 3 (below). Note that the CPU speed for all cores is roughly the equivalent, so the comparisons really focus on the value of the additional cores.
Table 3. Three CPUs tested in the Z820
The tests that I performed for this analysis were similar to those performed above, but not identical, primarily because I couldn’t get Episode Engine running on the Z800. Figure 5 (below) shows the results for rendering Premiere Pro projects within the Adobe Media Encoder. The numbers show the cumulative rendering time for all projects in the trial, with dedicated rendering on left, and rendering while producing a Blu-ray Disc in the background on the right.
Figure 5. Testing CPUs while rendering with Adobe Media Encoder
For those editing in dedicated mode (on the left), the 2680 V2 produced about 12% faster performance than the older 2680 V1, but the 2697 V2 delivered no performance improvement at all. On the right in Figure 5, with a Blu-ray Disc being produced in the background, the 2680 V1 delivered about 14% better performance than the V1, to which the 2697 delivered another 5%. Given these results, in both types of applications, I would choose a faster CPU over one that delivered additional cores.
Figure 6 (below) shows the results achieved with the batch-encoding programs when encoding multiple files to multiple outputs. As you can see, Carbon was the clear star of the show, with performance improvements that were nearly proportional with the additional cores. For example, the 2680 V2-equipped system had 25% more cores than the 2680 V1 system, and Carbon delivered a 22% speed improvement. With 48 cores, the 2697 V2-equipped system had 50% more cores than the base 2680 V1 system; here Rhozet delivered a 46% speed boost.
Figure 6. Testing many-to-many encodes with the listed program.
Though Squeeze and Episode Engine also benefited from the additional cores, the performance enhancement didn’t conform as proportionally to the increase in processor powe as with ProMedia. However, note that all programs achieved much less benefit in one-to-many trials, where a single file was encoded to multiple outputs (Figure 7, below). Since only a single file was involved, it was much more difficult for all three programs to load balance and keep all CPU cores active.
Figure 7. Testing one-to-many encodes with the listed programs
In a server farm environment, which will be continuously encoding files throughout the day, I would prioritize the number of cores over clock speed, particularly with an encoder as efficient as ProMedia Carbon. On the other hand, for sporadic encoding, I would prioritize clock speed over the number of cores.