In 2009, Intel launched its Nehalem line of workstations, which started with three models: the low-end Z400, the mid-range Z600, and the high-end Z800; later it was supplemented by the entry-level Z200. I had a look at the Z400, a single CPU quad-core, and the Z800, a dual-processor, quad-core system. Now HP is updating its workstation line to incorporate Intel’s new Westmere processor, which uses 32nm manufacturing technology to enable six cores on each CPU. HP sent me one of the first dual-processor six-core Z800 systems off the line, and I had about 2 days to run it through its paces for various digital video-processing tasks.
My test system shipped with two 3.33GHz X5680 Xeon processors, with 24GB of RAM running the 64-bit version of Windows 7, which I like heaps better than Vista. The graphics card was an NVIDIA Quadro FX 4800 with 1.5GB of dedicated memory and access to 3.5GB more system memory. With a 15K 150GB system drive and 1TB video drive, the total system price was a little more than $12,000, though Z800 prices start at $1,799.
As you may know, Nehalem succeeded the Core line of CPUs and features an integrated memory controller that delivers faster access to system memory. It also features the QuickPath Interconnect (QPI), which replaces and is much faster than the front-side bus that transmits data directly from CPU to CPU and from CPU to the chipset that integrates the CPU with other system components. This means faster transfer to and from your hard drive, graphics card, and other components. The other addition was the return of hyper-threaded technology (HTT), which adds components of a second processor core to each Nehalem core, which means that Windows Task Manager shows 24 active cores when HTT is enabled.
Beyond the additional two cores to each processor (four with HTT enabled), hardware changes to the Z800 are relatively minor. The most significant change is that there are now 12 memory slots instead of eight, configured in two banks of six, one for each CPU. This is important because Westmere’s (and Nehalem’s) integrated memory controller has three channels and works most efficiently when RAM is evenly distributed across all memory channels, which you couldn’t do with eight memory slots. It also frees you to use cheaper memory such as 2GB modules to reach 24GB of total memory. With the previous design, you’d have to use 4GB modules, which are more expensive. The Z800 now has FireWire on the motherboard, which saves you a slot.
The Z800 uses the same tool-less case that’s immortalized in my daughter Rose’s YouTube video (http://www.youtube .com/watch?v=VJ5KJVCc5C4). At this writing, the video had racked up 11,167 views, which is astounding when you consider that it shows an 8-year old opening the computer; taking out the hard drive, memory fan, and power supply; and then reassembling the computer and turning it on (yes, it did boot). So watch the video, and you’ll know all you need to know about the case.
Which brings us to performance. To assess this, I ran a series of benchmarks against the outgoing Z800, which has only eight cores, though they run at the same 3.33GHz speed as the new unit, along with 24GB of RAM running the 64-bit version of Windows Vista. The older Z800 was equipped with NVIDIA’s Quadro FX 3800, though I’m guessing that the graphics card had little to do with its overall performance.
It would have been nice to benchmark using the same operating system, but time didn’t permit an upgrade to Windows 7 on the old Z800. Note that I’ve run some comparative benchmarks between Vista and Windows 7 for Digital Content Producer that you can read at http://bit.ly/cFmlkq. Overall, between real-world projects and synthetic tests, Windows 7 was anywhere from 1% to 13% faster, so at least some of the speed boost shown in the accompanying table is attributable to the operating system as well as hardware.
Let’s set our expectations before looking at the benchmarks. In theory, adding four additional CPUs to an eight CPU system should increase performance by 50%. However, that assumes that the task at hand can be efficiently split among all available cores. As the expression goes, even if you assign nine women to the task, they can’t have a baby together in a single month-some tasks just can’t be split. In addition, anytime you split a task over multiple CPUs, you’re injecting some overhead into the system since the program has to allocate the tasks and track the results. So seldom, if ever, do you see a direct proportional drop in rendering time when you add cores to the problem.
Results also vary by program, sometimes bizarrely so. For example, On2 Flix Pro, a very popular VP6 encoder, is a relentlessly single-threaded program and-assuming the same processor speed (i.e., GHz)-will encode at the same speed whether on a single-core or a 12-core system. I tested with Adobe Premiere Pro and the Adobe Media Encoder CS4, which is relatively efficient from a hyper-threading perspective, but results will vary based upon input and output formats.
Why? Adobe licenses most codec-related functions from a third-party vendor, which means that Adobe Media Encoder “calls” a particular codec to handle a discrete encoding chore. When it’s time to encode into MPEG-2 format, for example, Adobe Media Encoder will hand off video data to the third-party module and say—in zeros and ones, of course—encode this. If that codec is efficient at multithreaded operation, the extra CPUs help; if not, they don’t.
In addition, regarding hyper-threaded technology, sometimes having those extra 12 cores does speed up the task, but sometimes it doesn’t. That’s why HP lets you turn them on or off in the system setup screen. For my tests, I ran all tasks with HTT enabled and disabled and used the fastest time for each system.
With this as a prologue, let’s begin.
When testing computers, I use a range or real-world and synthetic tests. Real-world tests, of course, are actual projects that I produced over the past few months, while synthetic tests are those created to test performance with a specific format or operating scenario. For example, I haven’t produced any projects with a RED or DVCPRO HD camera, so I’ll use footage that I’ve shot or downloaded to test performance with these formats.
My first real-world test was a simple HDV shoot and output to H.264 for upload to YouTube-my wife’s ballet company dancing to “Thriller” in downtown Galax, Va., on a rainy Halloween night. The project was 4:35 (minutes:seconds) long. As you can see in Table 1 (below), the eight-core system produced the necessary H.264 file in 6:15, while the 12-core system produced it in 4:54, a reduction of 22%.
The second project was a single-camera DV shoot also output to H.264 for upload to YouTube-an audition video for America’s Got Talent. The video was 7:28 long, and the 12-core system rendered the project 44% faster than the eight-core.
The next project involved Camtasia-based screencam footage mixed with MPEG-2 video shot with the JVC HM700 and output to VP6 format. The finished video was 3:20 in duration; it took the eight-core system 3 minutes to render, while the 12-core system finished in 1:27, a savings of 52%.
Next up was the 2008 Nutcracker performance, which was a two-camera HDV shoot mixed down to SD DVD. I timed how long it took to produce the MPEG-2 file and PCM file that I would then input into Encore. The video was an hour and 40 minutes long, which the eight-core computer produced in 48:05, and the 12-core Z800 produced in 39:02, a savings of 19%, which is similar to the 22% saving we saw with the “Thriller” project, which also used HDV source.
Let’s put some perspective on these numbers for those still working on pre-Nehalem-based systems. I’ve used this same Nutcracker project for testing for 2 years. When I first tested the eight-core Z800 back in April 2009, I compared it to my previous performance champ, a 3.3GHz dual-processor, quad-core system (64-bit Windows Vista, 16GB of RAM) HP workstation, which finished rendering in 122 minutes. The 3.2GHz Z800 system (running 64-bit Vista) brought encoding time down to 59:30, which was quite dramatic, and now we’re down to less than 40 minutes. To make a long story short, if you’re shooting in HDV video and haven’t stepped up to a Nehalem-based system, you should give the Z800 or another Z-family member some serious thought.
In addition, while working with this multicam project, I dropped into multicam mode to get a feel for playback smoothness. To explain, with many computers, I often achieve less than full-frame playback in multicam mode, where Premiere is retrieving and playing up to four videos simultaneously. Even with two- or three-camera shoots, I often seem to get about 5-10 fps playback in the preview windows, which is sufficient for editing but still distracting. With the two-camera Nutcracker project, multicam performance was clearly at full-frame rate, so I loaded my synthetic HDV test project, which has four camera angles. That was also silky smooth. Again, if you’ve been frustrated by the performance of Premiere’s multicamera window in the past, the 12-core Z800 is the perfect antidote.
The final project involved encoding a 1-hour concert into Blu-ray Disc-compatible H.264 format. The source material was MPEG-2 from the NewTek TriCaster system; this was the first set of a concert that I recorded with the TriCaster. Here, the 12-core system was fabulous, dropping rendering time from a heartbreaking 3:02 (hours:minutes) to a much more palatable 1:44, a savings of 43%. Clearly, this was the most CPU-intensive project of the group, and the 12-core system really shined.
One note before we move on to our synthetic projects—specifically, I shoot in AVCHD a lot, but with larger, multicamera projects, I capture in Final Cut Pro, which converts the AVCHD video to ProRes. I either edit in Final Cut Pro or use the ProRes capture file and edit in Premiere Pro on either the Mac or Windows. I had two real-world project files ready to queue up and encode, but Premiere Pro kept crashing on the Windows 7 system, and QuickTime wouldn’t play the ProRes QuickTime movie. I hunted around for a fix, but it didn’t seem like anyone else was having this problem. Needless to say, I have no real-world AVCHD projects to share.
Table 2 (below) shows the results of all the synthetic tests. The first synthetic project was 1-minute long, with three layers of DVCPRO HD source footage, one base layer, one rotating PiP and one green screen processed in Adobe After Effects via Dynamic link. Rendering for preview with the 12-core system was about 50% faster than the eight-core system.
The second test involved a similar 1-minute project with RED footage: one base layer, one rotating PiP, and one green screen via After Effects. Here the time-savings over the previous generation Z800 was only 17%. But I checked back on the tests I had performed with the first generation Z800 and saw that my eight-core xw8600 (pre-Nehalem system) had rendered the same project in 29:15. This 17% saving may not be enough to make you upgrade from an eight-core Z800 to a 12-core. But if you’re working with a pre-Nehalem system, now is the time.
Then came the AVCHD tests. With my 1-minute green-screen/PiP project, the 12-core system was 55% faster than the eight-core system. When rendering a 10-minute file to H.264 Blu-ray, it was 62% faster. These are almost certainly best-case results since I’m encoding to the most computationally intense format, but they certainly bode well for all AVCHD producers.
I ran a quick test with some footage from a Canon EOS 7D, which was the only counter-intuitive result in the overall review. Specifically, the 12-core Z800 took 8:29 to produce the file, while the 8-core system took 8:17. I checked CPU utilization while encoding and noticed that only one core out of the 8 or 12 (or 16/24) was steadily working when this file was bring produced, indicating that there were bottleneck processes that weren’t multithreaded. Not sure why the 8-core system was faster, but these results raise a caution flah if you’re an EOS 7D producer.
Overall, I was pleasantly surprised by how many test scenarios showed how the extra cores on the Z800 really paid off, and that situation will only improve over time as more 12-core systems get in the hands of developers and the NLE manufacturers optimize their revisions to take full advantage of these systems’ capabilities. There have definitely been times in the past where buying systems with multiple cores has involved betting on some future benefit rather than reaping an immediate reward. With the Z800, you can bet on the now since the 12-core system paid immediate dividends in nine out of 10 tests, with six of those tests showing performance improvements of 43% or higher.