Despite the fact that the 45 nm Quad-core Opteron was the best server CPU at launch, a few months later AMD’s success was washed away by a tsunami called “Nehalem”. The Nehalem architecture combined subtle tweaks to an already superior integer engine with brute force tactics such as a triple channel integrated memory controller. The IMC delivered low latency and massive amounts of bandwidth thanks to the highest clocked DDR-3 DIMMs. But it was not enough for the ambitious Intel engineers. They added Simultaneous MultiThreading (SMT), and this was the final blow to any competition left standing in the server market. SMT or Hyperthreading as Intel calls it, boosted performance by 30% and more in key applications such as SAP, Oracle and MS SQL Server. The end result is that the current Xeon outperforms AMD’s best CPU’s by 60 to 85%! Historic, as Intel never had such a commanding lead since AMD entered the market with it’s Athlon MP.
One could start debating about some of the details of these benchmarks, but that would mostly be splitting hairs. Yes, these scores were obtained with DDR3-1333, while the vast majority of X55xx servers are equipped with DDR3-1066. And yes, power consumption of the fastest Xeons is about 20W higher per CPU than on the “Shanghai” Opterons. So in order to compare in the same power range, you should compare with the E5540 at 2.53 GHz. But even with DDR3-1066 and at 2.53 GHz, the latest Xeon would - roughly estimated – outperform the best quad-cores of AMD with 40 to 70%. The lead is even higher in bandwidth intensive applications. Only in the pretty rare dense matrix applications, with Linpack being the most popular benchmark, AMD could still make a point. AMD can deliver the same amount of Gigaflops at lower power consumption and a lower price. Nice, but we are talking about the 1% of the applications on the market. The other ray of hope for AMD was the competitive performance that the Opteron 2389 2.9 GHz delivered on ESX 3.5 on our virtualized benchmark vApus Mark I. But with ESX 4.0, the new Xeon “Nehalem” should widen the gap again thanks to better hyperthreading support and the fact that EPT is fully supported in the latest ESX hypervisor. AMD’s next generation CPU is scheduled to appear in 2012, so it looks like AMD will have to leave the high-end and midrange server CPU market to Intel. Unless…
Ever since the introduction of the 45 nm CPUs, AMD has been executing very well. So well, even, that it reminds us of the K75 times. You might remember how in October 1999, AMD introduced the “K75” in 250 nm and sped up the “x86-Alpha” to 1 GHz in March 2000, only 5 months later. It has indeed been 10 years since AMD has executed so well. Only six months after the successful launch of their 45 nm quad-core, AMD rolls out their hex-core “Istanbul” at 2.6 GHz well ahead of schedule. It is basically a “Shanghai” Opteron with 2 extra cores and a slightly tweaked memory controller. What is more impressive, though, is that AMD is capable of launching a hex-core at 2.6 GHz today, a CPU that consumes only a few watt more than the six month older quad-core at 2.7 GHz. Well done, AMD. But should the IT professional care about the new six-core of AMD? In which applications does it make sense to consider an “Istanbul” based server? Are two extra cores enough to bring back AMD’s Opteron on the specsheet of your next high performance server?
Do Six Cores Make Sense?
The question is not theoretical. When Intel launched their hex-core “Dunnington”, quite a few applications did not make good use of it. The quad-socket “Istanbul”-based servers will face the same problems as “Dunnington”: some server applications prefer “2n cores”, a few will not scale above eight cores and many will not get past 16 very successfully. Yes, even in the server world, quite many applications do not scale well beyond 8-16 cores. Mailservers, webservers and even some databases may be in that situation. If your database gets a lot of locks on the same amount of data, locking contention will kill off your performance once you get beyond a certain number of cores. Rendering applications are another group that start to show diminishing returns with more than 8 cores. It is pretty likely that clustering dual-socket quad-cores makes more sense that adding more cores to the same machine.
But the six-core “Istanbul” CPU has advantages too. The Nehalem Xeon offers 8 logical cores, but the two threads on each core have to share the 32 KB L1 and the tiny 256 KB L2. Istanbul can work with “only” 6 threads, but each thread gets a 64 KB L1 and an in comparison copious amount of 512 KB of L2. In a nutshell, It is clear that the new AMD “Istanbul” Opteron targets a specific market: a few compute intensive HPC applications, large databases and most importantly: “heavy” virtualized workload. The reason why we say “heavy” is that the six-core is a drop-in replacement for the current quad-core Opterons. That means that the memory capacity of the servers based on the new six-core will probably be the same. If you are consolidating lots of light loads together, you are likely to run into memory limits before you run into processing power limits.
40 Comments
View All Comments
duploxxx - Wednesday, June 3, 2009 - link
ESX 4 should add IOMMU to the AMD istanbul platform, not sure how far this is implemented in the beta esx4 builds.Are you using the paravirtualization scsi driver in the new esx4 platform, I would expect bigegr differences between 3.5 and 4 and not just because EPT is included in esx4 together with enhanced HT.
for the rest very good thorough review.
The only thing I always miss in reviews is that although it is good to test the fastest out there, it is now where near the most deployed platform, you rather should look at the 5520-5530 against 2387 - 2431 as the mid range platform that will be deployed in a wide range of systems, this will have a much healthier performance/price/power platform then the top bin. Even the 5570 is not supported in all OEM platforms for the TDP range.
Adul - Monday, June 1, 2009 - link
I do not see oracle running on top of windows all that often. It is normally running on some *nix OS. How about running the same benchmark on say RHEL instead?InternetGeek - Monday, June 1, 2009 - link
There's actually an odd bug on Oracle's DB that makes it run faster on Windows than on Linux. Search on the internet and you'll find info about it.In the other hand, in my now 9 years in the IT industry I've only come across one Oracle DB running on HP-UX. Everything else (Sybase, MySQL, etc) runs on Windows.
LizVD - Friday, June 5, 2009 - link
Could you provide us with a link for that? I'd like to see if this "bug" corresponds with the behaviour we're seeing on our tests.Nighteye2 - Monday, June 1, 2009 - link
You give a good description of how it works and how it has so much benefit, but then you benchmark only dual-socket servers?It would be fairer to also test and compare octo-socket servers - to see the real impact of that HT assist feature.
phoenix79 - Monday, June 1, 2009 - link
Completely agreed (I was typing up a comment about this too when yours popped up)I'd love to see some 4-way VMWare scores
ltcommanderdata - Monday, June 1, 2009 - link
Yes. Nehalem is in a great position in the DP market, but isn't yet available in MP. It'd be great to see six-core Dunnington and six-core Istanbul go head to head. Conveniently their highest models have similar clock speeds at 2.66GHz and 2.6GHz respectively although Dunnington would be a lot more power hungry and although I don't remember their prices, probably more expensive too.JohanAnandtech - Tuesday, June 2, 2009 - link
Dunnington vs Istanbul coming up ... But we are going to take some time to address the shortcomings of this "deadline" article such as better power consumption readings.solori - Monday, June 1, 2009 - link
"Notice that HT-assist is a performance killer in 2P configurations: you remove two times 1 MB of L3-cache, which is a bad idea with 8 VM’s hitting your two CPUs."BIOS guidance suggests that HT Assist be disabled by default on 2P systems, and enabled only for specialized workloads. So that begs the question: Were vAPUS tests performed with or without HT Assist in the 2P configuration? It was not clear.
I assume AMD-V and RVI were enabled for ALL workloads in ESX 3.5 and 4.0 (forced for 32-bit workloads.) Is this accurate? Based on the number of ESX 3.5 installations out there, this probably should be clearly stated...
I do want to take issue with your memory sizing and estimates on vCPU loading. Let me put it this way: while Nehalem-EP has better memory bandwidth and SMT threads, Opteron has access to abundant memory. Therefore, it does not make sense - for example - to be OK with enabling SMT but then constrain the benchmark to 24GB due to a Xeon memory limitation.
I would urge you to look at 48GB configurations on Xeon and Istanbul for your comparison systems. By the way, in consolidation numbers, this makes a significant reduction in $/VM with only a minor increase in per-system CAPEX.
Another interesting issue you touched on is tuning and load balance. Great job here. These are "black magic" issues that - as you noted - can have serious effects on virtualization performance (ok, scheduling efficiency.) Knowing your platform's balance point(s) is critical to performance sensitive apps but not so critical for light-load virtualization (i.e. not performance sensitive.)
It sounds like your learning - through experimentation with vAPUS - that virtualization testing does not predict similar results from "similarly configured machines" where performance testing is concerned. In fact, the "right balance" of VM's, memory and vCPU/CPU loading for one system may be on the wrong side of the inflection point for another.
All and all, a very good article.
JohanAnandtech - Tuesday, June 2, 2009 - link
"this probably should be clearly stated... "Good suggestion. I adapted the article. RVI and EPT are always on if possible (so also 32 bit). HT-assist is of always on "Auto" (so off) unless we indicate otherwise.
"Therefore, it does not make sense - for example - to be OK with enabling SMT but then constrain the benchmark to 24GB due to a Xeon memory limitation. "
1) You must know that vApus Mark I uses too much memory for the webportals. They can run without any performance loss in 2 GB, even 1 GB. So as we move up on the number of tiles we run, it is best to reclaim the wasted memory.
2) I agree that a price comparison should include copious amount of memory (48 GB or so).
3) We don't have more than 24 GB DDR-3 available right now. It would be unfair to force the system to swap in a performance comparison.
"Opteron has access to abundant memory". What do you mean by this? Typical 2P Opterons have 64 GB, 2P Nehalems 72 GB as upper limit?
"In fact, the "right balance" of VM's, memory and vCPU/CPU loading for one system may be on the wrong side of the inflection point for another"
Great comment. Yes, that makes it even more complex to compare two systems. That is why we decided to show 2 datapoints for the 2 tile systems.
Collin, thanks for the excellent comments. It is very rewarding to notice that people take the time to dissect our hard work. Even if that means that you find wrinkles that we have to iron out. Great feedback.