Friday, September 09, 2005

Niagara: UltraSPARC's Viagra?

I like Sun's SPARC and UltraSPARC processors. There's something about them that feels inherently "right". But, let's be honest, their price/performance is lack-lustre to say the least. All right, it sucks. There, I've said it.

Their scalabilty is excellent, as the success of Sun's 100+ CPU SF25K behemoths illustrates. And that's important because sometimes, for certain workloads, straightline performance isn't everything. You wouldn't use a Formula 1 car in a rally, for example.

But many workloads are highly or even embaressingly parallel. For example, a web server. Each "hit" is potentially a different thread. Super fast CPUs like AMD's Opteron mask this by being really quick, in an inefficient manner. That is, they mask the fact that today's CPUs are *much* faster than the memory systems to which they're connected, and so even the fastest CPU spends a lot of time twiddling its electronic thumbs, waiting for memory (be it main RAM, or cache). The trouble is that while these fast CPUs are stalled, waiting for data or instructions to be fetched from memory, they're doing nothing. Well, nothing useful: they still eat up electricity. Today's multi-core chips (e.g., UltraSPARC-IV and the latest Opterons) suffer from the same problem, although the stalling is limited to each core (i.e., a stall in one core has no effect on the other).

While Intel was playing the GHz game, Sun's engineers recognised that a better way to faster performance--or more throughput--was to tackle the problem differently. Rather than endlessly ramping up the clock speed for smaller and smaller performance increases, they took a different approach: when the CPU stalls because it is waiting for memory, why not switch the CPU to a different thread? Sun's research concluded that a CPU spends about 75% of its time waiting for memory, so a CPU with four threads could conceivably be kept busy much more often: while three threads are waiting for memory, the fourth is running. (I guess an analogy would be the difference between using different processes and different threads. The former is more expensive to switch state (waiting for memory), while the latter is less.) To put it simply, a 4 GHz chip is busy only 25% of the time (the rest is spent waiting for memory), so it has an effective speed of 1 GHz.

The next logical step after having a single-core CPU that can run four threads is a multi-core CPU where each core has four threads. Enter Niagara, the code name for Sun's first implementation of its CMT (Chip Multi Threading) architecture. Niagara has eight cores, each of which runs four threads. Yep, 32 threads on a single CPU, 8 of which run simultaneously (one on each core).

As reported at El Reg, the first machines using Niagara--which will probably have a product name of "UltraSPARC-T1"--are already in Beta testing (no, alas, I don't have access to one). Here's what psrinfo says on these systems:

$ ./psrinfo -vp
The physical processor has 8 cores and 32 virtual processors
The core 0 has 4 virtual processors (0, 1, 2, 3)
The core 1 has 4 virtual processors (4, 5, 6, 7)
The core 2 has 4 virtual processors (8, 9, 10, 11)
The core 3 has 4 virtual processors (12, 13, 14, 15)
The core 4 has 4 virtual processors (16, 17, 18, 19)
The core 5 has 4 virtual processors (20, 21, 22, 23)
The core 6 has 4 virtual processors (24, 25, 26, 27)
The core 7 has 4 virtual processors (28, 29, 30, 31)
UltraSPARC-T1 (clock 1080 MHz)

So, we're looking at a ballpark clock speed of 1GHz. Doesn't sound very impressive, until you figure in the eight multi-threaded cores. In terms of throughput, these puppies are better than an 8 GHz CPU!

With all these threads (especially in multi-socket machines!), it's a good job that eats threads for lunch!

6 Comments:

At 9/9/05 10:56, Blogger Hering said...

Thanks, Rich, for sharing your insight. One question: Wouldn't a Niagara chip require the same memory bandwidth as a set of 32 single core CPUs? If so, wouldn't the memory subsystem have the same problem keeping up?

 
At 9/9/05 13:01, Anonymous Anonymous said...

No, it would not need 32 times the memory bandwidth. Remember that Niagara has no speculative execution of instructions, no speculative data fetch (which suck up more than 30% of the bandwidth in a Xenon server) and that it's SMT support allows much higher amounts of memory requests to be queued (which means: More than three times better usage of memory channels compared to Pentium 4 which consumes tons of bandwidth for a moment and then let the memory interface idel for a few hundred clock cycles). And Niagara has four DDR2 controllers...

 
At 9/9/05 13:39, Anonymous Anonymous said...

Rich, wanna Blog on http://www.theregister.co.uk/2005/09/09/niagara_many_cores/ , too? :-)

 
At 9/9/05 13:49, Anonymous Anonymous said...

hering, it would be pointless to have an imbalance such as that. it has four on-die dram controllers to keep those cores well-fed.

 
At 14/11/05 12:18, Anonymous Anonymous said...

4 DDR2 controllers eh? I think not.. Count the pins my friends http://blogs.sun.com/roller/resources/jonathan/niagara_chip_small_pic.jpg

Maybe 4 FB-DIMM channels... but not more then 1 DDR2 bus

 
At 30/7/06 04:54, Anonymous Anonymous said...

tv y novelas
tyler bondage
ubriache castra
ubriache esibizioniste
ubriache foto
ubriache party
ubriache sesso
uccello nella fica
uchida kosaburo pitture
ugas
uk hardcore
ultra 70 che scopano
ultra cumshot movie
ultra quarantenni vogliose di cazzo foto gratis
ultras
umido mutandine foto gratis
umorismo barzellette
un bel pompino
un cazzo

 

Post a Comment

<< Home