48 cores

J

John Larkin 12 years ago

formatting link

--

John Larkin Highland Technology, Inc

jlarkin att highlandtechnology dott com

formatting link

Vote

M

Martin Brown 12 years ago

Like megapixel count the number of cores makes for easy headlines that attract airheads like moths to a flame. In practice it is very hard to utilise more than a dozen cores efficiently in most real world problems.

Rendering engines and some types of brute force search being obvious exceptions. Servers might well benefit from the power saving though.

The effect of large numbers of cores in a chess or go game tree search is merely to better study futile lines that will be culled by the alpha-beta pruning algorithm - IOW an total waste of time and effort.

Regards, Martin Brown

Vote

J

John Larkin 12 years ago

But you don't need to use them efficiently. Just use them as needed, one CPU per process. For something like a server or a transaction processor, lots of cores make sense. For a PC-type OS, it would if the OS were designed right.

John Larkin Highland Technology, Inc jlarkin att highlandtechnology dott com http://www.highlandtechnology.com

Vote

B

bitrex 12 years ago

Chess programming has advanced somewhat since 1960. There are multithreaded tree search algorithms that can effectively leverage large numbers of cores.

Vote

B

bitrex 12 years ago

Have you seen Tilera's offerings? They've beem doing oodles of cores thing for a while now:

formatting link

cessors/TILE-Gx_Family

Vote

J

Jamie M 12 years ago

Hi,

The more cores and RAM, the better for servers, ie for virtualization, it doesn't matter if code is multithreaded since there are instances of operating systems distributed across the cores.

cheers, Jamie

Vote

R

rickman 12 years ago

I don't think these devices are oriented to playing games. The Cavium devices are targeted for server type uses. The part I am not clear on is how they get past the memory bottleneck. It doesn't matter how many cores you put on a chip if they are sitting around waiting for data and instructions to/from memory.

On the other hand, the primary restriction in servers is power consumption. So if you can process 20% more transactions using 20% less power it won't matter how many cores or how memory restricted they are. The proof of the pudding...

As others have pointed out, there are a number of high core count processors around. The question is how many sockets have they found?

Rick

Vote

K

krw 12 years ago

As long as the processes are completely independent (AKA "embarrassingly parallel problems"), it works. If there is any inter-process communication needed, things get ugly fast.

Vote

J

John Larkin 12 years ago

But massive computation of a big math problem is a niche application. Serving up web pages and transactions, or running a huge database, is a good app for a lot of CPUs that share main memory.

And a desktop OS is, too.

Transistors are never going to have 1 nm features or run at 20 GHz, so the future is multicore. Seems like, 40 years after C and VMS and Unix, it's time for something new.

John Larkin Highland Technology, Inc jlarkin att highlandtechnology dott com http://www.highlandtechnology.com

Vote

K

krw 12 years ago

Yes, "embarrassingly parallel".

Nope. Too much interprocess communication needed.

Not until the MP problems are solved, it's not.

Vote

R

Robert Baer 12 years ago

Can anyone spell "bottleneck"? Data bus can be only so wide, fixing the data bandwidth and the number of cores that can communicate simultaneously; the rest are a total waste.

Vote

R

rickman 12 years ago

I would agree with you for the most part, but once you cross over into a domain where the processor is not the expensive resource then it can make sense to have more processors than you can "efficiently" use. The GA144 is a 144 processor chip which has much more processing capability than can be supported by the I/O or memory. The intent is not to fully utilize all 144 processors, but to simply use them as best suit your algorithm. Just as some of the logic and most of the routing in an FPGA goes unused in nearly all designs, there is no reason to treat the processor as a precious resource once the cost (in various measures) drops significantly. The GA144 processor, the F18, costs less than $0.10.

We are very ingrained to think of optimizing the performance of the processor, but instead need to consider the performance of the system as a whole. That said, I believe the processor in the mentioned article may have crossed over to the point of the processor not being the focus of optimizing efficiency. If you only gain a 5% improvement for each of the processors after the 8th one, that is still a 200% performance gain.

Rick

Vote

G

Glenn 12 years ago

Hi Robert

In some future, the will be very wide bottlenecks :-)

E.g. the hypercube computer will be reinvented/reimplemented (for all of us) when optically communications for CPUs will be economically feasible.

Maybe later, (almost) full optically meshing will be possible in the future between processors. And the processors will be asynchronous/clockless (efficiency and a central clock is mutually exclusive) - see ARM996HS/

formatting link

.

From the previous millennium:

formatting link

Quote: "...This was a "massively parallel" hypercubic arrangement of thousands of microprocessors, each with its own 4 kbits of RAM, which together executed in a SIMD fashion..."

-

Recent Advances in Designing Clockless Digital Systems:

formatting link

Quote: "... Benefits (vs. sync):

3-4x lower power (and lower energy consumption/ops)
much lower ?electromagnetic interference? (EMI)
instant startup from stand-by mode (no PLL?s) ... Critical Design Issues:
components must communicate cleanly: ?hazard-free? design
highly-concurrent designs: harder to verify! ..."

-

formatting link

Quote: "... P5CC081UA Secure dual interface and contact PKI smart card controller ... low-power, performance optimized asynchronous technology. ..."

-

Achronix-ULTRA:

formatting link

Achronix preps 2-GHz Asynchronous FPGA for sampling in 2007:

formatting link

-

June 2, 2011, Qualcomm?s Dual-core is asynchronous, demonstrated at Computex 2011:

formatting link

Qualcomm's Dual-core is asynchronous, demonstrated at Computex 2011:

formatting link

MSM8660

Apr 2nd 2011, Qualcomm's 1.5GHz dual-core MSM8660 destroys the competition in majestic benchmark run:

formatting link

Video:

The Power of the Snapdragon? Processor: Asynchronous Processing:

formatting link

MARE (Multicore Asynchronous Runtime Environment) Overview Video:

formatting link

-

13 July 2011, Inside Manchester?s million ARM electronic brain:

formatting link

Quote: "... The 18 core IC is claimed to deliver the computing power of a PC and dissipate 1W, said the University.

The chosen core, for which ARM has granted a licence to the University for the project, is the ARM968, ironically the first ARM not to have Furber's fingerprints on it.

"The ARM7 is still recognisably mine," he said. "The ARM9 has a five-stage pipeline and Harvard architecture. The ARM7 has a three-stage pipeline and von Neumann architecture. These are the two design sweet spots. Anything more complicated is less efficient, and the 968 is particularly energy efficient."

Stated consumption is 0.12-0.23mW/MHz on a 130nm process. ... Both are based on a delay-insensitive communication technology developed at the University of Manchester.

Furber is a fan of asynchronous communication and previously developed an series of clock-less asynchronous ARM cores called Amulet. ..."

Asynchronous circuit:

formatting link

"...

70% lower power consumption compared to synchronous design[5] ... Less severe electromagnetic interference (EMI). Synchronous circuits create a great deal of EMI in the frequency band at (or very near) their clock frequency and its harmonics; asynchronous circuits generate EMI patterns which are much more evenly spread across the spectrum. ... Synchronous designs are inherently easier to test and debug than asynchronous designs.[7][dubious ? discuss] ([!]) ..."

-

7/5/2012, PCIe goes Clockless--Achieving independent spread-spectrum clocking without SSC isolation:

formatting link

-

PS:

The quantum computers qubits are in some sense "fully meshed". That is why some problems can be solved polynomially or exponentially faster.

/Glenn

Vote

G

Glenn 12 years ago

Hi Robert

In some future, there will be very wide bottlenecks :-)

E.g. the hypercube computer will be reinvented/reimplemented (for all of us) when optically communications for CPUs will be economically feasible.

Maybe later, (almost) full optically meshing will be possible in the future between processors. And the processors will be asynchronous/clockless (efficiency and a central clock is mutually exclusive) - see ARM996HS/

formatting link

.

From the previous millennium:

formatting link

Quote: "...This was a "massively parallel" hypercubic arrangement of thousands of microprocessors, each with its own 4 kbits of RAM, which together executed in a SIMD fashion..."

-

Recent Advances in Designing Clockless Digital Systems:

formatting link

Quote: "... Benefits (vs. sync):

3-4x lower power (and lower energy consumption/ops)
much lower ?electromagnetic interference? (EMI)
instant startup from stand-by mode (no PLL?s) ... Critical Design Issues:
components must communicate cleanly: ?hazard-free? design
highly-concurrent designs: harder to verify! ..."

-

formatting link

Quote: "... P5CC081UA Secure dual interface and contact PKI smart card controller ... low-power, performance optimized asynchronous technology. ..."

-

Achronix-ULTRA:

formatting link

Achronix preps 2-GHz Asynchronous FPGA for sampling in 2007:

formatting link

-

June 2, 2011, Qualcomm?s Dual-core is asynchronous, demonstrated at Computex 2011:

formatting link

Qualcomm's Dual-core is asynchronous, demonstrated at Computex 2011:

formatting link

MSM8660

Apr 2nd 2011, Qualcomm's 1.5GHz dual-core MSM8660 destroys the competition in majestic benchmark run:

formatting link

Video:

The Power of the Snapdragon? Processor: Asynchronous Processing:

formatting link

MARE (Multicore Asynchronous Runtime Environment) Overview Video:

formatting link

-

13 July 2011, Inside Manchester?s million ARM electronic brain:

formatting link

Quote: "... The 18 core IC is claimed to deliver the computing power of a PC and dissipate 1W, said the University.

The chosen core, for which ARM has granted a licence to the University for the project, is the ARM968, ironically the first ARM not to have Furber's fingerprints on it.

"The ARM7 is still recognisably mine," he said. "The ARM9 has a five-stage pipeline and Harvard architecture. The ARM7 has a three-stage pipeline and von Neumann architecture. These are the two design sweet spots. Anything more complicated is less efficient, and the 968 is particularly energy efficient."

Stated consumption is 0.12-0.23mW/MHz on a 130nm process. ... Both are based on a delay-insensitive communication technology developed at the University of Manchester.

Furber is a fan of asynchronous communication and previously developed an series of clock-less asynchronous ARM cores called Amulet. ..."

Asynchronous circuit:

formatting link

"...

70% lower power consumption compared to synchronous design[5] ... Less severe electromagnetic interference (EMI). Synchronous circuits create a great deal of EMI in the frequency band at (or very near) their clock frequency and its harmonics; asynchronous circuits generate EMI patterns which are much more evenly spread across the spectrum. ... Synchronous designs are inherently easier to test and debug than asynchronous designs.[7][dubious ? discuss] ([!]) ..."

-

7/5/2012, PCIe goes Clockless--Achieving independent spread-spectrum clocking without SSC isolation:

formatting link

-

PS:

The quantum computers qubits are in some sense "fully meshed". That is why some problems can be solved polynomially or exponentially faster.

/Glenn

Vote

M

Martin Brown 12 years ago

For small values of "effectively" it wasn't until the mid 1980's that powerful parallel chess algorithms were developed for N>4. Here is Bob Hyatts paper on parallel chess search strategy to illustrate my point:

formatting link

The classic parallel PVS algorithm vs Bob Hyatt's DTS scores highly as follows in terms of actual performance multiple with N cores.

N PVS EPVS DTS

1 1 1 1 2 1.8 1.9 2 4 3.0 3.4 3.7 8 4.1 5.4 6.6 16 4.6 6.0 11.1

The key to his performance improvement is in choosing the right nodes to fork the search at to keep all CPUs busy and not waste effort (and to avoid idle ones mithering busy ones with "can I help you msgs"). It is a major step forward over earlier algorithms which were wasting more than half of all the work done on 12 cores or more.

Ultimately it is the shared memory bandwidth that throttles back performance with N even if the algorithm is close to perfection.

Regards, Martin Brown

Vote

M

Martin Brown 12 years ago

The bloviator certainly can't.

Transputer nets sort of got around that to some extent but fell out of favour. You quickly discover that the highest priority task is keeping the cores close to fully occupied without saturating memory bandwidth.

One reason that hyperthreading looks good on paper but fails to deliver except on well known benchmarks and a few special cases is that it runs out of memory bandwidth. You get processors designed for specsmanship!

SIMD is a lot easier to use effectively for the right problems.

Regards, Martin Brown

Vote

M

Martin Brown 12 years ago

^^^^^^^^^ typo 1990's ISTR Hyatt(1994) was the DTS paper.

Regards, Martin Brown

Vote

P

Phil Hobbs 12 years ago

Depends on the cache. 3D stacks with the processor chip on top of a bunch of cache memory help a lot. You do have the skyscraper elevator problem, though--eventually all your die area consists of TSVs.

Cheers

Phil Hobbs

Dr Philip C D Hobbs Principal Consultant ElectroOptical Innovations LLC Optics, Electro-optics, Photonics, Analog Electronics 160 North State Road #203 Briarcliff Manor NY 10510 hobbs at electrooptical dot net http://electrooptical.net

Vote

P

Phil Hobbs 12 years ago

Optics isn't a panacea. Its propagation speed is potentially a little faster than wire on-chip (about c/4 vs c/10), the bandwidth is better on long wires, and it might conceivably have some speed vs power advantages.

Chip-to-chip optics is easier. The main problem is going from 1550 nm on-chip optics in single-mode silicon waveguides to and from 850 nm multimode links on the board and between boards, without losing all the speed/power/bandwidth benefits in the process.

And the processors will be

People have been talking about that for a looong time, and it hasn't amounted to much so far. The chip area that can be synchronously clocked isn't very big anyway, owing to propagation speed problems.

There are lots of SIMD machines out there, e.g. most Intel desktops.

Cheers

Phil Hobbs

Dr Philip C D Hobbs Principal Consultant ElectroOptical Innovations LLC Optics, Electro-optics, Photonics, Analog Electronics 160 North State Road #203 Briarcliff Manor NY 10510 hobbs at electrooptical dot net http://electrooptical.net

Vote

E

edward.ming.lee 12 years ago

Well, which PC? The Atom dual-core PC is twice the PC at 2W; so, it's about 1W per PC. The Cavium chip is 48core at 100W if fully utilized. The core does not really matter much. More than half of the heat come from the 16M bytes cache per core.

OTOH, it might not make sense to have uniform cache size. Perhaps some wit h 32M, 64M and 128M, etc.

Vote

48 cores

Join the Discussion

Didn't find your answer?