--
John Larkin Highland Technology, Inc
jlarkin att highlandtechnology dott com
--
John Larkin Highland Technology, Inc
jlarkin att highlandtechnology dott com
Like megapixel count the number of cores makes for easy headlines that attract airheads like moths to a flame. In practice it is very hard to utilise more than a dozen cores efficiently in most real world problems.
Rendering engines and some types of brute force search being obvious exceptions. Servers might well benefit from the power saving though.
The effect of large numbers of cores in a chess or go game tree search is merely to better study futile lines that will be culled by the alpha-beta pruning algorithm - IOW an total waste of time and effort.
-- Regards, Martin Brown
But you don't need to use them efficiently. Just use them as needed, one CPU per process. For something like a server or a transaction processor, lots of cores make sense. For a PC-type OS, it would if the OS were designed right.
-- John Larkin Highland Technology, Inc jlarkin att highlandtechnology dott com http://www.highlandtechnology.com
Chess programming has advanced somewhat since 1960. There are multithreaded tree search algorithms that can effectively leverage large numbers of cores.
Have you seen Tilera's offerings? They've beem doing oodles of cores thing for a while now:
Hi,
The more cores and RAM, the better for servers, ie for virtualization, it doesn't matter if code is multithreaded since there are instances of operating systems distributed across the cores.
cheers, Jamie
I don't think these devices are oriented to playing games. The Cavium devices are targeted for server type uses. The part I am not clear on is how they get past the memory bottleneck. It doesn't matter how many cores you put on a chip if they are sitting around waiting for data and instructions to/from memory.
On the other hand, the primary restriction in servers is power consumption. So if you can process 20% more transactions using 20% less power it won't matter how many cores or how memory restricted they are. The proof of the pudding...
As others have pointed out, there are a number of high core count processors around. The question is how many sockets have they found?
-- Rick
As long as the processes are completely independent (AKA "embarrassingly parallel problems"), it works. If there is any inter-process communication needed, things get ugly fast.
But massive computation of a big math problem is a niche application. Serving up web pages and transactions, or running a huge database, is a good app for a lot of CPUs that share main memory.
And a desktop OS is, too.
Transistors are never going to have 1 nm features or run at 20 GHz, so the future is multicore. Seems like, 40 years after C and VMS and Unix, it's time for something new.
-- John Larkin Highland Technology, Inc jlarkin att highlandtechnology dott com http://www.highlandtechnology.com
Yes, "embarrassingly parallel".
Nope. Too much interprocess communication needed.
Not until the MP problems are solved, it's not.
Can anyone spell "bottleneck"? Data bus can be only so wide, fixing the data bandwidth and the number of cores that can communicate simultaneously; the rest are a total waste.
I would agree with you for the most part, but once you cross over into a domain where the processor is not the expensive resource then it can make sense to have more processors than you can "efficiently" use. The GA144 is a 144 processor chip which has much more processing capability than can be supported by the I/O or memory. The intent is not to fully utilize all 144 processors, but to simply use them as best suit your algorithm. Just as some of the logic and most of the routing in an FPGA goes unused in nearly all designs, there is no reason to treat the processor as a precious resource once the cost (in various measures) drops significantly. The GA144 processor, the F18, costs less than $0.10.
We are very ingrained to think of optimizing the performance of the processor, but instead need to consider the performance of the system as a whole. That said, I believe the processor in the mentioned article may have crossed over to the point of the processor not being the focus of optimizing efficiency. If you only gain a 5% improvement for each of the processors after the 8th one, that is still a 200% performance gain.
-- Rick
Hi Robert
In some future, the will be very wide bottlenecks :-)
E.g. the hypercube computer will be reinvented/reimplemented (for all of us) when optically communications for CPUs will be economically feasible.
Maybe later, (almost) full optically meshing will be possible in the future between processors. And the processors will be asynchronous/clockless (efficiency and a central clock is mutually exclusive) - see ARM996HS/
From the previous millennium:
Recent Advances in Designing Clockless Digital Systems:
-
Achronix-ULTRA:
Achronix preps 2-GHz Asynchronous FPGA for sampling in 2007:
June 2, 2011, Qualcomm?s Dual-core is asynchronous, demonstrated at Computex 2011:
Apr 2nd 2011, Qualcomm's 1.5GHz dual-core MSM8660 destroys the competition in majestic benchmark run:
Video:
The Power of the Snapdragon? Processor: Asynchronous Processing:
MARE (Multicore Asynchronous Runtime Environment) Overview Video:
13 July 2011, Inside Manchester?s million ARM electronic brain:
The chosen core, for which ARM has granted a licence to the University for the project, is the ARM968, ironically the first ARM not to have Furber's fingerprints on it.
"The ARM7 is still recognisably mine," he said. "The ARM9 has a five-stage pipeline and Harvard architecture. The ARM7 has a three-stage pipeline and von Neumann architecture. These are the two design sweet spots. Anything more complicated is less efficient, and the 968 is particularly energy efficient."
Stated consumption is 0.12-0.23mW/MHz on a 130nm process. ... Both are based on a delay-insensitive communication technology developed at the University of Manchester.
Furber is a fan of asynchronous communication and previously developed an series of clock-less asynchronous ARM cores called Amulet. ..."
Asynchronous circuit:
-
7/5/2012, PCIe goes Clockless--Achieving independent spread-spectrum clocking without SSC isolation:-
PS:
The quantum computers qubits are in some sense "fully meshed". That is why some problems can be solved polynomially or exponentially faster.
/Glenn
Hi Robert
In some future, there will be very wide bottlenecks :-)
E.g. the hypercube computer will be reinvented/reimplemented (for all of us) when optically communications for CPUs will be economically feasible.
Maybe later, (almost) full optically meshing will be possible in the future between processors. And the processors will be asynchronous/clockless (efficiency and a central clock is mutually exclusive) - see ARM996HS/
From the previous millennium:
Recent Advances in Designing Clockless Digital Systems:
-
Achronix-ULTRA:
Achronix preps 2-GHz Asynchronous FPGA for sampling in 2007:
June 2, 2011, Qualcomm?s Dual-core is asynchronous, demonstrated at Computex 2011:
Apr 2nd 2011, Qualcomm's 1.5GHz dual-core MSM8660 destroys the competition in majestic benchmark run:
Video:
The Power of the Snapdragon? Processor: Asynchronous Processing:
MARE (Multicore Asynchronous Runtime Environment) Overview Video:
13 July 2011, Inside Manchester?s million ARM electronic brain:
The chosen core, for which ARM has granted a licence to the University for the project, is the ARM968, ironically the first ARM not to have Furber's fingerprints on it.
"The ARM7 is still recognisably mine," he said. "The ARM9 has a five-stage pipeline and Harvard architecture. The ARM7 has a three-stage pipeline and von Neumann architecture. These are the two design sweet spots. Anything more complicated is less efficient, and the 968 is particularly energy efficient."
Stated consumption is 0.12-0.23mW/MHz on a 130nm process. ... Both are based on a delay-insensitive communication technology developed at the University of Manchester.
Furber is a fan of asynchronous communication and previously developed an series of clock-less asynchronous ARM cores called Amulet. ..."
Asynchronous circuit:
-
7/5/2012, PCIe goes Clockless--Achieving independent spread-spectrum clocking without SSC isolation:-
PS:
The quantum computers qubits are in some sense "fully meshed". That is why some problems can be solved polynomially or exponentially faster.
/Glenn
For small values of "effectively" it wasn't until the mid 1980's that powerful parallel chess algorithms were developed for N>4. Here is Bob Hyatts paper on parallel chess search strategy to illustrate my point:
The classic parallel PVS algorithm vs Bob Hyatt's DTS scores highly as follows in terms of actual performance multiple with N cores.
N PVS EPVS DTS
1 1 1 1 2 1.8 1.9 2 4 3.0 3.4 3.7 8 4.1 5.4 6.6 16 4.6 6.0 11.1The key to his performance improvement is in choosing the right nodes to fork the search at to keep all CPUs busy and not waste effort (and to avoid idle ones mithering busy ones with "can I help you msgs"). It is a major step forward over earlier algorithms which were wasting more than half of all the work done on 12 cores or more.
Ultimately it is the shared memory bandwidth that throttles back performance with N even if the algorithm is close to perfection.
-- Regards, Martin Brown
The bloviator certainly can't.
Transputer nets sort of got around that to some extent but fell out of favour. You quickly discover that the highest priority task is keeping the cores close to fully occupied without saturating memory bandwidth.
One reason that hyperthreading looks good on paper but fails to deliver except on well known benchmarks and a few special cases is that it runs out of memory bandwidth. You get processors designed for specsmanship!
SIMD is a lot easier to use effectively for the right problems.
-- Regards, Martin Brown
^^^^^^^^^ typo 1990's ISTR Hyatt(1994) was the DTS paper.
-- Regards, Martin Brown
Depends on the cache. 3D stacks with the processor chip on top of a bunch of cache memory help a lot. You do have the skyscraper elevator problem, though--eventually all your die area consists of TSVs.
Cheers
Phil Hobbs
-- Dr Philip C D Hobbs Principal Consultant ElectroOptical Innovations LLC Optics, Electro-optics, Photonics, Analog Electronics 160 North State Road #203 Briarcliff Manor NY 10510 hobbs at electrooptical dot net http://electrooptical.net
Optics isn't a panacea. Its propagation speed is potentially a little faster than wire on-chip (about c/4 vs c/10), the bandwidth is better on long wires, and it might conceivably have some speed vs power advantages.
Chip-to-chip optics is easier. The main problem is going from 1550 nm on-chip optics in single-mode silicon waveguides to and from 850 nm multimode links on the board and between boards, without losing all the speed/power/bandwidth benefits in the process.
And the processors will be
People have been talking about that for a looong time, and it hasn't amounted to much so far. The chip area that can be synchronously clocked isn't very big anyway, owing to propagation speed problems.
There are lots of SIMD machines out there, e.g. most Intel desktops.
Cheers
Phil Hobbs
-- Dr Philip C D Hobbs Principal Consultant ElectroOptical Innovations LLC Optics, Electro-optics, Photonics, Analog Electronics 160 North State Road #203 Briarcliff Manor NY 10510 hobbs at electrooptical dot net http://electrooptical.net
Well, which PC? The Atom dual-core PC is twice the PC at 2W; so, it's about 1W per PC. The Cavium chip is 48core at 100W if fully utilized. The core does not really matter much. More than half of the heat come from the 16M bytes cache per core.
OTOH, it might not make sense to have uniform cache size. Perhaps some wit h 32M, 64M and 128M, etc.
ElectronDepot website is not affiliated with any of the manufacturers or service providers discussed here. All logos and trade names are the property of their respective owners.