a dozen cpu's on a chip

- J
- John Larkin
  
  Contact options for registered users
Vote on answer
posted
15 years ago

Sat, May 10, 2008 2:12 AM

In that third image down, those things towards the right sure look like floppy drives!

John

- K
- krw
  
  Contact options for registered users
Vote on answer
posted
15 years ago

Sat, May 10, 2008 2:27 AM

Asymmetric multiprocessing makes the scheduler's life more complicated. Since the scheduler is part of the OS, and the OS is most often M$, this isn't a good idea, IMO. ;-) Hardware is cheap (so cheap PowerPC is including decimal FPUs). Throw the FPU on every node, whether its needed or not.

Which negates what you say above. Running a task, then getting an exception because you don't have an instruction you thought you had is expensive.

--
Keith

- J
- John Larkin
  
  Contact options for registered users
Vote on answer
posted
15 years ago

Sat, May 10, 2008 2:38 AM

Why would you get an exception? If a device driver doesn't need fp opcodes, run it on one of the many cpu's that doesn't have floating point. And vice versa. rocket science.

John

- K
- krw
  
  Contact options for registered users
Vote on answer
posted
15 years ago

Sat, May 10, 2008 4:04 AM

You're making your scheduler's job more difficult and limiting flexibility. Computer architecture is rocket surgery.

--
Keith

- J
- John Larkin
  
  Contact options for registered users
Vote on answer
posted
15 years ago

Sat, May 10, 2008 6:18 AM

A bunch of cpu's don't need scheduling like a single-processor os does; individual cpu's do their thing concurrently and set semaphores, and go idle, if they finish whatever they are assigned to do. And besides, the task manager cpu doesn't have anything else to do. The scheduler will mostly set up things like memory management and priviliges and assignments and turn them loose, rather than frantically swapping them in real time. When everything runs simultaneously, priorities become less important. It's a whole new way of thinking.

The IBM Cell chip is an architecture that trends in that direction.

Current hardware and software has been driven by Intel's silicon process skill (and their vicious lack of ethics) and by Microsoft's thousands of programmers (and their vicious lack of ethics) but not by any particularly intelligent planning. Most big software apps are spinning-out-of-control crapware with gigabyte service packs just pushing the bugs around. It's time for a change, for the next generation of computing, and I think it will happen when there are so many processors on a chip that multitasking quits making sense.

A new language wouldn't hurt either.

John

- M
- MooseFET
  
  Contact options for registered users
Vote on answer
posted
15 years ago

Sat, May 10, 2008 1:42 PM

[....]

You are suggesting a hardware fix for a problem that doesn't need to exist. Since a very high performance machine by definition is not running anything M$, there is no need to build in hardware corrections for the errors in their products.

Since the FPU contains a great many transistors and those transistors could be use to make another CPU, it would make sense to not put in the FPU in favor of another full CPU.

But if you know which tasks never do floating operations, you can leave them on the CPU without an FPU. If you don't know you only need to perform the experiment once and suffer the overhead of moving the task to a different CPU once.

- T
- TheM
  
  Contact options for registered users
Vote on answer
posted
15 years ago

Sat, May 10, 2008 5:16 PM

marketting push for ever larger N to the point where N

ancillary hardware.

Better yet, megapixel race in mobile phones where bad optics kills any added resolution anyway and anything above 1.3mp is overkill.

Mark

- B
- bill.sloman
  
  Contact options for registered users
Vote on answer
posted
15 years ago

Sat, May 10, 2008 5:17 PM

ge

st

Chemistry is fun. If you throw it in for programming, you've got to really enjoy programming.

I wrote a fair bit of Fortran 4 when I was doing my Ph.D. work, and was one of the local gurus. I also wrote some 900 lines of Macro-8 - the PDP-8 assembly language - which worked fine. Nobody else was writing any Macro-8, so I wasn't actually a guru.

I have done some programming since I graduated, but electronics was even more enjoyable.

-- Bill Sloman, Nijmegen

- P
- Phil Hobbs
  
  Contact options for registered users
Vote on answer
posted
15 years ago

Sat, May 10, 2008 5:50 PM

Those specialized communications processors have been used in large systems for ages, and they're getting more important with time, as you suggest.

IBM has made 256-way SMPs for years, at varying levels of integration. SMPs cost much more than loosely-coupled machines, but there are good commercial reasons to use them. Keeping the illusion of symmetric shared memory really simplifies the programming model--a hugely important issue that non-programmers usually have no idea about. (If anyone figures out an efficient way to parallellize queries in large databases using loosely-coupled processors, I could be out of work. It isn't something I worry about much.)

Ever since Danny Hillis & Co. back in the 80s, people have been pushing one sort or another of massively parallel machine. They've been perfectly right all along, too--for certain classes of problems, massively parallel is the way to go. The problem has been that not too many problems of economic importance have fallen into that 'certain class'--which is why Hillis's Thinking Machines Inc. and many others have come and gone. Nobody knew how to do business apps efficiently on those machines then, and nobody knows now either, as far as I can tell.

One thing that I think has become clear is that huge interconnect bandwidth is the key to broadening the range of problems that run well on highly parallel machines. Maintaining the illusion of shared memory at the OS level requires cache coherency across the whole machine (or a reasonable facsimile). This leads to an interconnect bandwidth trend that goes as the square or the cube of Moore's law, and that is starting to dominate the power budget of large machines. The cost of maintaining that trend will become prohibitive, unless we come up with some really different approaches from the ones we've been using.

Cheers,

Phil Hobbs

- P
- Phil Hobbs
  
  Contact options for registered users
Vote on answer
posted
15 years ago

Sat, May 10, 2008 6:06 PM

You mean something like Parallel PowerBasic, I gather? ;)

The Cell is a SIMD processor, with a 1.5 Tb/s interconnect bus on chip. There's a Power CPU that controls 8 Synergistic Processors (the SIMD part). Some things run amazingly on SIMD machines, some don't. Crays and other vector processors were SIMD, massively parallel machines are MIMD. SIMD simplifies the architecture and the programming model, but restricts the range of problems that can be tackled efficiently.

I think the idea of grouping cores with different specializations will grow in importance, because it simplifies the programming model.

Cheers,

Phil Hobbs

- J
- John Larkin
  
  Contact options for registered users
Vote on answer
posted
15 years ago

Sat, May 10, 2008 7:32 PM

Isn't that what google does?

John

- J
- John Larkin
  
  Contact options for registered users
Vote on answer
posted
15 years ago

Sat, May 10, 2008 7:41 PM

Again, for the 111'th time, I don't advocate parallel processing as a way to get number-crunching performance. I advocate running a lot of cpu's on a chip, one per process, as a path to system simplicity, discipline, and reliability. It's high time we stopped worrying about wasting transistors. The limit of computing per chip is thermal; which transistors do it doesn't matter, so an architecture should optimize reliability, and to get reliability we need to force some draconian new rules onto programmers. They won't like it one bit. The resulting os and architectures would be so brute-force simple that the academics would hate it, too. Both groups have demonstrated their inability to manage complexity.

YES! Thanks for being the only person here who sorta agrees with me on this.

John

- J
- JosephKK
  
  Contact options for registered users
Vote on answer
posted
15 years ago

Sat, May 10, 2008 8:28 PM

Oh, my! Since i rarely see donkey's posts that is not much of a threat. You allege that there is a rule, provide a cite. Otherwise i await the next version of "The Chicago Manual of Style".

- J
- JosephKK
  
  Contact options for registered users
Vote on answer
posted
15 years ago

Sat, May 10, 2008 8:43 PM

when N>4. N> with the Chinese usage (ie big iron box didn't work and had service engineers tending it daily for about 6 months after delivery).

fold

not

That is mostly income stream salvage versus marginal yield. The real problem is memory band width. Do you remember the recent improvements in FSB speeds for the Intel chips? How the AMD introduction of Hyperchannel?

- J
- JosephKK
  
  Contact options for registered users
Vote on answer
posted
15 years ago

Sat, May 10, 2008 8:59 PM

computation when N>4. N>>>>> with the Chinese usage (ie big iron box didn't work and had service engineers tending it daily for about 6 months after delivery).

broken

Gawd, almost all programmer training (CS degrees, etc.,) is oriented that way. Maybe that is why we have a problem.

- J
- JosephKK
  
  Contact options for registered users
Vote on answer
posted
15 years ago

Sat, May 10, 2008 9:04 PM

Would please investigate the thermal considerations in these cases? Also the typical use mixes by machine?

- J
- John Larkin
  
  Contact options for registered users
Vote on answer
posted
15 years ago

Sat, May 10, 2008 10:11 PM

Why experiment? Every task image would have a header that identifies what it is, what it does, what version it is, who wrote it when, what resources it needs, and anything else anybody would want to know. And it would have a unique identifier that leads to a web page that explains its functions and history in detail, and includes source code.

John

- M
- MooseFET
  
  Contact options for registered users
Vote on answer
posted
15 years ago

Sun, May 11, 2008 12:43 AM

I was only defending against an objection that I assumed would be raised.

A header would be good when creating the software or complete system. After the system or a fraction there of is built, I thing that a centralized table or group of tables of which tasks need what would be better. The controlling code needs to allocate the tasks to CPUs in a way that leads to the best performance.

There is an interesting "traveling salesman" like problem that could appear here. If you have several tasks that are all listed as "floating point bound" and that admit to interacting with each other, then you would slow the whole thing down if you allocate only one to a slow FPU. If you find you have several groups of tasks like that then you would be best off sticking only tasks in one group with the slower FPUs.

The reason I see fast and slow FPUs is because the fastest floating point circuits need a huge number of transistors. Moderately slow ones use a lot less. The tasks that do a few floating point operations don't need to be put onto CPUs with super fast FPUs.

That last bit may make the amount of information unreasonably large. At least you should decrease it to "need to know".

- P
- Phil Hobbs
  
  Contact options for registered users
Vote on answer
posted
15 years ago

Sun, May 11, 2008 6:18 PM

No, because they don't have to worry about locks and collisions.

Cheers,

Phil Hobbs

- R
- Robert Baer
  
  Contact options for registered users
Vote on answer
posted
15 years ago

Mon, May 12, 2008 9:01 AM

Basically the same problem way back in time-sharing daze; one CPU trying to handle N customers; when N got too large (usually greater than

12 then), ther was not sufficent bus bandwidth as well as sufficent time to handle any one of them in a decent time - so everyone got bogged down. Put N CPUs on a memory bus and guess what? same problem. So, i say, support the insanity and make N as large as some idiot wants and sell the super-duper slicer-dicer, and pocket the money before the fleeced buyers get wise.