48 cores

- R
- Robert Baer
  
  Contact options for registered users
Vote on answer
posted
9 years ago

Thu, Jun 5, 2014 4:31 PM

Yeah. About 45 years ago i paper designed a clockless computer, based on RTL, and preliminary testing of modules showed a speed increase that was near ECL of the time.

- R
- Robert Baer
  
  Contact options for registered users
Vote on answer
posted
9 years ago

Thu, Jun 5, 2014 4:32 PM

Well, once upon a time, a long time ago, i wrote multi-million digit algorithms (you know, My Dear Aunt Sally) using the FFT as the basic method. In that venue, a power of two becomes attractive and thus a 16 /32 /

64 / 128 core system would seem to be advantageous. Waste bottleneck time in filling each one a part of the "pie", kick them to do the same thing (only different data), and when done, waste bottleneck time in transferring results for the next step. Time saved in doing (say) 128 pieces VS only one piece would be (core processing time)/(127 cores). Percentage gain is very small until one has a "decent" number of cores for this: say 1024 minimum. Now that "gain" is at the expense of putting ALL needed data into each core (you want NO time-wasting) bus accesses). Not much time difference between doing SIN/COS FPU calc VS table lookup, but "local" space needed becomes a lot for lookup. Also, time to fill eats "gain" and algorithm may need assembly code tweaking to optimize use of cache lines. That means SIN/COS FPU calc is preferred.

Anything else would seem to be randumb, so the code for "loading" the cores cannot be anywhere as streamlined. And that fact argues AGAINST a large number of (then useless) cores.

- R
- Robert Baer
  
  Contact options for registered users
Vote on answer
posted
9 years ago

Thu, Jun 5, 2014 4:35 PM

That last observation argues against a large number of cores. See my other posted comment covering FFT.

- R
- Robert Baer
  
  Contact options for registered users
Vote on answer
posted
9 years ago

Thu, Jun 5, 2014 4:36 PM

Yup!!!

- R
- Robert Baer
  
  Contact options for registered users
Vote on answer
posted
9 years ago

Thu, Jun 5, 2014 4:43 PM

Well, once upon a time, a long time ago, i wrote multi-million digit algorithms (you know, My Dear Aunt Sally) using the FFT as the basic method. In that venue, a power of two becomes attractive and thus a 16 /32 /

64 / 128 core system would seem to be advantageous. Waste bottleneck time in filling each one a part of the "pie", kick them to do the same thing (only different data), and when done, waste bottleneck time in transferring results for the next step. Time saved in doing (say) 128 pieces VS only one piece would be (core processing time)/(127 cores). Percentage gain is very small until one has a "decent" number of cores for this: say 1024 minimum. Now that "gain" is at the expense of putting ALL needed data into each core (you want NO time-wasting) bus accesses). Not much time difference between doing SIN/COS FPU calc VS table lookup, but "local" space needed becomes a lot for lookup. Also, time to fill eats "gain" and algorithm may need assembly code tweaking to optimize use of cache lines. That means SIN/COS FPU calc is preferred.

Anything else would seem to be randumb, so the code for "loading" the cores cannot be anywhere as streamlined. And that fact argues AGAINST a large number of (then useless) cores.

- J
- John Larkin
  
  Contact options for registered users
Vote on answer
posted
9 years ago

Thu, Jun 5, 2014 4:53 PM

I'm running Agent, Thunderbird, Firefox, Word, Crimson Editor, Dropbox, and a PDF viewer, and a couple of disk explorer windows. Not much interprocess stuff there. Each could run on its own CPU, fully sandboxed.

Of course they access file managers and printer drivers and the Winsock thing, common resources. Each could also have its own CPU.

--

John Larkin                  Highland Technology Inc 
www.highlandtechnology.com   jlarkin at highlandtechnology dot com    

Precision electronic instrumentation

- H
- haiticare2011
  
  Contact options for registered users
Vote on answer
posted
9 years ago

Thu, Jun 5, 2014 5:46 PM

John What have you got for DMA architectures that would facilitate bringing in sensor data directly (as possible) to long term memory, like a SATA SSD? (Going through an OS deprecated, unless you can show little latency.)

- L
- Lasse Langwadt Christensen
  
  Contact options for registered users
Vote on answer
posted
9 years ago

Thu, Jun 5, 2014 6:04 PM

if you are going to store it on an SSD the latency doesn't matter and the OS will not be an issue

-Lasse

- J
- John Larkin
  
  Contact options for registered users
Vote on answer
posted
9 years ago

Thu, Jun 5, 2014 6:14 PM

How fast do you want to go?

I'm currently working on a waveform record/playback box, based on a microZed (FPGA and dual ARM cores) running Linux, with gigabytes of waveform files on an SD card. Relatively slow, ballpark 1 MHz sample rates, and mostly SD card speed limited.

The ZYNQ FPGA could DMA waveform data into DRAM, but it's still got to be saved to longterm storage.

--

John Larkin         Highland Technology, Inc 

jlarkin att highlandtechnology dott com 
http://www.highlandtechnology.com

- E
- edward.ming.lee
  
  Contact options for registered users
Vote on answer
posted
9 years ago

Thu, Jun 5, 2014 6:23 PM

SSD is just a faster SD with more RAM buffer. But eventually, you will run out of SSD internal buffer. If you have lots of data to store, hard disk will still be faster than SSD on average.

- K
- krw
  
  Contact options for registered users
Vote on answer
posted
9 years ago

Fri, Jun 6, 2014 1:31 AM

Now try to do any of them faster.

Which wouldn't help a bit.

- J
- John Larkin
  
  Contact options for registered users
Vote on answer
posted
9 years ago

Fri, Jun 6, 2014 1:50 AM

Run a browser or a text editor faster, on a 2 GHz ARM that's doing nothing else? Why? You think it would be faster to time-slice all of them on one 2 GHz ARM?

It would keep things like buffer size exploits from taking over the entire system.

Oh well, I guess we'll have viruses and trojans everywhere 40 years from now; why change anything that's been sort of working for 40 years?

--

John Larkin         Highland Technology, Inc 

jlarkin att highlandtechnology dott com 
http://www.highlandtechnology.com

- U
- upsidedown
  
  Contact options for registered users
Vote on answer
posted
9 years ago

Fri, Jun 6, 2014 5:41 AM

An optical WDM (Wavelength Division Multiplexing) system will have a huge throughput, but the question is how much chip area is needed for the wavelength combiners and splitters, so it does not make sense for very short connections.

- D
- David Brown
  
  Contact options for registered users
Vote on answer
posted
9 years ago

Fri, Jun 6, 2014 8:49 AM

Does each have its own screen, keyboard, network card, disk, memory, etc.? These sorts of apps do interprocess "stuff" all the time.

Each process is already sandboxed to some extent - that's what MMU's are for. The only sensible additional sandboxing would be to limit the processes' access to OS resources (such as files or network ports). That's straight-forward already - get yourself a Linux system, and run the programs under different users. If you want even finer control, run them in chroot jails (or more modern Linux containers), or use SELinux and give them limited access.

Putting each on its own cpu would do absolutely /nothing/ for such isolation or sandboxing. Again, if you run on Linux you can do this already - you can pin the process to a particular cpu core and block other processes from that core. Nothing changes regarding security or accesses.

And if you look at graphs of cpu usage over time for such apps, you see that for the majority of the time they take very little time, then when they are busy they peak - taking as much cpu power as they can get from a single core. (They may use additional threads for background processes, but these are rarely intensive.) And there is seldom an overlap between the peaks for the different apps - when you are only one user, you are only highly active on one app at a time.

All this suggests that a single shared cpu core with maximal single-core speed is the best for such apps. Because there are always a few other things going on, and sometimes multi-threading is efficient (such as when the app is working, and the windowing system is updating at the same time), there is quite a bit to be gained by having two cores. But for the vast majority of desktop usage, that's the limit - two cores is all you can use.

(Of course there are specific apps that can use multiple cores - video editing, some graphics work, big compilations, etc.)

So when Firefox opens a dozen network connections, would that mean a dozen cpus with different Winsock instances? Or one cpu with one Winsock? And would that be shared amongst other apps that also use Winsock? And how about the network drivers that Winsock talks to - would these have their own cpus to make them "sandboxed"?

My Linux desktop currently has some 250+ processes running - but the process number counter is up to about 9000. This is after a fairly recent restart - in more typical use there are probably a thousand or so processes. Does that mean I need 1000 cpu cores to use my PC? Or that if I try and start process number 1001 my PC should give up?

There are plenty of computing tasks that are highly parallisable - desktop usage is not one of them. It could probably be improved somewhat from how it is done today, but not significantly - a user interface is an inherently serial task because the user is serial.

- P
- Phil Hobbs
  
  Contact options for registered users
Vote on answer
posted
9 years ago

Fri, Jun 6, 2014 1:08 PM

Lots. Also time delay for all the demuxes and SERDESes and so on. Point-to-point optics is a win though.

Cheers

Phil Hobbs

--
Dr Philip C D Hobbs 
Principal Consultant 
ElectroOptical Innovations LLC 
Optics, Electro-optics, Photonics, Analog Electronics 

160 North State Road #203 
Briarcliff Manor NY 10510 

hobbs at electrooptical dot net 
http://electrooptical.net

- M
- Martin Brown
  
  Contact options for registered users
Vote on answer
posted
9 years ago

Fri, Jun 6, 2014 1:26 PM

I suppose it all depends. Most of what I do is trying to squeeze the performance out of hardware that can just about do the job. Plenty of MCUs are under employed as washing machine controllers, mice etc.

So long as you can power down unused cores I don't really see the harm.

What tends to happen is that the series looks more like

1 + 1 + 1/2 + 1/3 + 1/4 + 1/5 + 1/6

Which will still reach infinity but not at all quickly and the law of diminishing returns sets in at around 12 CPUs on most of todays kit.

By 16 your are really having to think hard about memory bandwidth.

This is meaningless word salad and wrong at just about every fundamental level. It was known from way back how to split a big FFT across 2,4 or even 8 CPU cores by exploiting the butterfly symmetry and combining the results with simple phase factors. 16 cores came later.

But don't take my word for it here is a free access paper from ~2007 on optimising FFTs on multicore CPUs - your attention is drawn to figures 8 through 11. Performance with number of cores is linear up to 16.

formatting link

Here is a more contemporaneous paper (free access) from 1987

formatting link

I worked on 2D FFTs equivalent to a 2^20 1D array size ie 1024x1024 although all of our work was on a single fast CPU with limited memory.

Way back multiply was expensive and a lot of effort went into minimising the number of non-trivial multiplies. These days it is generally fetch from main memory that hurts performance the most. Back then some of the big transform fetches were from disk and a lot of thought went into minimising the number of reads and writes per transform. Later the same analysis done to minimise VM page faults.

The usual trick was to cut the problem into local cache sized chunks.

The serious codes typically used recurrence relations to generate sin/cos back then. The cunningly accurate versions using tables of the true discretised roots of -1 so that (cos(x), sin(x))^N = (1, 0) as closely as possible. Twiddle factors stored locally if possible.

People managed to do plausible vectorised and parallel FFTs back in the

1980's. The poor man's Cray aka the FPS-120 Array Processor was quite good at it. The "faster" replacement model 164 after that was rather disappointing at least for radio astronomy. I forget why.

--
Regards, 
Martin Brown

- P
- Phil Hobbs
  
  Contact options for registered users
Vote on answer
posted
9 years ago

Fri, Jun 6, 2014 1:57 PM

The hard thing to do well in highly multicore ("scale-out") systems is database updates.

Also, different schemes have different behaviour--fully cache-coherent systems with symmetric memory access (each core has the same access to all of the cache) run out of gas sooner than NUMA (nonuniform memory access) ones, and there are all sorts and kinds of NUMA.

It's much harder to do single I/O or memory-bound tasks across multiple cores, mostly on account of latency, but mixed tasks scale quite a bit better.

Cheers

Phil Hobbs

--
Dr Philip C D Hobbs 
Principal Consultant 
ElectroOptical Innovations LLC 
Optics, Electro-optics, Photonics, Analog Electronics 

160 North State Road #203 
Briarcliff Manor NY 10510 

hobbs at electrooptical dot net 
http://electrooptical.net

- J
- John Larkin
  
  Contact options for registered users
Vote on answer
posted
9 years ago

Fri, Jun 6, 2014 5:57 PM

The expensive resource will be power, the heat dissipation capacity of the silicon. If a cheap chip has 256 cores but can't run them all full blast at the same time, who cares? Maybe chip temperature is a major input into the OS scheduler.

Some day, computer and OS architectures will be different. Lots of people don't want to think about that.

--

John Larkin         Highland Technology, Inc 

jlarkin att highlandtechnology dott com 
http://www.highlandtechnology.com

- U
- upsidedown
  
  Contact options for registered users
Vote on answer
posted
9 years ago

Fri, Jun 6, 2014 8:57 PM

The latency issue is critical for request/response type transactions, so it does not make sense to transfer a single byte. More realistic would be to transfer a whole cache line (e.g. 32 bytes) or a whole virtual memory page (e.g. 4 KiB) at a time, in which the turn-around delays are not that devastating.

A 32 lane PCI Express has a quite decent throughput, Imagine using a DWDM system with 32 "colors" running in a single fiber (or two for bidirectional transfer).

- P
- Phil Hobbs
  
  Contact options for registered users
Vote on answer
posted
9 years ago

Fri, Jun 6, 2014 9:09 PM

PCIe-type things are useless for on-chip wiring.

Cheers

Phil Hobbs

--
Dr Philip C D Hobbs 
Principal Consultant 
ElectroOptical Innovations LLC 
Optics, Electro-optics, Photonics, Analog Electronics 

160 North State Road #203 
Briarcliff Manor NY 10510 

hobbs at electrooptical dot net 
http://electrooptical.net