New Model for Processing/Computing. (And some questions about current multi-core inter-communication design)

The basic idea for a new model for processing/computing is as follows:

The memory latency bottleneck is illiminated by the following design:

  1. Millions of small little processors which are capable of executing a basic instruction set and have a basic ammount of memory.
  2. The processors communicate with each other directly.
  3. Instead of an instruction pointer there could be a "processor pointer".
  4. The data is passed from processor to processor.

So for example this means an algorithm is split up into as many phases as possible. Perhaps each instruction of an algorithm is executed on a different processor.

This has a couple of beneficial effects:

4.1 Massive ammounts of pipelining, which will cause a significant speed up and close to parallel processing. 4.2 Maximum efficiency of instruction caches, maximum efficiency of data caches.

  1. Branching could no longer be an issue, for each branch of instructions processors are allocated. No more pipeline stalls and such.

There could be some drawbacks but these could be minor:

  1. For example the first few processors might be doing work which might not be necessary anymore because later processors have already determined the algorithm is done, in such case these unnecessary ammounts of work could be thrown away.

I kinda like this idea.

I don't think current processors can work this way, but I am not sure so let's examine that and ask some questions:

First the goal of current processors and algorithm design:

  1. Split the algorithm up into as many phases as possible.
  2. Fit as many phases as possible into L1 instruction cache of first core.
  3. Fit as many remaining phases as possible into L1 instruction of second core, third core and so forth.
  4. Read up to L3 or L2 data cache size blocks of L1 data cache size.
  5. Process these blocks in a pipelined fashion:

time/pipeline step 1: Core 0 processes block 0 with phase 0,1,2 of algorithm. Core 0 passes block 0 to core 1. time/pipeline step 2: Core 1 processes block 0 with phase 3,4,5,6 of algorithm. Core 1 passes block 0 to core 2. Core 0 processes block 1 with phase 0,1,2 of algorithm. Core 0 passes block 1 to core 1. time/pipeline step 3: Core 0 processes block 2 with phase 0,1,2 of algorithm. Core 0 passes block 2 to core 1. Core 1 processes block 1 with phase 3,4,5,6 of algorithm. Core 1 passes block 1 to core 2. Core 2 processes block 0 with phase 7 of algorithm.

Visualizing this would require a 3D visualization:

X-Axis: cores Y-Axis: phases Z-Axis: timesteps/pipeline steps.

Anyway my questions to current x86 hardware mostly is this:

  1. Can processor intercommunicate with each other without having to use caches or main memory ?!

  1. If not and they would have to use caches, would this for example cause a bottleneck in L2 caches ?

  2. In case L2 cache as indeed a bottleneck then hopefully this new design will prevent such a bottleneck.

Perhaps groups of 4 processors or more could be inter-connected so that some form of forward communication can take place.

I have seen some grid designs... though perhaps that's not the best design for this kind of processor.

Perhaps a simple serial form of communication would work nicely for example:

Core 0 -> Core 1 -> Core 2 -> Core 3 -> Core 4

For branching purposes it might be enough to connect an additional line as follows:

Core 200 -> Core 202 -> Core 203 -> Core 204... -> Core 299 ^ ^ ^ ^ ^ Core 100 -> Core 101 -> Core 102 -> Core 103... -> Core 199 ^ ^ ^ ^ ^ Core 000 -> Core 001 -> Core 002 -> Core 003.... -> Core 099

Perhaps core 099 could go all the way back to core 100... to continue it's pipeline execution.

At least this would give each core a chance to diverge it's branch execution to a different set of cores and give room for 100 instruction or less before it might become a problem at core 99.

Perhaps zig zagging the communication lines might be even better.

Core 8, 9, 10, 11 Core 7, 6, 5, 4 Core 0, 1, 2, 3,

At least in the ideal pipelined communication situation this would keep the lines at a minimum distance:

from 0 to 1 to 2 to 3 to 4 to 5 to 6 to 7 to 8 to 9 to 10 to 11.

If a branch occurs from 0 to 4 the diagonal line would be long and worst case.

Perhaps there can even be lines from each bottom core to each top core for fast branch divergence.

This will depend on if branch divergence is a bottleneck/reason of (signal) stall or not.

Preferred would be give programmer full control over setup of processor.

So for example it could be possible to instruction core 0 to pass it's data/information along to core 11.

However there might be no interconnect for this.

So maybe it might then be limited to a certain ammount of selections available to the programmer.

Advanced software could determine the best layout for pieces of code and data passing design.

So processor must also be able to tell the software how it's hardware interconnects are arranged, and what it's 2D or 3D dimensional layout is for it's cores.

So that programmer/software has some idea of signal length/distance to travel. Alternatively processor could specify latency in a table for Core to Core, however the number of possibilities could be quite large.

Or a test instruction could be embedded which allows software to test signal speed from core to core given it's interconnect limitations.

Each core could also specify to which other cores it is directly connected.

However this might result in duplicate information.

One possible solution is a 2D or 3D map embedded into the processor which contain an 0 or 1 to indicate of Core X1,Y1,Z1 can connect to core X2,Y2,Z2.

2D example:

Interconnect path available:

Or it could be a value indication it's speed: Speed 0 could indicate no path available, Speed 1 low, Speed 2, medium, Speed 3 High and so forth... depending on how much bits available.

To: Core 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11 From: Core 0 Core 1 Core 2 Core 3 Core 4 Core 5 Core 6 Core 7 Core 8 Core 9 Core 10 Core 11

However this could waste precious ammount of transistors.

So instead a model number could be embedded into the processor, and then these maps could be made available on an external storage device for consultation by the software, how to best layout the algorithm on these cores. This could be a dynamic process so that it can be optimized for different core designs.

For now I will leave it at that, you have much to consider/learn/go over.

Bye for now, Skybuck.

Reply to
Skybuck Flying
Loading thread data ...

.

up

ot

.

a

ome

n

le:

s
s

ion

ore

he

r

l)

for

to

nal

d.

Z2.

I gather the basic problem is that one can't split most real life computing into many parallellable processes. Of course they can be done in series, b ut when they are the data to be worked on doesn't exist until it has been c reated by previous steps, so lots of CPUs merely adds handover steps.

NT

Reply to
tabbypurr

This is an absolutely fantastic idea.

How on earth has no one ever thought of this before.

Skybuck is right, we all have much to consider, learn and go over.

The possibilities are absolutely endless.

My mind is reeling over the fact that no one has thought of this before.

...... oh wait, they have, so many times that I can't even recall a particular example, they just all merge into a general haze.

I wonder whether Skybuck is

a) so arrogant that he didn't think to spend 5 seconds with google. b) 13 years old and totally naive. c) retarded.

Reply to
colin_toogood

You're new here, aren't you?

Reply to
krw

d) smokes a lot of weed

Reply to
DemonicTubes

I suspect speed.

Reply to
tabbypurr

He's admitted being a high functioning autistic. With various opinions on how "high".

Mark Zenier snipped-for-privacy@eskimo.com Googleproofaddress(account:mzenier provider:eskimo domain:com)

Reply to
Mark Zenier

ElectronDepot website is not affiliated with any of the manufacturers or service providers discussed here. All logos and trade names are the property of their respective owners.