The basic idea for a new model for processing/computing is as follows:
The memory latency bottleneck is illiminated by the following design:
- Millions of small little processors which are capable of executing a basic instruction set and have a basic ammount of memory.
- The processors communicate with each other directly.
- Instead of an instruction pointer there could be a "processor pointer".
- The data is passed from processor to processor.
So for example this means an algorithm is split up into as many phases as possible. Perhaps each instruction of an algorithm is executed on a different processor.
This has a couple of beneficial effects:
4.1 Massive ammounts of pipelining, which will cause a significant speed up and close to parallel processing. 4.2 Maximum efficiency of instruction caches, maximum efficiency of data caches.
- Branching could no longer be an issue, for each branch of instructions processors are allocated. No more pipeline stalls and such.
There could be some drawbacks but these could be minor:
- For example the first few processors might be doing work which might not be necessary anymore because later processors have already determined the algorithm is done, in such case these unnecessary ammounts of work could be thrown away.
I kinda like this idea.
I don't think current processors can work this way, but I am not sure so let's examine that and ask some questions:
First the goal of current processors and algorithm design:
- Split the algorithm up into as many phases as possible.
- Fit as many phases as possible into L1 instruction cache of first core.
- Fit as many remaining phases as possible into L1 instruction of second core, third core and so forth.
- Read up to L3 or L2 data cache size blocks of L1 data cache size.
- Process these blocks in a pipelined fashion:
time/pipeline step 1: Core 0 processes block 0 with phase 0,1,2 of algorithm. Core 0 passes block 0 to core 1. time/pipeline step 2: Core 1 processes block 0 with phase 3,4,5,6 of algorithm. Core 1 passes block 0 to core 2. Core 0 processes block 1 with phase 0,1,2 of algorithm. Core 0 passes block 1 to core 1. time/pipeline step 3: Core 0 processes block 2 with phase 0,1,2 of algorithm. Core 0 passes block 2 to core 1. Core 1 processes block 1 with phase 3,4,5,6 of algorithm. Core 1 passes block 1 to core 2. Core 2 processes block 0 with phase 7 of algorithm.
Visualizing this would require a 3D visualization:
X-Axis: cores Y-Axis: phases Z-Axis: timesteps/pipeline steps.
Anyway my questions to current x86 hardware mostly is this:
- Can processor intercommunicate with each other without having to use caches or main memory ?!
- If not and they would have to use caches, would this for example cause a bottleneck in L2 caches ?
- In case L2 cache as indeed a bottleneck then hopefully this new design will prevent such a bottleneck.
Perhaps groups of 4 processors or more could be inter-connected so that some form of forward communication can take place.
I have seen some grid designs... though perhaps that's not the best design for this kind of processor.
Perhaps a simple serial form of communication would work nicely for example:
Core 0 -> Core 1 -> Core 2 -> Core 3 -> Core 4
For branching purposes it might be enough to connect an additional line as follows:
Core 200 -> Core 202 -> Core 203 -> Core 204... -> Core 299 ^ ^ ^ ^ ^ Core 100 -> Core 101 -> Core 102 -> Core 103... -> Core 199 ^ ^ ^ ^ ^ Core 000 -> Core 001 -> Core 002 -> Core 003.... -> Core 099
Perhaps core 099 could go all the way back to core 100... to continue it's pipeline execution.
At least this would give each core a chance to diverge it's branch execution to a different set of cores and give room for 100 instruction or less before it might become a problem at core 99.
Perhaps zig zagging the communication lines might be even better.
Core 8, 9, 10, 11 Core 7, 6, 5, 4 Core 0, 1, 2, 3,
At least in the ideal pipelined communication situation this would keep the lines at a minimum distance:
from 0 to 1 to 2 to 3 to 4 to 5 to 6 to 7 to 8 to 9 to 10 to 11.
If a branch occurs from 0 to 4 the diagonal line would be long and worst case.
Perhaps there can even be lines from each bottom core to each top core for fast branch divergence.
This will depend on if branch divergence is a bottleneck/reason of (signal) stall or not.
Preferred would be give programmer full control over setup of processor.
So for example it could be possible to instruction core 0 to pass it's data/information along to core 11.
However there might be no interconnect for this.
So maybe it might then be limited to a certain ammount of selections available to the programmer.
Advanced software could determine the best layout for pieces of code and data passing design.
So processor must also be able to tell the software how it's hardware interconnects are arranged, and what it's 2D or 3D dimensional layout is for it's cores.
So that programmer/software has some idea of signal length/distance to travel. Alternatively processor could specify latency in a table for Core to Core, however the number of possibilities could be quite large.
Or a test instruction could be embedded which allows software to test signal speed from core to core given it's interconnect limitations.
Each core could also specify to which other cores it is directly connected.
However this might result in duplicate information.
One possible solution is a 2D or 3D map embedded into the processor which contain an 0 or 1 to indicate of Core X1,Y1,Z1 can connect to core X2,Y2,Z2.
2D example:Interconnect path available:
Or it could be a value indication it's speed: Speed 0 could indicate no path available, Speed 1 low, Speed 2, medium, Speed 3 High and so forth... depending on how much bits available.
To: Core 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11 From: Core 0 Core 1 Core 2 Core 3 Core 4 Core 5 Core 6 Core 7 Core 8 Core 9 Core 10 Core 11
However this could waste precious ammount of transistors.
So instead a model number could be embedded into the processor, and then these maps could be made available on an external storage device for consultation by the software, how to best layout the algorithm on these cores. This could be a dynamic process so that it can be optimized for different core designs.
For now I will leave it at that, you have much to consider/learn/go over.
Bye for now, Skybuck.