Theory of co-divisional of electronics and math expression limit(s)

maybe PICTURE THIS
> Mr. Moore's type of 25 X model, HOWEVER, > 1) expanded to a sixteen by sixteen array for A and B busses, (5x5 --> > 16x16) > 1a) both A and B are MULTIPLEXED into TWO buses, each with an ID > multiplier of sixteen for inter and intra processor reg maps, > ( SIXTEEN is for a MAXIMUM bandwidth! ) > 1b) both A and B have peek ahead, two register element stacks > 1c) !!! TIMES FOUR FOR FAULT TOLERANT ( SUPER COOLED?) VERSION OPTION > !!! > 2) special C bus for local parallel memory ( Direct RamBus DRAM ?) > 3) extra X and Y stacks ( along with the T/S/parameter stack ) > 4) thats it! this is my whole base model list! > 5) iterate testing and recurse testing of my sixteen bit VLIW decode. > maw > --- > Mr Moore's X18 homepage ( obsolete? ) > --- > Mr Moore's 25 x ( obsolete? ) > --- > Chuck Moore's X18 Forth Microcomputer Core > > > > Updated 2001 June > X18 Microcomputer core > High performance, low power Forth engine. Optimized for compute-bound > portable applications. 18 bit address/data matches cache SRAM. > Features > 2400 Mips, sustained > Asynchronous (no external clock) > 2 16-deep push-down stacks > 27 0-operand instructions > 128 words ROM, 384 DRAM > Watchdog timer > 20 mW @ 1.8 V > .2 sq mm > Architecture > The X18 is an evolution of the F21 and i21 microprocessors. With .18um > transistors, it has 5x their speed and 1/5 their power. It has their > 16-deep Return and Data stacks and 27 0-operand instructions, packed 3 > per word. A 100ms watchdog timer assures continued operation. Boots > from on-chip ROM. > Redesigned with new layout and simulation tools to be robust and to > minimize power. The computer can be throttled by a factor of 1024 to > provide 2.4 Mips using 20 uW. It may be stopped altogether, but will > have to reboot. > Multiply (125 Mops) and divide (40 Mops) have been improved. > Internal memory is fast enough (1 ns) to sustain 2400 Mips. Data > access, especially to external SRAM, will slow this. Code is loaded > into on-chip DRAM for execution. > CPU > Forth code is highly factored into many small subroutines. An optimized > processor requires an efficient call/return mechanism. This is best > achieved with 2 push-down stacks. Each is implemented as a register > feeding a 16x18-bit RAM with 8-transistor bit cells. The current entry > is indicated by a 16-bit bidirectional, circular shift register. > One stack is used to store subroutine return addresses. All > processors have such a stack. The other is used to pass parameters to > and from subroutines. Other processors use registers or stack frames > for this purpose. However, all languages use an implicit stack to > evaluate expressions. Forth makes it explicit. > As if emphasizing their importance, the stacks require 2/3 of the > CPU silicon area. It is difficult to achieve their 1-cycle accesss > timing. > The merits of stack vs. register designs have been argued for > decades. A comprehensive book, href=
formatting link
/stack_computers/index.html>Stack > Computers, by Phil Koopman has been published online. To quote > Sec 6.2: "0-operand stack addressing ... makes stack machines superior > to conventional machines in the areas of program size, processor > complexity, system complexity, processor performance, and consistency > of program execution." > The Forth ALU operates on the top 1 or 2 items of the parameter > stack, leaving the result there. This permits 0-operand instructions. > Eliminating register addresses permits shorter instructions, in this > case 5-bit. Several instructions are required to rearrange the stack. > And it's convenient to move things to the return stack. > An address register is useful to reduce stack manipulation. It also > supports incrementing to address successive words in memory. Similar > use of the top of the return stack provides 2 addresses for > memory-memory moves. > A demultiplexor allows the packing of up to 3 instructions per > word. This increases the density of compiled code and reduces the > interference between instruction and data memory access. It keeps the > CPU busy while the next instruction is being fetched. Providing a > sustained execution speed of 2400 Mips. > This is implemented by a 3-bit shift register. The current bit > enables its slot into the instruction latch. A ready pulse from the > memory manager latches the high-order 5 bits (slot 0). The pulse is > delayed by a string of 14 inverters so that it repeats 2 ns later, > latching the next slot. Slot 2 stops the process, as does a jump or > fetch/store, until the next ready pulse. > There are 27 simple instructions, exactly suited to Forth. This > allows 1-1 compilation of Forth source to machine code. On other > processors, each Forth primitive requires several instructions. The > situation is reversed for other languages: several Forth instructions > may be required for their primitives. > > ...Register > TTop of stack > S2nd number on stack > RTop of Return stack > AAddress > Remember that fetch pushes the stack, store and binary operations > pop it. > CodeOpAction > 0word ;Jump to subroutine; tail recursion > 1ifJump to 'then' if T0-T17 are zero > 2wordCall subroutine > 3-ifJump to 'then' if T17 is one > 6;Return > 8@rFetch from address in R > 9@+Fetch from address in A; increment A > anFetch literal > b@Fetch from address in A > c!rStore into address in R > d!+Store into address in A; increment A > f!Store into address in A > 10-Ones-complement T > 112*Shift T left 1 bit > 122/Shift T right 1 bit; preserve T17 > 13+*Add S to T if T0=1 (multiply step) > 14orExclusive-or S to T > 15andAnd S to T > 17+Add S to T > 18popFetch R > 19aFetch A > 1adupDuplicate T > 1boverFetch S > 1cpushStore into R > 1da!Store into A > 1enopDo nothing > 1fdropStore T nowhere > nop > Another advantage of the 5-bit instruction is ease of decoding. A > tree of NAND and NOR gates lead from the instruction bus to the enable > for each register. This is facilitated by the limit of 10 lines to be > routed: each bit and its complement. > > --- > Chuck Moore's 25x Forth Multicomputer Chip > > > > Updated 2001 June > 25x Microcomputer > An array of 25 microcomputers on a 7 sq mm die. > Features > .2 sq mm asynchronous microcomputer core > 5 x 5 array of cores: 60,000 Mips > 5 horizontal, 5 vertical parallel interconnect buses: 180 Ghz > bandwidth > Specialized computers to interface off-chip. > Max power 500 mW @ 1.8 V, with 25 computers running > 100mAh battery life is 1 year, with 1 computer running throttled > 64-pin SOIC: mirrored pin-out to 4ns cache SRAM > Array chips on 2-sided PCB > Description > Availability of the tiny (.2 sq mm), asynchronous X18 > microcomputer core naturally suggested arraying it on a chip. Its > extremely low power (20 mW) made that feasible. A 5x5 array was chosen > to fit on a 7 sq mm die, the smallest available prototype, though > larger arrays are possible. 25 computers running at 2400 Mips is a > total of 60,000 Mips. An unlimited supply. > Communication among the computers is provided by a network with 5 > horizontal and 5 vertical buses. Each computer has 2 bus registers to > access a horizontal and a vertical bus. Each bus is 18-bits wide and > can run at 1 GHz. All 10 buses can be active at once connecting a > 20-computer subset. So total bandwidth is 180 GHz. > Each computer can customized. Registers are added to the 16 > processors at the edge of the array and connected to package pins. Each > computer is responsible for a particular interface. Protocols are > implemented with software. > SRAM controller > Flash controller > 4 serial controllers > USB controller > D/A controller > A/D controller > After booting from ROM, the computers await code downloaded from one of > these interfaces. > Pinout > Chosen to be the mirror image of an 18-bit cache memory chip. This is > the fastest memory available, with 4 ns access. Its package is a > 100-pin SOIC. The 18-bit Multicomputer thus has 256K words of external > memory in 1 chip. > Putting the Multicomputer chip on the top of a 2-sided PCB and the > SRAM chip on the bottom gives a very small footprint. A decoupling > capacitor is the only other component needed. An array of such pairs is > a multicomputer board. Connecting Multicomputer to SRAM is trivial, > with mm traces. Routing for power and a serial network is also easy. > Computers load code from the network. > A parallel computer with 60Gips nodes! Power is determined by the > SRAM. > Cost/Availability > The chip is awaiting funding. If interested, contact href=mailto: snipped-for-privacy@mindspring.com snipped-for-privacy@mindspring.com > A 7 sq mm die, packaged, will cost about $1 in quantity 1,000,000. > Cost per Mip is 0. > 25 prototypes can be obtained from href=
formatting link
MOSIS for $14,000 with 16 week > turn-around. The TSMC .18um process has monthly submissions. > > ---

Maybe an important note.

ONLY the diagonal needs the X[],Y[] and *SPECIAL* C[] register ( each is unique for parallel ram access, a 4 x 4 x MemWidth multiplex for maybe four Direct RAM Bus DRAMS)

the other ( 200+ nodes are used for programmable multiplexing)

Night all'

maw