Theory of co-divisional of electronics and math expression limit(s)

maybe PICTURE THIS

> Mr. Moore's type of 25 X model, HOWEVER, > 1) expanded to a sixteen by sixteen array for A and B busses, (5x5 --> > 16x16) > 1a) both A and B are MULTIPLEXED into TWO buses, each with an ID > multiplier of sixteen for inter and intra processor reg maps, > ( SIXTEEN is for a MAXIMUM bandwidth! ) > 1b) both A and B have peek ahead, two register element stacks > 1c) !!! TIMES FOUR FOR FAULT TOLERANT ( SUPER COOLED?) VERSION OPTION > !!! > 2) special C bus for local parallel memory ( Direct RamBus DRAM ?) > 3) extra X and Y stacks ( along with the T/S/parameter stack ) > 4) thats it! this is my whole base model list! > 5) iterate testing and recurse testing of my sixteen bit VLIW decode. > maw > --- > Mr Moore's X18 homepage ( obsolete? ) > --- > Mr Moore's 25 x ( obsolete? ) > --- > Chuck Moore's X18 Forth Microcomputer Core > > > > Updated 2001 June > X18 Microcomputer core > High performance, low power Forth engine. Optimized for compute-bound > portable applications. 18 bit address/data matches cache SRAM. > Features > 2400 Mips, sustained > Asynchronous (no external clock) > 2 16-deep push-down stacks > 27 0-operand instructions > 128 words ROM, 384 DRAM > Watchdog timer > 20 mW @ 1.8 V > .2 sq mm > Architecture > The X18 is an evolution of the F21 and i21 microprocessors. With .18um > transistors, it has 5x their speed and 1/5 their power. It has their > 16-deep Return and Data stacks and 27 0-operand instructions, packed 3 > per word. A 100ms watchdog timer assures continued operation. Boots > from on-chip ROM. > Redesigned with new layout and simulation tools to be robust and to > minimize power. The computer can be throttled by a factor of 1024 to > provide 2.4 Mips using 20 uW. It may be stopped altogether, but will > have to reboot. > Multiply (125 Mops) and divide (40 Mops) have been improved. > Internal memory is fast enough (1 ns) to sustain 2400 Mips. Data > access, especially to external SRAM, will slow this. Code is loaded > into on-chip DRAM for execution. > CPU > Forth code is highly factored into many small subroutines. An optimized > processor requires an efficient call/return mechanism. This is best > achieved with 2 push-down stacks. Each is implemented as a register > feeding a 16x18-bit RAM with 8-transistor bit cells. The current entry > is indicated by a 16-bit bidirectional, circular shift register. > One stack is used to store subroutine return addresses. All > processors have such a stack. The other is used to pass parameters to > and from subroutines. Other processors use registers or stack frames > for this purpose. However, all languages use an implicit stack to > evaluate expressions. Forth makes it explicit. > As if emphasizing their importance, the stacks require 2/3 of the > CPU silicon area. It is difficult to achieve their 1-cycle accesss > timing. > The merits of stack vs. register designs have been argued for > decades. A comprehensive book, href=
formatting link
/stack_computers/index.html>Stack > Computers, by Phil Koopman has been published online. To quote > Sec 6.2: "0-operand stack addressing ... makes stack machines superior > to conventional machines in the areas of program size, processor > complexity, system complexity, processor performance, and consistency > of program execution." > The Forth ALU operates on the top 1 or 2 items of the parameter > stack, leaving the result there. This permits 0-operand instructions. > Eliminating register addresses permits shorter instructions, in this > case 5-bit. Several instructions are required to rearrange the stack. > And it's convenient to move things to the return stack. > An address register is useful to reduce stack manipulation. It also > supports incrementing to address successive words in memory. Similar > use of the top of the return stack provides 2 addresses for > memory-memory moves. > A demultiplexor allows the packing of up to 3 instructions per > word. This increases the density of compiled code and reduces the > interference between instruction and data memory access. It keeps the > CPU busy while the next instruction is being fetched. Providing a > sustained execution speed of 2400 Mips. > This is implemented by a 3-bit shift register. The current bit > enables its slot into the instruction latch. A ready pulse from the > memory manager latches the high-order 5 bits (slot 0). The pulse is > delayed by a string of 14 inverters so that it repeats 2 ns later, > latching the next slot. Slot 2 stops the process, as does a jump or > fetch/store, until the next ready pulse. > There are 27 simple instructions, exactly suited to Forth. This > allows 1-1 compilation of Forth source to machine code. On other > processors, each Forth primitive requires several instructions. The > situation is reversed for other languages: several Forth instructions > may be required for their primitives. > > ...Register > TTop of stack > S2nd number on stack > RTop of Return stack > AAddress > Remember that fetch pushes the stack, store and binary operations > pop it. > CodeOpAction > 0word ;Jump to subroutine; tail recursion > 1ifJump to 'then' if T0-T17 are zero > 2wordCall subroutine > 3-ifJump to 'then' if T17 is one > 6;Return > 8@rFetch from address in R > 9@+Fetch from address in A; increment A > anFetch literal > b@Fetch from address in A > c!rStore into address in R > d!+Store into address in A; increment A > f!Store into address in A > 10-Ones-complement T > 112*Shift T left 1 bit > 122/Shift T right 1 bit; preserve T17 > 13+*Add S to T if T0=1 (multiply step) > 14orExclusive-or S to T > 15andAnd S to T > 17+Add S to T > 18popFetch R > 19aFetch A > 1adupDuplicate T > 1boverFetch S > 1cpushStore into R > 1da!Store into A > 1enopDo nothing > 1fdropStore T nowhere > nop > Another advantage of the 5-bit instruction is ease of decoding. A > tree of NAND and NOR gates lead from the instruction bus to the enable > for each register. This is facilitated by the limit of 10 lines to be > routed: each bit and its complement. > > --- > Chuck Moore's 25x Forth Multicomputer Chip > > > > Updated 2001 June > 25x Microcomputer > An array of 25 microcomputers on a 7 sq mm die. > Features > .2 sq mm asynchronous microcomputer core > 5 x 5 array of cores: 60,000 Mips > 5 horizontal, 5 vertical parallel interconnect buses: 180 Ghz > bandwidth > Specialized computers to interface off-chip. > Max power 500 mW @ 1.8 V, with 25 computers running > 100mAh battery life is 1 year, with 1 computer running throttled > 64-pin SOIC: mirrored pin-out to 4ns cache SRAM > Array chips on 2-sided PCB > Description > Availability of the tiny (.2 sq mm), asynchronous X18 > microcomputer core naturally suggested arraying it on a chip. Its > extremely low power (20 mW) made that feasible. A 5x5 array was chosen > to fit on a 7 sq mm die, the smallest available prototype, though > larger arrays are possible. 25 computers running at 2400 Mips is a > total of 60,000 Mips. An unlimited supply. > Communication among the computers is provided by a network with 5 > horizontal and 5 vertical buses. Each computer has 2 bus registers to > access a horizontal and a vertical bus. Each bus is 18-bits wide and > can run at 1 GHz. All 10 buses can be active at once connecting a > 20-computer subset. So total bandwidth is 180 GHz. > Each computer can customized. Registers are added to the 16 > processors at the edge of the array and connected to package pins. Each > computer is responsible for a particular interface. Protocols are > implemented with software. > SRAM controller > Flash controller > 4 serial controllers > USB controller > D/A controller > A/D controller > After booting from ROM, the computers await code downloaded from one of > these interfaces. > Pinout > Chosen to be the mirror image of an 18-bit cache memory chip. This is > the fastest memory available, with 4 ns access. Its package is a > 100-pin SOIC. The 18-bit Multicomputer thus has 256K words of external > memory in 1 chip. > Putting the Multicomputer chip on the top of a 2-sided PCB and the > SRAM chip on the bottom gives a very small footprint. A decoupling > capacitor is the only other component needed. An array of such pairs is > a multicomputer board. Connecting Multicomputer to SRAM is trivial, > with mm traces. Routing for power and a serial network is also easy. > Computers load code from the network. > A parallel computer with 60Gips nodes! Power is determined by the > SRAM. > Cost/Availability > The chip is awaiting funding. If interested, contact href=mailto: snipped-for-privacy@mindspring.com snipped-for-privacy@mindspring.com > A 7 sq mm die, packaged, will cost about $1 in quantity 1,000,000. > Cost per Mip is 0. > 25 prototypes can be obtained from href=
formatting link
MOSIS for $14,000 with 16 week > turn-around. The TSMC .18um process has monthly submissions. > > ---

Maybe an important note.

ONLY the diagonal needs the X[],Y[] and *SPECIAL* C[] register ( each is unique for parallel ram access, a 4 x 4 x MemWidth multiplex for maybe four Direct RAM Bus DRAMS)

the other ( 200+ nodes are used for programmable multiplexing)

Night all'

maw

--
> > > the other ( 200+ nodes are used for programmable multiplexing)

> > in my hypothetical super scalable parallel architecture ,
> >
> > stack_machine_id[A/B-select, [ A[0..15]] or B[0==self or  ZERO ID
> > ,1..15]]
> >  in self mode the program may programmatically generate message routing
> >
> >    machine code for a DirectMemoryAccess ( DMA) like transfer of data.
> >    INTRA-PROCESSOR vs INTER-PROCESSOR data transfer.
> >    ( at least three states, a[0..15], b[0..15] or self[0..15]
> >
> > I am IBM.
>

\'self\' represents the /diagonal/ ( of Mr. Moore\'s modified 25x model),
as within this posting\'s previously mentioned use of the term
/diagonal/.
Reply to
cpu16x1832
Loading thread data ...

ElectronDepot website is not affiliated with any of the manufacturers or service providers discussed here. All logos and trade names are the property of their respective owners.