maybe PICTURE THIS
> Mr. Moore's type of 25 X model, HOWEVER,
> 1) expanded to a sixteen by sixteen array for A and B busses, (5x5 -->
> 16x16)
> 1a) both A and B are MULTIPLEXED into TWO buses, each with an ID
> multiplier of sixteen for inter and intra processor reg maps,
> ( SIXTEEN is for a MAXIMUM bandwidth! )
> 1b) both A and B have peek ahead, two register element stacks
> 1c) !!! TIMES FOUR FOR FAULT TOLERANT ( SUPER COOLED?) VERSION OPTION
> !!!
> 2) special C bus for local parallel memory ( Direct RamBus DRAM ?)
> 3) extra X and Y stacks ( along with the T/S/parameter stack )
> 4) thats it! this is my whole base model list!
> 5) iterate testing and recurse testing of my sixteen bit VLIW decode.
> maw
> ---
> Mr Moore's X18 homepage ( obsolete? )
> ---
> Mr Moore's 25 x ( obsolete? )
> ---
> Chuck Moore's X18 Forth Microcomputer Core
>
>
>
> Updated 2001 June
> X18 Microcomputer core
> High performance, low power Forth engine. Optimized for compute-bound
> portable applications. 18 bit address/data matches cache SRAM.
> Features
> 2400 Mips, sustained
> Asynchronous (no external clock)
> 2 16-deep push-down stacks
> 27 0-operand instructions
> 128 words ROM, 384 DRAM
> Watchdog timer
> 20 mW @ 1.8 V
> .2 sq mm
> Architecture
> The X18 is an evolution of the F21 and i21 microprocessors. With .18um
> transistors, it has 5x their speed and 1/5 their power. It has their
> 16-deep Return and Data stacks and 27 0-operand instructions, packed 3
> per word. A 100ms watchdog timer assures continued operation. Boots
> from on-chip ROM.
> Redesigned with new layout and simulation tools to be robust and to
> minimize power. The computer can be throttled by a factor of 1024 to
> provide 2.4 Mips using 20 uW. It may be stopped altogether, but will
> have to reboot.
> Multiply (125 Mops) and divide (40 Mops) have been improved.
> Internal memory is fast enough (1 ns) to sustain 2400 Mips. Data
> access, especially to external SRAM, will slow this. Code is loaded
> into on-chip DRAM for execution.
> CPU
> Forth code is highly factored into many small subroutines. An optimized
> processor requires an efficient call/return mechanism. This is best
> achieved with 2 push-down stacks. Each is implemented as a register
> feeding a 16x18-bit RAM with 8-transistor bit cells. The current entry
> is indicated by a 16-bit bidirectional, circular shift register.
> One stack is used to store subroutine return addresses. All
> processors have such a stack. The other is used to pass parameters to
> and from subroutines. Other processors use registers or stack frames
> for this purpose. However, all languages use an implicit stack to
> evaluate expressions. Forth makes it explicit.
> As if emphasizing their importance, the stacks require 2/3 of the
> CPU silicon area. It is difficult to achieve their 1-cycle accesss
> timing.
> The merits of stack vs. register designs have been argued for
> decades. A comprehensive book, href=
formatting link
/stack_computers/index.html>Stack
> Computers, by Phil Koopman has been published online. To quote
> Sec 6.2: "0-operand stack addressing ... makes stack machines superior
> to conventional machines in the areas of program size, processor
> complexity, system complexity, processor performance, and consistency
> of program execution."
> The Forth ALU operates on the top 1 or 2 items of the parameter
> stack, leaving the result there. This permits 0-operand instructions.
> Eliminating register addresses permits shorter instructions, in this
> case 5-bit. Several instructions are required to rearrange the stack.
> And it's convenient to move things to the return stack.
> An address register is useful to reduce stack manipulation. It also
> supports incrementing to address successive words in memory. Similar
> use of the top of the return stack provides 2 addresses for
> memory-memory moves.
> A demultiplexor allows the packing of up to 3 instructions per
> word. This increases the density of compiled code and reduces the
> interference between instruction and data memory access. It keeps the
> CPU busy while the next instruction is being fetched. Providing a
> sustained execution speed of 2400 Mips.
> This is implemented by a 3-bit shift register. The current bit
> enables its slot into the instruction latch. A ready pulse from the
> memory manager latches the high-order 5 bits (slot 0). The pulse is
> delayed by a string of 14 inverters so that it repeats 2 ns later,
> latching the next slot. Slot 2 stops the process, as does a jump or
> fetch/store, until the next ready pulse.
> There are 27 simple instructions, exactly suited to Forth. This
> allows 1-1 compilation of Forth source to machine code. On other
> processors, each Forth primitive requires several instructions. The
> situation is reversed for other languages: several Forth instructions
> may be required for their primitives.
>
> ...Register
> TTop of stack
> S2nd number on stack
> RTop of Return stack
> AAddress
> Remember that fetch pushes the stack, store and binary operations
> pop it.
> CodeOpAction
> 0word ;Jump to subroutine; tail recursion
> 1ifJump to 'then' if T0-T17 are zero
> 2wordCall subroutine
> 3-ifJump to 'then' if T17 is one
> 6;Return
> 8@rFetch from address in R
> 9@+Fetch from address in A; increment A
> anFetch literal
> b@Fetch from address in A
> c!rStore into address in R
> d!+Store into address in A; increment A
> f!Store into address in A
> 10-Ones-complement T
> 112*Shift T left 1 bit
> 122/Shift T right 1 bit; preserve T17
> 13+*Add S to T if T0=1 (multiply step)
> 14orExclusive-or S to T
> 15andAnd S to T
> 17+Add S to T
> 18popFetch R
> 19aFetch A
> 1adupDuplicate T
> 1boverFetch S
> 1cpushStore into R
> 1da!Store into A
> 1enopDo nothing
> 1fdropStore T nowhere
> nop
> Another advantage of the 5-bit instruction is ease of decoding. A
> tree of NAND and NOR gates lead from the instruction bus to the enable
> for each register. This is facilitated by the limit of 10 lines to be
> routed: each bit and its complement.
>
> ---
> Chuck Moore's 25x Forth Multicomputer Chip
>
>
>
> Updated 2001 June
> 25x Microcomputer
> An array of 25 microcomputers on a 7 sq mm die.
> Features
> .2 sq mm asynchronous microcomputer core
> 5 x 5 array of cores: 60,000 Mips
> 5 horizontal, 5 vertical parallel interconnect buses: 180 Ghz
> bandwidth
> Specialized computers to interface off-chip.
> Max power 500 mW @ 1.8 V, with 25 computers running
> 100mAh battery life is 1 year, with 1 computer running throttled
> 64-pin SOIC: mirrored pin-out to 4ns cache SRAM
> Array chips on 2-sided PCB
> Description
> Availability of the tiny (.2 sq mm), asynchronous X18
> microcomputer core naturally suggested arraying it on a chip. Its
> extremely low power (20 mW) made that feasible. A 5x5 array was chosen
> to fit on a 7 sq mm die, the smallest available prototype, though
> larger arrays are possible. 25 computers running at 2400 Mips is a
> total of 60,000 Mips. An unlimited supply.
> Communication among the computers is provided by a network with 5
> horizontal and 5 vertical buses. Each computer has 2 bus registers to
> access a horizontal and a vertical bus. Each bus is 18-bits wide and
> can run at 1 GHz. All 10 buses can be active at once connecting a
> 20-computer subset. So total bandwidth is 180 GHz.
> Each computer can customized. Registers are added to the 16
> processors at the edge of the array and connected to package pins. Each
> computer is responsible for a particular interface. Protocols are
> implemented with software.
> SRAM controller
> Flash controller
> 4 serial controllers
> USB controller
> D/A controller
> A/D controller
> After booting from ROM, the computers await code downloaded from one of
> these interfaces.
> Pinout
> Chosen to be the mirror image of an 18-bit cache memory chip. This is
> the fastest memory available, with 4 ns access. Its package is a
> 100-pin SOIC. The 18-bit Multicomputer thus has 256K words of external
> memory in 1 chip.
> Putting the Multicomputer chip on the top of a 2-sided PCB and the
> SRAM chip on the bottom gives a very small footprint. A decoupling
> capacitor is the only other component needed. An array of such pairs is
> a multicomputer board. Connecting Multicomputer to SRAM is trivial,
> with mm traces. Routing for power and a serial network is also easy.
> Computers load code from the network.
> A parallel computer with 60Gips nodes! Power is determined by the
> SRAM.
> Cost/Availability
> The chip is awaiting funding. If interested, contact href=mailto: snipped-for-privacy@mindspring.com snipped-for-privacy@mindspring.com
> A 7 sq mm die, packaged, will cost about $1 in quantity 1,000,000.
> Cost per Mip is 0.
> 25 prototypes can be obtained from href=
formatting link
MOSIS for $14,000 with 16 week
> turn-around. The TSMC .18um process has monthly submissions.
>
> ---
Maybe an important note.
ONLY the diagonal needs the X[],Y[] and *SPECIAL* C[] register ( each is unique for parallel ram access, a 4 x 4 x MemWidth multiplex for maybe four Direct RAM Bus DRAMS)
the other ( 200+ nodes are used for programmable multiplexing)
Night all'
maw