Forth-CPU design

- F
- Frank Buss
  
  Contact options for registered users
posted
17 years ago

Sat, Sep 2, 2006 10:30 PM

After thinking about how a "normal" CPU could look like, I've tried to design a Forth-like CPU:

formatting link

Maybe someone from comp.lang.forth could take a look at it, if something is missing for compiling normal high-level Forth programs to this instruction set, without too many complications.

--
Frank Buss, fb@frank-buss.de
http://www.frank-buss.de, http://www.it4-systems.de

- J
- Jim Granville
  
  Contact options for registered users
Vote on answer
posted
17 years ago

Sun, Sep 3, 2006 12:01 AM

Looks good - some brief comments : I think the opcode is 8 bits ? That's good for off-chip memories. You could optionally allow a 9-bit opcode, for where the memory is FPGA BRAM ? That could also allow an option for 32 bit registers ?

I think there is also opcode room to allow one byte SKIP form on the BRA... opcodes ? This could be assembler-automatic - where this is usefull, is in operating from serial flash memory.

-jg

- J
- Jecel
  
  Contact options for registered users
Vote on answer
posted
17 years ago

Sun, Sep 3, 2006 1:01 AM

If you are going to compile Forth then CALL is probably the first instruction you want to optimize and I didn't even see it or an equivalent in your list. Other than that this is probably a very reasonable way to encode Forth. Given that you have spare bits in several cases my guess would be that it isn't an optimal encoding (Chuck Moore's 5 bit instruction set and the variations people have made are probably closer to optimal).

Even with your current instruction set it is probably possible to write your network address swap code to be a little smaller:

push byte 5 :loop dup ; i i pop A ; i @A byte ; (i) i over ; i (i) i push byte 6 ; 6 i (i) i add ; 6+i (i) i pop A ; (i) i @A byte ; (6+i) (i) i over ; (i) (6+i) i !A byte ; (6+i) i over ; i (6+i) i pop A ; (6+i) i !A byte ; i dec ; i-1 bcc byte :loop drop

When you have an address register like in the MISC processors it is a good idea to avoid reloading it when possible. In this case having both

6+i accesses happen next to each other does that. And in general it isn't very good to allow the stack to get very deep with temporary values when programming stack machines.

By the way - this code fragment that you have been testing various architectures with has a problem in that in general the first word to be swapped would be some random one and not happen to be address zero. Rewritting it so that MAC addresses to be swapped are at 200 and 206, for example, would have no effect in some cases but increase the length of the code for other processors.

-- Jecel

- B
- Brad Eckert
  
  Contact options for registered users
Vote on answer
posted
17 years ago

Sun, Sep 3, 2006 4:28 AM

Since you have flag registers, you need a way to push and pop them if you have interrupts. Usually Forth chips use 0=, 0< and IF or IF and

-IF instead of flag registers, so there is nothing to save. This means that signed comparisons like < take longer but that doesn't seem to be a problem.

CALL is used in Forth much more often than in other languages so you should try to encode it compactly.

8-bit opcodes are kind of nice because serial flash is pretty fast. In a Spartan3, you could map the lower 2K bytes of program space to a BRAM block and the rest to external flash. You keep the fast (mostly kernel) words in the BRAM.

--brad

- F
- Frank Buss
  
  Contact options for registered users
Vote on answer
posted
17 years ago

Sun, Sep 3, 2006 5:10 AM

You are right. My idea was to use something like "pushr byte 3 +PC", "load PC word :my_function" for the call and "popr PC" for the return. But as Jim noticed, there is some room in the branch commands and "jump" is redundant, because it can be implemented with "loadpc word :label" (and even relative to PC with "loadpc word :label +PC", or for skipping the next instruction, the one byte instruction "loadpc # 1 +PC" is possible), so it could be added to the branch commands. But to make the pop/load commands more complete, I've added it as a special destination case and reorganized the branch commands, which now can branch on 8 conditions.

I've updated the documentation at

formatting link

It could be encoded more compact, but my idea was to avoid large demultiplexers for decoding every single command. Maybe I should start implementing the design to see if this works :-)

Thanks. Looks like most of the time an arithmetic operation is executed before "pop A", so I've used one bit for all instructions, which specifies that a "pop A" is executed after the command to compress the code even more:

push byte 5 ; i=5 :loop dup popa ; i @A byte ; (i) i over ; i (i) i push byte 6 ; 6 i (i) i add popa ; (i) i @A byte ; (6+i) (i) i over ; (i) (6+i) i !A byte ; (6+i) i over popa ; (6+i) i !A byte ; i dec ; i-1 bcc byte :loop drop ; empty stack

Now it is 16 bytes long, just 1 byte more than the shortest code so far, the 6502 code (it needed 15 bytes, not 13 as I wrote).

This was by intention to see if there are some tricks for zero based addressing.

--
Frank Buss, fb@frank-buss.de
http://www.frank-buss.de, http://www.it4-systems.de

- F
- Frank Buss
  
  Contact options for registered users
Vote on answer
posted
17 years ago

Sun, Sep 3, 2006 5:16 AM

You can use "pop S" and "push S" or even "popr S" and "pushr S" with my instruction set.

Yes, I've added it as a possible destination to the "pop" command.

Thanks, mapping the flash into the address space is a good idea and maybe 2 BRAMs: one for the TCP/IP stack and one for fast programs and other data.

--
Frank Buss, fb@frank-buss.de
http://www.frank-buss.de, http://www.it4-systems.de

- F
- Frank Buss
  
  Contact options for registered users
Vote on answer
posted
17 years ago

Sun, Sep 3, 2006 5:22 AM

There was one bit too much, so there are 4 conditions only.

--
Frank Buss, fb@frank-buss.de
http://www.frank-buss.de, http://www.it4-systems.de

- J
- Jim Granville
  
  Contact options for registered users
Vote on answer
posted
17 years ago

Sun, Sep 3, 2006 6:41 AM

Yes, but unless you have large amounts of 5 bit memory lying about :) there is more point in focusing on packing the 8 bit opcodes, to try and reduce the use of 16 bit ones. If you do plan to use serial flash, the keep in mind the (IIRC) 5 byte address reload time, so short (1 byte) skips up to 5 bytes ahead would be efficent. In a FPGA, another useful thing to cache could be constants ? - so you do not have to reload the Serial address pointer, just to access (smaller) constants.

-jg

- R
- rickman
  
  Contact options for registered users
Vote on answer
posted
17 years ago

Sun, Sep 3, 2006 6:45 AM

I got around that in my Forth CPU by pushing the return address on the return stack and the PSW on the data stack at the same time, so there was no speed penalty. It did require a separate return from interrupt instruction as compared to the standard return.

Although the 9 bit wide block ram is pretty much a standard feature on FPGAs, I did not take advantage of it for the instruction word. I wanted to keep the instruction size minmal so that decoding would take less logic. I could have gone with a 5 bit opcode, but I decided to take advantage of the extra bits and encode a variable sized literal field for LIT, CALL and JUMP instructions. This should significantly reduce the code size for small programs.

- J
- Jecel
  
  Contact options for registered users
Vote on answer
posted
17 years ago

Mon, Sep 4, 2006 3:43 AM

That sounds like a very good idea.

I haven't actually written it down, but my impression is that in a 16 bit RISC processor like Jan Gray's gr0040 you could do it in 7 instructions (14 bytes).

-- Jecel

- R
- rickman
  
  Contact options for registered users
Vote on answer
posted
17 years ago

Mon, Sep 4, 2006 4:14 AM

I am curious, what exactly is this code segment you are using as a benchmark? I saw the code earlier in this thread, but I don't understand the notation enough to know what it is doing. Is there some other code or a description so that I can see how other Forth CPUs might compare?

- J
- jacko
  
  Contact options for registered users
Vote on answer
posted
17 years ago

Mon, Sep 4, 2006 4:35 AM

hi

not sure if looping is the best thing on a 16 bit word.

push byte 5 ; i=5 :loop dup popa ; i @A byte ; (i) i over ; i (i) i push byte 6 ; 6 i (i) i add popa ; (i) i @A byte ; (6+i) (i) i over ; (i) (6+i) i !A byte ; (6+i) i over popa ; (6+i) i !A byte ; i dec ; i-1 bcc byte :loop drop ; empty stack

a similar thing in indi assembly, swapping two sets of 3*16 bit locations with pointers on stack s would be

/.s=q /.q.r ; get 1st pointer to q /.q.r /.q.r ; copied 3 16 bit words to r /=q.r /.s=q ; save address for write over and get other address /.q.s /.q.s ; get other 3 16 bit words to s /.q.s /=q.s ; and save q on s /.r=q /.s.r ; restore 1st address /.s.q /.s.q ; overwrite 3 16 bit words /.s.q /.r.s ; restore 2nd pointer in q /.r.q /.r.q ; overwrite 2nd set /.r.q /=p=p ; and a nop

which is 20 bytes (19 if nop not counted)

cheers

formatting link

- R
- rickman
  
  Contact options for registered users
Vote on answer
posted
17 years ago

Mon, Sep 4, 2006 6:24 AM

Ok, I found the sample program on the web page describing the instruction set. The font was so small I did not see it at first.

Here is the code segment for the instruction set I came up with. I use a byte for each opcode, but

LIT 6 BFLG ( set memory ops to byte ) LIT 0 ( D: ) ( R: SIZE ADDR ) LOOP: RDUP ( D: ) ( R: SIZE ADDR ADDR ) FETCH ( D: (ADDR) ) ( R: SIZE ADDR ) RDUP ( D: (ADDR) ) ( R: SIZE ADDR ADDR ) LIT 6 ( D: (ADDR) ) ( R: SIZE ADDR ADDR 6 ) RADRS ( D: (ADDR) ) ( R: SIZE ADDR ADDR+6 ) RDUP ( D: (ADDR) ) ( R: SIZE ADDR ADDR+6 ADDR+6 ) FETCH ( D: (ADDR) (ADDR+6) ) ( R: SIZE ADDR ADDR+6 ) SWAP ( D: (ADDR+6) (ADDR) ) ( R: SIZE ADDR ADDR+6 ) STORE ( D: (ADDR+6) ) ( R: SIZE ADDR ) STORP ( D: ) ( R: SIZE ADDR++ ) RCMP ( D: ) ( R: SIZE ADDR++ ) LOOP JMPC ( D: ) ( R: SIZE ADDR++ ) RDROP RDROP BFLG ( set memory ops back to word )

19 bytes, with 6 of those outside the loop. So the loop would run 13 clock cycles in the current architecture. I have yet to port it to the newer type FPGAs which do not have an asynchronous read from the block RAMs. But I think it will still run with one clock per instruction including memory accesses.

Any idea how large the new processor will be in an FPGA? I want to say mine was around 600 LUTs, IIRC.

Any interest in comparing this to one of the proprietary cores such as microBlaze or NIOS2? I have wanted to see how efficient they would be implementing a Forth, but I have yet to work with one of them yet.

- J
- Jim Granville
  
  Contact options for registered users
Vote on answer
posted
17 years ago

Mon, Sep 4, 2006 6:51 AM

I like the idea of mode-opcodes like this BFLG, that change the op-size. It is not common to mix opeand sizes in one function, but it is common to have to need BYTE and WORD fetches, for example. A mode change saves valuable opcode space.

Main issue of such opcodes, is the interrupt-ability - they should probably have a hardware stack dedicated to 'keep track', so code like that above, can be interrupted. In a FPGA device, with 9 bit memory, that's probably easy to do: there's room to stack some mode state with return address

Also, the opcode above appears to be a toggle ?, which could cause obscure bugs, if the routine has multiple exit points, and one forgets to toggle back, the next entry will be upside-down.....

-jg

- R
- rickman
  
  Contact options for registered users
Vote on answer
posted
17 years ago

Mon, Sep 4, 2006 7:10 AM

Yes, it is a toggle rather than to have two instructions. But it toggles a bit in the PSW which can always be set directly using a read modify and write operations. It is 6 instructions that way. To have instructions to set and clear the Byte flag would require that I give back an instruction and I don't have anything I can spare, unless I make a significant change to the instruction set.

The interrupt is not an issue for memory I/O size as long as you set the PSW bit on entry. I guess I could make that automatic on any interrupt. I hadn't thought of that, thanks. The PSW is automatically saved on the data stack while the return address is saved on the return stack. RETI restores them both while RET only does the return.

I actually like my approach to the stacks. The way I handle literals and the CALL/JUMP instructions the return stack is more of an address register backed up by a stack while the data stack is pretty much just for data. It does Forth loops pretty efficiently. The overall design was a trade off between optimizing the instruction efficiency and keeping the processor size small, both to limit the resources used in an FPGA.

- J
- Jecel
  
  Contact options for registered users
Vote on answer
posted
17 years ago

Tue, Sep 5, 2006 1:03 AM

This is an interesting instruction set and seems to make good use of the two stacks.

My impression is that neither Microblaze nor NIOS II would do a particularly good job of executing Forth since their memory instructions seem to be of the simple register+short immediate offset type. Contrast that with the ARM (or any CISC) which can use pre-decrement and post-increment (or pre-increment and post-decrement) to move things to and from the stack.

The native code for this particular task, however, would be reasonably small and fast for nearly any RISC including the two proprietary cores. Just two loads, two stores, an increment, a comparison and a branch (the last two are often a single instruction).

-- Jecel

- J
- jacko
  
  Contact options for registered users
Vote on answer
posted
17 years ago

Tue, Sep 5, 2006 6:57 AM

hi

i'm on the final bit of the indi core, which is the sequencing unit. i have decided to go for a 4 clock bus cycle, and so a read takes 4 cycles etc.

there are 4 bus cycle types read, write, fetch and bushalt. so this gives 16 "microcode lines".

the question is a solution for minimum area would be best done as a 16 way decoder, or as a fourway 4 parallel decoder with a 4 way post mux??

any input on minimum area solutions??

cheers

- J
- jacko
  
  Contact options for registered users
Vote on answer
posted
17 years ago

Tue, Sep 5, 2006 9:02 AM

hi

just wondering if grey code (1 bit change) addressed stack memory might be useful for cutting down carry chain logic for pre-post dec/inc addressing??

cheers

- J
- Jim Granville
  
  Contact options for registered users
Vote on answer
posted
17 years ago

Tue, Sep 5, 2006 10:54 AM

Not sure exactly what you are getting at here ? Gray code ctrs might only toggle one bit, but their creation is more complex than binary counters - often it is Binary ctrs + XOR's.

- ie they are not likely to be faster counters. Johnson counters can avoid carry chains, but they have lower densities (do not cover all binary states)

The best solution could vary with exact target device & mappings, and counter length, so I'd suggest create some of each candidate, and check the reported speeds.

-jg

- J
- J Thomas
  
  Contact options for registered users
Vote on answer
posted
17 years ago

Tue, Sep 5, 2006 11:59 AM

I wouldn't expect it to. Do you have a good algorithm for incrementing grey code? If you have a good method it might be better.