Zero Address CPU logic

C

Cecil Bayona 9 years ago

On 8/29/2016 7:55 PM, rickman wrote: > On 8/29/2016 4:30 PM, Cecil Bayona wrote: >> Nothing Fancy, that is why in my earlier post I mentioned that I don't >> have a lot of experience. I been working a 32 bit stack based CPU, but >> it's a work in progress, I'm still sorting it out, it taken less than >> 20% of the chip, but a stack CPU are rather simple compared to other >> CPU's, when finished it should be pretty nice, most instructions take >> one clock to execute, and it used packed instructions, 5 instructions to >> a word fetch. Originally it was on a Lattice Brevia2, I am now >> converting it to a Artix-7 board, but there is software involved too so >> it's going slow and I'm learning as I go. > > Just a comment on your stack processor. I've done some design work with > stack processors and read about a lot of designs. In my humble opinion, > if you have multiple cycle instructions, you are doing it wrong. I > don't want to steal the thread. If you care to discuss this we can > start another thread. > I'm not sure why you think that it uses multiple c*ck instruction, I mentioned that most occur in one clock, the exception is load immediate it takes the instruction fetch, then a second fetch for the 32 bit value to push on the stack, all others take one clock. Even that one can take place in one clock with extra hardware to fetch RAM with a buffer to hold two 32 bit words, an alternative is to use two clocks one to execute the other to fetch program instructions.

What is does have is multiple instructions on one program word, it's five instructions or one depending on what it is, jump, and call take the whole 32 bit word and is not packed, everything else is 5 instructions to a 32 bit program word so you have fewer memory fetches.

Cecil - k5nwa

Vote

R

rickman 9 years ago

Are you using external memory? The CPUs I've designed were 100% internal to the FPGA so there was no advantage to fetching multiple instructions. In fact, the multiplexer needed was just more delay.

I thought you design was not one clock per instruction because of what you said. Your design is using a single memory interface. It is common for stack machines to separate data and instruction space. But if you are working out of external memory that is not so easy.

I assume you have read the various similar work that has been done? The J1 is very interesting in that it was so durn simple. A true MISC design and effective. If you meander over to comp.lang.forth there are folks there who have designed some MISC CPUs which have proven their worth. Bernd Paysan designed the B16 which seems like an effective design. I've never tried to work with it, but it sounds similar to yours in they way it combines multiple instructions into a single instruction word.

Rick C

Vote

C

Cecil Bayona 9 years ago

I think the multiple instructions is to save on program space, 32 bits to an instruction will eat up the memory space ways too quickly. It is a Von Newman machine with a single address space, code and data is all in the same space.

It uses three kinds of instruction formats, short and long, and double word.

A short instruction is 6 bits so you can cram 5 instruction to a memory word, so there are 64 possible instructions but the bare machine uses 27 opcodes so there is room for additional instructions. The are things that require no addresses such as dup, swap, return, add, etc.

Long instructions take 30 bits, they are things like jump, call, including conditional versions, they include a 24 bit address as part of the instruction which can address 16MB of program space, the upper two bits are unused at present

There is a single two word instruction, which pushes the second 32 bit word into the stack, it itself is a 6 bit instruction and there can be more that one "LIT" instruction, each one has an additional word to use as a literal. Each 32 bit program word could have 6 LIT instructions with each one having an additional 32 bit word following the program code word so if you have six literal words packed into single 32 bit program word, it is followed by six 32 words containing the literal values, the program continues 7 words after the current 6 instruction word, of course one can have fewer literals.

The whole thing is like a packed instruction pipeline it minimizes the number of fetches but any control flow instruction cancels and dumps the current instruction queue and does a fetch at a different place since the IP has changed.

The job of packing the instructions into one word is handled by the Forth Compiler so the user does not see it, it just makes the program code smaller automatically. I did some test and 5 is too long a queue, when doing a 16 bit version it packs 3 instructions to a word and that version ends up with a shorter program area so it's more efficient, in part because Forth changes the Program Counter often so having a longer instruction queue waste instruction slots. on a 16 bit version adding some extra instructions would be nice so you end up with add with carry, subtract with borrow etc so it can handle 32 operations efficiently.

Overall its an efficient design but it could be improved as anything else could. One could add combined instructions where a return is combined with some regular instructions so it execute both in one clock.

Eventually I want to work with the J1, its a simpler 16 bit machine but since the instruction word contains multiple fields it can have instructions that do multiple operations at the same time naturally, with a more complex compiler packing multiple instructions it might save on program space. It does have very limited addressing space due to it's instructions limited to 16 bits.

Cecil - k5nwa

Vote

R

rickman 9 years ago

Again, that is a reflection of having a common address space for instructions and data. My CPUs use 8 or 9 bits for instructions while being data size independent. The instruction format does not imply any particular data bus size.

I've shied away from multicycle instructions because it means more bits (1 bit in this case) to indicate the cycle count which is more input(s) to the decoder. I wanted to try to keep the decoder as simple as possible.

The F18A does that. Learning how to pack instructions into the word is a bit tricky. Even harder is learning how to time execution, but that's because it is async and does not use a fixed frequency clock.

One of the things I have looked at briefly is breaking out the three "engines" so each one is separately controlled by fields in the instruction. This will require a larger instruction, but 16 bits should be enough. Then many types of instructions could be combined. I didn't pursue it because I didn't want to work on the assembler that would handle it.

I seem to recall discussing this with someone else not too long ago. I don't think there actually is much parallelism possible with the J1 design. You can combine a return instruction with arithmetic instructions, and there are fields to adjust the stack independently. So you might be able to combine say, 2DUP + in one instruction by using

with a DSTACK +1 instead of a -1. But the useful combos will be limited. The utility is also limited by the instruction frequency. Only 35% of the instructions in the app the J1 was designed for are ALU instructions which are the only ones that can be paralleled.

In my CPU design I had separate "engines" for the data stack (ALU), the return stack (destination for literals, addresses for all operations and looping control) and the instruction fetch. If each of these had separate fields in the instruction word there might be more opportunity for parallelism. But as I said, I didn't pursue this because the software support would be messy. I'd like to get back to that, but it won't happen any time soon.

Rick C

Vote

Zero Address CPU logic

Join the Discussion

Didn't find your answer?