Zero operand CPUs

rickman · 2009-03-17T02:10:56+00:00

I have been following the development of the ZPU, a zero operandprocessor for FPGAs. The primary intent is to design a CPU that canspan a range of sizes from very space efficient to high speed whilebeing efficient at running C code. The original author has an opensource compiler producing code for it which seems to be the part he isgood at.However, it has been running rather slow in the benchmark they havebeen running, Dhrystone. I think I figured out why. The ISA is zerooperand, but the stack in maintained in memory. There is no stackregister architecture. So every stack operation consists of readingthe operands, performing the operation and writing back the result. Ican see why it is giving slow performance, even when pipelined.I believe the real issue is that the focus is on building a complexmachine and trying "techniques" to make it simple and fast. The moreI look at things like this, the more I am convinced that the Moorephilosophy is right. You can achieve performance by adding more andmore complexity, or you can simplify to the point of inherent speed.But then my thinking is biased. I am a hardware guy and myprogramming has always been the sort of stuff that can fit in a .comfile, you know, the ones with the 64 kbyte limit. I still think BillGates was right when he said that no one would ever need more than 640kbytes ;^) I just know that my current multi-GHz machine is notreally any faster than my old 12 MHz 286 in many respects... Itcertainly does not boot any faster and is *much* slower to turn off.Rick

E

Eric Smith 17 years ago

Some of the Virtex 6 and Spartan 6 LUTs are configurable as SRL32 as in Virtex 5. No SRL64, so you'll have to use two LUTs for that.

I seem to recall that one of the Xilinx employees explained to us a while back that the latch structure used in the LUT required either two latches per bit or some other magic to make it work properly as a shift register (e.g., avoid a race where the input falls through multiple level-triggered latch bits), and that they've left that magic out of the newer slice designs.

Obviously there's some details we customers don't know about since they manage to shift one bit at a time through all 64 latches in the

6-LUT as part of the configuration process.

Vote

C

Chris Burrows 17 years ago

I agree. From my experience, if, when writing the code, I write it assuming that I will not have access to a debugger, there's an excellent chance I won't actually need one when testing it. It might take me a bit of extra thinking at coding time but, considering that coding is typically 10% of the initial development effort and testing is typically 50%, the resulting savings in testing time more than compensate. For software destined to have a long life-cycle, the subsequent gains during the period of program maintenance are even more substantial.

Consequently, I haven't had to resort to using a debugger for embedded development yet. I'm convinced that, for me, not having to program in C is a significant factor - 95% of my errors are picked up at compile time. Typically most of the remainder are detected at runtime - as and when they happen, not 5 minutes later in some other unrelated part of the code!

-- Chris Burrows CFB Software Armaide: ARM Oberon Development System for Windows

formatting link

Vote

J

Jeff Fox 17 years ago

No. The term refers to the instructions, instruction format, and use of operand fields in the instruction format. Instructions don't reside on the stacks nor do the operands inside of the instructions.

The name 'zero operand' refers to the number of operand fields in the instruction set. On a register machine you need operand fields in the instructions that have to be decoded after the type of instruction is decoded.

An example of a three operand instrction would be XOR register 01 with register 02 and write the result to register 03. This type of register operation is using three operands. Some have more some have less. The path to the data operated on is designated by operand fields in the instructions.

The equivalent instruction on a stack machine is just XOR. There are zero operands in the *instruction* to be decoded. That's what the term means. The path to the data is hard wired to be an exclusive or the top of the stack and the next item on the stack with the result replacing these values as the new top of stack. Since that is all implied ZERO operand fields are needed in the instruction.

You seem to completely missunderstand the term and seem to be claiming that it must mean NOP only. ;-)

Now when the path to the data for each instruction is hard wired and there are no operand fields in the instructions.

Same instruction but different operand fields to select the path to the data. Yes. On a zero operand design there is just ADD. There is no set of operands to specifiy the path for an ALU to use on the add. The path to data is hard wired on zero operand designs. This is obvious because the "instruction" is just "ADD" and contains no operands, zero operands.

In the case of your register opcode example the operand bits may not be the same for all instructions. So first one has to decode that it is an instruction with two register operands. The operands can then be used to gate the ALU. The after that the ALU operates on the data.

Only if all instructions in the instruction set have the same format. If there are any eight bit instructions, or larger instructions, or instructions that don't have exactly two operands at fixed positions then there is separate phase.

But I think you missed my point. I was saying that with a zero operand design the paths to operands are fixed. In the designs mentioned in the original post in the thread indirectly all opcodes begin execution at the same time that the decoding of an instruction begins. When the instruction decoding is complete the instruction has completed any logical operation and it is selected as the instruction to write its results to the system. So there is an instruction decode phase and a write result phase.

In the kind of register designs you are describing the paths to the ALU are selected only after the appropriate instruction format is known. So first the instruction is decoded, then knowing which operands mean what to this instruction the paths to the ALU are set up. Then and only then the ALU operates on this path and the results are written to the appropriate place which might be specified by an operand decoded after the instruction format is known.

The point you seem to have missed is that one way to describe that is to say that a phase is needed between instruction decoding and the ALU operation because operands are used. When operand decoding is needed the ALU can't perform the "ADD A,1" until after the instruction is decoded. On zero operand designs the "ADD" can begin execution at the same time that instruction decoding begins.

But that's not how real designs work. Every combination of registers does not have its own hard wired ALU. ;-)

An ALU requires operands to gate its input and output when it is shared. There is not a dedicated ALU for every possible path in these designs. In contrast there may not be an ALU at all in a zero operand design,

has an addition circuit, XOR has an xor circuit, etc. And as result no operands are needed in the instruction to gate and ALU.

Again the discussion here was about Chuck Moore's zero operand designs which have register based stacks not stacks in memory. Stacks in memory is a whole other thing than the subject that was introduced here.

The original posted stated that a certain design with uncached stacks in memory was slow because a simple operation like "+" would require four memory accesses: load the opcode, load a parameter from a stack in memory, load a parameter from a stack in memmory, write the result back to the stack in memory. In contrast we have been talking about stacks in registers and only one instruction memory access per four stack instructions. So the thread began by contrasting four memory accesses per stack operation to four stack operations per memory access.

Let's take your example "+" and note that in a zero operand design that's the instruction, just "+" because there are zero operands.

There are zero operands needed in the instruction. It is different than "ADD A,1" because no operand specifying an ALU path is needed.

In the zero operand version the arguments are in the T register and the S register and the result replaces T and S is popped from lower stack registers. This doesn't have to be specified in programming the machine with operand fields in the instruction. The path for the add instruction is hard wired. No operand for this path is needed in the instruction set. It is a simple idea.

I doubt it. I think most people know that zero operand architecture means a stack machine where arguments are mostly on a data stack and need no operands in the instruction set. I have never heard of anyone else every assuming it means NOP only because no arguments are used. ;-)

I have never heard anyone claim that all one can do in Forth is NOP. I have seen people demonstrate that "they" can't write good code but that proves very little.

l

I hadn't heard that. Which cell phones have x86?

I agree. It is really terrible code. One can write really terrible code in any language. It doesn't prove much. It looks nothing like real Forth code. ;-)

It sounds like you haven't learned much about Forth.

I recall one programmer at SVFIG lament that for him the only way to make money with Forth was to keep it a secret. When he made the mistake of telling clients that he had written their application in Forth they realized that the nice symbolic stuff was so clear that even the project manager with very little knowledge of Forth could update the code and make changes that worked and that they never needed to call the expert back in again for maintenance. He said if he didn't tell them that it was a Forth program they were more likely to call him for updates.

I want them to have access to the nice symbolic abstractions indeed. I want them to never have to deal with useless code like you wrote. Twenty years ago I argued with Chuck Moore about compiler optimizations.

I have written compilers with a few hundred such optimizations. But after a few more years I switched to a style of writing much simpler code where the compiler had no opportunity to optimize more than a little inlining and tail recursion. Again, the original poster simply said that he now saw that Chuck Moore had the right idea about simplicity in design but he didn't detail what that meant in detail.

In one extreme people want chips that they say they can't program without a very smart optimizing compiler. In the other extreme people say that if you write smarter source code then the compiler doesn't have to be so smart. When asked about optimizing compilers for simple chips like those being discussed Chuck Moore asked, "Why? You want to write non- optimal code?"

Chuck uses colorforth exclusively. It has a "brutally simple" compiler. His first generation of Forth hardware required a fairly complex optimizing native code Forth compiler because of the irregular nature of the instruction packing. It used 6K of RAM and he felt that was way too complicated so he simplified the compiler to just a couple of K.

If you are claiming that his tools are the size of others then you are not very well informed. It may run on big Pentium machines but the Pentium compiler is simple and small as is his target compiler for embedded chips.

Some of my own research has been into how much smaller than 1K an optimzing native code compiler for one of these machines can be. I have written a lot of them. They are certainly not as big as everything else. ;-)

I like to acknowledge that some people have legitmate problems to solve that don't occur for other people. When you get an embedded programming job they might say, "We are building our widget using this processor and this language." If that happens then that implies a long list of assumptions about how the problem to be solved is constrained by processor and language features.

At another job an employee might be told "We used to make our widgets with this processor and program them in this language but now we are making a new product and want to examine all our options since we are not constrained by having already made decisions about the target platform and target tools."

Sometimes the object is low production cost because quantity is high and development cost is not so important. Time to market might matter. Sometimes low volume or one-of projects create problems where development cost or time are all that matters. You know the old saying, "Fast, cheap, soon, pick two." I like to extend it and say "Fast, cheap, soon, standard, pick two."

Best Wishes

Vote

J

Jeff Fox 17 years ago

Certainly not. I was just trying to inform Helmar that the term zero operand architecture doesn't mean NOP only as he claims. ;-) And that led to my explaining other things that he seemed confused about.

That is true for most people since they know those tools.

I know a lot of people who have been programming in Forth for decades or designing soft-core processors and programming them for decades.

What is "easiest" is what you know.

I was a teaching assistant in a UC course on processor design. A small register based design was used to teach students one semester and a small stack based design was used to teach students in another semester so the instruction could observe which were easier for the students to get.

Well certainly if you pick a first cut hobby design, or design by someone who can't write or hasn't written documentation it is going to be much harder to use than something that has been debugged, optimized and documented. But that's not inherent in comparing register designs to stack designs.

But then I have had the opportunity to compare new students reactions to nice simple tutorial designs for register based and stack based machines of similar quality from the same author. And I have watched them deal with debugging their hardware and software.

Yes. ;-) .

I probably would. But that's a different discussion altogether.

That's why I made the videos of the stack machine design for FPGA course done for SVFIG ten years ago available to the public and why I have bothered to answer questions about it for a decade and helped a number of people with their designs.

But I have been clear that like Chuck Moore my interests moved from FPGA to full custom twenty years ago. But I have worked with some nice FPGA implementations along the way.

It is hard to get everything at once. Often you need to look a number of times to see the connections.

Best Wishes

Vote

J

Jonah Thomas 17 years ago

....

So to make it perfectly clear, the bottom line here is that the zero-operand approach can be simpler.

With the operands embedded in the instruction it's more flexible. You get to choose which registers to use. But that choice has to get decoded every time. Extra work for the processor, which has to be more complicated. It takes time to do that, and the larger instructions take a bigger bus or more space on a large bus.

If you can program so that what you need *will* be on the top of the stack then you can avoid that overhead. Sometimes you might have to juggle the stack which adds overhead, but you only have to do that sometimes. The 1 2 or 3 operand instructions have their overhead all the time, plus the extra gates on the chip are there all the time etc.

Requiring the programmer to learn how to manipulate a data stack is an overhead. But it pays off. It's an overhead that happens not at execution time, or compile time, or edit time, but at *education* time. Learn how to do it and it pays off throughout your working lifetime, whenever you have the chance to use a zero-operand architecture.

I apologise for stating the obvious.

Vote

R

rickman 17 years ago

"Simpler" and "extra work" are relative terms. Not necessarily relative to other processors, but relative in terms of how you design the processor. Or maybe I should say, "it depends". You can create a very simple register based processor. If the opcodes use a fixed field for the registers selected, there is *no* register selection decoding required, so that is not more complicated and there is no extra work. In fact, I think a register based design can be simpler since it can reduce the mux required for input to the stack/register file.

I think the difference is in the opcodes. A register based processor requires the registers to be specified, at least in conventional usage. I expect there are unusual designs that "imply" register selection, but they are not used much in practice. The MISC model, which typically uses a stack, really only needs opcodes to specify the operations and not the operands, so it can be very small. Several MISC designs use a 5 bit opcode. This can reduce the decoding logic and the amount of program storage needed. But in both of these, the devil is in the details. Still, the potential is clearly there.

Different overhead. Adding a few gates to decode a different opcode,

*if* it actually requires more gates, is not a big deal. The extra instructions needed to manipulate the contents of the stack take code space and execution time and may or may not result in a "simpler" processor.

I am finding that none of this is truly obvious and may not always be true. Real world testing and comparisons are in order, real tests that we can all see and understand...

Rick

Vote

H

Hal Murray 17 years ago

Back in the 70s, Xerox had the Mesa world running on Altos. It was a stack based architecture. The goal was to reduce code space. (That was back before people figured out that Moore's law was going to make code size not very interesting.)

Given the available technology of the time, it worked great.

In addition to the stack (I think it was 5 or 6 registers) there was a PC, a module pointer for global variables, and a frame pointer for this procedure context.

The opcodes were implemented in microcode rather than gates, so there was a lot of flexibility in assigning values.

Calls were fancy, but the simple case allocated a frame off the free list, setup the return link and such.

Most of the opcodes were loads. It was a 16 bit system, but there was a lot of support for 32 bit arithmetic and pointers.

Excpet in rare occasions when you were hacking on the system, we didn't care about the details of the architecture. Code is code. The basic ideas don't change because the architecture changes. You have loads, stores, loops, adds, muls... y = a*x+b turns into (handwave) load a load x mul load b add store y

Some people call "load" push. If a or b are constants, the load might be a load immediate...

It might be a little weird if you wanted to write assembly code. I think I'd get used to it if I had some good examples to learn from. (I've writted quite a bit of microcode back in the old days and some Forth recently.) If you have a good complier you never think about that stuff.

These are my opinions, not necessarily my employer's. I hate spam.

Vote

A

Albert van der Horst 17 years ago

Silly mistake, sorry.

(Loosing registers on conditional jumps would even prevent conditional expression, to give an example. You can turn a jump into a conditional jump, but not the other way around.)

--

Albert van der Horst, UTRECHT,THE NETHERLANDS Economic growth -- like all pyramid schemes -- ultimately falters. albert@spe&ar&c.xs4all.nl &=n http://home.hccnet.nl/a.w.m.van.der.horst

Vote

J

Jacko 17 years ago

do

if(instructionRegister

Vote

A

Antti.Lukats 17 years ago

se

/.

ow do

e

,

and how wide is instruction register?

Antti

Vote

R

rickman 17 years ago

Not entirely true. Code size is still an issue when a CPU is being built in a small FPGA or there are a number of them in most any FPGA. In an FPGA, memory is still a limited resource.

Why write in assembly then? I seem to recall that HP made some machines that were stack based. A friend had a job of digging through core dumps to figure out why a program crashed. I don't remember the details, but he didn't do that for long before he got promoted out of there.

Rick

Vote

R

rickman 17 years ago

se

/.

ow do

e

,

Ok, I'll bite, how wide is the instruction register and how does it get loaded?

Getting straight (and complete) answers out of you is torture.

Rick

Vote

J

Jacko 17 years ago

3

l,

f

he

ense

es/.

=A0How do

ive

on,

Hi

The instruction register width like all register widths are controlled by the generic parameter wide. if wide is 16 then 16 bit registers, datapaths, addressable element size, alu, instruction width, in fact all std_logic_vector of relavance are this generic width.

So if the generic wide is set to 4096 then a 4096 bit microprocessor is rendered. Note the ALU will be slow until much if generate logic is written in the VHDL.

As program and data memory address size is n bits and each addressable is n-bits, then memory size is n^2 bits.

The useful knowledge of dividing the memory into optimized sections such as say the microcode section at 4 bits wide only, saves on bits, and means the instruction unpack logic is missing (no delays due to it), yet the same density is achived, well better!!

This packing of the core 'microcode' to 4 bits is the main reason for not having literal fetch as one might expect.

Then there is the next code layer, which is a threading list of subroutine addresses and possibly literal values. This can be compacted by using (m

Vote

R

rickman 17 years ago

Ok, I did the digging and found your instruction set doc as well as the HDL. I found that the IR is as wide as the rest of the machine. This means that each instruction is N bits wide. So on every opcode that is one of the 16 instructions that are not calls, the real opcode would be 0x000X in a 16 bit machine. That is pretty durn inefficient of program memory. Your code size is going to suffer rather severely. Not only is each instruction large for a MISC machine, because you only have 16 basic ops, you will need a lot of them.

So it would seem that this machine is slow (multi cycles to execute one instruction and lots of instructions to do anything useful) as well as inefficiently using code space. Not what I would want to consider for an ASIC, although if the docs are good enough it could be worth it... ;^)

Rick

Vote

A

Antti.Lukats 17 years ago

n 3

ful,

of

the

icense

nses/.

=A0How do

five

tion,

.

oooo my god

so if you have 32 bit wide datapath then you use 4 bits as instruction and WASTE 28 bits of the instruction width?

this is soooo stupid, i could not take that option seriously in count, that the reason why i asked how wide the instruction is!

Antti

Vote

J

Jacko 17 years ago

Hi

0x000X in 4 bit memory 0xX

cheers jacko

Vote

J

Jacko 17 years ago

hi

Chuck and the poe are in the design lab,

Chuck sys to pope "Have you got a rubber, my designs are getting big?" Pope says "That's a bit RISCy!"

So if you had possiblly 4 instructions to do stack init pointers and save both aswell, what would you use?

cheers jacko

Vote

A

Albert van der Horst 17 years ago

Do you ever read over your posts before submission? Do you have a spelling checker?

--

Albert van der Horst, UTRECHT,THE NETHERLANDS Economic growth -- like all pyramid schemes -- ultimately falters. albert@spe&ar&c.xs4all.nl &=n http://home.hccnet.nl/a.w.m.van.der.horst

Vote

R

rickman 17 years ago

ROFLMAO!!! All the "stuff" this guy posts and you are commenting on his ~~~spelling~~~!!!

Rick

Vote

J

Jacko 17 years ago

Darwin, Mutation and the Death of a Lnguage via Stagnation =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D

Oh yo evil looking mutated word, you is dead right, no sexy none. Me bee's full oxford smili life, gets me the beer token and enduf this wanton easo-speak.

cheers jacko

Vote

Zero operand CPUs

Join the Discussion

Didn't find your answer?