Zero operand CPUs

- R
- rickman
  
  Contact options for registered users
posted
15 years ago

Tue, Mar 17, 2009 2:10 AM

I have been following the development of the ZPU, a zero operand processor for FPGAs. The primary intent is to design a CPU that can span a range of sizes from very space efficient to high speed while being efficient at running C code. The original author has an open source compiler producing code for it which seems to be the part he is good at.

However, it has been running rather slow in the benchmark they have been running, Dhrystone. I think I figured out why. The ISA is zero operand, but the stack in maintained in memory. There is no stack register architecture. So every stack operation consists of reading the operands, performing the operation and writing back the result. I can see why it is giving slow performance, even when pipelined.

I believe the real issue is that the focus is on building a complex machine and trying "techniques" to make it simple and fast. The more I look at things like this, the more I am convinced that the Moore philosophy is right. You can achieve performance by adding more and more complexity, or you can simplify to the point of inherent speed.

But then my thinking is biased. I am a hardware guy and my programming has always been the sort of stuff that can fit in a .com file, you know, the ones with the 64 kbyte limit. I still think Bill Gates was right when he said that no one would ever need more than 640 kbytes ;^) I just know that my current multi-GHz machine is not really any faster than my old 12 MHz 286 in many respects... It certainly does not boot any faster and is *much* slower to turn off.

Rick

- J
- Jonathan Bromley
  
  Contact options for registered users
Vote on answer
posted
15 years ago

Tue, Mar 17, 2009 8:24 AM

Strange. I'm sure you too can see what to do about this - same as you had to do on the AMD29K which used some of its huge register set to cache the top of the stack. Implement the stack in on-chip RAM, using circular addressing. When the on-chip stack threatens to overflow, "spill" some of the oldest part of it to main memory, using a fast block write operation. CPU operations then continue to use the on-chip stack but the circular addressing no longer overflows. Similarly, when the stack threatens to underflow, "fill" from main memory. Way back then, you could get quite good performance if you were careful to align the spill/fill operations with a DRAM page. The 29K used software trap routines to do the spill/fill, but I'm sure you could do it at least partly in hardware without too much trouble. Spill/fill can then be done speculatively, in the background, when there is spare bandwidth on the memory interface.

Unfortunately the stack cache trashes multi-threading performance, because there is so much context to swap. I guess the correct compromise these days would be very different, with the stack cache probably about 16 words. With only a small stack cache you can keep several process's stacks in the on-chip memory (that's harder to plan, of course, but may still be helpful particularly in a small system).

Always provided you have sufficiently smart compilers to convert complicated real-world code into a suitable stream of your simple instructions. But in general I think I agree. Compilers _are_ pretty smart these days.

--
Jonathan Bromley, Consultant

DOULOS - Developing Design Know-how
VHDL * Verilog * SystemC * e * Perl * Tcl/Tk * Project Services

Doulos Ltd., 22 Market Place, Ringwood, BH24 1AW, UK
jonathan.bromley@MYCOMPANY.com
http://www.MYCOMPANY.com

The contents of this message may contain personal views which 
are not the views of Doulos Ltd., unless specifically stated.

- H
- Hal Murray
  
  Contact options for registered users
Vote on answer
posted
15 years ago

Tue, Mar 17, 2009 8:46 AM

You can simplify the hardware a lot by pushing the stack overflow problem back to the compiler. That is, the stack has a fixed size. The compiler can't generate code that overflows that limit.

I don't understand that comment. Multi-threading requires separate stacks. That's just more RAM in the CPU, perhaps the virtual CPU number is part of the RAM address if that's what you mean by multi-threading.

--
These are my opinions, not necessarily my employer's.  I hate spam.

- J
- Jonathan Bromley
  
  Contact options for registered users
Vote on answer
posted
15 years ago

Tue, Mar 17, 2009 9:04 AM

OK.

Getting the compiler to limit the stack size is clearly possible (Transputer, anyone? it had a

3-register stack). But this is certain to cause more memory references to escape to memory. It's a compromise, like everything else.

[...]

Yes, but if a large swath of CPU register space is used to cache the top-of-stack, then that cache must be saved and restored on a context switch. You can only provide a finite number of stack spaces in the CPU's on-chip RAM, so at some point a context switch is sure to entail a large penalty as some other thread's stack cache must be evicted to main memory.

Shallower on-chip stack cache means slower single- thread performance, but faster context switch and the opportunity to keep more threads' stacks on-chip. Compromises again.

--
Jonathan Bromley, Consultant

DOULOS - Developing Design Know-how
VHDL * Verilog * SystemC * e * Perl * Tcl/Tk * Project Services

Doulos Ltd., 22 Market Place, Ringwood, BH24 1AW, UK
jonathan.bromley@MYCOMPANY.com
http://www.MYCOMPANY.com

The contents of this message may contain personal views which 
are not the views of Doulos Ltd., unless specifically stated.

- G
- glen herrmannsfeldt
  
  Contact options for registered users
Vote on answer
posted
15 years ago

Tue, Mar 17, 2009 1:13 PM

As is usually done for x87 floating point.

The story is that the 8087 was designed such that software could detect the stack over/underflow and swap to/from memory. No-one tried writing the software until the hardware was done, and then it was found that it wasn't possible.

Presumably it could have been fixed in later processors, but as far as I know, it wasn't changed.

-- glen

- J
- Jacko
  
  Contact options for registered users
Vote on answer
posted
15 years ago

Tue, Mar 17, 2009 1:24 PM

Hi

Three cheers for Chuck Moore, and his hidden friend Keep Less ;-)

Another zero operand CPU

formatting link

cheers jacko

Now available in free licence of one core per ASIC/FPGA/CPLD, with two conditions.

A K Ring Technologies Logo must be printed atop the chip or close by on the PCB at any resolution.
Any documentation produced must acknowledge copyright and provide the URL.

This licence is for those folks who do not like the BSD derived work restrictions.

- A
- Antti.Lukats
  
  Contact options for registered users
Vote on answer
posted
15 years ago

Tue, Mar 17, 2009 1:58 PM

the difference between zpu and nibz is that ZPU is supported by GCC toolchain, while there are no tools to generate any meaningful code for nibz

correct me if i am wrong

Antti

- R
- rickman
  
  Contact options for registered users
Vote on answer
posted
15 years ago

Tue, Mar 17, 2009 3:09 PM

Is that really the primary critera? I think you are right. But I have a similar CPU design that I expect to use on a project shortly and it will be programmed in assembly, but it will look a lot like Forth. I consider that to be close enough to a high level language.

BTW, ZPU may have a GCC compiler, but without a debugger, is that really useful? There aren't many projects done in C that are debugged without an emulator.

Rick

- D
- David Brown
  
  Contact options for registered users
Vote on answer
posted
15 years ago

Tue, Mar 17, 2009 3:22 PM

It's possible to do a lot of development without a debugger. I often do embedded development without one (though I prefer to have one available if possible). Until you've done debugging with only a single LED for signalling, you haven't really done embedded development. Bonus points if the microcontroller you're using only comes in OTP version.

Even big projects can be done without a debugger:

So a compiler without a debugger is somewhat limited but still useful, but a debugger without a compiler is rather less useful!

- A
- Antti.Lukats
  
  Contact options for registered users
Vote on answer
posted
15 years ago

Tue, Mar 17, 2009 3:27 PM

o

well assembler is GOOD if it exist :)

so any soft-core with assembler is ok but there is no assembler for nibz? so =3D=3D useless. no C, no assembler, =3D=3D not possible to use :(

a non working somesort of forht partially adopted but not really tested does not count as development tool.

a simple assembler would.

I personally dont like C, but unfortunatly can not fully avoid it either

Antti

- R
- Rich Webb
  
  Contact options for registered users
Vote on answer
posted
15 years ago

Tue, Mar 17, 2009 3:40 PM

Hear, hear! You can dump an awful lot of information out of a single pin, even floating point numbers (don't ask).

Emulators are slightly useful to check on whether one's understanding of the datasheet is correct but, unless they're a cycle-by-cycle exact replica of the microcontroller core *and* all of its peripherals then it's not really emulating, more like approximating -- and that path can lead to trouble.

--
Rich Webb     Norfolk, VA

- J
- Jonathan Bromley
  
  Contact options for registered users
Vote on answer
posted
15 years ago

Tue, Mar 17, 2009 4:12 PM

I think you're being a bit blinkered about this. Writing an assembler for a machine as simple as nibz is not much more than a day's work if you are sensible about your choice of tool (Tcl, Perl?) and limit your ambitions reasonably. No rocket science required.

Yes. gcc is not the only act in town for these very simple, small machines. A simple assembler gets you going, and a nice macro-generating assembler gets you productive, for very little investment in the tool chain. Of course a full C toolchain is way better; but targeting gcc to a new machine is not for the faint-hearted, I think (it's not for me at all, I would have no idea how to start).

I'd consider throwing together an assembler for nibz myself, but like many others here I simply can't divine its specification from the published docs - and that's my real problem with it. If jacko wants it to become more widely accepted he must put in the effort to document it intelligibly (or find someone else who can do so). He's competing in a very crowded marketplace, and has erected very effective barriers to other people's understanding of his offering; not a smart move.

--
Jonathan Bromley, Consultant

DOULOS - Developing Design Know-how
VHDL * Verilog * SystemC * e * Perl * Tcl/Tk * Project Services

Doulos Ltd., 22 Market Place, Ringwood, BH24 1AW, UK
jonathan.bromley@MYCOMPANY.com
http://www.MYCOMPANY.com

The contents of this message may contain personal views which 
are not the views of Doulos Ltd., unless specifically stated.

- A
- Antti.Lukats
  
  Contact options for registered users
Vote on answer
posted
15 years ago

Tue, Mar 17, 2009 5:33 PM

right Jonathan

a simple assembler would suffice a good macro assembler is better or then can use c preprocessor on the asm source or use some retargetable assembler..

but if the core specs itselfs are really fuzzy then its hard to use it, or make an assembler for it

Antti

- B
- Bernd Paysan
  
  Contact options for registered users
Vote on answer
posted
15 years ago

Tue, Mar 17, 2009 5:35 PM

I've done a debugger on my last b16 version, and it doesn't come with a compiler, either (only an assembler), and I assure you, the debugger *is* useful. At least that's what the coworker who does the firmware development tells me.

--
Bernd Paysan
"If you want it done right, you have to do it yourself"
http://www.jwdt.com/~paysan/

- J
- Jeff Fox
  
  Contact options for registered users
Vote on answer
posted
15 years ago

Tue, Mar 17, 2009 6:17 PM

The first generation zero operand Forth machine, Novix, used three memory busses to be able to manipulate the data stack, the return stack, and main memory in a single cycle. The three memory busses with stack pointers made it easy to switch tasks very quickly.

The 29k description also describes the stack operation on Chuck Moore's second generation Forth chip, Sh-Boom. Sh-Boom got only 100 Forth mips back in 1988 when my Intel machines only got a few Forth mips.

There is a discussion of spill/fill in Koopman's book on stack machines where he shows how often stacks spill based on how many cells are cached in registers. However Moore rejected the hardware spill/fill for software spill fill in his full custom vlsi zero operand designs.

That's a good point. Unless you have banks of registers stack cells cached in registers reduce task switching performance. However the fourth generation machine designs were designed for multiprocessing and not so much for multitasking. Switching tasks and interrupts need memory cycles which takes a lot of time compared to the few picoseconds needed for a dedicated processor to react to an event.

According to Koopman's research caching eight cells will result in a spill about 1% of the time. My experience as director of software at the iTV corporation developing Internet Appliances was that it was much less than that with well designed code.

Since spill/fill happened so infrequently in this kind of software the decision was made to use software and design code to do it when needed. For that once in a decade spill/fill we used software.

The philosophy mentioned says that there are a dozen things you can do to simplify the design to reduce cost and power use and increase speed. To get 700 mips in .18u using only 20k transistors without pipelining or memory caching you have to have a simple design. To burn thirty times less energy on a given computation than an MS430, to get response to events in a few nanoseconds or to fit a hundred core on a tiny low power embedded chip requires a simple design. Having stacks in registers, packing multiple opcodes per word, and decoding opcodes while they execute are all example of the techniques used.

Of course one of the design points of the third generation zero operand Forth machines was that the design fell out of greatly simplifying the compiler. The first cross compiler was an additional 300 bytes to a Forth system. And the idea was to simplify both the hardware and the software.

The compilers the op was talking about are very simple. The Forth compilers for the high performance full custom vlsi Forth chips are relatively simple. I will admit that with many core executing streamed instructions that the instruction stream packet builders are useful. But it is nothing like dealing with deep pipelines and multi-level memory cache on complicated processors.

There is a pretty simple almost one to one correspondence between the Forth source code and the object code. What the op mentioned is in contrast to the complex smart compilers that are used and needed with complex pipelined and cached architectures.

The full custom vlsi cad design software used to create Moore's zero operand designs is a good example of his approach to keeping software simple. The compilers and OS and chip design and layout and simulation and design rule check software sufficient for multi-mega transistor chip design, several chips designs, and documentation fit easily on one floppy drive. This kind of software is a natural fit to the kind of hardware being designed in this process.

Best Wishes

- J
- Jeff Fox
  
  Contact options for registered users
Vote on answer
posted
15 years ago

Tue, Mar 17, 2009 6:44 PM

That's about right.

Forth covers a wide range of hardware and software these days. The op mentioned Moore's ideas on the subject. Mr. Moore's current ideas are about full custom vlsi Forth chips designed with Forth CAD software. They have 700Mhz core that use 1/30 the power of an MSP430 for a given computation and are small enough to fit a hundred or more on a tiny chip and program them with the simplest Forth compilers yet using libraries of code objects written by various teams of people. Other people have fit multiple tiny 200 mip code compatible core on small FPGA. These are some of the things the op refered to as Chuck Moore's ideas on this subject.

The op also talked about a project to implement a simple processor with modest performance requirements and write code himself. He is not alone in being interested in that sort of thing, but rolling your own hardware/software or using gcc are certainly not the only options here.

The first page of gcc documentation will explain why you would need to choose a different path if you want to try to implement a C for these designs.

People are making knock-offs of 20+ year old Forth chips in modern FPGA and are now getting 50 mips or more. People are making newer smaller Forth chips with tens or hundreds of thousands of Forth mips and at very lower power and cost. Some people are making their own designs tuned for the performance that they need which might be just be a small control processor with source code that doesn't need to be very fast or might be specialized for some purpose.

Choosing a model that offers a few mips and writing your own assembler for it are not your only options. However it is how the inventor of Forth got started in hardware design in the early 80s after thirty years experience writing code. The first step was 4-10 mips, then 100, 220, 700, 18,000, 30,000, 100k+ and beyond.

It started with one person's work and expanded to include a hundred other people doing cad work in Forth or writing tools and library code for target chips etc. so you don't have to redo it all yourself.

Best Wishes

- H
- Hal Murray
  
  Contact options for registered users
Vote on answer
posted
15 years ago

Tue, Mar 17, 2009 7:04 PM

You can also use cpp in front of your assembler.

--
These are my opinions, not necessarily my employer's.  I hate spam.

- A
- Antti.Lukats
  
  Contact options for registered users
Vote on answer
posted
15 years ago

Tue, Mar 17, 2009 7:13 PM

sure can use GCC as i already mentioned 4 posts ago :) but it not always as good as good macro assembler

Antti

- H
- Hal Murray
  
  Contact options for registered users
Vote on answer
posted
15 years ago

Tue, Mar 17, 2009 7:49 PM

I'm happy without a debugger, at least as long as the edit/compile/run cycle is fast. Besides, a lot of the quirks I'm chasing are timing issues where you need a scope to see what's going on.

What do you use for a debugger when working on perl/python code?

--
These are my opinions, not necessarily my employer's.  I hate spam.

- H
- Hal Murray
  
  Contact options for registered users
Vote on answer
posted
15 years ago

Tue, Mar 17, 2009 8:01 PM

3 is a small number.

If you have 8 or 16, the same ballpark as the number of registers in a typical CPU, then most code never spills to memory.

It was a long time ago, so my memory is fuzzy. I think the Mesa/Cedar system had a rule of nothing on the stack at procedure calls except the arguments/results. I think the only time that sane code spilled was calls to get the arguments of a call, things like: x = foo(a, baz(b));

--
These are my opinions, not necessarily my employer's.  I hate spam.