MISC - Stack Based vs. Register Based

I have been working with stack based MISC designs in FPGAs for some years. All along I have been comparing my work to the work of others. These others were the conventional RISC type processors supplied by the FPGA vendors as well as the many processor designs done by individuals or groups as open source.

So far my CPUs have always ranked reasonably well in terms of speed, but more importantly to me, very well in terms of size and code density. My efforts have shown it hard to improve on code density by a significant degree while simultaneously minimizing the resources used by the design. Careful selection of the instruction set can both improve code density and minimize logic used if measured together, but there is always a tradeoff. One can always be improved at the expense of the other.

The last couple of days I was looking at some code I plan to use and realized that it could be a lot more efficient if I could find a way to use more parallelism inside the CPU and use fewer instructions. So I started looking at defining separate opcodes for the two primary function units in the design, the data stack and the return stack. Each has its own ALU. The data stack has a full complement of capabilities while the return stack can only add, subtract and compare. The return stack is actually intended to be an "address" processing unit.

While trying to figure out how to maximize the parallel capabilities of these units, I realized that many operations were just stack manipulations. Then I read the thread about the relative "cost" of stack ops vs memory accesses and I realized these were what I needed to optimize. I needed to find a way to not use an instruction and a clock cycle for moving data around on the stack.

In the thread on stack ops it was pointed out repeatedly that very often the stack operands would be optimized to register operands, meaning they wouldn't need to do the stack ops at all really. So I took a look at a register based MISC design. Guess what, I don't see the disadvantage! I have pushed this around for a couple of days and although I haven't done a detailed design, I think I have looked at it enough to realize that I can design a register oriented MISC CPU that will run as fast, if not faster than my stack based design and it will use fewer instructions. I still need to add some features like support for a stack in memory, in other words, pre-increment/post-decrement (or the other way around...), but I don't see where this is a bad design. It may end up using *less* logic as well. My stack design provides access to the stack pointers which require logic for both the pointers and muxing them into the data stack for reading.

I guess looking at other peoples designs (such as Chuck's) has changed my perspective over the years so that I am willing and able to do optimizations in ways I would not have wanted to do in the past. But I am a bit surprised that there has been so much emphasis on stack oriented MISC machines which it may well be that register based MISC designs are also very efficient, at least if you aren't building them to service a C compiler or trying to match some ideal RISC model.

--

Rick
Reply to
rickman
Loading thread data ...

You might also look at some VLIW designs. Not that the designs themselves will be useful, but maybe some of the ideas that were used would help.

Well, much of the idea of RISC is that code density isn't very important, and that many of the more complicated instructions made assembly language programming easier, but compilers didn't use them.

Do you mean the size of CPU (lines of verilog) or size of the programs that run on it?

Seems to me that much of the design of VAX was to improve code density when main memory was still a very significant part of the cost of a machine. The large number of addressing modes allowed for efficient use of bits. It also greatly complicated making an efficient pipelined processor. With little overlap and a microprogrammed CPU, it is easy to go sequentially through the instruction bytes and process them in order. Most of the complication is in the microcode itself.

But every level of logic adds delay. Using a wide bus to fast memory is more efficient that a complicated decoder. But sometimes RISC went too far. In early RISC, there was the idea of one cycle per instruction. They couldn't do that for multiply, so they added multiply-step, an instruction that you execute many times for each multiply operation. (And maybe no divide at all.)

For VLIW, a very wide instruction word allows for specifying many different operations at the same time. It relies on complicated compilers to optimally pack the multiple operations into the instruction stream.

This kind of thought is what lead to the branch delay slot on many RISC processors. There is work to be done for branches, especially for branch to or return from subroutine. Letting the processor know early allows for more overlap. So, on many early (maybe not now) RISC machines the instruction after a branch instruction is executed before the branch is taken. (It might be a NOOP, though.) Again, reduced code density (if it is NOOP) in exchange for a faster instruction cycle.)

Well, consider the x87 stack instructions. Some operations exist in forms that do and don't pop something off the stack. A small increase in the number of opcodes used allows for avoiding many POP instructions.

Well, actually the 8087 was designed to be either a register or stack machine, as many instructions can index into the stack. If you keep track of the current stack depth, you can find data in the stack like in registers. Well, it was supposed to be able to do that.

Well, the stack design has the advantage that you can use instruction bits either for a memory address or for the operation, allowing for much smaller instructions. But that only works as long as everything is in the right place on the stack.

Seems to me that there hasn't been much done in stack machines since the B5500, the first computer I ever used when I was about nine.

-- glen

Reply to
glen herrmannsfeldt

I once made a CPU design for an FPGA that had multiple stacks. There was a general purpose stack "A", two index stacks "X" and "Y", and a return stack "R". ALU operations worked between A and any other stack, so they only required 2 bits in the opcode. There was also a move instruction that could move data from a source to a destination stack.

Having access to multiple stacks means you spend less time shuffling data on the stack. There's no more need for swap, over, rot and similar stack manipulation instructions. The only primitive operations you need are push and pop.

For instance, I had a load instruction that could load from memory using the address in the X stack, and push the result on the A stack. The cool part is that the X stack itself isn't changed by this operation, so the same address can be used multiple time. So, you could do a

LOAD (X) ; load from (X) and push on A 1 ; push literal on A ADD ; add top two elements of A STORE (X) ; pop A, and store in (X)

to increment a location in memory.

And if you wanted to increment X to access the next memory location, you'd do:

1 ; push literal on A ADD X ; pop X, pop A, add, and push result on A. MOVE A, X ; pop A, and push on X

It was an 8 bit architecture with 9 bit instructions (to match the FPGA block RAM + parity bit). Having 9 bit instructions allows an 8 bit literal push to be encoded in 1 instruction.

Feel free to e-mail if you want more details.

Reply to
Arlet Ottens

Are those your actual results or did you just reiterate what is on Wikipedia? Yes, that's a serious question. Read the MISC page:

formatting link

See ... ?!

Code density is a CISC concept. I don't see how it applies to your MISC project. Increasing code density for a MISC processor means implementing more powerful instructions, i.e., those that do more work, while minimizing bytes in the instruction opcode encoding. Even if you implement CISC-like instructions, you can't forgo the MISC instructions you already have in order to add the CISC-like instructions. So, to do that, you'll need to increase the size of the instruction set, as well as implement a more complicated instruction decoder. I.e., that means the processor will no longer be MISC, but MISC+minimal CISC hybrid, or pure CISC...

No offense, but you seem to be "reinventing the wheel" in terms of microprocessor design. You're coming to the same conclusions that were found in the 1980's, e.g., concluding a register based machine can perform better than a stack based machine, except you've applied it to MISC in an FPGA package... How is that a new conclusion?

Also, please read up on CISC and RISC:

formatting link
formatting link

You posted to two groups. Which group was the " ... thread about the relative 'cost' of stack ops vs memory accesses ..." posted on? comp.lang.forth? comp.arch.fpga? I can look it up, but I'd rather not.

Also, you cross-posted to comp.arch.fpga. While they'll likely be familiar with FPGAs, most there are not going to be familiar with the features of stack-based processors or Forth processors that you discuss indirectly within your post. They might not be familiar with ancient CISC concepts such as "code density" either, or understand why it was important at one point in time. E.g., I suspect this Forth related stuff from above won't be widely understood on c.a.f. without clarification:

"peoples[sic] designs (such as Chuck's)"

- various Forth processors by Charles Moore

"the data stack and the return stack."

- interpreted Forth machine model

Rod Pemberton

Reply to
Rod Pemberton

Let's see, this would be a hybrid really, between register based and stack based as it has multiple stacks which must be selected in the instruction set like registers, just not many.

There is the extra hardware required to implement multiple stacks. Why have quite so many? I use the return stack for addresses. That works pretty well. Maybe A, X and R?

But you aren't quite done yet, at least if you care about stack overflows. Chuck's designs don't care. If he has data on the bottom of the stack he can just leave it. Otherwise you need to drop the address on the X stack.

I looked at this when designing my MISC processor. I ended up with two fetch and two stores. One just does the fetch or store and pops the address. The other does a post increment and retains the address to be used in a loop. This one would have to be dropped at the end.

Add an autoincrement to the index stacks and these three instructions go away.

I was all about 9 bit instructions and 18 bit address/data bus sizes until I started working with the iCE40 parts which only have 8 bit memory. I think the parity bit is a comms industry thing. That one bit makes a *big* difference in the instruction set capabilities.

Thanks.

--

Rick
Reply to
rickman

It sounds to me rickman is questioning the (unsupported) claims on wikipedia that stack based machines have an advantage in size and/or simplicity, not reiterating them.

It is perfectly possible to trade one type of MISC processor for another one. The choice between stack and register based is an obvious one. If you switch from stack to register based, there's no need to keep stack manipulation instructions around.

The design of simple and compact processors is of great interest to many FPGA engineers. Plenty of FPGA designs need some sort of control processor, and for cost reduction it's important to use minimal resources. Like rickman said, this involves a careful balance between implementation complexity, speed, and code density, while also considering how much work it is to write/maintain the software that's running on the processor.

Code density is still critically important. Fast memory is small, both on FPGA as well as general purpose processors.

Reply to
Arlet Ottens

Exactly. And as a hybrid, it offers some advantages from both kinds of designs.

I had all stacks implemented in the same block RAM, just using different sections of it. But you are right, in my implementation I had reserved room for the Y stack, but never really implemented it. Just using the X was sufficient for the application I needed the CPU For. Of course, for other applications, having an extra register/stack that you can use as an memory pointer could be useful, so I left the Y register in the instruction encoding. For 3 registers you need 2 bits anyway, so it makes sense to allow for 4.

Correct, if you no longer need the address, you need to drop it from the stack. On the other hand, if you let it drop automatically, and you need it twice, you would have to dup it. Intuitively, I would say that in inner loops it would be more common that you'd want to reuse an address (possibly with offset or autoinc/dec).

Agreed. As soon you use the parity bits, you're tied to a certain architecture. On the other hand, if you're already chose a certain architecture, and the parity bits are available for free, it can be very advantageous to use them.

Reply to
Arlet Ottens

(snip)

Well, yes, VLIW is just about like compiling from source directly to microcode. The currently in production VLIW processor is Itanium. I believe 128 bit wide instructions, which specify many different operations that can happen at the same time. 128 general and 128 floating point registers, 128 bit wide data bus, but it uses a lot of power. Something like 100W, with Icc of 100A and Vcc about 1V.

I have an actual RX2600 dual Itanium box, but don't run it very often, mostly because of the power used. (snip)

The early SPARC didn't have a multiply instruction. (Since it couldn't be done in one cycle.) Instead, they did multiply, at least the Sun machines did, through software emulation, with a software trap.

Early when I started using a Sun4/110 I generated a whole set of fonts for TeX from Metafont source, which does a lot of multiply. Unix (SunOS) keeps track of user time (what your program does) and system time (what the OS does while executing your program). Multiply counted as system time, and could be a large fraction of the total time.

For Itanium, the different units do different things. There are instruction formats that divide up the bits in different ways to make optimal use of the bits. I used to have the manual nearby, but I don't see it right now. (snip)

Yes, that counts, but it gets much more interesting with pipelines like the Cray-1 that, after some cycles of latency, generate one result every clock cycle.

(snip)

For x87, they avoid the DUP, SWAP, and such by instructions being able to specify any register in the stack. You can, for example, add any stack register (there are eight) to the top of stack. I haven't thought about it for a while, but I believe either pushing the result, or replacing the previous top of the stack. (snip)

Seems to me that one possibility is to have a really functional stack operation instruction, such that, with the give number of bits, it allows for the most opertions. Some combination of DUP, SWAP, and POP all at once. Though that isn't easy for the stack itself.

-- glen

Reply to
glen herrmannsfeldt

The original VAX series (780, then 730 and 750) and the uVAX I and uVAX II had no cache, so reducing rate of memory fetches was also important. The first cached machine I used was the uVAX III (KA650) and a good part of the performance increase was due to the cache.

Jon

Reply to
Jon Elson

Well, you snipped alot of context. I wouldn't have reformatted all of it either.

Anyway, what I took from rickman's statements was that he had only managed to confirm what is already known about MISC according to Wikipedia. I.e., what was/is the point?

Yes, true. But, what does eliminating MISC stack instructions and replacing them with MISC register instructions have to do with the CISC concept of code density or with CISC instructions? ...

So, shouldn't he dump the entire MISC instruction set he has, and implement a CISC instruction set instead? That's the only way he's going to get the "critically important" code density, which I'll take it you rank well above MISC as being important. Of course, a CISC instruction set requires a more complicated instruction decoder ... So, it seems that either way he proceeds, he is being contrained by the "minimal resources" of his FPGA. That was what he stated:

"My efforts have shown it hard to improve on code density by a significant degree while simultaneously minimizing the resources used by the design."

I.e., if the FPGA he is attempting to use is insufficient to do what he wants or needs, then it's insufficient. Or, he needs some new techniques. He didn't explicitly ask for any though ...

Glen mentioned the numerous address modes of the VAX. The 6502 also had alot of address modes and had instructions which used zero-page as a set of fast registers. I would think that early, compact designs like the 6502 and Z80 could be useful to rickman. They had low transistor counts.

Z80 8500 transistors

6502 9000 transistors 8088 29000 transistors (for comparison...)

Rod Pemberton

Reply to
Rod Pemberton

(snip)

(snip, someone wrote)

Seems to me that one could still Huffman code the opcode, even within the MISC concept. That is, use fewer bits for more common operations, or where it otherwise simplifies the result.

As someone noted, you can have an N-1 bit load immediate instruction where N is the instruction size.

In one of Knuth's TAOCP books he describes a two instruction computer.

Seems like if one is interested in MISC, someone should build that one.

Also, maybe a MIX and MMIX machine, maybe the decimal version.

For those who don't read TAOCP, MIX is defined independent of the underlying base. Programs in MIXAL are supposed to assemble and run correctly on hosts that use any base within the range specified.

Instruction bytes have, if I remember, between 64 and 100 possible values, such that six bits or two decimal digits are possible representations.

I believe that allows for bases 2, 3, 4, 8, 9, 10, and 16.

-- glen

Reply to
glen herrmannsfeldt

I don't think 'code density' is a CISC concept at all. Code density applies to any kind of instruction encoding.

Any kind of MISC architecture will have a certain code density (for a given application), and require a certain amount of FPGA resources. And if you store the code in local FPGA memory, and you know your application, you can even convert it all to FPGA resources.

Obviously, one MISC design will have better resource use than another, and one of the questions is whether we can make any kind of general statement about stack vs register implementation.

Exactly. If you want to get FPGA resources low, a MISC design seems most appropriate. But within the MISC concept, there are still plenty of design decisions left.

Really, the biggest problem with FPGA CPU design is that there are too many design decisions, and each decision influences everything else.

Low transistor counts do not necessarily translate to low FPGA resources. Early CPUs used dynamic storage, dual clock latches, pass logic and tri-state buses to create really small designs that don't necessarily map well to FPGAs. On the other hand, FPGAs have specific features (depending on the brand) that can be exploited to create really tight designs.

Also, these processors are slow, using multi cycle instructions, and 8 bit operations. That may not be acceptable. And even if low performance is acceptable, there are numerous other ways where you can trade speed for code density, so you'd have to consider these too. For instance, I can replace pipelining with multi-cycle execution, or use microcode.

Reply to
Arlet Ottens

(snip)

A 100 watt lightbulb uses about $10 a month if left on 24/7. So I wouldn't worry too much about the cost of running your Itanium. If you turn it off when you aren't using it I can't imagine it would really cost anything noticeable to run.

Yes, the array processor I worked on was coded from scratch, very laboriously. The Itanium is trying to run existing code as fast as possible. So they have a number of units to do similar things, but also different types, all working in parallel as much as possible. Also the parallelism in the array processor was all controlled by the programmer. In regular x86 processors the parallelism is controlled by the chip itself. I'm amazed sometimes at just how much they can get the chip to do, no wonder there are 100's of millions of transistors on the thing. I assume parallelism in the Itanium is back to the compiler smarts to control since it needs to be coded into the VLIW instructions.

That is something I have not yet looked at, a hybrid approach with a stack, but also with stack top relative addressing. It is important to keep the hardware simple, but that might not be too hard to implement. Arlet talked about his hybrid processor design with multiple stacks. I would need to give this a bit of thought as the question becomes, what possible advantage would a hybrid processor have over the other two? Actually, the image in my mind is a bit like the C language stack frame model. You can work with addressing relative to the top of stack and when a sub is called it layers its variables on top of the existing stack. That would require a bit of entry and exit code, so there will be a tradeoff between simple register addressing and simple subroutine entry.

Like many tradeoffs, this can complicate the hardware. If you want to be able to combine an ADD with a SWAP or OVER, you need to be able to access more than two operands on the stack at once. In my designs that means pulling another register out of the proper stack. Rather than one register and a block of memory, each stack would need two registers and the block of memory along with the input multiplexers. This would need to be examined in context of common code to see how much it would help vs the expense of added hardware.

I'm still not complete in my analysis of a register based MISC design, but at the moment I think the register approach gives better instruction size efficiency/faster execution as well as simpler hardware design.

--

Rick
Reply to
rickman

You clearly don't understand my post or the Wiki article or possibly both. I suggest you reread both and then ask questions if you still don't understand.

Yes, I get that you don't understand. Do you have a specific question?

Yes, that part you seem to understand.

Really? I can't drop the entire instruction set and start over? Who says so? Am I breaking a law?

Define "increase the size of the instruction set". I am using a 9 bit opcode for my stack design and am using a similar 9 bit opcode for the register design. In what way is the register design using a larger instruction set?

That was exactly the blinder I was wearing until now. I had read a lot about register CPU instruction sets where the intent was not in line with MISC goals. MicroBlaze is a good example. I think it uses well in excess of 1000 LUTs, maybe multiple 1000's. I need something that is much smaller and it hadn't occurred to me that perhaps the common goals of register designs (lots of registers, orthogonality, address mode flexibility, etc) could be limited or even tossed out the window. I don't need a machine that is easy for a C compiler to produce code for. My goals are for minimal hardware without losing any more performance than is essential. In particular I work in FPGAs, so the design needs to work well in that environment.

Nonsense. "Minimal Instruction Set Computer (MISC) is a processor architecture with a very small number of basic operations and corresponding opcodes." from Wikipedia. BTW, I don't think the term MISC is widely used and is not well defined. This is the only web page I found that even tries to define it.

Actually, if you consider only the opcodes and not the operand combinations, I think the register design may have fewer instructions than does the stack design. But the register design still is in work so I'm not done counting yet.

There are some interesting corners to be explored. For example a MOV rx,rx is essentially a NOP. There are eight of these. So instead, why not make them useful by clearing the register? So MOV rx,rx is a clear to be given the name CLR rx... unless the rx is r7 in which case it is indeed a MOV r7,r7 which is now a NOP to be coded as such. The CLR r7 is not needed because LIT 0 already does the same job. Even better the opcode for a NOP is 0x1FF or octal 777. That's very easy to remember and recognize. It feels good to find convenient features like this and makes me like the register MISC design.

I don't see how you can say that. I don't know of other MISC designs that are very good. I think the picoBlaze is one, but I don't think it was designed to be very MISC really. It was designed to be small, period. It has a lot of fixed features and so can't be altered without tossing old code. Some of the MicroChip devices might be MISC. I have not worked with them. I might want to take a look actually. There may be useful ideas.

But "reinventing the wheel"? I don't think so.

--

Rick
Reply to
rickman

It is a dual processor box, plus all the rest of the systems in the box. Yes, I was considering running it all the time, but it is too expensive for that.

Seems to me that the big problem with the original Itanium was the need to also run x86 code. That delayed the release for some time, and in that time other processors had advanced. I believe that later versions run x86 code in software emulation, maybe with some hardware assist.

-- glen

Reply to
glen herrmannsfeldt

You really need to reread the wiki description of MISC. There seems to be a disconnect.

Weren't you the person who brought CISC into this discussion? Why are you asking this question about CISC?

I have "dumped" the stack related portion of the instruction set and replaced it with register references. Why would I do otherwise?

No one said anything about "insufficient". I am looking for an optimal design for a small CPU that can be used efficiently in an FPGA.

I'm not counting transistors. That is one of the fallacies of comparing chip designs to FPGA designs. When you have the flexibility of using transistors you can do things in ways that are efficient that are not efficient in the LUTs of an FPGA.

The two big constraints are resources used in the FPGA and opcode size. If an instruction would require addition of resources to the design, it needs to justify that addition. An instruction also needs to justify the portion of the opcode space it requires. Operations that specify more than one register become very expensive in terms of opcode bits. Operations that require additional data paths become expensive in terms of resources. Doing as much as possible with as little as possible is what I'm looking for.

BTW, the ZPU is a pretty good example of just how small a design can be. But it is very slow. That's the other size of the equation. Improve speed as much as possible while not using more resources than possible. The optimal point depends on the coefficients used... in other words, my judgement. This is not entirely objective.

--

Rick
Reply to
rickman

Yup, I am still using the high bit to indicate an immediate operand. This requires an implied location, so this literal is always in R7, the addressing register. In my stack design it is always the return stack, also the addressing register.

Jumps and calls still contain a small literal to be used alone when the relative jump is within range or with the literal instruction when not.

I think this is very similar to Huffman encoding.

You can design a one instruction computer, but there is a balance between resources used and the effectiveness of the resulting design. The effectiveness of this sort of design is too low.

Doesn't sound like an especially practical computer. Has anyone ever built one?

--

Rick
Reply to
rickman

I think you have a very good handle on the nature of my goals.

--

Rick
Reply to
rickman

(snip, I wrote)

Well, it exists mostly to write the examples (and, I believe, homework problems) in the book. I believe that there have been software emulation, but maybe no hardware (FPGA) versions.

MIXAL programs should be base independent, but actual implementations are likely one base.

(The model number, 1009 or MIX in roman numerals, is supposed to be the average of the model numbers of some popular machines.

There is also the DLX, a RISC machine used by Hennessy and Patterson in their book, which could also be in roman numerals.

-- glen

Reply to
glen herrmannsfeldt

x86 compatibility was not the "big" problem with the Itanium (though it didn't help). There were two far bigger problems. One is that the chip was targeted as maximising throughput with little regard for power efficiency, since it was for the server market - so all of the logic was running all of the time to avoid latencies, and it has massive caches that run as fast as possible. The result here is that the original devices had a power density exceeding the core of a nuclear reactor (it was probably someone from AMD who worked that out...).

The big problem, however, is that the idea with VLIW is that the compiler does all the work scheduling instructions in a way that lets them run in parallel. This works in some specialised cases - some DSP's have this sort of architecture, and some types of mathematical algorithms suit it well. But when Intel started work on the Itanium, compilers were not up to the task - Intel simply assumed they would work well enough by the time the chips were ready. Unfortunately for Intel, compiler technology never made it - and in fact, it will never work particularly well for general code. There are too many unpredictable branches and conditionals to predict parallelism at compile time. So most real-world Itanium code uses only about a quarter or so of the processing units in the cpu at any one time (though some types of code can work far better). Thus Itanium chips run at half the real-world speed of "normal" processors, while burning through at least twice the power.

Reply to
David Brown

ElectronDepot website is not affiliated with any of the manufacturers or service providers discussed here. All logos and trade names are the property of their respective owners.