MISC - Stack Based vs. Register Based

rickman · 2013-03-29T21:00:50+00:00

I have been working with stack based MISC designs in FPGAs for some years. All along I have been comparing my work to the work of others. These others were the conventional RISC type processors supplied by the FPGA vendors as well as the many processor designs done by individuals or groups as open source. So far my CPUs have always ranked reasonably well in terms of speed, but more importantly to me, very well in terms of size and code density. My efforts have shown it hard to improve on code density by a significant degree while simultaneously minimizing the resources used by the design. Careful selection of the instruction set can both improve code density and minimize logic used if measured together, but there is always a tradeoff. One can always be improved at the expense of the other. The last couple of days I was looking at some code I plan to use and realized that it could be a lot more efficient if I could find a way to use more parallelism inside the CPU and use fewer instructions. So I started looking at defining separate opcodes for the two primary function units in the design, the data stack and the return stack. Each has its own ALU. The data stack has a full complement of capabilities while the return stack can only add, subtract and compare. The return stack is actually intended to be an "address" processing unit. While trying to figure out how to maximize the parallel capabilities of these units, I realized that many operations were just stack manipulations. Then I read the thread about the relative "cost" of stack ops vs memory accesses and I realized these were what I needed to optimize. I needed to find a way to not use an instruction and a clock cycle for moving data around on the stack. In the thread on stack ops it was pointed out repeatedly that very often the stack operands would be optimized to register operands, meaning they wouldn't need to do the stack ops at all really. So I took a look at a register...

R

Rod Pemberton 13 years ago

Yes.

You mentioned code density. AISI, code density is purely a CISC concept. They go together and are effectively inseparable.

RISC was about effectively using all the processor clock cycles by using fast instructions. RISC wasn't concerned about the encoded size of instructions, how much memory a program consumed, the cost of memory, or how fast memory needed to be.

CISC was about reducing memory consumed per instruction. CISC reduced the average size of encoded instructions while also increasing the amount of work each instruction performs. CISC was typically little-endian to reduce the space needed for integer encodings. However, increasing the amount of work per instruction produces highly specialized instructions that are the characteristic of CISC. You only need to look at the x86 instruction set to find some, e.g., STOS, LODS, XLAT, etc. They are also slow to decode and execute as compared to RISC.

So, if memory is cheap and fast, there is no point in improving code density, i.e., use RISC. If memory is expensive or slow, use CISC.

Arlet mentioned changes to a processor that appeared to me to have nothing to do with increasing or decreasing code density. AISI, the changes he mentioned would only affect what was in current set of MISC instructions, i.e., either a set of register-based MISC instructions or a set of stack-based MISC instructions. This was stated previously.

Rod Pemberton

Vote

R

Rod Pemberton 13 years ago

...

It's just a logical outcome, AISI. The design criteria that you stated was that of producing MISC processors. MISC seems to be purely about the minimizing the quantity of instructions. You've produced a MISC processor. So, if you now change your mind about MISC and add additional instructions to your processor's instruction set, especially non-MISC instructions, you're effectively going against your own stated design requirement: MISC or reducing the quantity of instructions. So, it'd be you who says so, or not... It just seems contradictory with your past self to change course now with your current self.

You'll have more instructions in your instruction set.

You haven't added any additional CISC-like instructions, yet. You just exchanged stack operations for register operations. So, none for now.

How exactly does fewer instructions contribute to increased code density for the remaining instructions? The eliminated instructions are no longer a measured component of code density. I.e., they no longer consume memory and therefore aren't measured.

I can see that you're attempting to minimize the quantity of implemented instructions. Although similar in nature, that's not the same as improving the code density. Are you conflating two different concepts? One of them reduces the encoded size of an instruction, while the other eliminates instructions ...

How are you going to attempt to increase the code density for your processor?

1) adding new, additional, more powerful instructions that you don't already have 2) merging existing instruction into fewer instructions 3) finding a more compact method of instruction encoding 4) use little-endian to reduce encoded sizes of integers 5) none or other

I'd think it should be #1 and #3 and #4, or #2 and #3 and #4, or "other" and #3 and #4 ...

Rod Pemberton

Vote

G

glen herrmannsfeldt 13 years ago

They do go together, but I am not so sure that they are inseperable.

CISC began when much coding was done in pure assembler, and anything that made that easier was useful. (One should figure out the relative costs, but at least it was in the right direction.)

That brought instructions like S/360 EDMK and VAX POLY. (Stories are that on most VAX models, POLY is slower than an explicit loop.) Now, there is no need to waste bits, and so instruction formats were defined to use the available bits.

S/360 (and successors) only have three different instruction lengths, and even then sometimes waste bits.

The VAX huge number of different instructions lengths, and also IA32, does seem to be for code size efficiency. VAX was also defined with a 512 byte page size, even after S/370 had 2K and 4K pages. Way too small, but maybe seemed right at the time.

Yes, but that doesn't mean that CISC is concerned with the size of instructions.

Even if it is true (and it probably is) that CISC tends to make efficient use of the bits, that doesn't prove that is what CISC was about.

As above, CISC was about making coding easier for programmers, specifically assembly programmers. Now, complex instructions take less space than a series of simpler instructions, but then again one could use a subroutine call.

The PDP-10 has the indirect bit, allowing for nested indirection, which may or may not make efficient use of that bit. S/360 uses a sequence of L (load) instructions to do indirection.

Instruction usage statistics noting how often L (load) was executed in S/360 code may have been the beginning of RISC.

This I don't understand at all. They take the same amount of space. Little endian does make it slightly easier to do a multiword (usually byte) add, and that may have helped for the 6502. It allows one to propagate the carry in the same order one reads bytes from memory.

But once you add multiply and divide, the advantage is pretty small.

Those are not very CISCy compared with some S/360 or VAX instructions. Now, compare to S/360 TR which will translate by looking up bytes in a lookup table for 1 to 256 byte long strings. (Unless I remember wrong, XLAT does one byte.)

Well, RISC is more toward using simpler instructions that compilers actually generate and executing them fast. Having one instruction size helps things go fast, and tends to be less efficient with bits.

Even so, I believe that you will find that RISC designers try to make efficient use of the bits available, within the single size instruction constraint.

Decoding multiple different instruction formats tends to require complicated demultiplexers which are especially hard to do in an FPGA. Even so, one can make efficient use of the bits and still be MISC.

-- glen

Vote

R

rickman 13 years ago

Ok, so that's how you see it.

Don't know why you are even mentioning RISC.

I think CISC was not solely about reducing memory used per instruction. CISC was not an area of work. CISC was not even coined until long after many CISC machines were designed. Most CISC computers were designed with very different goals in mind. For example, the x86, a CISC processor, was initially designed to extend the x86 instruction set to a 16 bit processor and then to a 32 bit processor. The goal was just to develop an instruction set that was backwards compatible with existing processors while adding capabilities that would make 32 bit processors marketable.

LOL, that is a pretty MINIMAL analysis of computers, so I guess it is MACA, Minimal Analysis of Computer Architectures.

This is so far out of context I can't comment.

Rick

Vote

R

rickman 13 years ago

Your logic seems to be flawed on so many levels. I don't think I stated that producing a MISC processor was a "design criteria". It doesn't even make sense to have that as a "design criteria".

I never said I was "adding" instructions to some existing instruction set. In fact, I think I've said that the instruction set for the register based MISC processor is so far, *smaller* than the instruction set for the stack based MISC processor as long as you don't consider each combination of X and Y in MOV rx,ry to be a separate instruction. If you feel each combination is a separate instruction then they both have approximately the same number of instructions since they both have

9 bit instructions and so have 512 possible instructions.

Sorry, that isn't a good definition because you used part of the term you are defining in the definition.

Ok, now we are getting somewhere. In fact, if you read my other posts, you will find that I *won't* be adding any CISC instructions because one of my stated "design criteria" is that each instruction executes in one clock cycle. It's pretty hard to design a simple machine that can do "complex" instructions without executing them in multiple clock cycles.

Not sure what you mean here. Code density how many instructions it takes to do a given amount of work. I measure this by writing code and counting the instructions it takes. Right now I have a section of code I am working on that performs the DDS calculations from a set of control inputs to the DDS. This is what I was working on when I realized that a register based design likely could do this without the stack ops, OVER mainly, but also nearly all the others that just work on the top two stack items.

So far it appears the register based instructions are significantly more compact than the stack based instructions. Just as important, the implementation appears to be simpler for the register based design. But that is just *so far*. I am still working on this. The devil is in the details and I may find some aspects of what I am doing that cause problems and can't be done in the instruction formats I am planning or something blows up the hardware to be much bigger than I am picturing at the moment.

You really don't seem to understand what I am doing. You continually misinterpret what I explain.

Uh, I am designing an instruction set that does as much as possible with as little hardware as possible. When you say, "new, additional" instructions, compared to what? When you say "more compact", again, compared to what exactly?

When you say "litte-endian" to reduce encoded integer size, what exactly is that? Are you referring to specifying an integer in small chunks so that sign extension allows the specification to be limited in length? Yes, that is done on both the stack and register based designs. Koopmans paper lists literals and calls as some of the most frequently used instructions, so optimizing literals optimizes the most frequently used instructions.

Rick

Vote

S

Syd Rumpo 13 years ago

Can you achieve as fast interrupt response times on a register-based machine as a stack machine? OK, shadow registers buy you one fast interrupt, but that's sort of a one-level 2D stack.

Even the venerable RTX2000 had an impressive (IIRC) 200ns interrupt response time.

Cheers

Syd

Vote

A

Arlet Ottens 13 years ago

It depends on the implementation.

The easiest thing would be to not save anything at all before jumping to the interrupt handler. This would make the interrupt response really fast, but you'd have to save the registers manually before using them. It would benefit systems that don't need many (or any) registers in the interrupt handler. And even saving 4 registers at 100 MHz only takes an additional 40 ns.

If you have parallel access to the stack/program memory, you could like the Cortex, and save a few (e.g. 4) registers on the stack, while you fetch the interrupt vector, and refill the execution pipeline at the same time. This adds a considerable bit of complexity, though.

If you keep the register file in a large memory, like a internal block RAM, you can easily implement multiple sets of shadow registers.

Of course, an FPGA comes with flexible hardware such as large FIFOs, so you can generally avoid the need for super fast interrupt response. In fact, you may not even need interrupts at all.

Vote

A

Albert van der Horst 13 years ago

But, of course, this is a fallacy. The same goal is accomplished by macro's, and better. Code densitity is the only valid reason.

Albert van der Horst, UTRECHT,THE NETHERLANDS Economic growth -- being exponential -- ultimately falters. albert@spe&ar&c.xs4all.nl &=n http://home.hccnet.nl/a.w.m.van.der.horst

Vote

A

Arlet Ottens 13 years ago

Speed is another valid reason.

Vote

G

glen herrmannsfeldt 13 years ago

Presumably some combination of ease of coding, speed, and also Brooks' "Second System Effect".

Paraphrasing from "Mythical Man Month" since I haven't read it recently, the ideas that designers couldn't implement in their first system that they designed, for cost/efficiency/whatever reasons, come out in the second system.

Brooks wrote that more for OS/360 (software) than for S/360 (hardware), but it might still have some effect on the hardware, and maybe also for VAX.

There are a number of VAX instructions that seem like a good idea, but as I understand it ended up slower than if done without the special instructions.

As examples, both the VAX POLY and INDEX instruction. When VAX was new, compiled languages (Fortran for example) pretty much never did array bounds testing. It was just too slow. So VAX supplied INDEX, which in one instruction did the multiply/add needed for a subscript calcualtion (you do one INDEX for each subscript) and also checked that the subscript was in range. Nice idea, but it seems that even with INDEX it was still too slow.

Then POLY evaluates a whole polynomial, such as is used to approximate many mathematical functions, but again, as I understand it, too slow.

Both the PDP-10 and S/360 have the option for an index register on many instructions, where when register 0 is selected no indexing is done. VAX instead has indexed as a separate address mode selected by the address mode byte. Is that the most efficient use for those bits?

-- glen

Vote

G

glen herrmannsfeldt 13 years ago

(snip)

If you disable interrupts so that another one doesn't come along before you can save enough state for the first one, yes.

S/360 does it with no stack. You have to have some place in the low (first 4K) address range to save at least one register. The hardware saves the old PSW at a fixed (for each type of interrupt) address, which you also have to move somewhere else before enabling more interrupts of the same type.

-- glen

Vote

A

Arlet Ottens 13 years ago

ARM7 is similar. PC and PSW are copied to registers, and further interrupts are disabled. The hardware does not touch the stack. If you want to make nested interrupts, the programmer is responsible for saving these registers.

ARM Cortex has changed that, and it saves registers on the stack. This allows interrupt handlers to be written as regular higher language functions, and also allows easy nested interrupts.

When dealing with back-to-back interrupts, the Cortex takes a shortcut, and does not pop/push the registers, but just leaves them on the stack.

Vote

E

Elizabeth D. Rather 13 years ago

The best interrupt implementation just jumps to the handler code. The implementation knows what registers it has to save and restore, which may be only one or two. Saving and restoring large register files takes cycles!

Interrupts are good. I don't know why people worry about them so!

Cheers, Elizabeth

================================================== Elizabeth D. Rather (US & Canada) 800-55-FORTH FORTH Inc. +1 310.999.6784 5959 West Century Blvd. Suite 700 Los Angeles, CA 90045 http://www.forth.com "Forth-based products and Services for real-time applications since 1973." ==================================================

Vote

R

rickman 13 years ago

Albert, do you have a reference about this?

I think you have just described the CISC instruction development concept. Build a new machine, add some new instructions. No big rational, no "CISC" concept, just "let's make it better, why not add some instructions?"

I believe if you check you will find the term CISC was not even coined until after RISC was invented. So CISC really just means, "what we used to do".

Rick

Vote

G

glen herrmannsfeldt 13 years ago

(snip, then I wrote)

Yes, but remember that there is competition and each has to have some reason why someone should by their product. Adding new instructions was one way to do that.

Well, yes, but why did "we used to do that"? For S/360, a lot of software was still written in pure assembler, for one reason to make it faster, and for another to make it smaller. And people were just starting to learn that people (writing software) are more expensive that machines (hardware). Well, that is about the point that it was true. For earlier machines you were lucky to get one compiler and enough system to run it.

And VAX was enough later and even more CISCy.

-- glen

Vote

A

Alex McDonald 13 years ago

e

Let's take two commonly used S/360 opcodes as an example of CISC; some move operations. MVC (move 0 to 255 bytes) MVCL (move 0 to 16M bytes). MVC does no padding or truncation. MVCL can pad and truncate, but unlike MVC will do nothing and report overflow if the operands overlap. MVC appears to other processors as a single indivisible operation; every processor (including IO processors) sees storage as either before the MVC or after it; it's not interruptible. MVCL is interruptible, and partial products can be observed by other processors. MVCL requires 4 registers and their contents are updated after completion of the operation; MVC requires 1 for variable length moves, 0 for fixed and its contents are preserved. MVCL has a high code setup cost; MVC has none.

Writing a macro to do multiple MVCs and mimic the behaviour of MVCL? Why not? It's possible, if a little tricky. And by all accounts, MVC in a loop is faster than MVCL too. IBM even provided a macro; $MVCL.

But then, when you look at MVCL usage closely, there are a few defining characteristics that are very useful. It can zero memory, and the millicode (IBM's word for microcode) recognizes 4K boundaries for

4K lengths and optimises it; it's faster than 16 MVCs.

There's even a MVPG instruction for moving 4K aligned pages! What are those crazy instruction set designers thinking?

The answer's a bit more than just code density; it never really was about that. In all the years I wrote IBM BAL, I never gave code density a serious thought -- with one exception. That was the 4K base address limit; a base register could only span 4K, so code that was bigger than that, you had to have either fancy register footwork or waste registers for multiple bases.

It was more about giving assembler programmers choice and variety to get the best out of the box before the advent of optimising compilers; a way, if you like, of exposing the potential of the micro/millicode through the instruction set. "Here I want you to zero memory" meant an MVCL. "Here I am moving 8 bytes from A to B" meant using MVC. A knowledgeable assembler programmer could out-perform a compiler. (Nowadays quality compilers do a much better job of instruction selection than humans, especially for pipelined processors that stall.)

Hence CISC instruction sets (at least, IMHO and for IBM). They were there for people and performance, not for code density.

Vote

R

rickman 13 years ago

That's an interesting question. The short answer is yes, but it requires that I provide circuitry to do two things, one is to save both the Processor Status Word (PSW) and the return address to the stack in one cycle. The stack computer has two stacks and I can save these two items in one clock cycle. Currently my register machine uses a stack in memory pointed to by a register, so it would require *two* cycles to save two words. But the memory is dual ported and I can use a tiny bit of extra logic to save both words at once and bump the pointer by two.

The other task is to save registers. The stack design doesn't really need to do that, the stack is available for new work and the interrupt routine just needs to finish with the stack in the same state as when it started. I've been thinking about how to handle this in the register machine.

The registers are really two registers and one bank of registers. R6 and R7 are "special" in that they have a separate incrementer to support the addressing modes. They need a separate write port so they can be updated in parallel with the other registers. I have considered "saving" the registers by just bumping the start address of the registers in the RAM, but that only saves R0-R5. I could use LUT RAM for R6 and R6 as well. This would provide two sets of registers for R0-R5 and up to 16 sets for R6 and R7. The imbalance isn't very useful, but at least there would be a set for the main program and a set for interrupts with the caveat that nothing can be retained between interrupts. This also means interrupts can't be interrupted other than at specific points where the registers are not used for storage.

I'm also going to look at using a block RAM for the registers. With only two read and write ports this makes the multiply step cycle longer though.

Once that issue is resolved the interrupt response then becomes the same as the stack machine - 1 clock cycle or 20 ns.

Rick

Vote

A

Alex McDonald 13 years ago

As long as you discount IBM mainframes. They are big endian. Or Borroughs/Unisys; they were big endian too. Or the Motorola 68K; it was big endian. For little-endian CISC, only the VAX and x86 come to mind. Of those only the x86 survives.

Vote

G

glen herrmannsfeldt 13 years ago

(snip, someone wrote)

MVC moves 1 to 256 bytes, conveniently. (Unless you want 0.)

I haven't looked recently, but I didn't think it locked out I/O. Seems that one of the favorite tricks for S/360 was modifying channel programs while they are running. (Not to mention self- modifying channel prorams.) Seems that MVC would be convenient for that. It might be that MVC interlocks on CCW fetch such that only whole CCWs are fetched, though.

As far as I understand, millicode isn't exactly like microcode, but does allow for more complicated new instructions to be more easily implemented.

Compared to VAX, S/360 is somewhat RISCy. Note only three different instruction lengths and, for much of the instruction set only two address modes. If processors fast path the more popular instructions, like L and even MVC, it isn't so far from RISC.

Though stories are that even the old Fortran H could come close to good assembly programmers, and likely better than the average assembly programmer.

For many processors, MVC was much faster on appropriately aligned data, such as the 8 bytes from A to B. Then again, some might use LD and STD.

I noticed some time ago that the hex opcodes for add instructions end in A, and for divide in D. (That leaves B for subtract and C for multiply, but not so hard to remember.)

If they really wanted to reduce code size, they should have added a load indirect register instruction. (RR format.) A good fraction of L (load) instructions have both base and offset zero, (or, equivalently, index and offset).

-- glen

Vote

R

rickman 13 years ago

Sure, none of this stuff was done without some purpose. My point is that there was no *common* theme to the various CISC instruction sets. Everybody was doing their own thing until RISC came along with a basic philosophy. Someone felt the need to give a name to the previous way of doing things and CISC seemed appropriate. No special meaning in the name actually, just a contrast to the "Reduced" in RISC.

I don't think this is a very interesting topic really. It started in response to a comment by Rod.

Rick

Vote

MISC - Stack Based vs. Register Based

Join the Discussion

Didn't find your answer?