MISC - Stack Based vs. Register Based

- R
- rickman
  
  Contact options for registered users
Vote on answer
posted
11 years ago

Fri, Apr 5, 2013 12:07 AM

I'm pretty sure that conclusion is not correct. If you have an instruction that does two or three memory accesses in one instruction and you replace it with three instructions that do one memory access each, you end up with two extra memory accesses. How is this faster?

That is one of the reasons why I want to increase code density, in my machine it automatically improves execution time as well as reducing the amount of storage needed.

--

Rick

- A
- Alex McDonald
  
  Contact options for registered users
Vote on answer
posted
11 years ago

Fri, Apr 5, 2013 12:26 AM

y

My bad; the encoding is 0 to 255 but it's interpreted as +1.

Certainly on the S/360 whole CCWs, since it had a precise interrupt model and MVC wasn't (and still isn't) interruptible. The S/370 allowed interrupts on page faults for the target and source, but that is done before the instruction is executed.

IPL operates just like that; it issues a fixed CCW that reads in data that's a PSW and some CCWs, and away she goes...

A modern Z series has more instructions than the average Forth has words; it's in the high hundreds.

Fortran/H was a good compiler. The early PL/I was horrible, and there was a move to use it for systems programming work. I never did so due to its incredibly bad performance.

OK, 9 bytes. :-)

Agreed. And had they wanted to, a single opcode for the standard prolog & epilog; for example:

STM 14,12,12(13) LR 12,15 LA 15,SAVE ST 15,8(13) ST 13,4(15) LR 13,15

could have been the single op

ENTRY SAVE

It was macros every time, which is in direct opposition to Albert's assertion.

- A
- Albert van der Horst
  
  Contact options for registered users
Vote on answer
posted
11 years ago

Fri, Apr 5, 2013 1:17 AM

Not in a wikipedia sense where you're not allowed to mention original research, and are only quoting what is in the books. It is more experience that school knowledge.

If you want to see what can be done with a good macro processor like m4 study the one source of the 16/32/64 bit ciforth x86 for linux/Windows/Apple. See my site below.

The existance of an XLAT instruction (to name an example) OTOH does virtually nothing to make the life of an assembler programmer better.

Groetjes Albert

--
Albert van der Horst, UTRECHT,THE NETHERLANDS 
Economic growth -- being exponential -- ultimately falters. 
albert@spe&ar&c.xs4all.nl &=n http://home.hccnet.nl/a.w.m.van.der.horst

- G
- glen herrmannsfeldt
  
  Contact options for registered users
Vote on answer
posted
11 years ago

Fri, Apr 5, 2013 4:10 AM

To me, the only one where little endian seems reasonable is the 6502. They did amazingly well with a small number of gates. Note, for one, that on subroutine call the 6502 doesn't push the address of the next instruction on the stack. That would have taken too much logic. It pushes the adress minus one, as, it seems, that is what is in the register at the time. RET adds one after the pop.

Two byte addition is slightly easier in little endian order, but only slightly. It doesn't help at all for multiply and divide.

VAX was little endian because the PDP-11 was, though I am not sure that there was a good reason for that.

-- glen

- A
- Alex McDonald
  
  Contact options for registered users
Vote on answer
posted
11 years ago

Fri, Apr 5, 2013 6:49 AM

On Apr 4, 9:10 pm, glen herrmannsfeldt wrote: .

John Savard's take on it;

formatting link

dian-1326.html

- M
- Mark Wills
  
  Contact options for registered users
Vote on answer
posted
11 years ago

Fri, Apr 5, 2013 7:41 AM

e

s

used

f

Exactly. The CISC moniker was simply applied to an entire *generation* of processors by the RISC guys. The term RISC was coined by David Patterson at the University of California between 1980 and 1984.

formatting link

*Until* that time, there was no need for the term "CISC", because there was no RISC concept that required a differentiation!

I always thought of the Z80 of a CISC 8-bitter; it has some useful memory move instructions (LDIR etc).

- M
- Mark Wills
  
  Contact options for registered users
Vote on answer
posted
11 years ago

Fri, Apr 5, 2013 7:51 AM

:

I think you're on the right track. With FPGAs it's really quite simple to execute all instructions in a single cycle. It's no big deal at all

- with MPY and DIV being exceptions. In the 'Forth CPU world' even literals can be loaded in a single cycle. It then comes down to careful selection of your instruction set. With a small enough instruction set one can pack more than one instruction in a word - and there's your code density. If you can pack more than one instruction in a word, you can execute them in a single clock cycle. With added complexity, you may even be able to execute them in parallel rather than as a process.

- A
- Arlet Ottens
  
  Contact options for registered users
Vote on answer
posted
11 years ago

Fri, Apr 5, 2013 12:31 PM

Multiple instructions per word sounds like a bad idea. It requires instructions that are so small that they can't do very much, so you need more of them. And if you need 2 or more small instructions to do whatever 1 big instruction does, it's better to use 1 big instruction since it makes instruction decoding more efficient and simpler.

- M
- Mark Wills
  
  Contact options for registered users
Vote on answer
posted
11 years ago

Fri, Apr 5, 2013 2:33 PM

?

he

ted text -

If you're referring to general purpose CPUs I'm inclined to agree. When commenting earlier, I had in mind a Forth CPU, which executes Forth words as native CPU instructions. That is, the instruction set is Forth.

Since there are no need for things like addressing modes in (shall we say) classical Forth processors then you don't actually need things like bit-fields in which to encode registers and/or addressing modes*. All you're left with is the instructions themselves. And you don't need that many bits for that.

A Forth chip that I am collaborating on right now two 6 bit instruction slots per 16-bit word, and a 4-bit 'special' field for other stuff. We haven't allocated all 64 instructions yet.

Even though it's not strictly necessary in a classical registerless Forth CPU, bit-fields can be useful. We're using a couple of bits to tell the ALU if a word pushes results or pops arguments, for example.

- A
- Arlet Ottens
  
  Contact options for registered users
Vote on answer
posted
11 years ago

Fri, Apr 5, 2013 3:01 PM

text -

Well, I was thinking about general purpose applications. A Forth CPU may map well to a Forth program, but you also have to take into account how well the problem you want to solve maps to a Forth program.

A minimal stack based CPU can be efficient if the values you need can be kept on top of the stack. But if you need 5 or 6 intermediate values, you'll need to store them in memory, resulting in expensive access to them. Even getting the address of a memory location where a value is stored can be expensive.

Compare that to a register based machine with 8 registers. You need 3 more opcode bits, but you get immediate access to a pool of 8 intermediate values. And, with some clever encoding (like rickman suggested) some operations can be restricted to a subset of the registers, relaxing the number of encoding bits required.

It would be interesting to see a comparison using non-trivial applications, and see how much code is required for one of those minimal stack CPUs compared to a simple register based CPU.

- G
- glen herrmannsfeldt
  
  Contact options for registered users
Vote on answer
posted
11 years ago

Fri, Apr 5, 2013 5:33 PM

(snip on MISC, RISC, and CISC)

Well, it depends on your definition of instruction.

The CDC 60 bit computers have multiple instructions per 60 bit word, but then the word is big enough.

Now, consider that on many processors an instruction can push or pop only one register to/from the stack.

The 6809 push and pop instructions have a bit mask that allows up to eight register to be pushed or popped in one instruction. (It takes longer for more, but not as long as for separate instructions.)

For x87, there are instructions that do some math function and then do, or don't, remove values from the stack. Some might count those as two or three instructions.

-- glen

- R
- rickman
  
  Contact options for registered users
Vote on answer
posted
11 years ago

Fri, Apr 5, 2013 10:00 PM

Ok, I'm a bit busy with a number of things to go into this area now, but I appreciate the info.

I have used macro assemblers in the past and got quite good at them. I even developed a micro programmed board which used a macro assembler and added new opcodes for my design (it was a company boilerplate design adapted to a new host) to facilitate some self test functions. I sorta got yelled at because this meant you needed my assembler adaptations to assemble the code for the board. I wasn't ordered to take it out so it remained. I didn't. I left within a year and the company eventually folded. You can make a connection if you wish... lol

--

Rick

- R
- rickman
  
  Contact options for registered users
Vote on answer
posted
11 years ago

Fri, Apr 5, 2013 10:08 PM

I have looked at the multiple instruction in parallel thing and have not made any conclusions yet. To do that you need a bigger instruction word and smaller instruction opcodes. The opcodes essentially have to become specific for the execution units. My design has three, the data stack, the return stack and the instruction fetch. It is a lot of work to consider this because there are so many tradeoffs to analyze.

One issue that has always bugged me is that allocating some four or five bits for the instruction fetch instruction seems very wasteful when some

70-90% of the time the instruction is IP < IP+1. Trying to Huffman encode this is a bit tricky as what do you do with the unused bit??? I gave up looking at this until after I master the Rubik's Cube. lol

It does clearly have potential, it's just a bear to unlock without adding a lot of data pathways to the design.

--

Rick

- R
- rickman
  
  Contact options for registered users
Vote on answer
posted
11 years ago

Fri, Apr 5, 2013 10:25 PM

Just to make a point, my original post wasn't really about Forth processors specifically. It was about MISC processors which may or may not be programmed in Forth.

There aren't many algorithms that can't be dealt with reasonably using the data and return stacks. The module I was coding that made me want to try a register approach has four input variables, two double precision. These are in memory and have to be read in because this is really an interrupt routine and there is no stack input. The main process would have access to these parameters to update them.

The stack routine uses up to 9 levels on the data stack currently. But that is because to optimize the execution time I was waiting until the end when I could save them off more efficiently all at once. But - in the process of analyzing the register processor I realized that I had an unused opcode that would allow me to save the parameters in the same way I am doing it in the register based code, reducing the stack usage to five items max.

I understand that many stack based machines only have an 8 level data stack, period. The GA144 is what, 10 words, 8 circular and two registers?

That's the whole enchilada with a register MISC, figuring out how to encode the registers in the opcodes. I think I've got a pretty good trade off currently, but I have not looked at running any Forth code on it yet. This may be much less efficient than a stack machine... duh!

I have been invited to get a presentation to the SVFIG on my design. I need to work up some details and will let you know when that will be. They are talking about doing a Google+ hangout which I suppose is a video since it would be replayed for the meeting. I'm not sure I can be ready in time for the next meeting, but I'll see. I don't think my design is all that novel in the grand scheme of MISC. So I likely will do this as a comparison between the two designs. I think I also need to bone up on some of the other designs out there like the J1 and the B16.

--

Rick

- R
- rickman
  
  Contact options for registered users
Vote on answer
posted
11 years ago

Sat, Apr 6, 2013 9:32 PM

This morning I found one exception to the one cycle rule. Block RAM. My original design was for the Altera ACEX part which is quite old now, but had async read on the block rams for memory and stack. So the main memory and stacks could be accessed for reading and writing in the same clock cycle, read/modify/write. You can't do that with today's block RAMs, they are totally synchronous.

I was looking at putting the registers in a block RAM so I could get two read ports and two write ports. But this doesn't work the way an async read RAM will. So I may consider using a multiphase clock. The operations that really mess me up is the register indirect reads of memory, like stack accesses... or any memory accesses really, that is the only way to address memory, via register. So the address has to be read from register RAM, then the main memory RAM is read, then the result is written back to the register RAM. Wow! That is three clock edges I'll need.

If I decide to go with using block RAM for registers it will give me N sets of regs so I have a big motivation. It also has potential for reducing the total amount of logic since a lot of the multiplexing ends up inside the RAM.

The multiphase clocking won't be as complex as using multiple machine cycles for more complex instructions. But it is more complex than the good old simple clock I have worked with. It will also require some tighter timing and more complex timing constraints which are always hard to get correct.

--

Rick

- R
- rickman
  
  Contact options for registered users
Vote on answer
posted
11 years ago

Sat, Apr 6, 2013 9:37 PM

I think multiple instructions per word is a good idea of you have a wide disparity in speed between instruction memory and the CPU clock rate. If not, why introduce the added complexity? Well, unless you plan to execute them simultaneously... I took a look at a 16 bit VLIW idea once and didn't care for the result. There are just too many control points so that a proper VLIW design would need more than just 16 bits I think. At least in the stack design.

An 18 bit instruction word might be worth looking at in the register CPU. But getting more parallelism in the register design will require more datapaths and there goes the "minimal" part of the MISC.

So many options, so little time...

--

Rick

- R
- rickman
  
  Contact options for registered users
Vote on answer
posted
11 years ago

Sat, Apr 6, 2013 9:44 PM

text -

That can be useful. The automatic pop of operands is sometimes expensive by requiring a DUP beforehand. In my design the fetch/store words have versions to drop the address from the return stack or increment and hold on to it. In looking at the register design I realized it would be useful to just make the plus and the drop both options so now there are three versions, fetch, fetch++ and fetchK (keep).

--

Rick

- B
- Brian Davis
  
  Contact options for registered users
Vote on answer
posted
11 years ago

Sun, Apr 7, 2013 9:59 PM

I had the same problem when I first moved my XC4000 based RISC over to the newer parts with registered Block RAM.

I ended up using opposite edge clocking, with a dual port BRAM, to get what appears to be single cycle access on the data and instruction ports.

As this approach uses the same clock, the constraints are painless; but you now have half a clock for address -> BRAM setup, and half for the BRAM data core data setup. The latter can cause some some timing issues if the core is configured with a byte lane mux so as to support 8/16/32 bit {sign extending} loads.

-Brian

- R
- rickman
  
  Contact options for registered users
Vote on answer
posted
11 years ago

Mon, Apr 8, 2013 5:16 AM

Yes, that was one way to solve the problem. This other I considered was to separate the read and write on the two ports. Then the read would be triggered from the address that was at the input to the address register... from the previous cycle. So the read would *always* be done and the data presented whether you used it or not. I'm not sure how much power this would waste, but the timing impact would be small.

I looked at making the register block RAM part of the main memory address space. This would required a minimum of three clock cycles in a machine cycle, read address or data from register, use address to read or write data from/to memory and then write data to register. If it helps timing, the memory write can be done at the same time as the register write. I'm not crazy about this approach, but I'm considering how useful it would be to have direct address capability of the multiple register banks.

Some of the comments about register vs. stacks and what I have seen of the J1 has made me think about a hybrid approach using stacks in memory, but with offset access, so items further down in the stack can be operands, not just TOS and NOS. This has potential for saving stack operations. The J1 has a two bit field controlling the stack pointer, I assume that is +1 to -2 or 1 push to 2 pop. The author claims this provides some ability to combine Forth functions into one instruction, but doesn't provide details. I guess the compiler code would have to be examined to find out what combinations would be useful.

The compiler end is not my strong suit, but I suppose I could figure out how to take advantage of features like this.

--

Rick

- A
- Albert van der Horst
  
  Contact options for registered users
Vote on answer
posted
11 years ago

Mon, Apr 8, 2013 9:52 AM

This is the approach we took with the FIETS chip, about 1980, emulated on an Osborne CPM computer, never build. The emulation could run a Forth and it benefited from reaching 8 deep into both the return and the data stack. It still would be interesting to build using modern FPGA.

Groetjes Albert

--
Albert van der Horst, UTRECHT,THE NETHERLANDS 
Economic growth -- being exponential -- ultimately falters. 
albert@spe&ar&c.xs4all.nl &=n http://home.hccnet.nl/a.w.m.van.der.horst