Branch delay slot on MIPS32 processors

- J
- Julia Groszark
  
  Contact options for registered users
posted
20 years ago

Sat, Jul 5, 2003 9:27 PM

MIPS32 processors have "delayed" loads and branches. The MIPS32 manual says that the instruction immediately following a branch is always executed, regardless of whether the branch is taken or not. Optimizing compilers try to fill a branch delay slot with an appropriate instruction.

Are there any restrictions on the kind of instruction that can be placed in the branch delay slot? Is it possible to place a STW (store word) in the delay slot? Is it possible to fill the delay slot with a multiply-and-add (MADD) instruction, even if MADD needs more than one cycle to complete?

Thanks for your answers.

Julia.

- G
- Girish
  
  Contact options for registered users
Vote on answer
posted
20 years ago

Sun, Jul 6, 2003 8:56 AM

i am not compiler expert but just an user. i think putting sw in delay slot won't be a problem. madd i am not sure. also if pipeline (> r4k) has hazard detection, it will stall the pipeline if it detects one.

Julia Groszark wrote:

- J
- jetmarc
  
  Contact options for registered users
Vote on answer
posted
20 years ago

Mon, Jul 7, 2003 7:27 PM

I'm sorry, I've never worked with MIPS32, so I can't answer your question.

But the concept of delayed branches sounds quite interesting to me. It not only solves the problem of prefetch stall, but also gives quite a number of exciting new possibilities. If you, for example, were to place another branch in the slot, you control the execution flow at the branch source, instead of the destination. That's not possible on traditional designs.

I think this feature could be quite useful on tiny (havard) 8 bit micros, where you squeeze the best out of a very limited code space.

Marc

- T
- Terje Mathisen
  
  Contact options for registered users
Vote on answer
posted
20 years ago

Tue, Jul 8, 2003 6:00 AM

OUCH!

Branch delay slots have problems:

a) They expose, and thereby tie you to, a particular microarchitectural implementation.

a) How long should they be? On a 4-issue superscalar version of the architecture, you'd want 4, 8 or even 12 instructions if you want to avoid pipeline bubbles.

Terje

--
- 
"almost all programming can be viewed as an exercise in caching"

- M
- Michael Carstens-Behrens
  
  Contact options for registered users
Vote on answer
posted
20 years ago

Tue, Jul 8, 2003 2:03 PM

The delay slot instruction must be harmless to both paths. An STW instruction often uses several clock cycles, it is not harmless if the memory content is re-used soon and it's not clear how the pipeline works in that case, e.g. you might have a non-blocking load/store implementation or not. So don't use memory access, coprocessor instructions, ..., and take care about the registers you modify in the delay slot.

Typically the delay slot is used to update the stack pointer. Or you can use it to otpimize algorithms in assembler.

Best regards, Mike...

- K
- Kai Harrekilde-Petersen
  
  Contact options for registered users
Vote on answer
posted
20 years ago

Tue, Jul 8, 2003 10:10 PM

I believe that Dominic Sweetman's "See MIPS run" and/or Gerry Kane's "MIPS RISC Architecture" should answer your questions.

Putting the addiu in the delay slot should work perfectly. I cannot say for sure; it has been too many years (7 and counting) since I last worked with a MIPS processor.

--Kai

- J
- Jonathan Larmour
  
  Contact options for registered users
Vote on answer
posted
20 years ago

Wed, Jul 9, 2003 2:59 AM

Julia Groszark enlightened us with:

There's some volumes of the MIPS32 spec at

formatting link

but the spec itself is quite vague. If you want chapter and verse guarantees, I'm not sure you'll get them.

You can get specs for specific implementations like 4Kc from of course.

But to me and my understanding of the MIPS pipeline, it makes sense that a SW in a branch delay slot should be fine. Read the docs above about the pipeline, and work it out for yourself what happens at what stage if you like.

FWIW, playing with some code in GCC compiled with "mipsisa32-elf-gcc

-mips32", it generated stuff with the STW in the branch delay slot.

Jifl

--
--[ "You can complain because roses have thorns, or you ]--
--[  can rejoice because thorns have roses." -Lincoln   ]-- Opinions==mine

- M
- Michael Meissner
  
  Contact options for registered users
Vote on answer
posted
20 years ago

Wed, Jul 9, 2003 3:56 AM

You can certainly do the ADDIU in the delay slot. What you need to do is bias the end condition, since you are doing the subtract after the test.

Looking at the GCC sources, it allows stores in the delay slot, and I know I've done it with earlier generations of MIPS processors. On older MIPS (ie, MIPS1 and MIPS2) loads, transfers, and move hi/lo weren't allowed since they had user visible delays before their result could be used.

If you can, I would suggest unrolling the loop at least 1 time, so that you aren't as dependent on things being in the level 1 cache.

--
Michael Meissner
email: mrmnews@the-meissners.org
http://www.the-meissners.org

- G
- Glen Herrmannsfeldt
  
  Contact options for registered users
Vote on answer
posted
20 years ago

Wed, Jul 9, 2003 6:39 PM

(snip)

I was reading a thread, at least farther down on my news reader, called "Simplified forwarding".

I was then remembering a discussion some years ago on branch delay slots, and the problem of how many cycles (or instructions) are needed. A thought I came up with was to add a field (yes, there are always not enough bits) to the branch instructions for how many delay slot instructions there should be. Maybe two bits for 0, 1, 2, 4, for example.

The compiler could determine how many instructions it could possibly execute in the delay slot and move them there. If the hardware only needed one, nothing would be lost.

It would be interesting to know, for compiler generated code, how many instructions could possibly be moved into delay slots. Code containing small loops would not have so many instructions available to move.

-- glen

- P
- Peter "Firefly" Lund
  
  Contact options for registered users
Vote on answer
posted
20 years ago

Wed, Jul 9, 2003 8:50 PM

Another idea: always make it eight instructions but include a version of NOP with an immediate field that says how many NOPs it really stands for.

(apologies to TMS320C6000)

-Peter

- R
- Robin KAY
  
  Contact options for registered users
Vote on answer
posted
20 years ago

Wed, Jul 9, 2003 9:41 PM

Yet another idea: add a bit in the instruction encoding to signify the end of the delay slot. Compilers would generate delay slots as long as is possible and the processor can take advantage of them as needed.

--
Wishing you good fortune,
--Robin Kay-- (komadori)

- S
- Stephen Sprunk
  
  Contact options for registered users
Vote on answer
posted
20 years ago

Wed, Jul 9, 2003 11:25 PM

At that point, you might consider that bit as explicit termination of a group of instructions that can be executed in parallel, so the processor doesn't have to check dependencies...

Even better, we could separate the branch decision from the jump point entirely, removing the need for speculative execution or delay slots...

Oh wait, that's already been done :)

S

--
Stephen Sprunk         "God does not play dice."  --Albert Einstein
CCIE #3723         "God is an inveterate gambler, and He throws the
K5SSS        dice at every possible opportunity." --Stephen Hawking

- R
- Robin KAY
  
  Contact options for registered users
Vote on answer
posted
20 years ago

Wed, Jul 9, 2003 11:45 PM

Point taken ^_^', but perhaps arbitarily long delay slots could be benificial on their own and without some of the other 'features' you mention. Am I mistaken? I would not have thought scheduling my long delay slots significantly more complex than slots only a single instruction long.

--
Wishing you good fortune,
--Robin Kay-- (komadori)

- A
- Allan Sandfeld Jensen
  
  Contact options for registered users
Vote on answer
posted
20 years ago

Thu, Jul 10, 2003 2:35 PM

I've always wanted to do a "Prepare branch/jump" and "Do branch" type of system. It feels like a logical extention to modern prefect instructions.

`Allan

- K
- Kevin D. Kissell
  
  Contact options for registered users
Vote on answer
posted
20 years ago

Fri, Jul 18, 2003 3:08 PM

The only instructions that absolutely must not be put into a MIPS32 branch delay slot are those which themselves have a delay slot, i.e. you cannot have a branch followed directly by a branch. Stores, adds, whatever, are fine. Note that the branch *likely* variants cause the delay slot instruction to be "squashed" (not executed) if the branch isn't taken. Otherwise the instruction in a branch delay slot can be thought of as being logically sequenced prior to the branch instruction. Indeed, MIPS assemblers typically have a default mode where one writes as if there were no delay slots, and the assembler re-orders things auto-magically and moves a branch-invariant instruction (on which the branch does not depend) down into the delay slot.