Branch prediction using counters.

- T
- Tom Linden
  
  Contact options for registered users
Vote on answer
posted
18 years ago

Tue, Aug 9, 2005 3:44 PM

as an aside, the Regulus architecture which was the follow on to the Power

6 from Computer Consoles had a cute feature. If the value of a < b could be determined in advance a bit was set and when execution reached that point, only the appropriate branch made it to the pipeline. Multiflow tried executing in parallel, selecting the right branch when the truth value was known, but I think it became a bit unwieldy at that time (ca. 1984)

- A
- Andy Freeman
  
  Contact options for registered users
Vote on answer
posted
18 years ago

Tue, Aug 9, 2005 4:08 PM

...

US Patent 5,949,995.

You can use an instruction that references the branch or its prediction instead of a branch that references the prediction datum. The advantage of the former is that it doesn't require any additional information in branch instructions and the prediction instructions are a lot like stores.

-andy

- K
- Keith Williams
  
  Contact options for registered users
Vote on answer
posted
18 years ago

Tue, Aug 9, 2005 4:39 PM

Yes, but speculative execution doesn't usually include following both paths. Going both ways after branches gets ugly on the second and subsequent branches. ;-). The PPC-970 can have ~200 instructions in flight (half in the prefetch/decode stages and half between issue and completion). To keep the pipe full, it will speculatively fetch/execute up to sixteen branches deep, but will speculatively execute only one path of the 65K possible. There are optimization rules for branches (branches backwards are usually taken, forwards usually not, etc.) and the programmer can drop hints to override these defaults.

--
  Keith

- N
- Nick Maclaren
  
  Contact options for registered users
Vote on answer
posted
18 years ago

Tue, Aug 9, 2005 8:53 PM

In article , Keith Williams writes: |>

|> Yes, but speculative execution doesn't usually include following both |> paths. Going both ways after branches gets ugly on the second and |> subsequent branches. ;-). The PPC-970 can have ~200 instructions in |> flight (half in the prefetch/decode stages and half between issue and |> completion). To keep the pipe full, it will speculatively fetch/execute |> up to sixteen branches deep, but will speculatively execute only one |> path of the 65K possible. There are optimization rules for branches |> (branches backwards are usually taken, forwards usually not, etc.) and |> the programmer can drop hints to override these defaults.

Which is fine for the simplest codes, but useless for ones which are hard to predict. An old rule is that 20% of instructions are branches, but that might drop to 10% in some ISAs. Anyway, unless you can predict close to 95% correctly, speculatively executing 16 branches ahead isn't going to help.

The thing that CAN be done both ways is cache preloading, though it can increase the memory bandwidth unacceptably. Something that could be done is a priority based cache preloader, fed with data from the speculative execution. But even that doesn't help all that much for difficult codes.

Regards, Nick Maclaren.

- K
- Ken Smith
  
  Contact options for registered users
Vote on answer
posted
18 years ago

Tue, Aug 9, 2005 9:01 PM

In article , Skybuck Flying wrote: [...]

Take a look at what is suggest be done and you should be able to see that, when coded for speed, other than the loop jumps, the jumps in the first few passes through the code are worse than useless for predicting the jumps later.

For the loop instructions, the branch condition is known well in advance. In the fastest RISC machines, this allows the loop jumps to happen with no lost cycles.

--
--
kensmith@rahul.net   forging knowledge

- N
- Nick Maclaren
  
  Contact options for registered users
Vote on answer
posted
18 years ago

Tue, Aug 9, 2005 9:06 PM

The same applies to much of the code in many graph theoretic algorithms (VERY common and often dominating the time in some fields) and the "object oriented" programming paradigm favoured by many C++ and GUI programmers. In the latter case, the correct key for prediction is the contents of a register and not what the branch did last time.

Regards, Nick Maclaren.

- K
- Ken Smith
  
  Contact options for registered users
Vote on answer
posted
18 years ago

Tue, Aug 9, 2005 9:18 PM

In article , Nick Maclaren wrote: [... me ...]

Yes, there are lots of other examples, but I think the Booth's case is the simplest real case to see the problem with.

BTW: I disagree with the suggestion that jumps in OO programs are by there nature harder to predict. Virtual method dispatching is a hard type of branch to predict but in non-OO code, the virtual method call would be replaced with a long switch statement. In good compilers, long switches are implemented with jump tables. "goto table[i]" is very hard to predict.

--
--
kensmith@rahul.net   forging knowledge

- N
- Nick Maclaren
  
  Contact options for registered users
Vote on answer
posted
18 years ago

Tue, Aug 9, 2005 9:41 PM

In article , snipped-for-privacy@green.rahul.net (Ken Smith) writes: |> |> Yes, there are lots of other examples, but I think the Booth's case is the |> simplest real case to see the problem with.

It does, however, give the impression that the problem applies only to very esoteric codes. It doesn't.

|> BTW: I disagree with the suggestion that jumps in OO programs are by |> there nature harder to predict. Virtual method dispatching is a hard type |> of branch to predict but in non-OO code, the virtual method call would be |> replaced with a long switch statement. In good compilers, long switches |> are implemented with jump tables. "goto table[i]" is very hard to |> predict.

Neither is hard to predict, if you do it right. Few, if any, modern systems do.

What is needed is a compiler-generated instruction that sets up a datum to be used for branch prediction, which is then quoted in the branch, and used by the hardware. The instruction history alone is useless for such codes, but the object address is (obviously) excellent. This has the advantage that information from a previous branch in the same function could be used.

For example, consider the following instructions:

Set predictor. It takes up to 2 registers, and sets one of a few (4-16) predictors based on their contents (in an unspecified fashion).

Predict branch. It takes a predictor and also specifies a preferred direction, a resulting direction or none. It must immediately precede a branch.

This would allow a compiler to take the following:

IF a > b THEN ... FI; . . . IF b < a THEN ... FI;

and issue:

Set_predictor 1 a b . . . Predict_branch 1 +1 [ indicating a resulting direction ] . . . Predict_branch 1 -2 [ indicating a preferred direction ]

I have little idea how effective this would be, but my gut feeling is that it would be better than correct approaches for such codes. As far as I know, I have rendered this unpatentable by posting :-)

Regards, Nick Maclaren.

- J
- John Savard
  
  Contact options for registered users
Vote on answer
posted
18 years ago

Tue, Aug 9, 2005 10:36 PM

On Mon, 8 Aug 2005 10:32:18 +0200, "Skybuck Flying" wrote, in part:

Actually, it's standard practice now to use a counter. But the counter is inside the chip, not in RAM or in higher-level language. It is usually a saturating two-bit wide counter.

And usually the counters are used in conjunction with a branch history table (not to be confused with butylated hydroxytoluene).

John Savard

formatting link

Usenet Zone Free Binaries Usenet Server More than 120,000 groups Unlimited download

formatting link

to open account

- N
- Nick Maclaren
  
  Contact options for registered users
Vote on answer
posted
18 years ago

Tue, Aug 9, 2005 10:52 PM

In article , "Peter \\"Firefly\\" Lund" writes: |> On Tue, 9 Aug 2005, Ken Smith wrote: |> |> > there nature harder to predict. Virtual method dispatching is a hard type |> > of branch to predict but in non-OO code, the virtual method call would be |> |> No, virtual methods are not that expensive (*). It turns out that each |> call site usually have very few targets at a time, often only one. |> You can use the call site address to predict which of a number of saved |> target addresses (or even saved target instructions) to go to.

Oh, really? That doesn't apply to non-trivial GUI codes.

A very large number of component widgets are very complex, with a lot of object-dependent branching internally, and are used in very different ways from higher-level widgets. Now, this is NOT how to do it, but it is the programming paradigm that is most common in that area.

|> You can also use software methods to achieve the same thing: your VM (or |> whatever) arranges a compare instruction that checks that the target is |> what you think it is + a conditional jump + a call with a hardcoded target |> address. If the target is different from what you thought, then the |> conditional jump will jump to code that handles the new target. Usually, |> the jump will be predicted to be a fall-through so this is a cheap way of |> handling typical OO calls.

Regrettably not, in the case I mention above. Each component widget may have a dozen major attributes, which interact in truly horrible ways. That would increase the code size by a large factor.

I agree that nobody in his right mind would ever design code like that, but that isn't the point. That is how it has been designed.

Regards, Nick Maclaren.

- N
- Nick Maclaren
  
  Contact options for registered users
Vote on answer
posted
18 years ago

Tue, Aug 9, 2005 11:23 PM

In article , "Andy Freeman" writes: |> Nick Maclaren wrote: |> > What is needed is a compiler-generated instruction that sets up a |> > datum to be used for branch prediction, which is then quoted in |> > the branch, and used by the hardware. The instruction history |> > alone is useless for such codes, but the object address is |> > (obviously) excellent. This has the advantage that information |> > from a previous branch in the same function could be used. |> ... |> > As far as I know, I have rendered this unpatentable by posting :-) |> |> US Patent 5,949,995.

Thanks. I think that my posting includes an aspect that isn't in that, but I can't be bothered to check. It certainly covers the basic principle.

I did a search on Google groups to see if I had posted the method before the patent, but it seems not. It is pretty obvious, in my view.

Regards, Nick Maclaren.

- P
- Peter "Firefly" Lund
  
  Contact options for registered users
Vote on answer
posted
18 years ago

Wed, Aug 10, 2005 12:40 AM

No, virtual methods are not that expensive (*). It turns out that each call site usually have very few targets at a time, often only one.

You can use the call site address to predict which of a number of saved target addresses (or even saved target instructions) to go to.

You can also use software methods to achieve the same thing: your VM (or whatever) arranges a compare instruction that checks that the target is what you think it is + a conditional jump + a call with a hardcoded target address. If the target is different from what you thought, then the conditional jump will jump to code that handles the new target. Usually, the jump will be predicted to be a fall-through so this is a cheap way of handling typical OO calls.

*) It depends on what resources the CPU designers had available and what kinds of code it was deemed necessary to make fast. The software method is nice in that it only requires simple branch prediction (which is relatively cheap) but it can be hard to make your compiler/linker/loader/VM work that way. The hardware method requires more resources for caching addresses/instructions but it will work with more naive toolchains/languages.

-Peter

- P
- Paul Hovnanian P.E.
  
  Contact options for registered users
Vote on answer
posted
18 years ago

Wed, Aug 10, 2005 3:40 AM

This sort of thing is done at higher levels in various search algorithms. Far more advanced techniques than just counters are used.

Building this sort of thing into processor firmware or general purpose library routines might impose an inordinate burden on execution time if it was employed for every branch. For cases where the branch execution probabilities don't change much, just profiling the code and adjusting the design will get you just as much gain. In the few cases where a search can be optimized, the designer should be able to apply the correct heuristics based on knowledge of the problem domain.

--
Paul Hovnanian     mailto:Paul@Hovnanian.com
------------------------------------------------------------------
If you can\'t beat them, arrange to have them beaten.
                                -- George Carlin

- K
- Ken Smith
  
  Contact options for registered users
Vote on answer
posted
18 years ago

Wed, Aug 10, 2005 7:24 AM

Are you suggesting that Booth's divide is "esoteric"? It seems to be a very common bit of code.

[...]

This could sort of work but it is adding an instruction in the process of all branches. The added instruction time will eat away at any advantage gained.

As discribed, it also would only work if the same routine is called for the same object. You would be better off feeding the address of the Virtual Method Table into this added instruction. In most C++ programs, all objects of a given type share the same VMT and hence any use of a object of a given type would give you the information needed to predict jumps.

Remember that some of the RISC and DSP processors duck the problem by letting some small number of instructions after the branch happen before the branch takes effect. This only adds instructions to the code in cases were nothing can be done before the branch happens. I think this is a much better way to go if you want a lot of speed per transistor.

No, the object's address is not obviously "excellent", at least to me. Please explain what you mean.

[..long description removed..]

Maybe they'll name it after you. You'll be famous!

--
--
kensmith@rahul.net   forging knowledge

- K
- Ken Smith
  
  Contact options for registered users
Vote on answer
posted
18 years ago

Wed, Aug 10, 2005 7:35 AM

Thinking of some OO code I've done:

The typical case is that a given call could dispatch the, lets say, "edit" method for some 25 different types. Do you consider 25 a "small" number.

BTW: There are quite a few cases over 25 in the code I was thinking of.

Are you intending to save this information on the CPU chip? If so, couldn't the same number of transistors be put to much better use?

No, this is not a cheap way of doing VMTs. The Virtual Method Table, is the fast way of doing it. There is no point in optimizing a slower method if you want speed. The only case where your method would be faster is if the majority of times you come to the call the same object type is in use. I can't think of a program I've written were that was true.

--
--
kensmith@rahul.net   forging knowledge

- A
- Andy Freeman
  
  Contact options for registered users
Vote on answer
posted
18 years ago

Wed, Aug 10, 2005 2:32 PM

You don't have to do it for all branches, just the ones where doing so is a good idea. (That's one argument against "quoting" an explicit prediction datum in the branch instruction.)

The cost of a branch mispredict varies with implementation, but 10 cycles is not unknown and more than 5 cycles is reasonably common in the high-performance world. Since each cycle can be 3-5 instructions, using a couple of instructions to avoid a mispredict is usually a good trade (as long as they don't stall).

A small number of instructions isn't enough and it's usually very hard to find more than 1-2 (on average). You need to find 20-50 if you're actually trying to go fast. (Yes, you can pick an implementation where

1-2 is enough, but those implementations are slower in general.)

The object at a given address tends to have the same type for "a while".

-andy

- N
- Nick Maclaren
  
  Contact options for registered users
Vote on answer
posted
18 years ago

Wed, Aug 10, 2005 3:05 PM

The mind boggles! I have never seen it used, but I can believe that it is. The most usual software divide is Newton-Raphson. Whatever the case, only a few systems lack hardware division for the standard types, very few applications use it for non-standard types, and the proportion of time spent in such code is miniscule!

Regards, Nick Maclaren.

- A
- Andrew Reilly
  
  Contact options for registered users
Vote on answer
posted
18 years ago

Wed, Aug 10, 2005 6:52 PM

Up to 40 instructions, in the case of the TI C6000 series: 5 instruction issue slots in the branch shadow, up to eight instructions per slot. Makes loop pre-amble and termination of otherwise single-cycle loops amusing, but do-able. There are various multi-cycle NOP instructions to fiddle the other cases without taking up too much code space.

As well as good speed/transistor, this does help with predictability too.

Cheers,

--
Andrew

- N
- Nick Maclaren
  
  Contact options for registered users
Vote on answer
posted
18 years ago

Wed, Aug 10, 2005 10:47 PM

In article , "Skybuck Flying" writes: |> |> Dude if this stuff is encoded into the jump instruction then it falls under |> my description of inserting counters into jump instructions and therefore I |> should still be able to patent it and not you :P

Look, Bubba, no counters! Try rereading what I posted.

Regards, Nick Maclaren.

- S
- Skybuck Flying
  
  Contact options for registered users
Vote on answer
posted
18 years ago

Thu, Aug 11, 2005 12:36 AM

the

type

be

switches

Dude if this stuff is encoded into the jump instruction then it falls under my description of inserting counters into jump instructions and therefore I should still be able to patent it and not you :P

Bye, Skybuck :)