Branch prediction using counters.

- S
- Skybuck Flying
  
  Contact options for registered users
posted
18 years ago

Mon, Aug 8, 2005 5:32 PM

Hi,

Some time ago I already had this idea of inserting counters into if statements to predict which branch is most likely to be executed. At the lower levels this would mean inserting counters into jump instructions and recording which branch is executed the most by incrementing or maybe even decrementing counters everytime the branch gets executed.

Then the execute-ahead logic would execute those branches with the highest counter. Also a execute-ahead flag could be inserted into the jump instruction to indicate if execute-ahead is allowed. For example a compiler can use this flag to indicate that execute-ahead is not possible etc.

Ofcourse branch prediction is nothing new and processors nowadays have that. So I was browsing through some patents. (Not all because that would take way too much time... It would maybe take a day maybe even a few days to browse through them all ;))

So far I have seen one patent which mentions using a "taken/not taken" bit which indicates if a branch was taken or not.

So I have few questions about branch prediction.

Is it common to use counters as described above to do branch prediction or is this something novel ? ;)

Suppose it's not novel... then why only use 1 bit to do branch prediction ?

Some reasons which I can think of:

It requires less memory than bigger counters.
Counters might take to long to change. For example Branch A might be executed many times, then suddenly Branch B has to be executed many times. Using large counters might take to long for the prediction to catch up ;)

What are your thoughs on this ? ;)

Any references to patents or other information about branch prediction ?

Bye, Skybuck.

- B
- Bob Monsen
  
  Contact options for registered users
Vote on answer
posted
18 years ago

Mon, Aug 8, 2005 6:41 PM

GCC has a profiling mode that will allow one to capture this information, for optimizing MIPS output (where it is quite useful, since they have alternate opcodes depending on which branch is more likely). As I recall (I've never used it) you run your code for a while, and then get a dump of the counters, which you then feed back into GCC.

--
Regards,
  Bob Monsen

If a little knowledge is dangerous, where is the man who has
so much as to be out of danger?
                                  Thomas Henry Huxley, 1877

- K
- Ken Smith
  
  Contact options for registered users
Vote on answer
posted
18 years ago

Mon, Aug 8, 2005 9:26 PM

[...]

No, its is not common. For that matter I've never heard of the idea being used. There are problems that the counter. It should not be part of the instruction because that would mean that code space has to be written into as the program runs. This will slow things down.

Many processors have caches on them. The cache will contain the code that was done on the last pass around the loop and as a result repeating a loop the same way is faster than taking some new path.

Many pipelined processors perform an instruction or two after the jump instuction even when the jump is taken. A good compiler will take advantage of this if it can.

Some processors have a conditional skip the next instruction and some have conditional instructions. These allow logic to be done without taking a branch. This increases the speed of things like "if X

- S
- Stephen Fuld
  
  Contact options for registered users
Vote on answer
posted
18 years ago

Mon, Aug 8, 2005 10:40 PM

You are mising an important point. There typically is no need to write to the code space at all. Instruction caches have no write to memory circuitry. To write the counters to memory you would have to add an additional path, and if you were going to write the updated value on every branch, you are going to add a huge amount of memory traffic (studies show that typically a branch occurs once every 6-10 instructions on average). If you are not going to write the counter back every time, when are you? How are you going to associate the multiple counters you then must keep with branches? There are a lot of problems you would have to solve for what is probably a small gain (hint, current branch predictors do pretty well already).

You really need to do some more basic study. You have depricated the MIT on-line courses, calling them "for dummies". First of all, I doubt that any MIT course is "for dummies". Second of all, while you may not be a dummy in general, in this area, you are, simply because you lack the basic knowledge. I admire your enthusiasm, but you need to channel it better so as not to appear as foolish as you currently do to members of the group who have lots of experience.

--
 - Stephen Fuld
   e-mail address disguised to prevent spam

- C
- Colonel Forbin
  
  Contact options for registered users
Vote on answer
posted
18 years ago

Mon, Aug 8, 2005 11:38 PM

Actually this is sort of analogous to runtime profiling feedback to compilers, which is used on a number of systems.

The problem is that it is often impossible to predict an event until the outcome is observed (think quantum mechanics).

If we knew the outcome of each branch, we could simply eliminate the branch in most cases and still have a reasonable approximation.

Inserting counters into production code typically comes at a great price in terms of overhead.

We typically do this in development in order to optimize the most costly sections of code for the most common outcomes, but trying to put some sort of low level adaptive optimization metrics into production code would consume more in overhead than the cost of doing the task to begin with.

- S
- Skybuck Flying
  
  Contact options for registered users
Vote on answer
posted
18 years ago

Tue, Aug 9, 2005 12:11 AM

and

prediction

If it slows down the system then that would mean the bottleneck is in the bandwidth/speed of the bus/whatever used to read/write instructions from/to main memory.

I know games and stuff like that uses tremendous ammounts of bandwidth but that goes via special bussess like agp or pci express.

How much memory bandwidth/speed is available to transfer instruction from/to main memory where program code is located ? ;) and how much of that bandwidth is generally used for application or maybe even games ? ;)

In other words... is bandwidth from main memory to cpu and back, truely the limiting factor ? ;)

Bye, Skybuck

- C
- Colonel Forbin
  
  Contact options for registered users
Vote on answer
posted
18 years ago

Tue, Aug 9, 2005 12:38 AM

Thank you for that insightful contribution to our collective knowledge.

I am sure it will inspire many lurkers on these groups to come forth and bestow you with their knowledge and to spend many hours of independent research on their own time to answer your future questions.

The amusing thing is that we might have some useful conversations peripherally related to issues you raised.

Aol to you, too.

- S
- Skybuck Flying
  
  Contact options for registered users
Vote on answer
posted
18 years ago

Tue, Aug 9, 2005 1:00 AM

the

If

any

in

knowledge.

lots

I could depricate you as dummy ;) for not seeing the obvious solutions to this idea :P

But I guess you didn't put much thought into it... but I have a little bit ;)

However I do not see a need to share my wisedom to make dummies rich :)

In fact you didn't answer any of my questions ;)

Bye, Skybuck.

- J
- Jeroen
  
  Contact options for registered users
Vote on answer
posted
18 years ago

Tue, Aug 9, 2005 1:01 AM

"Stephen Fuld" schreef in bericht news:q1LJe.78310$ snipped-for-privacy@bgtnsc05-news.ops.worldnet.att.net...

the

any

in

knowledge.

lots

Do a search for his name in Google newgroups to see some interesting threads ;)

- D
- Dan Koren
  
  Contact options for registered users
Vote on answer
posted
18 years ago

Tue, Aug 9, 2005 1:59 AM

This is usually referred to as "speculative execution".

dk

- B
- Bob Monsen
  
  Contact options for registered users
Vote on answer
posted
18 years ago

Tue, Aug 9, 2005 2:12 AM

When I am coding assembler, I find it's almost always obvious which way is more likely. Unfortunately, compilers don't usually understand what you are trying to do all that well, and thus often get it wrong. The profile to feedback scheme might be useful because the compiler can then do the proper optimization by itself. However, any time I care that much, I profile the code and rewrite any expensive chunks in assembler.

--
Regards,
  Bob Monsen

If a little knowledge is dangerous, where is the man who has
so much as to be out of danger?
                                  Thomas Henry Huxley, 1877

- S
- Skybuck Flying
  
  Contact options for registered users
Vote on answer
posted
18 years ago

Tue, Aug 9, 2005 2:15 AM

and

I used profilers, they use procedure calls to do their thing... and also write to harddisk making everything slllllllowwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwww.

Often the same branch is executed. So prediction can help.

No you cannot, you have to execute the rare case sometimes as well.

Define overhead ;)

Screw profiling, try it at lower levels and see what happens.

Instructions have to be loaded anyway... they sit in cache, counters sit there too, not much overhead,

then after jumps counter increments and other logic has to be done which can be done by additional gates so no real overhead there.

Also these updated instructions/jumps have to go back to main memory... which in my oppinion isn't that much... but that's what I don't know about... if the bandwidth is available I would say use it ! ;)

Ofcourse there are other problems to solve like compiler changes and some things can't be pre-computed, etc... and in the end the counters might even be counter-productive ;)

Since the feedback on my questions is running low I have only one last thing to say to you people:

Screw you and f*ck you too lol if doesn't work lol...

Bye, Skybuck :D

- B
- Ben Bradley
  
  Contact options for registered users
Vote on answer
posted
18 years ago

Tue, Aug 9, 2005 2:47 AM

I recall reading of a processor that does pipelining through branches (I don't remember which, but it might be a TI DSP) by using several execution units. One executes the instruction(s) where the branch is taken, the other executes the code where the branch is not taken. When the branch is decided, the results from the corresponding execution unit are taken and execution continues from there, and the other results are 'thrown away' unused. This method takes a lot more silicon space for the processor, but results in a speed increase that apparently can't be done otherwise.

-----

formatting link

- S
- Skybuck Flying
  
  Contact options for registered users
Vote on answer
posted
18 years ago

Tue, Aug 9, 2005 3:49 AM

and

even

prediction

being

the

into

Ohhhhh that's really fancy pancy.

- G
- glen herrmannsfeldt
  
  Contact options for registered users
Vote on answer
posted
18 years ago

Tue, Aug 9, 2005 3:51 AM

Bob Monsen wrote: (snip)

It isn't always so obvious in assembler.

The ESA/390 instructions BXH and BXLE, branch on index high, and branch on index low or equal, have opposite branch prediction logic.

There was a post once from someone who had coded a loop with both, and found very different execution times. With the suggestion that it might be branch prediction, the two were coded with the opposite branch logic, and the timing difference was reversed, as would be expected from branch prediction.

It is, though, not so obvious to an assembly programmer which way the branch prediction logic is set up.

-- glen

- B
- Bob Monsen
  
  Contact options for registered users
Vote on answer
posted
18 years ago

Tue, Aug 9, 2005 7:28 AM

Actually, what I meant is that it is usually obvious to an assembly coder which path through a branch is usually taken. Note that the MIPS architecture has different opcodes for both cases; for example, it has a BLTZAL (Branch on Less Than Zero And Link) and BLTZALL (Branch On Less Than Zero And Link Likely), which do exactly the same thing, but the second one is presumably somewhat faster if the branch is more likely.

I guess they took your example to heart when designing the instruction set.

--
Regards,
  Bob Monsen

If a little knowledge is dangerous, where is the man who has
so much as to be out of danger?
                                  Thomas Henry Huxley, 1877

- K
- Ken Smith
  
  Contact options for registered users
Vote on answer
posted
18 years ago

Tue, Aug 9, 2005 8:18 AM

In article , Skybuck Flying wrote: [... me ..]

I may not be the bus its self that is the limitting factor. The memory device that holds the code may have a longer write time than read time. This may be a physical limitation of the storage technology. To give an extreme example, if the code is stored on a drum memory, this will reduce the speed to one instruction per rotation.

Also, a counter is by its nature a slow operation because the highest bit depends on all the bits below it. There are ways around this.

Never enough.

All of it.

No, it is but one limiting factor among many.

BTW: If the code in question is a Booth's divide routine, the branch prediction you suggest may actually slow the operation down instead of speeding it up.

--
--
kensmith@rahul.net   forging knowledge

- K
- Ken Smith
  
  Contact options for registered users
Vote on answer
posted
18 years ago

Tue, Aug 9, 2005 8:28 AM

In article , Stephen Fuld wrote: [...]

You are jumping to the wild assumption that the machine has caches at all. The CPU could be tightly coupled to the system RAM. We could make life even more interesting for the OP by making it one of those old bit-slicy things where the code comes from a few dozen fast ROM chips and the data is in static RAM.

We could always make a special code store with a counter for each location built into it. That way the extra transfer would not be needed at the mere cost of making it about 10 times as expensive.

[...]

.. and CPUs without them are fairly fast anyway.

--
--
kensmith@rahul.net   forging knowledge

- K
- Ken Smith
  
  Contact options for registered users
Vote on answer
posted
18 years ago

Tue, Aug 9, 2005 8:42 AM

In article , Bob Monsen wrote: [...]

Profiling can be useful on nearly any modern processor if you can't see by inspection which way the conditionals land. If you are hand tuning some ASM code for the maximum speed, knowing which jumps are likely helps to minimize the out of line jumps.

BTW: You can pulse a port bit high if you jump and another if you don't and use a frequency counter to find the odds. This works well on microcontrollers.

--
--
kensmith@rahul.net   forging knowledge

- S
- Skybuck Flying
  
  Contact options for registered users
Vote on answer
posted
18 years ago

Tue, Aug 9, 2005 2:48 PM

being

the

into

from/to

the

For god's sakes what does booth's algorithm have to do with jump instructions.

formatting link

See ya.