Executing both branches in advance ?


It seems CPU's nowadays have prediction logic, to try and predict which branch will be taken, and prepare/execute those instructions (pipelining etc), if it was miss-predicted the performance suffers.

Why not execute both branches in two seperate logic units etc... and once the real outcome is know continue with one of the two prepared branches ?

Bye, Skybuck.

Reply to
Skybuck Flying
Loading thread data ...

This is not uncommon; it's termed "speculuative execution" and has appeared in the occasional desktop PC (the old DEC/Compaq Alpha CPU probably being the most common -- I don't think Pentium do it).

It's not ubiquitous because -- with branch prediction already as good as it is -- you're adding significant extra hardware to execute both paths simultaneously and then having to cancel the path that turns out to be "wrong." Additionally, with pipelines as deep as they are now, it probably wouldn't be uncommon to hit another branch before your even knew the correct result of the *first* branch, and now you're stuck with need either 4 parallel execution units or stalling anyway. So... most people seem to think additional hardware resources are better spent on avoiding stalls through, e.g., scheduling multiple *threads* simultaneously (e.g., Intel's Hyperthreading).

The comp.arch guys could give you much better answers than I can; I'm just recalling what I learned back in school from Patteron/Hennessy's book. (You'll definitely want to check out a copy some time... it's a pretty simple book -- anyone with a high-school education can understand it --- but they do a very good job of explaining how you'd go about building your own RISC CPU, as well as why a lot of ideas that sound good on paper end up not being worth the transistors in actuality.)

@#@#$% -- I'm noticing you already posted this to comp.arch... well, ignore me -- as I say, they know much more about it than I do!


Reply to
Joel Kolstad

The idea of executing down both paths (called, imaginatively enough "dual-path execution") is actually pretty complex. Firstly, you'd only want to do this for branches that you suspect you will get wrong. So you need a confidence predictor (not hard to build mind you, and some branch predictors effectively provide one). Then comes the hard part. You need to be able to start both "threads" and kill the wrong one later. This gets tricky to do well. Killing a somewhat arbitrary set of

I just had a student group do a class project where they implemented a very simple version of this and they really struggled with it. The group had built a functional out-of-order processor (in synthesizable Verilog) but that part was a huge amount of work and didn't really help performance enough to offset the clock-period effects. I'm not saying it can't be done (I suspect it can be done with a performance gain) but this was a pretty good group of folks and they found it hard to get right.

So the idea's been around. AFAIK, no commercial processor has done this. And with power as the leading constraint I'm not sure anyone will in the near-term.

Mark Brehob, Lecturer > Hello,

Reply to
Mark Brehob

Mark Brehob wrote: (top posting fixed)

The decision should always be based on the answer to the question "now that I've used x transistors and y watts to gain a performance gain of z

-- could I have done better spending power and real estate somewhere else?"

Tim Wescott
Wescott Design Services
 Click to see the full signature
Reply to
Tim Wescott

That could be doable to a certain extent, but it may only make sense to do this if there absolutely is no other work that can be done for other threads or processes. On modern OS's there's usually something else that can be done with the extra CPU resources. You realize executing other threads is 100% use of the other resources, executing code at other speculative jumps is at best 50% utilization.

Reply to

No, speculative execution is predicting the branch (or something else), and following only that prediction. Following both sides of each branch has been called "eager execution".

No Alpha has ever performed eager execution. However, IIRC the first Power implementation did at least fetch both sides of a conditional branch.

Followups to comp.arch.

- anton

M. Anton Ertl                    Some things have to be seen to be believed
anton@mips.complang.tuwien.ac.at Most things have to be believed to be seen
 Click to see the full signature
Reply to
Anton Ertl

IBM System 360/85 in 1969. When it came to a branch it would prefetch both paths up to 16 bytes (or the next branch). When the branch was determined it would use the result from the "right" path. It took 1 cycle to switch if the branch went to the non current prefetch buffer. Also one of the first machines to have cache memory.

Reply to

Hi all,

While many of the arguments here are correct (power, complexity, etc.) the idea is also flawed at the theoretic level. It "may" give some performance, but for a non-trivial reason.

All the people that proposes it make the mistake to compare to different machines for the cases with and without eager execution (that is what we called it in the past.) In the post above, we have the term "separate" logic unit. If you are willing to add a logic unit, why not to use it for the standard code as well? (I will come back to this later, so don't kill me yet)

So, the merits of this idea must be compared when the machine back end is given. Make it twice as big (of what?) if you want, but now use it for both cases. The two cases, for low confidence branches, are:

1) I predict the branch and run the predicted flow alone. It runs at the full CPU speed. 2) I run both flows. Each have half the resources so it runs at half speed (yes, I know, just wait)

Now, if my probability teacher was right, I expect case one to run at full speed x branch prediction probability. If the confidence predictor tells me a low confidence prediction is ~70% (this is usually the case, I was told), then the performance is ~70% of full speed. For the second case, I always run at 50% of full speed. Hence, eager execution is a statistic loss for branches with more than

50% confidence.

Can we use it only for branches with less than 50% confidence? Well, if you think the confidence is less than 50%, just flip the prediction.

Ok, but I made to assumptions: 1) I can make a bigger machine (double) and get twice the speed for one flow. 2) Each flow can use 100% of what its given, so when I run two flows, I get 50%. In fact, these two assumptions mean that the amount of ILP in the CPU, in the two cases, are the same. But the two flows are by definition independent, so if I use eager execution, I rise the instruction level parallelism. This is in fact the principle behind SMP (see Intel Hyperthreading.) If, for example, the two flows run at 60% of full speed, then the aggregated performance is 120%. But this is a weak argument, since the extra parallelism doesn't give much on real machines, and the predictors have a pretty good level of confidence, even for weird or unknown branches.


Discalimer: I work for Intel, but this has NOTHING to do with it, and NOTHING must be inferred from it.

Reply to

This is _totally_ bogus!

You are completely disregarding the cost of restarting the pipeline each time you've mispredicted a branch. Let's take a simple example using a current Intel cpu, the Core 2:

A branch miss cost about 15 cycles, so if we have a branch that goes 80% f the time in one direction, and 20% the other, those 20% will mispredict, right?

Just to take a small exmple, the core operation when decoding HD-TV/HD-DVD/Blu-Ray is a context-adaptive binary arithmetic decoder, where the least probable symbol occurs something like 20-30% of the time.

The reference C code looks like this:

bit = state & 1; if (offset < range) { state = transMPS[state]; } else { offset -= range; range = range_lps;

bit ^= 1; state = transLPS[state]; }

I.e. the short path is the most likely, but the other path is only three instructions longer.

Taking a branch miss every 5 iterations corresponds to a cost of 3-5 cycles/iteration, so if we can run both halves in parallel and merge the results in less than this time, then it will be a clear win.

According to the docs of the VLC media player, the latest ffmpeg decoder does use conditional move operations to remove the branch from the code above, simply because doing so is a win on nearly all current cpus.

Having hardware available to do the same thing would be a much clearer win, since hw could presumably handle the cancellation of the non-taken path in not more than a cycle or two, right?

With 5-6 execution units in a Core 2 cpu, including one load unit, you could run all the operations in the long path in a single cycle, so even if it is four times as long, it has the same latency as the short path!

It is also important to note that this code has a lot of sequentially dependent operations, so most of the time nearly all those execution units are idle. :-(

See above, making that code branchless actually does help. :-)

Yes, eager execution is a net loss for most branches in real code, but when you need it, you need it bad. :-)


"almost all programming can be viewed as an exercise in caching"
Reply to
Terje Mathisen

Those 2 assumptions just don't happen in the real world. It's a challenge to keep current machines full. One of the reasons the industry is going multi-core is because it's hard to keep the current number of logic units full. Then as Terje has pointed out you've also got the case where the branch misprediction costs more than it would've to execute both branches.

Reply to
Nicholas King

Leaving issues of power and transistor count aside, it's a net loss in time only if the branch that should have been executed takes less time than the branch that shouldn't have been executed. Even that disadvantage could be eliminated if execution were allowed to proceed on both branches until it is known that one branch should be killed.

It's a many-universes style of execution: at each branch, another universe is created until someone opens the box and finds out if the cat is dead or alive. The parallel universes don't affect one another in any way. What am I missing?


Reply to
Robert Myers

Not so. It is still 100% utilization, you just have uncommitted results floating around. Besides there is a serious overhead in thread switching.

Start with the Tomasulo algorithm (i think you already know about pipelined execution). Speculative execution has been implemented on several processors including some versions of SPARC and MIPS. It does bring with an associated cost in reworking the compiler and all the libraries to work optimally with it.

 Gegen dummheit kampfen die Gotter Selbst, vergebens.  
 Click to see the full signature
Reply to

ElectronDepot website is not affiliated with any of the manufacturers or service providers discussed here. All logos and trade names are the property of their respective owners.