Async Processors

Further to an earlier thread on ASYNC design, and cores, this in the news :

formatting link

and a little more is here

formatting link

with some plots here:

formatting link

They don't mention the Vcc of the compared 968E-S, but the Joules and EMC look compelling, as does the 'self tracking' aspect of Async.

They also have an Async 8051

formatting link

-jg

Reply to
Jim Granville
Loading thread data ...

Hi, While always dealing with clock cycles, I am really surprised to learn how a clockless CPU works. Amazing!

Weng

Reply to
wtxwtx

formatting link

I seem to recall participating in a discussion of asynch processors a while back and came to the conclusion that they had few advantages in the real world. The claim of improved speed is a red herring. The clock cycle of a processor is fixed by the longest delay path which is at lowest voltage and highest temperature. The same is true of the async processor, but the place where you have to deal with the variability is at the system level, not the clock cycle. So at room temperature you may find that the processor runs faster, but under worst case conditions you still have to get X amount of computations done in Y amount of time. The two processors will likely be the same speed or the async processor may even be slower. With a clocked processor, you can calculate exactly how fast each path will be and margin is added to the clock cycle to deal with worst case wafer processing. The async processor has a data path and a handshake path with the handshake being designed for a longer delay. This delay delta also has to have margin and likely more than the clocked processor to account for two paths. This may make the async processor slower in the worst case conditions.

Since your system timing must work under all cases, you can't really use the extra computations that are available when not running under worst case conditions, unless you can do SETI calculations or something that is not required to get done.

I can't say for sure that the async processor does not use less power than a clocked processor, but I can't see why that would be true. Both are clocked. The async processor is clocked locally and dedicates lots of logic to generating and propagating the clock. A clocked chip just has to distribute the clock. The rest of the logic is the same between the two.

I suppose that the async processor does have an advantage in the area of noise. As SOC designs add more and more analog and even RF onto the same die, this will become more important. But if EMI with the outside world is the consideration, there are techniques to spread the spectrum of the clock that reduce the generated EMI. This won't help on-chip because each clock edge generates large transients which upset analog signals.

I can't comment on the data provided by the manufacturer. I expect that you can achieve similar results with very agressive clock management. I don't recall the name of the company, but I remember recently reading about one that has cut CPU power significantly that way. I think they were building a processor to power a desktop computer and got Pentium 4 processing speeds at just 25 Watts compared to 80+ Watts for the Pentium 4. That may not convey well to the embedded world where there is less paralellism. So I am not a convert to async processing as yet.

Reply to
rickman

Yes, but systems commonly spend a LOT of time waiting on external, or time, events.

The two processors will likely be the same

Why ? In the clocked case, you have to spec to cover Process spreads, and also Vcc and Temp. That's three spreads. The Async design self tracks all three, and the margin is there by ratio.

This may make the async processor slower in the

You did look at their Joule plots ?

Their gate count comparisons suggest this cost is not a great as one would first think.

... and that involves massive clock trees, and amps of clock driver spikes, in some devices....(not to mention electro migration issues...)

yes. [and probably makes some code-cracking much harder...]

As SOC designs add more and more analog and even RF onto the

Perhaps, in the limiting case, yes - but you have two problems: a) That is a LOT of NEW system overhead, to manage all that agressive clock management... b) The Async core does this 'Clock management for free - it is part of the design.

I don't recall the name of the company, but I remember

Intel are now talking of Multiple/Split Vccs on a die, including some mention of magnetic layers, and inductors, but that is horizon stuff, not their current volume die. I am sure they have an impressive road map, as that is one thing that swung Apple... :)

That may not convey well to the

I'd like to see a more complete data sheet, and some real silicon, but the EMC plot of the HT80C51 running indentical code is certainly an eye opener. (if it is a true comparison).

It is nice to see (pico) Joules / Opcode quoted, and that is the right units to be thinking in.

-jg

Reply to
Jim Granville

There are a lot of different async technologies, not all suffer from this. Dual rail with an active ack do not rely on the handshake having a longer time to envelope the data path worst case. Phased Logic designs are one example.

Using dual rail with ack, there is no worst case design consideration internal to the logic ... it's just functionally correct by design at any speed. So, if the chip is running fast, so does the logic, up until it must synchronize with the outside world.

for fine grained async, there is very little cascaded logic, and as such very little transitional glitching in comparision to relatively deep combinatorials that are clocked. This transitional glitching at clocks consumes more power than just the best case behaviorial of clean transitions of all signals at clock edges and no prop or routing delays.

for course grained async, the advantage obviously goes away.

By design clocked creates a distribution of additive current spikes following clock edges, even if spread spectrum. This simply is less, if any, of a problem using async designs. Async has a much better chance of creating larger DC component in the power demand by time spreading transistions so that the on chip capacitance can filter the smaller transition spikes, instead of high the AC components with a lot of frequency components that you get with clocked designs.

In the whole discussion about the current at the center of the ball array and DC currents, this was the point the was missed. If you slow the clock down enough, the current will go from zero, to a peak shortly after a clock, and back to zero, with any clocked design. To get the current profile to maintain a significant DC level for dynamic currents, requires carefully balancing multiple clock domains and using deeper than one level of LUTs with long routing to time spread the clock currents. Very Very regular designs, with short routing and a single lut depth, will generate a dynamic current spike 1-3 lut delays from the clock transition. On small chips which do not have a huge clock net skew, this will mean most of the dynamic current will occuring in a two or three lut delay window following clock transitions. Larger designs with a high distribution of multiple levels of logic and routing delays flatten this distribution out.

Dual rail with ack designs just completely avoid this problem.

Reply to
fpga_toys

Yes, and if power consumption is important the processor can stop or even stop the clock. That is often used when power consumption is critical. That's all the async processor does, it stops its own clock.

BTW, how does the async processor stop to wait for IO? The ARM processor doesn't have a "wait for IO" instruction. So it has to set an interrupt on a IO pin change or register bit change and then stop the CPU, just like the clocked processor. No free lunch here!

Yes, the async processor will run faster when conditions are good, but what can you do with those extra instruction cycles? You still have to design your application to execute M instructions in N amount of time under WORST CASE conditions. The extra speed is wasted unless, like I said, you want to do some SETI calcs or something that does not need to be done. The async processor just moves the synchronization to the system level where you sit and wait instead of at the gate level at every clock cycle.

Yes, but there are too many unknowns to tell if they are comparing apples to oranges. Did the application calculate the fibonacci series, or do IO with waits? Did the clocked processor use clock gating to disable unused sections or did every section run full tilt at all times? I have no idea how real the comparison is. Considering how the processor works I don't see where there should be a difference. Dig below the surface and consider how many gate outputs are toggling and you will see the only real difference is in the clocking itself; compare the clock tree to the handshake paths.

But the gate count is higher in the async processor.

You can wave your hands and cry out "massive clock trees", but you still have to distribute clocks everywhere in the async part, it is just done differently with lots of logic in the clock path and they call it a handshake. Instead of trying to minimize the clock delay, they lengthen it to exceed the logic delay.

It is "free" the same way in any design. The clock management in a clocked part would not be software, it would be in the hardware.

I found the article in Electronic Products, FEB 2006, "High-performance

64-bit processor promises tenfold cut in power", pp24-26. It sounds like a real hot rod with dual 2 GHz processors, dual high speed memory interfaces, octal PCI express, gigabit Ethernet and lots of other stuff. 5 to 13 Watts typical and 25 Watts max.

So you can do some amazing stuff with power without going to async clocking.

Reply to
rickman

Can you explain? I don't see how you can async clock logic without having a delay path that exceeds the worst path delay in the logic. There is no way to tell when combinatorial logic has settled other than to model the delay.

I found some links with Google, but I didn't gain much enlightenment with the nickle tour. What I did find seems to indicate that the complexity goes way up since each signal is two signals of value and timing combined called LEDR encoding. I don't see how this is an improvement.

That is the point. Why run fast when you can't make use of the extra speed? Your app must be designed for the worst case speed and anything faster is lost.

I think you are talking about a pretty small effect compared to the overall power consumption.

Care to explain how Dual rail with ack operates?

Reply to
rickman

Worst case sync design requires that the clock period be slower than the longest worst case combinatorial path ... ALWAYS ... even when the device is operating under best case conditions. Devices with best case fab operating under best case environmentals, are forced to run just as slow as worst case fab devices under worst case environmental.

The tradeoff with async is to accept that under worst case fab and worst case environmental, that the design will run a little slower because of the ack path.

However, under typical conditions, and certainly under best case fab and best case environmentals, the expecation is that the ack path delay costs are a minor portion of the improvements gained by using the ack path. If the device has very small deviations in performance from best case to worst case, and the ack costs are high, then there clearly isn't any gain to be had. Other devices however, do offer this gain for certain designs.

Likewise, many designs might be clock constrained by an exception path that is rarely exercised, but the worst case delay for that rare path will constrain the clock rate for the entire design. With async, that problem goes away, as the design can operate with timings for the normal path without worrying about the slowest worst case paths.

Depends greatly on the design and logic depth. For your design it might not make a difference as you suggest. For a multiplier it can be significant, as every transistion, including the glitches cost the same dynamic power.

Reply to
fpga_toys

Yes, that has to be one of the keys. Done properly, JNB Flag,$ should spin only that opcode's logic, and activate only the small cache doing it.

That's the coarse-grain way, the implementation above can drop to tiny power anywhere.

Nothing, the point is you save energy, by finishing earlier.

Not in the 8051 example. In the ARM case, it is 89:88, pretty much even.

The thing to do now, is wait for some real devices, and better data.

-jg

Reply to
Jim Granville

You have ignored the real issue. The issue is not whether the async design can run faster under typical conditions; we all know it can. The issue is how do you make use of that faster speed? The system design has to work in the worst case conditions, so you can only use the available performance under worse case conditions.

You can do the same thing with a clocked design. Measure the temperature and run the clock faster when the temperature is cooler. It just is not worth the effort since you can't do anything useful with the extra instructions.

The glitching happens in any design. Inputs change and create changes on the gate outputs which feed other gates, etc until you reach the outputs. But the different paths will have different delays and the outputs as well as the signals in the path can jump multiple times before they settle. The micro-glitching you are talking about will likely cause little additional glitching relative to what already happens. Of course, YMMV.

Reply to
rickman

No, it should not spin since that still requires clocking of the fetch, decode, execute logic. You can do better by just stopping until you get an interrupt.

I disagree. Stopping the CPU can drop the power to static levels. How can you get lower than that?

How did you save energy? You are thinking of a clocked design where the energy is a function of time because the cycles are fixed in duration. In the async design energy is not directly a function of time but rather a fuction of the processing. In this case the processing takes the same amount of energy, it just gets done faster. Then you wait until the next external trigger that you need to synchronize to. No processing or energy gain, just a longer wait time.

In the real world two equivalent designs will have to take more gates for the async design. You need all the same gates as in the clocked design, you subtract the clock tree and add back in the async clocking. I expect this would be nearly a wash in any design. The only way the async design can be smaller is if they make other changes.

I say that clock management will be easier to use and implement and give the same results as async clocking. In a large sense, the "async" clocking is just a way to gate the clock to each logic block on a cycle by cycle basis. It is really just a matter of what you claim to get from this. Speed is not one of these gains.

Reply to
rickman

Again, depends on the application. If it's a packet routing/switching engine running below wire speed, then it means that the device will route/switch more packets per second without overrunning when not worst case.

But that only allows derating for temp based on worst case assessment of the process data. It doesn't allow for automatic adjustments for process variation or other device specific variances.

Sure you can take every device and characterize it across all the environmental factors which impact performance, and write a custom clock table per device ... but get realistic ...

it's all about tradeoff's ... designs and the target implementation hardware.

Not true, as there have always been ways to design without glitches using choices like gray coded counters for state machines, one hot state machines, and covering the logic states to be glitch free by design, which most good engineers will purposefully do when practical and necessary, as should good tools. It's just a design decision to ensure that every term is deterministic without static or dynamic hazards. Maybe they don't teach this in school any more now that everyone does vhdl/verilog.

formatting link
formatting link

In the best async designs, extremely fine grained dual rail, it shouldn't happen at all with good gate design.

Reply to
fpga_toys

Well, it depends what you take from their published infos.

To me, it makes sense, and I look forward to seeing real devices :)

The ultimate proof, is not what someone thinks may, or may not matter, but how the actual chip performs.

-jg

Reply to
Jim Granville

Do they design the equipment to drop packets when it gets hot or when the PSU is near the low margin or when the chip was just not the fastest of the lot? That is my point. You design equipment to a level of acceptable performance. The equipment spec does not typically derate for uncontrolled variables such as temperature, voltage and process.

If your VOIP started dropping packets so that your phone calls were garbled and the provider said, "of course, we had a hot day, do you expect to see the same speeds all the time?", would you find that acceptable?

Actually they do. That is called speed grading and nearly all makers of expensive CPU chips do it. You could do that with any chip if you wanted to. But there is no point unless you want to run your Palm 10% faster because it is a toy to you.

Yes, I agree, you have to be realistic. There is no significant advantage to having a processor run at different speeds based on uncontrolled variables.

You are talking about stuff that no one uses because there is very little advantage and it does not outweight the cost. My point is that none of this is justified at this time.

Besides, for the most part, the things you mention do not prevent the glitches. One hot state machines use random logic with many input variables. Only one FF is a 1 at a time, but that means two FFs change state at each transition and potentially feed into the logic for many state FFs. This means each rank of logic can have transitions from the two FFs that change and potentially significant power is used even in the ranks that do not change their output.

I have never heard anyone suggest that you should design to avoid the intermediate transients of logic. Of course you can, but there are very few designs indeed that need to be concerned about the last few % of power consumption this would save.

Great, you have identified an advantage of async designs. They can be done with extremely fine grained dual rail logic that can avoid transients in intermediate logic. But then you can do that in sync designs if you decide you want to, right?

Reply to
rickman

The gain is very simple.. every time a sync circuit clocks.. a zillion transistors switch.. less if the circuit is partitioned correctly. But by definition, in an async circuit, only the one path actually does something. The clock distribution in an advanced processor is a significant proportion of the overall clock budget. There is also the capacitive effect.. charge and discharge.. if the current never falls to zero, then the minimum power is slightly above zero also. Static cmos fets, of course, draw no current.

Simon

Reply to
Simon Peacock

You are talking about a circuit with NO clock gating. A sync clocked circuit can have the clocks dynamically switched on and off to save power. I already posted an example of a high end CPU which is doing that and getting a 3x power savings, just like the async chip compared to the sync chip listed earlier in the thread.

Actually power dissipation is more complex than just saying "this path is not used". If the inputs to a block of logic change, it does not matter if the outputs are used or not. The inputs must be held constant or the combinatorial logic switches drawing current.

Clock distribution is not a significant issue in the timing budget. The delays in the clock tree are balanced. If you need to minimize the effective delay to match IO clocking the delay can be compensated by a DLL. The clock tree does use some power, but so does the handshaking used in async logic.

Everyone seems to minimize their analysis of the situation rather than to think it through. This reminds me a bit of the way fuzzy logic was claimed to be such an advance, but when you looked at it hard you would find little or no real advantage. Do you see many fuzzy logic projects around anymore?

Reply to
rickman

Yes, nearly every communications device with a fast pipe, will discard packets when over run. Cisco Routers of all sizes, DSL modems, wireless radios, .... it's just everyday fact of life.

Faster cpu's cost more money ... if you want to by a Cisco router that drops packets at higher loads, spend more money. The primary difference between whole families, is simply processor speed.

IT HAPPENS!!! Reality Check ... IT HAPPENS EVERY DAY.

No, that is a completely different issue ... not dynamic fit of processing power to current on chip delays.

No ... wrong.

Sorry ... that is true only in your mind for your designs. It does not apply broadly to all designs for all real world problems. Real engineers do this because it really does matter for THEIR designs.

So, I've already stated clearly one every day application where the customer benifits by having routers only drop packets when the hardware isn't capable of going faster, rather than derating the whole design to reduced worst case performance levels.

"no one uses because" ... sorry, but clearly you haven't been keeping up with your reading and professional skills training as you certainly don't know everyone.

You really need to read a lot more the C.A.F. to get a better grounding on what people actually do these days. ... For starters read starting at the end of page 3 about Data path reordering and glitch power reduction:

formatting link

Get the point ... people concerned about low power, do actually design to remove glitching by design ... by serious engineering design. Keep on reading about what "no one uses because" to get a real understanding about real no body engineering for power in section 5 Architecture Optimization:

formatting link

Note the lead in to the topic ... glitches can consume a large amount of power. Now clearly some engineers have never had to worry about battery life or waste heat from excess power. But for the real power sensitive engineers, the truth is that nobody can ignore these factors.

The reality is that the faster the logic gets, the more you have to worry about these timing miss match effects. Three generations back, a

1ns glitch was obsorbed into the switching times. At 90nm glitches as short as a few dozen ps will cause two unwanted transistions and power loss. The whole problem with glitches is this extra double state flip when there should have been zero that robs power ... and that is amplified by all the logic behind the glitch also flipping once or twice as well ... greatly amplifing the cost of the initial failure to avoid glitches by design. At 90nm there are a whole lot more sources of glitches that require attention to design details that didn't even matter two or three generations back. So while you may think that no one actually attempts glitch free design practices, by using formal logic tools to stop them dead, you clearly do not know everyone to make that statement so firmly.

If you still think that no one decides to design formally correct glitch free circuits, keep reading what leading engineers from Actel, Xilinx, say:

formatting link
formatting link

Note the end of section 5.2 where it discusses the power consumed in several of the designs sections due to glitches was 9-18%. Note also that agressively pipelining with the additional "free" registers in FPGA's is a clear win. Other ASIC studies by Shen on CMOS combinatorial logic have stated that as much as 20-70% of a devices power can be consumed by glitches, which is a strong reminder to use the FPGA registers and pipeline wisely.

So, from my perspective "no one" concerned about power can possibly be doing their job if they are unware of glitching power issues ... a stark contrast from your enlightening broad statements to the contrary.

I think you have now, and it's a lot more than a few percent for some designs.

yep ... with worst case limited performance too.

Reply to
fpga_toys

There's overhead and delays associated with starting the clock up again; not significant for power, but may make it impractical for high-rate I/O.

The other option is to memory-map and use a wait. The clocked processor will stall, and power consumption will drop, since the output of the clocked elements aren't changing. However, the clock nets are still charging and discharging.

For the async processor, though, you should be able to get down to leakage currents.

Reply to
Paul Johnson

I think there's an issue here with the definition of "worst-case conditions". It's not just process/voltage/temperature corners, and tool would have to build in a safety margin even if it was. But, when you're designing a static timing analyser, you also have to take into account random localised on-die variations, and you have to build in more safety margin just in case. The end result is that when doing synchronous design your tool gives you a conservative estimate, and you're stuck with it. If you've got a bad process async design and a bad-process sync design sitting next to each other in a hot room with low voltages, then the async design should presumably run faster.

You can't do that because, I think, you can't get the tools to give you a graph of max frequency vs. temperature for worst-case process and voltage. You just get the corner cases. With an async design it doesn't matter - it just runs as fast as it can. Brings to mind the gingerbread man.

Reply to
Paul Johnson

"Local" variations can also affect the async processor. That is why the delta between the data and control path delays must be larger than zero.

In the end all these effects must be accounted for whether at the chip level or at the system level.

I don't need tools, silicon speed vs. temp and voltage is a well known quantity. Besides, there are little or no tools commercially available for doing async design. I assumed we were not talking about the practicality with today's tools, but were extrapolating to a "perfect" world.

But the real issue is what do you do with the excess speed of the async design at room temp, etc? Your design has to meet specific goals over all variables of temp, voltage and process.

Ok, FPGA identified one application where it might be acceptable to not meet your timing goals as the box warms up. Personally I don't believe that, since even Cisco designs using requirements and I seriously doubt there is room for uncontrolled variables limiting the performance of their equipment. "Yes, our product will operate at XXX packets per second (as long as you keep it very cool and the voltage regulator is at the high end of its spec and the chip is at the fast end of its spec)." Do they spec equipment that way?

Doesn't this make sense? What do you do with the extra MIPs you get

*sometimes*.
Reply to
rickman

ElectronDepot website is not affiliated with any of the manufacturers or service providers discussed here. All logos and trade names are the property of their respective owners.