Combination loops and false paths

R

Rob Doyle 13 years ago

I creating an FPGA implementation of a old DEC PDP-10 (KS-10, specifically) Mainframe Computer. Why? Because I always wanted one...

The KS-10 was microcoded and used 10x am2901 4-bit slices in the ALU. At this stage, most of the instruction set diagnostics simulate correctly.

When I synthesize this design using Xilinx ISE I get warnings about combinatorial loops involving the ALU - and an associated "Minimum period: 656.595ns (Maximum Frequency: 1.523MHz)" message...

My understanding is that if combination loops really existed then the simulation wouldn't stabilize. I can't really add pipelining or registers to the design without affecting the microcode - and I don't want to do that.

Most of the information that I've read about "false paths" assume two clocked processes not a combinatorial loop.

Anyway. I'm not sure how to resolve this. I can mark the path as a false path but I think that it will ignore /all/ the timing (even the desired timing) through that path.

What should I do?

Rob.

Vote

G

GaborSzakacs 13 years ago

Combinatorial loops _with delay_ will simulate correctly. Otherwise you couldn't simulate a ring oscillator.

Not necessarily true. False paths have a FROM and a TO specification, and would not affect other paths that don't start at the FROM or don't end at the TO timing group. This allows you for example to say that you don't care how long a control register bit takes to get through some logic, but you want the streaming data to get through in the standard PERIOD time.

You could always run your machine at 1.5 MHz. After all, how fast was the PDP-10? Other than that, we'd probably need to analyze this path to give any useful advice.

Vote

R

rickman 13 years ago

Do you know why the tool is complaining? Did you write the code describing the ALU? Off the top of my head, I can't think of why an ALU would have a combinatorial loop. It should have two input data busses, a number of control inputs, an output data bus and some output status signals.

I don't recall the details of the 2901 bit slice and my data books are not handy. That's the problem with paper books, you can't just shove them on your hard drive... Does this part have an internal register file? Even so, that means the part would have a clock and not a combinatorial loop. Maybe this is because of some internal bus that is shared in a way that looks like a loop even though it would never be used that way?

I may have to find my old AMD data book. That could be an archeological dig!

Rick

Vote

B

Brian Drummond 13 years ago

Look at the critical path reported by synthesis. Sounds like a VHDL coding error; that delay would equate to a chain of 3-400 LUTs between FFs which strongly suggests a mistake somewhere.

- Brian

Vote

R

rickman 13 years ago

I took a look at the block diagram and I don't see any combinatorial loops. However, they use latches for the RAM outputs. These are combinatorial if implemented that way. Typically latches are used because they can provide speed advantages since the data will flow through before being held while D type registers don't change outputs until the clock edge.

Are the RAM and output latches in the path being reported as too long? If so, I would recommend changing the latches to rising edge registers. This should cut these loops. The RAM is level sensitive which may be a problem in an FPGA. I think all of the sequential elements are edge sensitive these days. I supposed you could make it out of latches; it's not that many elements.

The clock runs down the center of the block diagram cutting the data paths and preventing any internal loops I can see. Of course, there could be combinatorial loops created by the way it is used. The only one I see is created if you loop the ALU output Y back to the ALU input D. This could happen if you try to put these busses on a tri-state bus. Since tri-states aren't used in an FPGA, you are better off using multiple separate busses to drive all the various inputs hanging on the tri-state bus.

I think it would be very interesting to implement this in a low power FPGA and see just how efficient it can become. What target were you thinking of? I am currently working with the iCE40 family from Lattice and it has very impressive power consumption. You likely could run a design at some 10's of MHz while drawing the power level of an LED... a small LED.

Any interest in making this a group project? Or are you keeping all the fun to yourself?

BTW, where did you get documentation on the PDP-10 sufficient to design an equivalent? Just from the instruction manual? Or do you have more details?

Rick

Vote

G

glen herrmannsfeldt 13 years ago

Loops with an odd number of inversions won't stabilize, but with an even number they should be fine.

(snip)

The tools should be good enough to figure out latches.

As I wrote above, though, be sure that there is no (odd number) of inverters in the loop.

The BRAM on most FPGAs are synchronous (clocked). That might not match what you need for some older designs. If it isn't too big, and you really need asynchronous RAM, you have to make it out of CLB logic.

As I understand it, the Xilinx tools, at least, know how to convert tristate logic to MUX logic. I suppose in some cases that might generate unexpected, or even false, loops.

I believe that the KA-10 was done in asynchronous (non-clocked) logic. That might make an interesting FPGA project.

-- glen

Vote

A

Andy 13 years ago

You might be able to get around the async/sync ram issues by using the other edge of your clock (and if necessary, a dual-port bram could run off different edges for read & write).

There are also ways to build a DDR register out of two registers and 3 XOR (or XNOR) gates, without gating the clock. Google "flancter circuit". It is STA-friendly too. That might be another trick you could use.

If the loops were not stable, it would show up even in RTL sim (assuming the conditions needed to make it unstable were met). Since it works with the loops in there (in simulation), I assume it is at least not always unstable.

Andy

Vote

R

rickman 13 years ago

Figure out in what context? The tool can't know when the latch is enabled or disabled. When enabled it is transparent and so is logic. I'm not sure what your point is. For timing a latch is combinational logic and has to be figured into the timing paths. In fact, that is usually why latches are used, because they improve timing.

Yes, not only are the block RAMs synchronous, the LUT RAMs (distributed) are also synchronous. That is why I say you have to make async RAM out of latches.

That should not create loops unless the loop is already there in the connected logic. Translating tristate busses create multiple sets of multiplexors which are all distinct, preventing loops... unless the rest of the logic connects input to output.

Doing async logic in an FPGA is not so easy. You need timing info that is hard to get.

Rick

Vote

R

Rob Doyle 13 years ago

On 1/15/2013 7:21 PM, rickman wrote: > On 1/15/2013 12:54 AM, Rob Doyle wrote: >> >> I creating an FPGA implementation of a old DEC PDP-10 (KS-10, >> specifically) Mainframe Computer. Why? Because I always wanted >> one... >> >> The KS-10 was microcoded and used 10x am2901 4-bit slices in the >> ALU. At this stage, most of the instruction set diagnostics >> simulate correctly. >> >> When I synthesize this design using Xilinx ISE I get warnings >> about combinatorial loops involving the ALU - and an associated >> "Minimum period: 656.595ns (Maximum Frequency: 1.523MHz)" >> message... >> >> My understanding is that if combination loops really existed then >> the simulation wouldn't stabilize. I can't really add pipelining >> or registers to the design without affecting the microcode - and I >> don't want to do that. >> >> Most of the information that I've read about "false paths" assume >> two clocked processes not a combinatorial loop. >> >> Anyway. I'm not sure how to resolve this. I can mark the path as a >> false path but I think that it will ignore /all/ the timing (even >> the desired timing) through that path. >> >> What should I do? > > Do you know why the tool is complaining? Did you write the code > describing the ALU? Off the top of my head, I can't think of why an > ALU would have a combinatorial loop. It should have two input data > busses, a number of control inputs, an output data bus and some > output status signals.

Oops. Sorry I guess I relied to rickman instead of following up with the group. I'm resending...

I guess I'm using term ALU and am2901 interchangeably. I'll be more specific.

There is nothing wrong with the am2901 proper. It is what it is.

Exactly.

The problems is that am2901 output goes to a bus that eventually routes back to the am2901 input for some unused (as best I can tell) configuration of the microcode. This all happens with no registers in the loop.

The am2901 does have an internal dual-ported register file. Register file writes from the ALU output are clocked. Register file reads to the ALU input are latched only. The am2901 control inputs and register file addresses all originate from the microcode which is registered.

The am2901 has a single input bus which is combinatorial through the ALU to output bus. Therefore all am2901 ops require at least one register (or the constant zero) as an ALU source.

I think I know what to do. It looks like ISE supports a FROM-THRU-THRU-THRU-THRU-TO timing constraint - with an indefinite number of THRUs. I think I just want to very specifically exclude the paths that the tool is whining about and leave everything else.

I guess that it is just a design from another day - a whole lot less synchronous than anything I've done in an FPGA before.

I have enjoyed going back through that all. I even found my "Mick and Brick" book. I'll probably do a VAX 11/780 next which also used bit-sliced parts.

Bob.

Vote

R

Rob Doyle 13 years ago

Folks have done a remarkable job at archiving design information, software, and hardware for these historic machines. Notably Paul Allen (of Microsoft Fame) sponsors a museum that maintains a few of these machines in working order. See

formatting link

The folks at bitsavers.org are frantically scanning documents/books and imaging magnetic media before it becomes unreadable.

AMD Info is at:

formatting link

All the KS10/PDP10 information is at:

formatting link

Microcode source/listings and processor diagnostics are available from:

formatting link

A webpage that describes my/our project is at:

formatting link

I put a block diagram of the KS10 CPU on the techtravels website. Referring to that block diagram, the false paths are from the ALU output through the DBM mux, through the DBUS Mux, and back into the ALU. There is another false path though the SCAD. (The SCAD is a 12-bit mini-ALU built from 3x 74S181s that is used for managing floating-point exponents and loop constructs in the microcode).

This is definitely a group project. Right now, I'm doing all the FPGA work by myself -

If you are interested in participating, contact me off-list.

You used the term 'archeology'. It sure feels like that...

Rob.

Vote

G

glen herrmannsfeldt 13 years ago

(snip)

I suppose, but it was designed way before the tools we use now. (snip)

(snip)

Years ago, maybe just about when it was new, I bought "Mick and Brick."

Then, about 20 years ago, it got lost in a move. A few weeks ago I bought a used one from half.com for a low price. (In case I decide to do some 2901 designs in FPGAs.)

The discussion on combinatorial loops reminds me of the wrap around carry on ones complement adders. If done the obvious way, it is a combinatorial loop, but hopefully one that, in actual use, resolves itself.

-- glen

Vote

G

glen herrmannsfeldt 13 years ago

(snip, I wrote)

The are now? They didn't used to be. I am somewhat behind in the generations of FPGAs. (snip, I also wrote)

The whole idea behind asynchronous logic is that you don't need to know any timing information. Otherwise known as self-timed logic, there is enough hand-shaking such that every signal changes when it is ready, no sooner and no later. If you use dual-rail logic:

formatting link

then all timing just works.

If you mix synchronous and asynchronous logic, then things get more interesting.

-- glen

Vote

R

rickman 13 years ago

Mick and Brick was not just about the 2901, it covered the basic concepts of designing a processor. One of the things that stuck with me was the critical path they described, which I believe was in a conditional branch calculating the next address (I guess it finally got away from me again). I found that to be true on every processor design I looked at, including the MISC designs I did in FPGAs on my own. These guys had some pretty good insight into processor design.

I had my own book too and will have to dig around for it. But I am pretty sure it is gone as I haven't seen it in other searches I've done for other books the last ten years or so. I think I got it free from AMD at one point. Now they are over $100 for one in good condition. I'm not sure what "adequate" means for a book condition. They say it is all legible, but I've seen some pretty rough books in "good" condition.

Rick

Vote

R

rickman 13 years ago

Actually, I think Xilinx made their XC4000 series with clocks for writing the distributed RAM. They had too much trouble with poor designs trying to generate a write pulse with good timing and decided they were better off giving the user a clock. I used the ACEX from Altera in 2000 or so which had an async read block RAM. It made a processor easier to design saving a clock cycle on reads. Block RAMs have always been synchronous on the writes and now they are synchronous on reads as well... so many generations of FPGAs...

You need to read that section again... Nowhere does it say the timing "just works". It describes two ways to communicate a signal, one is to send one of two pulses for a 1 or a 0 and the other is to use a handshake signal which has a delay longer than the data it is clocking. In both cases you have to use timing to generate the control signal (or combined data and control in the first case). The advantage is that the timing issues are "localized" to the unit rather than being global.

The problem with doing this in an FPGA is that the tools are all designed for fully synchronous systems. This sort of local timing with an emphasis on relative delays rather than simple maximum delays is difficult to do using the standard tools.

All real time systems are at some point synchronous. They have deadlines to meet and often there are recurring events that have to be synced to a clock such as an interface or an ADC. In the end an async processor buys you very little other than saving power in the clock tree. Even this is just a strawman as the real question is the power it takes to get the job done, not how much power is used to distribute the clock.

The GA144 is an array of 144 fully async processors. I have looked at using the GA144 for real world designs twice. In each case the I/O had to be clocked which is awkwardly supported in the chip. In the one case the limitations made it very difficult to even analyze timing of a clocked interface, much less meet timing. In the other case low power was paramount and the GA144 could not match the power requirements while I am pretty sure I can do the job with a low power FPGA. Funny actually, the GA144 has an idle current of just 55 nA per processor or just 7 uA for the chip. The FPGA I am working with has an idle current of some 40 uA but including the processing the total should be under 100 uA. In the GA144 I calculated over 100 uA just driving the ADC not counting any real processing. Actually, most of the power is used in timing the ADC conversion. Without a high resolution clock the only way to time the ADC conversion is to put the processor in an idle loop...

There is many a slip 'twixt cup and lip.

Rick

Vote

G

glen herrmannsfeldt 13 years ago

(snip)

Yes, but with 29xx for all the examples. I also have some books on microprogramming, independent of the processor. Well, maybe not completely independent.

half.com has $1.49 (plus shipping) for acceptable condition, $5.52 for good condition, and $8.88 for very good condition.

The one I got has the dust jacket a little worn and torn, and the spine might be a little weak, but plenty usable.

-- glen

Vote

G

glen herrmannsfeldt 13 years ago

(snip)

I am not sure about writes now. BRAMs are synchronous read, but LUT RAM better not be, as the LUTs are the same as used for logic.

Some designs just won't work with a synchronous read RAM (which is sometimes a ROM). (snip on asynchronous logic)

I meant the one they call dual rail logic. There are two wires sending the signal, in one of three states, 0, 1, or none, and one coming back acknowledging the signal. The generate a signal, either the 0 or 1 goes active, until the ack comes back, at which time the output signal is removed until the ack goes away. Full handshake both ways.

Yes. Besides all those useless FF's in each cell.

As I understand it, there are some current processors with asynchronous logic blocks, such as a multiplier. The operands are clocked in and, an unknown (data dependent) number of cycles later the result comes out, is latched, and sent on. So, 0*0 might be very fast, were full width operands might be much slower.

Sounds like an interesting design.

-- glen

Vote

R

rickman 13 years ago

That's right. My processor design on the ACEX with async reads had to be modified to work in nearly any other part with fully sync block RAM. I could possibly clock the RAM on the negative edge with the rest of the design clocked on the positive. Or I could do a read on every clock using the address precursor which is available on the prior clock cycle at the input to the address register. Both methods reduce timing margins along with other tradeoffs.

I haven't seen the logic, but how do they generate the timing for the handshakes? I don't think it "just works". My understanding is that the handshakes are generated by a delay line that is designed to have a longer delay than the logic. This is hard to do with the timing tools designed for synchronous systems.

I don't follow. I think typical async logic still has FFs, they just don't use a global clock. I suppose if you have handshakes back and forth you are making latches in the combinatorial logic if nothing else.

I haven't heard of that. How would that benefit a sync processor? I can only think that would be useful if the design were compared to one with a very slow multiplier which required the processor to wait for many clock cycles. A multiplier with a "ready" flag could shorten the wait. But that can also be done for a fully sync multiplier. In fact, in the async GA144 there is no multiply instruction. Instead there is a multiply step instruction which can be used to do multiplies in a loop. The loop can be terminated when the multiply has detected the rest of the bits are all zero (or all ones maybe?). I haven't seen the code that does this, but this is reported in some of their white papers.

I still need to verify that the LVDS input will detect the still very low level signal from the antenna. Once I show that to work, I've got the rest covered. If it doesn't work, I'll either need to use a separate comparator or if that won't work I might be able to provide some feedback to keep the detector on it's sensitive edge.

All the other parts have been analyzed well enough that I am very confident I'll meet my goal.

BTW, I am thinking of using a cheap analog battery driven clock as an output device. So I bought one for $4 and took it apart. It has the tiny circuit board for the clock chip and crystal and a very simple coil driving a gear that turns 180° each tick. The rest of the clock is the same as any analog clock except it is *all* plastic. Plastic gears, plastic pivot, plastic box. I guess once you do the timing with electronics there is no longer a need for the fancy stuff in the mechanism. Checking on Aliexpress I found the mechanisms for only $2! Sometimes technology is amazing in just how cheaply it can be produced.

Rick

Vote

J

Jon Elson 13 years ago

Yes, some PDP-10s were not rigidly clocked at all, so that when there were no carries from the ALU after a few ns, the operation was considered complete and the result stored. Really nasty way to design a machine!

No, not true. The 780 was all 74S chips (some LS on non-critical paths) but nothing LSI at all. The 730 and 750 used TI mask-programmed logic array parts. I actually read the print set of a 780 about 30 years ago and at one time knew the design pretty well.

Jon

Vote

G

glen herrmannsfeldt 13 years ago

(snip)

Story I knew from about 30 years ago was that the 730 was build from 2900 series parts. That was supposed to be related to H-float being included, (no extra charge) when it wasn't for the earlier models. So, H-float on the 730 was faster than the software emulation on the 750. (But then again, I never tried.)

-- glen

Vote

Combination loops and false paths

Join the Discussion

Didn't find your answer?