Style for Highly-Pipelined State Machines

K

Kevin Neilson 18 years ago

I'm designing another FSM and I've run into the problem I always have when trying to pipeline them for high-speed designs. I'll show a simple example.

STATE2: begin if (condition) begin state

Vote

M

Mike Treseler 18 years ago

I don't know, but I use a step counter and a case of that step counter in a synchronous block.

-- Mike Treseler

Vote

K

KJ 18 years ago

As is usually the case...'it depends'. There are two basic approaches:

Add additional states to the FSM to match the latency of the newly pipelined operation.
Add a request/acknowledge handshake signal pair into and out of the FSM that controls movement to the next state.

#1 is fairly straightforward, but as you're finding out gets to be a pain if you use this method exclusively instead of just for relatively simple cases that won't change. This method can also tend to create fairly cumbersome logic implementation as well (i.e. the state machine logic can get to become the critical timing path).

For #2, with the request/acknowledge signals what you're doing is refactoring your logic and using signals to indicate when that process should wake up and when it has produced a result. Another way of looking at it is that you're 'subcontracting' or 'outsourcing' that hunk of logic. In the example you gave the multiply/add operation is the function that you're outsourcing. From the perspective of the state machine then it simply needs to tell this other hunk-o-logic when to start (i.e. request) and the hunk-o-logic in turn needs to tell the FSM when it has completed (i.e. acknowledge).

This basic approach scales very well and is generally tolerant of further changes in the actual latency. Using your multiply/add example, you might at first decide to have one stage of latency (registering the inputs) and then later decide to make it three (register, inputs, multiply, and sum). The request/acknowledge signal *interface* between the hunk-o-logic and the FSM does not need to change, nor does the FSM itself (unlike approach #1). All that does need to change is you add a couple more flops of delay to generate the acknowledge.

As you find other critical timing paths you'll still need to figure out exactly which sub-functions need to be segregated out for pipelining, but once you've done this, the approach is the same: simply create a request/acknowledge pair to control that sub-function and tie those signals into the state machine.

Put the source code for the logic needed to generate the acknowledge physically right by the hunk-o-logic function itself and it makes it easy to maintain as well. If the hunk-o-logic is complex enough, you might consider making it it's own entity/architecture. If not, maybe put it in it's own separate process (something I do to kind of break up the actual source code text into manageable sized pieces to make it easy to see what that bit does). In any case, what you're basically doing is adding a bit of hierarchy to your design whether it is formal (i.e. separate entities) or less formal (separate clocked process). Once you've physically segregated the stuff, you can probably also see that it wouldn't be that hard to parameterize it so that you could have generics select whether the latency in some fashion...but that's a bit off topic.

The other thing to consider is whether the latency being introduced by this outsourced logic needs to be 'compensated for' in some fashion or is it OK to simply wait for the acknowledge. In some instances, it is fine for the FSM to simply wait in a particular state until the acknowledge comes back. In others you need to be feeding new data into the hunk-o-logic on every clock cycle even though you haven't got the results from the first back. In that situation you still have the req/ack pair but now the ack is simply saying that the request has been accepted for processing, the actual results will be coming out later. Now the hunk-o-logic needs an additional output to flag when the output is actually valid. This output data valid signal would typically tend to feed into a separate FSM or some other logic (i.e. 'usually' not the first FSM). The first FSM controls feeding stuff in, the second FSM or other processing logic is in charge of taking up the outputs and doing something with it.

When you ponder on this approach some more, you will come to the realization that this signalling back and forth between various sub-processes boils down to simply managing data transfer. Once that realization has settled in, it is worthwhile to study existing data transfer techniques (I'd suggest Wishbone and Avalon, I prefer Avalon), decide for yourself which signalling scheme to use and stick to it, using that specification's naming/logic conventions and go from there and never look back at all the other possible ad-hoc ways of doing it.

Altera's Quartus has a state machine machine viewer that can be exported for documentation if that's what you're looking for.

Kevin Jennings

Vote

K

Kevin Neilson 18 years ago

...

In this case I do indeed have to continue to keep the pipe full, so inserting wait states is not an option. And the latency of the "hunk of logic", aka concurrent process, is actually significant because I have to get the result and feed it back into the FSM. This example shows why:

STATE2: begin if (condition) begin state

Vote

A

Aiken 18 years ago

I think you should do is put your "piplied" . the number of state you need to add =3D=3D the number of cycle you need to use for (a*b+c)*d. that's you exactly add the operation inside the states. but not out side the FSM

his

OK

the

k.

=A0In

ults

tput

nal

e.

, the

ts

:

Vote

M

Mike Treseler 18 years ago

I still like the idea of a step counter.

On tick one, I do x

Vote

K

KJ 18 years ago

Well, just the fact that you're time sharing the DSP48 means that you're not processing something new on every clock cycle which just screams out to me that you'd want to implement this with a request/acknowledge type of framework. Consider having a black box that has two logical interfaces called 'inp' and 'out'. The 'inp' interface will be written to by some external thingy and provide 'a', 'b', 'c' and 'd' inputs. The black box will compute "y

Vote

K

Kevin Neilson 18 years ago

But I *do* have to process something on every cycle. Consider that I have to process these two equations:

y0

Vote

K

Kevin Neilson 18 years ago

That is essentially what I'm doing; I'm just trying to find a syntactically better way to design this pipelined stuff without having a bunch of interdependent concurrent FSMs (or a single FSM and a bunch of logic outside). -Kevin

Vote

K

KJ 18 years ago

You're not able to process a new set of 'a', 'b', 'c' and 'd' on every clock cycle since the DSP48 is time shared (by your choice) and that was my point. Time multiplexing the DSP48 to keep *it* busy on every clock cycle is not the same thing.

That's only true if the addition can't be done combinatorially. If it can then the calculation of 'y0' takes two clock cycles and the DSP48 is fully utilized. The answer pops out after two clock cycles of latency, the DSP48 hums along doing something useful on every tick.

And depending on just what the bottlenecks in the design are, one can do all kinds of things. But no matter what, you still need to interface *to* that thing, no matter what it does and no matter how wide of an input vector it takes (i.e. a0, b0, c0, d0, a1, b1, c1...if that's what it takes). In other words, a0, b0, c0, d0, a1, b1 and c1 all need to get in somehow; y0 and y1 both need to make it out and you need to flag when they are valid and that flagging is functionally the same thing as handshaking.

Kevin Jennings

Vote

Style for Highly-Pipelined State Machines

Join the Discussion

Didn't find your answer?