Small, fast, resource-rich processor

Tim Wescott · 2013-09-11T16:11:04+00:00

I'm working on a project that needs to have a pretty hefty amount of digital signal processing done in more or less real time ("soft" real time, if you must split hairs). For a variety of reasons I think this algorithm would work best on a small single-board computer (my customer disagrees -- but getting it shoe- horned into the chips I was considering is going to take WORK, and I think it'll be cheaper for them to go with more expensive hardware). So I'm looking for suggestions. I mostly build custom boards or I make algorithms for other people's hardware -- I've never specified a single- board computer that's gone into production. I was thinking PC-104, but I've never actually used a PC-104 computer, and I have no idea, beyond trade-show displays, how the market has evolved. So, here's what I think I need. Anyone who wants to look through this and point me to the current crop of solutions for all this is welcome to do so -- I'll be grateful. Small: PC-104 form factor, or some other solution that's less than about 20 square inches of board and less than an inch tall. Fast: Something that supports native dual-precision floating point, and has a clock rate of 500MHz or better. This algorithm runs about 5x faster than real time as a Linux application on a Dell Dimension 8300. That's a 2.8GHz Pentium 4, so if it's running alone it should do more with less. Resource-rich: The algorithm runs, albeit way slow, on a STM32F407, using less than 128kB of memory. So at least that much memory plus whatever is necessary for any OS (see below). Ports: Comes with serial ports. I don't need Ethernet or that stuff. Depending on the processor (see below), having a JTAG debug port would be nice. Extensible: I need something onto which I can easily slap an ADC board, or something that talks USB, and suggestions for matching ADC modules that talk USB. My preference is something that has an easy parallel I/O implementation, an SPI controller that I can hook...

R

rickman 12 years ago

This is all useful info, I appreciate the insight.

Is IMU Inertial Measurement Unit? GPS gives you position and velocity, but nothing about orientation. I assume IMU gives you orientation, does it also give any of the others? I expect it actually gives you acceleration but may calculate the others?

This is not really terms I am looking for, but knowing X is the state helps. What exactly is it the state of? Is the state just a mathematical feature to allow the KF to work?

Hmmm.. maybe it was others who said 64 bit floating point was required for the KF calculations? I'm a bit confused.

Ok.

So what is the part that would be hard to do in an FPGA?

Rick

Vote

G

glen herrmannsfeldt 12 years ago

(snip, someone wrote)

When an existing processor is fast enough, that is probably the best choice. When a small number (cluster) is fast enough, that might also be the best choice.

When you need to be 1000's of times faster, and the algorithm does a lot of fixed point addition, maybe some multiplication, you can do much better in a medium sized FPGA.

For floating point algorithms, you might be able to use fixed point a little wider instead.

A systolic array can be a very efficient way to process a large amount of data if the dependency is right.

(snip)

Well, hardware costs usually do come in, but often enough the FPGA is cheaper than 100 or 1000 of the currently popular microprocessor.

Over the years, there have been many tries at boards for general purpose hardware acceleration that interface to another system. Most have not been very successful commercially. As well as I know it, mostly it is the problem of getting data into and out of the FPGA that makes it hard, and reduces the usefulness of the solution.

-- glen

Vote

G

glen herrmannsfeldt 12 years ago

(snip)

Usually you will want the multipliers to do actual multiplication, but if not, then yes, you can use them.

Fortunately many now have LUT6, which can do a 4 to 1 mux.

(snip)

-- glen

Vote

G

glen herrmannsfeldt 12 years ago

(snip)

It isn't difficult, but it is big. The pre/post normalization for add/subtract is pretty big. Big enough that you don't get so many inside the FPGA, and so much of the advantage is lost. Maybe not too much, though.

But for enough problems you don't need the full exponent range of IEEE single or double. If you can live with less, it might not be so bad. Or use wider fixed point. 36 bit fixed point is likely smaller (better packing) than 32 bit (24 bit significand) floating point.

The popular floating point systems are designed to be usable over a wide range of applications. In the FPGA, you can design for the exact width you need. (Sometimes only even widths.)

-- glen

Vote

G

glen herrmannsfeldt 12 years ago

(snip)

Usually the system will have an array of similar blocks, so you only need to design one. Once it works, the rest is easy.

In the popular families, there is a flip-flop (register) on the output of each LUT. That makes pipelined arrays very easy to do.

Yes, pipeline the whole thing in each step. The clock rate might be a lot less, but you might get 100 or more on each clock cycle. Maybe 1000 for simple fixed point operations. You can also array the FPGAs for even larger calculations.

-- glen

Vote

G

glen herrmannsfeldt 12 years ago

(snip)

I probably don't disagree with that one.

This one I probably do disagree with.

Yes, but doing massively parallel operations on the usual processor also takes a lot of attention to detail. Consider doing it with MPI in C compared to the FPGA in verilog.

That might not be true if there were more logic designers and fewer software engineers.

OK, but if you come back next month and need it 1000 or 10000 times faster, then you might find the FPGA implementation better.

There is a problem that I know about that can use 1e15 operations of 8 bit addition per second. You could do that with a billion Pentium IVs, but probably not.

(snip)

-- glen

Vote

R

rickman 12 years ago

These are all your opinions. I find it interesting that you use a multiplier of 1000 for your cut over point. So at 100 you would use 100 CPUs rather than an FPGA? That is a rhetorical question...

You didn't respond to this question. Would you care to comment?

Rick

Vote

T

Tom Gardner 12 years ago

Partially agree. It mainly depends on the algorithm and the contraints, particularly speed and parallelism. Horses for courses.

Disagree. Software can be extraordinarily resistant to change, particularly if it is pushing limits or no one person understands it all.

Think of programming an FPGA as being similar to large-scale enterprise software: there's a lot going on invisibly "under the hood" that the developer simply has to take on trust.

For FPGAs it is things like the design tools, place and route, timing and clock distribution.

For enterprise software it is things like the software frameworks (e.g. J2EE etc), individual components (e.g. beans) being correctly "wired up", distributed caches (hardware and software), ACID properties.

Changing an enterprise framework is equivalent to changing an FPGA family: you do it as often as you change house.

I won't bother to draw the analogy with hard real-time software since this groups is probably aware of its characteristics.

Disagree. They are equivalent.

Agree strongly. Here "fast" means any of - latency - parallelism - raw number crunching

Agree strongly. Embedded micros were initially marketed and sold as "programmable logic", but they later outgrew their boots.

The tradition continues; have a look at the XMOS processors (available from Digikey for a few bucks) that provide *hard* realtime timing guarantees and DSP performance even though they are programmed in C.

formatting link

They are encroaching into the FPGA design space.

Anyone that can't accept "horses for courses" is a mere fanatic: "A fanatic is one who can't change his mind and won't change the subject." Winston Churchill

Vote

R

rickman 12 years ago

People often confuse pipelining as being required for single cycle operation. That is not actually correct. Pipelining allows the clock rate to be faster. Anything in an FPGA can be done in one clock cycle... or no clock cycles at all. The logic is combinatorial, registers are optional.

Rick

Vote

D

David Brown 12 years ago

My point - and I am not alone here in thinking this, based on other peoples posts - is that you have been making a lot of posts recently recommending FPGAs as a better/cheaper/faster/lower-power/easier solution to all sorts of problems that really are not FPGA problems at all. People are looking for screwdrivers, pliers, or spanners and you are insisting that your hammer is the best tool for all jobs.

I /know/ you are not /actually/ saying this, and I /know/ you don't believe this - I've read enough of your posts over the years to know you are a "right tool for the job" person. You don't always agree with others about what that right tool is, since you are biased towards the tools you know best - just like the rest of us.

However, you should be aware that this is the impression you are giving. Everybody - you, me, and everyone else - has a tendency to exaggerate and over-sell their opinions at times, and I think that has come across in these two threads. It is unfortunate that this impression of over-selling is shadowing the actual information and ideas you are trying to get across.

I hope you see this as constructive criticism here - at least, that is what I am trying to do. I would just like to see things going back to technical discussions - or informative and interesting off-topic discussions. If you don't agree with me here, then fair enough.

I understand your point that FPGAs have a bad reputation for complexity, and that this is not justified (or at least, not /always/ justified!). And I agree with it to a fair extent - there are more things that can be done sensibly with FPGAs than many people think. But I think it works the other way too - there are many things that FPGA fans think are a good idea in FPGAs that are actually better implemented in other ways.

FPGAs are really good at doing things in parallel, and less good at doing things serially. Processors are really good at doing things serially, and less good at doing them in parallel. This is a fundamental difference between the two types of computation. Obviously you /can/ do things in serial in an FPGA (and you can at least simulate parallelism in a cpu), an inherently serial algorithm with lots of branches, choices, loops, etc., is most naturally implemented in a serial processor.

Add to that the desire for double-precision floating point. Yes, an FPGA /can/ do these calculations - but it is much easier to do it in software. In software, you write "double a, b, c, d; a = b * c + d;" and that's it done - in an FPGA, you design and debug double-precision floating point addition and multiplication blocks.

I haven't done much - but I have done enough to be entirely confident in claiming that implementing a Kalman filter in an FPGA would be /much/ harder and more time consuming for an FPGA expert than implementing the same thing in software would be for a software expert (given equal knowledge of the maths, etc.).

I have done enough FPGA work to know that it is certainly /possible/ in an FPGA - and that with a good enough FPGA developer it would not be as bad as many might think. I think that tools such as MyHDL could make it more tractable than traditional Verilog or VHDL. Possibly the most efficient way would be to write the code in C, get it all working nicely, then use something like Altera's tools for converting C into FPGA hardware. But that leaves you with the question of why you should bother with the FPGA step at all.

You could make a processor card for a great deal less than $50 that will do the job - the OP is looking at SBC solutions as a way of minimising development costs, not for minimising unit costs.

I know that FPGA's have many uses other than making things run fast - but in this case, I don't see any potential benefits for calculating Kalman filters other than possibly high speed (for a given price, board space, power, etc.). Can /you/ give any other potential benefits? You can take it as a given that a microcontroller will be cheaper if unit costs are important, since the whole thing could be re-implemented in fixed point and run in an ARM for a couple of dollars.

I tried to do so - but you said I had no evidence for the points I made. You have no evidence for any counter-arguments, so I guess you either believe what people are telling you (and what you can see from web searches on Kalman on FPGAs and in software), or you can disagree, or you can try and implement them in software and FPGAs for comparison.

Vote

R

rickman 12 years ago

I don't agree at all. You can program an FPGA in an HDL with a high level of abstraction. Like using an HLL at a high level of abstraction you may not get the best efficiency, but it will run your code just fine. Or you can design with an HDL at a lower level trying to control the implementation for efficiency or speed. That is when HDL programming can be more time consuming... but that is due to the optimizations which applies to *any* design methodology.

I don't agree at all. Here is an FPGA that was used because it was the best fit to the job. Actually it was a speed issue, but nothing like you are referring to. An MCU was excluded because there were none which could provide the flexibility to implement the appropriate interfaces, one was 30 MHz, similar to SPI.

formatting link

The calculations were actually very slow even by MCU standards. An 8 kHz sample rate at the ADC, detection of the amplitude envelope, rate reduced to 1 kHz, bit width detected and down sampled to 100 Hz. Hardly overwhelming to even an 8051.

The FPGA also allowed for design upgrades for later work with none of the limitations of MCUs.

I think the XMOS devices are a tempest in a teapot. Can you give examples of how they have "encroached"? I would bet they don't have even 1% the market size of FPGAs.

Who has disagreed with "horses for courses"? This discussion started talking about FPGAs being a "nightmare" to work with and only suitable for the rare situation where they were absolutely required.

My point is that most people are working under impressions formed more than 10 years ago. FPGAs have changed since the 90's.

Rick

Vote

T

Tom Gardner 12 years ago

You haven't read/understood the point to which I was replying.

I'm sure there are correct and valid anecdotes that support any position. You can use a hammer to put in a screw, and professionals frequently do so, except for the last quarter turn :)

I don't think you understand what "encroaching" does and doesn't imply. Market size is irrelevant to the point being made. As for examples, read their website.

The conversation has veered in several directions.

Do you feel the comment is specifically relevant to you alone?

No argument there. But so have MCUs.

In general can I suggest you read a little more slowly, and take a coffee break before replying.

Vote

A

Al Clark 12 years ago

I realize that I am jumping in very late to this conversation. It was so long, I couldn't find the original post.

As I gather, the requirement is for a double precision floating point application and a battleground has erupted over FPGA versus DSP processor.

We make boards with both FPGAs and SHARC DSPs. In fact, I am working on one now.

From my perspective, FPGAs can do very fast processing but are usually much more difficult to program. Maybe if you are very skilled in Verilog or VHDL, your experience might be different. Our boards have been DSP/FPGA combos where the FPGA is usually configured to do a few simple operations very fast. For example, the front end of a software defined radio. The DSP tends to do baseboand processing and all the housekeeping.

FPGAs almost always consume a lot more power than processors when performing the same task. This matters in some applications.

I don't know why this application needs double precision floating point. SHARCs normally operate with 32 bit IEEE floating point, but there is also a 40 bit mode, where the mantissa is 32 bits instead of 24. Perhaps this would meet the requirement. If this is the case, there are several solutions available, most less expensive than FPGAs and much easier to program. You can also do fixed point in double precision with a SHARC. Floating point emulation is always possible with fixed point processors but generally not efficient enough to make sense.

Al Clark

formatting link

Vote

D

David Brown 12 years ago

That's the usual experience, as far as I have ever heard - at least until you are pushing the limits of what you can do with the DSP.

Usually a DSP is significantly more difficult to work with than a microcontroller, which is what the OP (and others in c.a.e) usually work with. This is because many DSP's work with weird data formats (such as the 40-bit mode mentioned), often have a C "char" that is greater than

8-bit, often need a lot of extra work to get the best throughput (such as putting different data in different memory areas to maximize pipelining), and often have outdated, expensive or simply odd tools.

But of course DSP's are typically faster per MHz at central DSP operations. Thus they fall somewhere between microcontrollers and FPGAs in the tradeoff of speed vs. development time.

(This is, of course, a generalisation - there will be exceptions depending on the particulars of the problem, the people working on it, and the devices in question.)

The application does not specifically require double precision floating point. But it requires maths that can be expressed conveniently using floating point, and that require a higher accuracy (at some points) than standard single-precision floating point provides. A floating point format between single and double precision may do the job, as would fixed point with enough bits (64-bit has been mentioned for intermediary results, but 32-bit is perhaps enough for other parts - again, these sizes come from standard C sizes rather than theoretical ideal sizes).

Vote

T

Tim Wescott 12 years ago

The IMU does not give you orientation. It gives you acceleration and rotation rate. The Kalman filter has to deduce the vehicle orientation by comparing the evolution of the vehicle position over time with the acceleration from the IMU. It's all done indirectly via the covariance matrix, but it works.

I put that in somewhere. It's the state of your system model, which is hopefully representative of your actual system. It's far more than just a mathematical feature to allow the KF to work -- it's usually what you want to know about the system, plus what the KF has to figure out about the system in order to work correctly.

64 bit floating point isn't _required_, but it makes the math substantially easier, in a project-speeding sort of way.

Either Glen Herrmannsfeldt or Tom Gardener already answered this: the part where you're debugging the whole mess with the usual FPGA tools, instead of the tools that are available for debugging software.

Also, since there's a lot more people who understand both Kalman Filters and software than people who understand both Kalman Filters and FPGA work, the part where you go out and hire a guy to do the work.

Tim Wescott Wescott Design Services http://www.wescottdesign.com

Vote

T

Tim Wescott 12 years ago

:)

For an example of a solution that was ruled by data transfer:

One of the projects that I worked on that involved some serious DSP vs. FPGA tradeoffs in the early design step was for video processing. We needed to do a correction to a pixel value that basically involved

px_cor = f(px_raw, px_sur, a, b, c);

where px_sur was the eight nearest-neighbors of a given pixel, and a, b, and c were each unique to a pixel. The function was rather simple, but in addition to a write the algorithm required either 12 fetches per pixel if you were bone-headed about it, or buffering three lines of video on- chip and four fetches per pixel.

Even when we leveraged the DDR RAM burst mode transfers to the max, with the available memory at the time we still needed two memory buses to get the data into and out of the processor that was doing the actual work.

We ended up using FPGAs instead of DSPs not because of the limitations on core execution speeds, but because we couldn't find a DSP chip with a wide and fast enough memory interface that was cheaper than an FPGA talking to a pair (or perhaps three: I can't remember) sets of DDR chips.

Tim Wescott Wescott Design Services http://www.wescottdesign.com

Vote

R

Rob Gaddi 12 years ago

See, right there's one of the big problems trying to do serious amounts of floating point in FPGAs. Even in a V7, that's still forcing you to do 4 DSP slice cross products if you want to work in double precision. Go down to something smaller and you're up to having to use 6, with all the attendant slowdown of the followup additions, and then still manage the renormalization.

So all of a sudden you're chewing through DSP blocks fast and hard. If you start timesharing them you can keep the number used managable, but the code's getting ugly and throughput is dropping. And the smallest V7 is what, 4 grand? That buys a whole lot of C writable, conventional, sequential processing.

I've done (single) floats in an FPGA before in some limited capacity. It's not the end of the world, but it's a poor fit for the technology. Big fat fixed point is resource-cheaper and generally introduces fewer exciting corner cases.

Rob Gaddi, Highland Technology -- www.highlandtechnology.com Email address domain is currently out of order. See above to fix.

Vote

T

Tim Wescott 12 years ago

That's OK. The original post has little to do with the FPGA vs. processor discussion.

I was looking for a small SBC with fast double-precision floating point so that I could take a working Kalman filter implementation, slap it in, and have it work.

I almost called you, until the customer and I came up with a better solution, involving a lightly loaded PC in the same box.

Not "need" so much as "wants real bad". I've simulated the Kalman filter with 64-bit fixed point, and it works just fine (which is not surprising when you think about it, because I was using double precision floating point as the underlying data type). It may even work with 48-bit. But doing so would obviate my primary goal: this is a small production volume project, for which a proven solution exists that runs just fine on a PC- class processor. The volume is small enough that we can spend a Whole Lot of Money on hardware before it's worthwhile for me to do the development work to port things over to just about anything else.

I thought there was a double-precision floating point SHARC?

That's a disappointment -- nearly any time that I choose floating point the choice is between 32-bit fixed point and double-precision floating point, because most of my control problems just don't cut it with single- precision floating point.

_Sometimes_ that's not the case, but with control loops, usually if your sampling rates are getting into DSP territory then you care enough about precision that your integrator depth is getting beyond 24 bits of precision.

Tim Wescott Wescott Design Services http://www.wescottdesign.com

Vote

T

Torfinn Ingolfsen 12 years ago

Hopefully I don't destroy this interesting discussion now. Kalman filter in FPGA, from Google: (some are PDF format)

formatting link

Video:

formatting link

Torfinn Ingolfsen, Norway

Vote

T

Tim Wescott 12 years ago

Come to think of it, if you were going to implement a Kalman filter in anything but an absolutely huge FPGA, you'd probably end up using most of the FPGA resources for a double-precision floating-point MAC machine, which you would then have to keep fed with a sequencer.

You'd end up with something that looked an awful lot like a really small, hard to program processor connected to a really impressive math co- processor.

Tim Wescott Wescott Design Services http://www.wescottdesign.com

Vote

Small, fast, resource-rich processor

Join the Discussion

Didn't find your answer?