Small, fast, resource-rich processor

Tim Wescott · 2013-09-11T16:11:04+00:00

I'm working on a project that needs to have a pretty hefty amount of digital signal processing done in more or less real time ("soft" real time, if you must split hairs). For a variety of reasons I think this algorithm would work best on a small single-board computer (my customer disagrees -- but getting it shoe- horned into the chips I was considering is going to take WORK, and I think it'll be cheaper for them to go with more expensive hardware). So I'm looking for suggestions. I mostly build custom boards or I make algorithms for other people's hardware -- I've never specified a single- board computer that's gone into production. I was thinking PC-104, but I've never actually used a PC-104 computer, and I have no idea, beyond trade-show displays, how the market has evolved. So, here's what I think I need. Anyone who wants to look through this and point me to the current crop of solutions for all this is welcome to do so -- I'll be grateful. Small: PC-104 form factor, or some other solution that's less than about 20 square inches of board and less than an inch tall. Fast: Something that supports native dual-precision floating point, and has a clock rate of 500MHz or better. This algorithm runs about 5x faster than real time as a Linux application on a Dell Dimension 8300. That's a 2.8GHz Pentium 4, so if it's running alone it should do more with less. Resource-rich: The algorithm runs, albeit way slow, on a STM32F407, using less than 128kB of memory. So at least that much memory plus whatever is necessary for any OS (see below). Ports: Comes with serial ports. I don't need Ethernet or that stuff. Depending on the processor (see below), having a JTAG debug port would be nice. Extensible: I need something onto which I can easily slap an ADC board, or something that talks USB, and suggestions for matching ADC modules that talk USB. My preference is something that has an easy parallel I/O implementation, an SPI controller that I can hook...

D

David Brown 12 years ago

You can do a lot in HDL's - but not everything. First, lets get the easy points out of the way (I hope!), even though they don't apply in the particular case of the Kalman Filter. There are several "C to HDL" tools available these days (that these tools exist at all suggests it is faster and easier to write and debug code in C, then move it into FPGA hardware for speed). By looking at some of the limitations in such software, we can get an idea about things that map well to FPGAs, and things that do not. Now, I have not made an exhaustive search of all such tools - I've just had a quick google and read around. But these are typically difficult or impossible to translate:

Dynamic memory (malloc, free, etc.)
General pointers (pointers are usually limited to within a specific array implemented in a memory block)
Recursion (unless it can be completely determined at compile time)

Algorithms that require a lot of dynamic or unpredictable behaviour, or that require complex data structures, are going to be hard to implement in the FPGA. Not impossible, of course, but hard.

As a general point, if you have a algorithm that is expressed sequentially, and it does not involve such dynamic behaviour, then you can probably code it in a fairly straight-forward way in an HDL. But what good does that do you? Your system is doing one step at a time, even though you are probably using lots of resources - you will end up with a lot of blocks being instantiated for arithmetic and other functions as they are used in the algorithm, even though they are only active for a few cycles out of each multi-cycle loop. You've just created a large, expensive and not particularly fast sequential system (albeit with very predictable timing). To make the whole thing worth the effort, you will have to work at re-arranging the code to make good use of the FPGA with more of the blocks working more of the time.

My guess is that you don't see the issues here because you do this so often that you don't really think about it (just as you say about me below) - but that does not make them any less real.

It's true that there are balances and trade-offs in software too. But I think that these sorts of decisions have far less impact in software than in FPGA design, in terms of the types of changes they need in the source code, the time taken to make the changes, and impact and effect of the changes. And I think more often the "obvious" implementation is going to be the optimal one. In the example of where to put your variables, then obvious answer is to make them automatic variables - the compiler will put them in registers, on the stack, or eliminate them entirely. You don't have to think about anything here. If you are talking about an array of data, you begin to have real choices - put the array on the stack, the heap, or in statically allocated memory. But the code and run-time impact of the choice is minimal in most cases.

But I'll accept that my experience makes me see these things as a bit more obvious than they would be to non-experts.

HDLs work at a much lower level than software, and the tools and simulators match that. This is both a benefit and a disadvantage - it can give you far more control and more precise timing - but it also makes it harder to see the wood for the trees. (This is not entirely different from the age-old assembly vs. HLL debates.)

It is /much/ simpler than that.

If a system has to read 6 input signals, generate 10 output signals, and handle telegrams on an RS-485 bus, then it can handle them one at a time sequentially - and as long as the timing is good enough, the result is the same as if everything were done in parallel.

Sometimes an operating system of some sort (I presume that's what you mean by the "additional software") can make this task easier, but it is certainly not a requirement.

Of course, if timing is very tight or there are difficult synchronisation requirements, then the job might be done more easily using "real" parallelism - an FPGA.

I'm guessing that the IP blocks Altera sells (or re-sells, for the third-party blocks) are reasonably optimised for their devices. The number of LE's used varies according to the device used, so I assume that on the more advanced devices the DSP blocks can do more of the work.

And certainly there will always be tradeoffs in terms of the resources used and the time used, and there will always be scope for optimisation based on more precise usage of the blocks. I am just using specific examples, from what I assume is a realistic source, to get rough figures.

I agree that I am not proficient at FPGA work - and I have no doubts at all that efficient implementation of a Kalman Filter in an FPGA would be a lot more time-consuming for me than for experienced FPGA developers such as yourself (assuming equal understanding of the maths and the algorithm, of course). But I believe I have enough understanding and experience of FPGA work to have an idea of the challenges involved, the costs, and the time and effort required - as well as the benefits of FPGAs for some types of problem. I have not managed to convince you of this, and I don't think we will ever agree here. But Usenet discussions are about exchanging ideas - no one ever really expects other people to change their minds!

If I had the time and opportunity, I would love to do more FPGA development. Perhaps my opinions would change if I did - I will be that open-minded at least. But it has been a while since we've had a project for which an FPGA was ever a serious contender.

Better, cheaper, faster, easier, more flexible - is that too demanding?

It must be said that the devices and the tools for programmable logic have improved enormously over the years, making FPGAs suitable for a wider range of applications than they used to be (many years ago, I worked on a CPLD design that took 6 to 8 hours for place and route at each trial, and was debugged using a couple of flashing LEDs). On the other hand, microcontrollers have got enormously faster and cheaper too, and applications that used to require expensive FPGAs can now be done on cheap micros.

$3 will get you a Cortex M4 (Freescale K10) at 72 MHz. You don't need hardware floating point when you used fixed point arithmetic (and to be fair, I know that converting the algorithm to fixed point would make an FPGA implementation much easier too). Although these devices don't have floating point (SP FP takes the Cortex M4F costs to about $6+), they have MAC type instructions. Obviously I have no details of the OP's project, but I except such chips to be fast enough with a bit of tuning of the algorithm. (He didn't want to have to tune the algorithm or the implementation, as development time was more relevant than hardware costs.)

You don't need any support circuitry for these chips other than a single power supply (1.7 - 3.6V, about 200 mA for full speed), a header for programming and/or debugging, and perhaps a couple of capacitors. You don't even need a crystal for many applications - the internal oscillator is much less than 1% accuracy at room temperature. If your inputs and outputs are analogue, you have an ADC and DAC built in.

In comparison, you usually need a lot more support circuitry for an FPGA

- typically you need multiple voltage levels with significantly more current and tighter tolerances, you need some sort of accurate clock source, and you need external flash. (I know there are some FPGAs with less requirements.) You may also need external RAM - you don't get nearly as much built-in ram for your money with FPGAs.

I leave you with one more thought. When you google for Kalman Filter software, it's easy to find ready-made implementations in C to download and use. When you look for Kalman Filters in FPGAs, results are mostly academic papers, and mostly without any code. The impression is that Kalman in C is so easy (/if/ you understand the algorithm!) that it can be given away - but implementing them in an FPGA is an undertaking worthy of a major project at university, and anyone making them outside of academic circles considers the results too valuable to share.

Vote

P

Paul Rubin 12 years ago

What happens if you use a higher end board (Beaglebone etc.) with software floating point? Those things run around 1 GHz so they may be fast enough even with no floating point hardware.

Vote

L

Les Cargill 12 years ago

"enterprise stuff" is lots and lots and lots of disparately operated cruft. Each morning in the class I took, we watched while the instructor updated *everything*. Frequently, this broke things. there was no apparent packaging to cohere layers beyond marked releases of packages. *Everything( was a beta.

After lunch, we'd pick up where we left off...

It's amazing it works at all. And when pressed about the transaction rate on a box store level desktop, I was shocked to learn that they expect no more than a hundred transactions per second.

I dunno. They are all different. It's gotten so that the word "problem" means different things in different shops doing what appears to be the same thing.

Les Cargill

Vote

D

dp 12 years ago

Doing MAC using multiprecision addition (i.e. use 2 GP regs as an accumulator) may well be efficient enough.

Vote

P

Paul Rubin 12 years ago

It gets a lot of numerics right that previous designs and formats did not. As stated earlier, that's why Kahan got the Turing award, which is most prestigious award in computer science. IEEE-754 cleaned up a lot of messes from the 1970s when numerical programs got wrong answers all the time because people didn't understand these issues then. Some people of course still don't understand them.

Who's the "designer"? The hardware guy? Some code jockey like me? Unless they're qualified numerical analysts why should anyone listen to them if they're doing something serious and say 22 mantissa bits are enough? Double precision (53 mantissa bits) is not a magic gumball that can automatically save unstable algorithms, but it can absorb quite a lot of accumulated 1-ulp errors, unlike single precision.

Here's a famous incident where some "designer" decided 24 bit fixed point was enough, and the resulting accumulated error got a bunch of people killed:

formatting link

Kahan's advice is to do all intermediate computation with >= 2x the bits of the data and desired results, which would have prevented the above incident. He actually says "[t]o protect us from clever programmers who use floating-point occasionally without ever having endured a competent Numerical Analysis course, programming languages should be changed to use IEEE 754?s quadruple precision by default for all scratch variables."[1]

It could be that the Kalman filter is self-stabilizing because of the frequent updates from real observations. Still, before believing any nontrivial numerical algorithm in single precision (or fixed point or whatever) I'd want to see it run in double precision with the same input data, and make sure that the results matched (this is in addition to investigating around singularities and so on). But once you have a double precision implementation to compare, you may as well just use it directly if you can.

[1]

formatting link

p. 11

Vote

P

Paul Rubin 12 years ago

This makes sense for software emulations and maybe for FPGA's, but I think with hardware implementations, handling all those flags and traps can be done concurrently with the main calculation, with a handful of extra gates.

I wonder what kinds of verification techniques exist for this. It's above my pay grade, I guess.

Well put.

By the way here's an interview about IEEE 754 that I saw some years ago but just came across again:

formatting link

This stuff is not trivial. It's not just a "format".

Vote

R

rickman 12 years ago

They are loss leaders to promote the product. Not safe for product use btw. I have seen that on many low end chip vendor boards. I have even seen that on the low end programming pods for FPGAs, they say "not for production use".

Rick

Vote

R

rickman 12 years ago

Where do you get this stuff? Why do you continue to put words in my mouth? I said the language is complete. No one has indicated what is hard about coding a KF in an HDL.

"If it can be implemented" makes it sound like you can take any old HDL code and it will be good code. I never said that.

I won't argue with that at all. BTW, what are your timing constraints? The only number I've seen is 1 MFLOPS which is a *very* generous timing constraint in any FPGA I've ever worked with. Is that your realistic goal, 1 MFLOPS? Then I would likely not worry too much with timing constraints... a one liner would do the job,

clock Foo = 1 MHz

Rick

Vote

G

glen herrmannsfeldt 12 years ago

(snip, I wrote)

(snip)

I suppose with two data points, 1 and 1000, knowing that the dividing line is somewhere in between, maybe closer to 1, maybe to 1000.

The geometric mean of 1 and 1000 is about 30, so maybe about there.

But okay, there is a lot of NRE to be done for the FPGA design. A small cluster might be a good choice for a smaller number of processors. One could also build a special board with a bunch of your favorite CPU chip for a single board cluster.

Oops.

-- glen

Vote

G

glen herrmannsfeldt 12 years ago

(snip, I wrote)

(I also wrote)

Yes you can get 1000 times, but it might take more than one FPGA chip.

The price of FPGAs is a very nonlinear function of the available logic, so an optimal priced system might have a larger number of smaller FPGAs. (Also include the board and box cost.)

CPUs are an amazingly inefficient way to use logic, but often a worthwhile tradeoff.

If you are doing a lot of 16 bit adds, your 1 billion transistor chip might do a few 16 bit adds per clock cycle. But a 16 bit adder only takes thousands of transistors to build.

You can build 1000's of adders in an FPGA, but might run at 1/10 the clock frequency. With 100 FPGA chips at about $50 each, you might get much more than 1000 times the processing of a high-end CPU chip.

-- glen

Vote

D

David Brown 12 years ago

That is /one/ of the arguments for avoiding IEEE. It is certainly an argument for using "loose" IEEE in software (such as with gcc's "-ffast-math" switch).

There are other requirements for IEEE beyond the logic to handle the assorted NaNs, denormals, etc. For example, IEEE imposes quite strict requirements for rounding, ordering and error margins, to make the results as repeatable as possible across different systems. This puts severe limits on the compiler's optimiser - it cannot change "a * b" into "b * a", or "(a / b) / c" into "a / (b * c)" or "a * (1 / (b * c))", even if the results are faster.

Have a look at the gcc manual for the various "-ffast-math" flag details:

(Other compilers will normally have similar flags of some sort, though perhaps not in the same detail, if they allow users the choice of strict IEEE or faster floating point.)

Actually, this is not true - especially of smaller FPU with single-precision only, and just the main arithmetic functions (no transcendentals, but perhaps a square root function) . Many hardware floating point units require significant software help to be fully IEE compliant. You can set control flags for how it should handle things like NaNs - such as to ignore them (i.e., do the calculations as though it were a normal floating point - garbage in, garbage out), or to trap to a software exception for full handling.

It should not be "above your pay grade" - even if you are an amateur. It's called "make sure your program works with the correct data". How do you know your function won't generate NaNs and other nonsense? Write the code correctly, and give it valid data (or check the data if it might be invalid).

For most work - and certainly anything embedded - you will only ever hit exceptional floating point values if you've got bugs in your code or algorithms. So you treat these issues just like any other potential bugs.

Vote

R

rickman 12 years ago

You are extending the topic of discussion far off field. I never said you can easily implement *any* algorithm in an FPGA. We were talking about the Kalman Filter.

Code? Who said anything about starting with code? You are replying to a portion of this discussion where a claim was made, "you have to re-organise the algorithm into a form that the FPGA languages can handle." Reorganizing an algorithm is not the same as reorganizing code. You are assuming that the starting point is some HLL.

I just don't see the issues you raise. Resources are used if you specify them. I'm not saying you can code in an HDL exactly the same way you do in an HLL without any regard to the nature of hardware. I am saying there is no inherent limitation to using an HDL to express an algorithm.

Ok...

That is an oversimplification of HDLs. You have the ability to work at a low level, but you are not forced to. A perfect example is my software friend who wrote the "hello world" program. He didn't do anything low level and it worked just fine.

That is a *big* if, but yes, that is true. If the timing is relaxed enough and the tasks are simple enough you can do "parallel" tasks in a round robin manner and it will appear to be in parallel.

I would be happy to listen to any factual statements about the difficulty of implementing a KF in an FPGA. So far I have heard few "facts" and those were not accurate. Mostly the arguments are hand waving.

If I can convince myself that it will be worthwhile to do, I want to implement a radio controlled clock in an FPGA at very low power so that it won't need batteries. It will be powered by environmental power sources. This is not really a hobby project, I would only do it if I can convince myself that it will help me professionally. But that is not really germane to this discussion. An RCC is not the same as a KF.

I have discussed this crossover point between FPGAs and MCUs. My contention there is that it does not need to be a matter of use the FPGA

*only* when an MCU can't do the job. But that is a separate discussion and I don't want to venture off the topic at hand, the "nightmare" issue.

I am slightly familiar with the CM4's DSP like capability. The vendor described it in a way that required a lot of *ifs*. The result was it

*approaches* single cycle MAC operations. That is certainly better than most MCUs, so it is interesting.

But you are way off topic. You seem to be discussing the general issue of FPGA vs MCU. That was not the topic at hand and I have not been discussing it.

I didn't look at the links, but someone posted some 10 or a dozen links. I think your conclusion is fallacious. You indicate some of the papers give code. There is the code... so how precious can it be?

Rick

Vote

P

Paul Rubin 12 years ago

They seem to have just lowered the price:

formatting link

Vote

R

rickman 12 years ago

Why do you need a numerics expert to do a KF in an FPGA, but not on a CPU?

Rick

Vote

P

Paul Rubin 12 years ago

You haven't even shown how to code Gaussian elimination in an HDL. Once you've done that, we can talk about KF.

Vote

G

glen herrmannsfeldt 12 years ago

I suppose, but I still think that denormals were a bad idea.

Each exponent bit double the range of representable values. Denormals increase the range slightly, by much less than one bith worth, with a large extra cost in logic.

Inf and NaN are nice, but not needed for a hardware array implementation. You can easily supply extra data lines to pass the needed information along with the numeric value. That takes much less logic than generating and decoding the Inf/NaN bit patterns.

-- glen

Vote

R

rickman 12 years ago

I agree that CPUs are very inefficient logic.

I'm a little confused. Are you talking about FPGAs or CPUs? Intel CPUs are the biggest, most complex logic chips I've ever heard of. Surely this is the inefficiency you are talking about no? Implementing a KF on an Intel CPU is an ***enormous*** waste of transistors. ;^)

Rick

Vote

D

David Brown 12 years ago

That will depend on the board in question, and the type of product. Typically, these cheap boards are not designed and produced to high quality standards - they will work fine at room temperature in a desktop environment, but they are not produced with the quality levels needed to guarantee operation over long term or in tough environments. However, if your "product" (or prototype product) is limited to room temperature and kind environments, and you don't expect any sort of warranty, then they could be good enough. They are then marked "not for production use" /because/ they are loss-leaders.

Vote

G

glen herrmannsfeldt 12 years ago

(snip, someone wrote)

(snip)

If you want to compare numbers that way, it isn't hard to do the conversion. You could, for example, convert before writing to disk, and convert back when reading them in. That way, the disk file could be sorted as you say.

-- glen

Vote

R

rickman 12 years ago

That is a good point. Who is qualified to say that IEEE single precision floating point is good enough? Who is qualified to say that double precision floating point is good enough?

I don't know that testing with data is a very good test. If you run with different data (which you surely will) you can get very different results. I would think an analysis would be the only reasonable way to verify implementation resolution in a critical app.

Rick

Vote

Small, fast, resource-rich processor

Join the Discussion

Didn't find your answer?