Small, fast, resource-rich processor

Tim Wescott · 2013-09-11T16:11:04+00:00

I'm working on a project that needs to have a pretty hefty amount of digital signal processing done in more or less real time ("soft" real time, if you must split hairs). For a variety of reasons I think this algorithm would work best on a small single-board computer (my customer disagrees -- but getting it shoe- horned into the chips I was considering is going to take WORK, and I think it'll be cheaper for them to go with more expensive hardware). So I'm looking for suggestions. I mostly build custom boards or I make algorithms for other people's hardware -- I've never specified a single- board computer that's gone into production. I was thinking PC-104, but I've never actually used a PC-104 computer, and I have no idea, beyond trade-show displays, how the market has evolved. So, here's what I think I need. Anyone who wants to look through this and point me to the current crop of solutions for all this is welcome to do so -- I'll be grateful. Small: PC-104 form factor, or some other solution that's less than about 20 square inches of board and less than an inch tall. Fast: Something that supports native dual-precision floating point, and has a clock rate of 500MHz or better. This algorithm runs about 5x faster than real time as a Linux application on a Dell Dimension 8300. That's a 2.8GHz Pentium 4, so if it's running alone it should do more with less. Resource-rich: The algorithm runs, albeit way slow, on a STM32F407, using less than 128kB of memory. So at least that much memory plus whatever is necessary for any OS (see below). Ports: Comes with serial ports. I don't need Ethernet or that stuff. Depending on the processor (see below), having a JTAG debug port would be nice. Extensible: I need something onto which I can easily slap an ADC board, or something that talks USB, and suggestions for matching ADC modules that talk USB. My preference is something that has an easy parallel I/O implementation, an SPI controller that I can hook...

T

Tim Wescott 12 years ago

The number of computations in a Kalman filter is something on the order of 6 * n^3, where n is the number of states you're trying to estimate. For more than a few states (and I'm well over ten here), you'd need a Really Big FPGA to do the whole computation in parallel.

If you look in finer detail, it's a bunch of matrix multiplies and adds, and one division. Matrix multiplies are nothing but a collection of dot products, just like a FIR filter is one dot product, so that part is known.

Tim Wescott Wescott Design Services http://www.wescottdesign.com

Vote

T

Tim Wescott 12 years ago

My understanding is that it's way more than a handful -- you'd need to ask someone who's actually done it, though.

What I was told, ages ago, was that in an IEEE-compliant floating-point coprocessor, there's more silicon area devoted to detecting and handling the exceptions than there is to doing an actual computation.

Tim Wescott Wescott Design Services http://www.wescottdesign.com

Vote

R

Randy Yates 12 years ago

I dunno, at 45 nm, a handful is a whole bunch!

Randy Yates Digital Signal Labs http://www.digitalsignallabs.com

Vote

R

robert bristow-johnson 12 years ago

of course. in fact, that conversion of the mantissa from sign-magnitude format to 2's-complement is something you would have to do if implementing math functions (at least for addition) on some hardware context like an ASIC or FPGA. at least i don't understand very well how you could add two IEEE floats without doing it.

the issue for me is that it might have been nice if the IEEE guys had defined 754 to have a twos-complement rendering of negative values rather than having a sign-magnitude rendering. then no need to pull the sonuvabitch apart to do a compare with an integer machine.

r b-j rbj@audioimagination.com "Imagination is more important than knowledge."

Vote

D

Don Y 12 years ago

Exactly. Even neglecting the "cost" (time/silicon/etc.) involved.

Far more frequently overlooked is a critical analysis of the actual algorithm ("equations") used and how they behave in the "dark corners". I.e., do you understand the implementation of the data type well enough to know what sorts of test cases to throw at it to *really* stress it's performance in conditions that *may* well pop up in The real World?

(This is akin to someone naively testing for equality among floats -- though much more nefarious!)

Vote

G

glen herrmannsfeldt 12 years ago

(snip on systolic array for FIR)

Yes, you don't do that. I believe that there is literature on systolic array matrix multiply. Most obvious is N processing units, so N**3 multiplies in O(N**2) clock cycles. For smaller N, some might do a 2D array. Easiest when N is constant (known at compile time) and the array is appropriately arranged.

How big is N, and is it somewhat constant? (within a small range)

-- glen

Vote

G

glen herrmannsfeldt 12 years ago

A favorite example over the years has been the quadratic formula.

It seems so easy, but with the appropriate values you get enough cancellation that the results are very far off. Also, the values don't look so obviously exceptional.

(snip)

-- glen

Vote

D

Don Y 12 years ago

Yes. Any time you're mixing very big and very small. Solve one type of quadratic equation one way; another a *different* way -- based on a, b and c.

(The first time this happens to you, you stare at the results and the code and wonder what the hell went wrong! Everything

*appears* correct -- except the "answer" :-/ Thereafter, you learn not to be so naive when approaching these problems! Regardless of single, double, extended, quad, etc. precision formats!! Sometimes, numbers aren't *just* numbers...

Vote

R

rickman 12 years ago

"Most things"??? What does that mean? How do you know if your app fits into the definition of "most things?

Ok

Rick

Vote

P

Paul Rubin 12 years ago

I don't know how many times I have to explain this.

There are two kinds of numerical algorithms:

1) poorly behaved (unstable, working near singularities, etc). 2) well-behaved (everything else)

It usually isn't that hard to tell which kind you're dealing with in a given application. Well-behaved algorithms are characterized by a slow buildup of errors rather than massive oscillation or whatever.

With double precision, you're potentially in bad shape with #1 and there's not much you can do, but you're generally in good shape with #2, for reasons already explained. You can tolerate a lot of 1-ulp error buildup before the imprecision reaches into the realm of physical measurement.

With single precision, you can get clobbered even in situation #2, like in that Patriot missile example. That's the difference.

Also, situation #1 is more common than #2 because real-world numerics applications are pretty routine, and use algorithms designed by numerical analysts to avoid bad behavior. Like including the step of finding the largest pivot and swapping rows in the Gaussian elimination article that I mentioned. That took some sophistication to figure out, but they teach it in math class now, so everyone knows about it unless they skipped class.

Of course if you've got someone skipping math class AND designing numerical algorithms despite having skipped class AND choosing low precision for those badly designed algorithms because they wanted to save 5 cents worth of transistors, well, all bets are off.

Vote

D

Don Y 12 years ago

+1 I had a colleague who would periodically "re-bug" our floating point library. All in the hope of eeking out a tiny bit more performance or trimming a few bytes out of the code. :<

Well, it's biggest advantage is that it presents *a* "standard". Previously, the capabilities of the "floating point library" would vary from project to project, environment to environment, etc. 754 can be overkill for lots of applications (esp if you know you don't need NaN's, denorms, etc.) but it's often easier just to implement it all and forget about it.

OTOH, amusingly, Limbo was created with with a single "real" data type -- doubles. Yet, some concessions were made to efficiency -- no support for denorms, gradual underflow, a different set of traps, etc.

So, its interesting to see how folks juggle requirements and capabilities over time!

Vote

P

Paul Rubin 12 years ago

That was the subject of a very long debate mentioned in the interview I linked in another post, but consensus finally emerged at the time, and appears to have held up in retrospect, that the denormals (I think this is what they mean by gradual underflow) was the right thing.

Really, they are used in calculations. You can run a calculation without a lot of intermediate tests because you can check at the end if a NaN came out. Similarly in cases where you can get real answers despite the appearance of Inf in some intermediate result (e.g. since

1/Inf=0), you can rely on it working. That isn't someone abusing the standard in some way that's too smart for their own good. The standard was designed in order to make that type of calculation work in the determinate cases and give NaN in the indeterminate cases.

Vote

G

glen herrmannsfeldt 12 years ago

(snip)

I think you snipped out too much.

In an FPGA implementation, it is easier to just run a separate line saying that the value is Inf or NaN. That is faster and easier than coding it into 64 bits, and then decoding it again just a little later.

It is the bit representation that isn't needed, not the concept.

(I suppose that was confusing, since I was also suggesting that the concept of denormals wasn't needed.)

-- glen

Vote

P

Paul Rubin 12 years ago

I'd have to say that is a misconception. It's not for repeatbility: IEEE imposes those requirements in order to help numerical algorithms give the right answers. I.e., bypassing the requirements leads to wrong answers. Mathematics is not modern art or literary criticism where everything is subjective and there's no right or wrong answer. Numerical problems actually do have wrong answers and IEEE-754 was designed because the mathematicians who designed it had a huge amount of experience with previous systems and they were tired of getting wrong answers and they decided to (somewhat) fix the situation.

If your application can withstand wrong answers, then sure. It's something you have to be judicious about rather than generalizing.

Verification (in the sense of certifying that the program does the right thing for ALL POSSIBLE inputs, not just for your test vectors) is a very complicated subject. I know a little about how it's done for integer math, but floating point adds another level of complexity and I just mean I don't have any clue how the high-assurance community deals with it. It goes way beyond the topic at hand though.

There is nothing nonsensical about NaN's. They are part of the standard and algorithms intended to run on standard-conformant hardware can and do use NaN and Inf on purpose, expecting them to propagate through calculations the way the standard says they should. If your floating point implementation requires avoiding them, then you're imposing restrictions on the user's algorithm choices.

Vote

P

Paul Rubin 12 years ago

I see, yeah, that makes some sense, as long as the algorithm can make use of the features. I'm still pretty unclear about how a conventionally presented algorithm is supposed to be translated into FPGA form.

Vote

T

Tom Gardner 12 years ago

I highly recommend reading this set of notes: - lightly and amusingly written - information dense - university course in "how to avoid being bitten by computer arithmetic" - theoretical, why features are there and how features interact - practical, how various languages get it right/wrong - written by somebody that has been on the sharp end of diagnosing corner-case HPC "issues" over the last 40 years Even a cursory inspection will cut through some of the arrogant guff that has appeared in this thread

formatting link

Vote

T

Tom Gardner 12 years ago

Everybody on this thread should, at the very least, speed read

formatting link

It is amusingly written by someone that's been on the sharp end of diagnosing numerical problems with HPC since the 60s.

Even a cursory > David Brown writes:

Vote

T

Tom Gardner 12 years ago

Read, learn and inwardly digest the (amusing written) contents of

formatting link

Amusing, written by someone on the sharp end of numerical problems since the 1960s.

Then you will begin to have an understanding of where the dragons lie.

Vote

U

upsidedown 12 years ago

The denormal question only exists due to using hidden bit normalization in a Radix-2 system. Since a Radix-8 or Radix-16 floating representation does not have that hidden bit issue, were denormals ever a problem with these machines ?

Vote

U

upsidedown 12 years ago

Sounds like sweeping the problem under the carpet :-).

If you really get infinity at some intermediate stage, this really looks that the algorithm was faulty from the beginning or not valid for the range of arguments.

Relying on 1/Inf=0 may hide much more fundamental problems.

Vote

Small, fast, resource-rich processor

Join the Discussion

Didn't find your answer?