Small, fast, resource-rich processor

- T
- Tim Wescott
  
  Contact options for registered users
Vote on answer
posted
10 years ago

Sun, Sep 15, 2013 5:58 PM

Every single time I've seen an algorithm implemented on an FPGA -- and by some pretty damned smart FPGA people I might add -- it's taken about ten times more calendar time and engineering effort to get it going than to do the same damned thing with software. When they were done, every time a change was necessary it took ten times longer to implement the change than if the algorithm were implemented in software.

I've done a bit of FPGA work myself, too. While I can't claim to be an expert, what I've done backs up my impression that making things work on an FPGA requires a higher level of attention to more necessary details than assembly language programming does.

So as far as I'm concerned, FPGAs are there for when there's not a suitable processor that's fast enough to haul the freight.

My point about PCs having processors instead of FPGAs, is that if FPGAs were so easy and handy to use, that's what we'd be using. But we don't

-- we use processors, unless we have to.

But let's put that aside: _you_ are the FPGA expert in this discussion, and _you_ are the one who is challenging my statement that it would be downright stupid to pour engineering resources and schedule months into a rathole just so that you could have a Kalman filter working about 100 times faster than necessary. Since you're such a big enough expert that you can get huffy with me even though I've had experience on projects that use combined processor and FPGA systems to get a job done, I figure that _you_ can be the expert to go to the effort to assess the engineering time necessary to make an extended Kalman filter work on an FPGA.

Your benchmark is 20 engineering hours, which is how long it took me to get the Kalman filter working acceptably on a PC.

Lest you think I'm being mean, that's both an underestimate of the time, and it's for _just_ the time I spent making the core filter work -- it ignores the extra time that you or some software guy would have to spend actually getting a display device to _talk_ to your FPGA, vs. the ease of interfacing to a chunk of code that's executing on the PC you're using.

So get cracking, or back off.

--

Tim Wescott 
Wescott Design Services 
http://www.wescottdesign.com

- T
- Tim Wescott
  
  Contact options for registered users
Vote on answer
posted
10 years ago

Sun, Sep 15, 2013 6:02 PM

Gosh, Rick. It must be nice to be smarter than everyone else on the entire planet.

Has it ever occurred to you that folks on comp.dsp are, by and large, people who _do_ know things like that, and may have even implemented floating point algorithms, possibly even in hardware?

--

Tim Wescott 
Wescott Design Services 
http://www.wescottdesign.com

- R
- rickman
  
  Contact options for registered users
Vote on answer
posted
10 years ago

Sun, Sep 15, 2013 6:33 PM

You did not add anything to the conversation. I asked the question because, as I stated, his question implies that he doesn't. If he did know how floating point was implemented in hardware he would know that the same multipliers used for fixed point are used for floating point. Someone knowledgeable might also know that floating point addition is pretty much the same complexity as multiplication and requires the use of multiplier blocks for optimal implementation in an FPGA.

If Paul does understand how floating point is implemented then I would next be asking him why he asked the question he did.

Meanwhile you have added nothing to the conversation. If you want to discuss the issue I suggest that you not be so snarky about it.

--

Rick

- D
- Don Y
  
  Contact options for registered users
Vote on answer
posted
10 years ago

Sun, Sep 15, 2013 6:33 PM

Well, I'm not sure "higher level of attention" is the right term. It's just a different engineering discipline.

E.g., when I design hardware, I worry about propagation delays, signal loading, EMI/RFI, power consumption, heat dissipation, etc.

OTOH, when I write code, I worry about synchronization, communication, boundary conditions, bogus data, error reporting/recovery, etc.

Typically, yes. It's also handy as a way of prototyping a full custom.

Processors tend to be more general purpose. You can find more people in the world who can write code than design hardware. There's also more of a market for software effort than hardware effort.

Software tends to see more reuse. There's more in common between software projects than hardware projects. You don't rewrite a floating point library from scratch for each project. OTOH, if you can get by with a 24b format, you'd be more likely to do so in hardware.

And, software folks tend to think more serially than hardware folks (witness how easily software projects get muddied by "multiprocessing or multitasking").

With hardware, it's only natural to think about what A is doing while B is doing something else. And, *exploiting* that (because it would be silly not to!)

E.g., in the early 80's, I designed a little piece of silicon that did six multiplies, two divides, two adds (or maybe it was four?) and a subtract in 1.6us (continuously). A software solution would have tripped all over itself (even if you redefined each operation to just *8* bits).

In the 90's I worked on a dedicated processor that could *easily* have been done in a (dog slow!) processor -- but would never have achieved the target price point (think in 10's of millions for quantities).

- R
- rickman
  
  Contact options for registered users
Vote on answer
posted
10 years ago

Sun, Sep 15, 2013 6:40 PM

You sir, are a regular John Larkin.

There is a big difference between taking more time to implement a given algorithm and being "a nightmare".

You put a lot of emotion into your reply. None of that is appropriate. "Huffy", "stupid", "rathole" are loaded words with tons of connotation and no real meaning to an engineering discussion. *That* is what I was objecting to.

To be honest, your post is so full of anger that much of the meaning is distorted. It sounds like you want me to design your filter for you in an FPGA. Is that right? Usual consulting rates?

Uh, back off from what exactly? Making statements about design work? You can be pretty childish when you get angry. Why not take a break and cool off. Then read what I wrote without assuming I intended to insult anyone.

--

Rick

- R
- robert bristow-johnson
  
  Contact options for registered users
Vote on answer
posted
10 years ago

Sun, Sep 15, 2013 7:09 PM

you use multiplier blocks for optimal implementation of barrel shifting in an FPGA? i didn't know that. (and i don't know diddley about FPGA programming, or about hooking them up. so there's a lot i don't know.)

boy, glad i don't have dog in this tiff.

i too, from what little i know regarding FPGAs, was a little dubious of any inference that doing floating point in an FPGA is a paradigm shift. but it's gotta be a little messier. i wouldn't think that doing it in double is additionally messier, but it's gotta be *more*, of course.

about the statement: "The DSP blocks in FPGA's are usually fixed point and narrow." are you disputing that? are DSP blocks in FPGA's usually floating point and phat and pheature-rich? like more FPGA designs are being done which way?

i dunno, honestly, you tell me.

--

r b-j                  rbj@audioimagination.com 

"Imagination is more important than knowledge."

- T
- Tim Wescott
  
  Contact options for registered users
Vote on answer
posted
10 years ago

Sun, Sep 15, 2013 7:25 PM

Please explain to me how spending 200 hours to implement something that can be done in 20, for a totally unnecessary increase in performance, is neither throwing money down a rathole, nor is stupid.

You want me to believe that implementing a KALMAN filter in an an FPGA is trivial compared to implementing it in an embedded processor. I don't. I'm not an expert in FPGA design. You claim to be.

I suggested that you take your much self-vaunted expertise, actually look at the algorithm, and then -- as a presumably unbiased expert -- give me an estimate of what it would take. I didn't ask that you implement it -- just that you look at it as a professional, and estimate what it'd take to get it working.

The fact that you call it a "filter", as if it's just a FIR or something, indicates that you know as little about Kalman filters as you were accusing Paul Rubin of knowing about floating point, earlier.

First, back off from assuming that you're the only one in the group who has any idea of what it takes to implement algorithms in FPGAs.

Then, back off from the idea that you can make an unsupported claim and then demand that someone else do the work to support it.

Then, since you say you don't intend insult, back off and re-read your posts and think about what impression you're leaving.

--
Tim Wescott 
Control system and signal processing consulting 
www.wescottdesign.com

- R
- robert bristow-johnson
  
  Contact options for registered users
Vote on answer
posted
10 years ago

Sun, Sep 15, 2013 7:40 PM

and some of us would need about 20 engineering weeks to relearn all he did about estimation theory and the Kalman filter. (i hadn't needed to do one of those in audio, so i figger when i need one, i can just go to the DSP 'r US store and pick up one of those Kalman filters.)

so this runs on a PC target (like in micro$hit Visual Dungeon) and you want it on a stand-alone board?

when i was at Kurzweil Music Systems, i thought an early decision was made that had permanence prematurely attached. to do, essentially, all of the audio processing in ASIC (now they're FPGA, i'm sure). and exactly that observation was made by me, that everything now takes 10 times more effort/time/expense to do even simple things, in comparison to having just hooked up a few DSPs that cost more to manufacture with in quantities of tens's of thousands. but we were really talking less than 6 digits, regarding any single design. then i just could not figure it out, *why* they were so convinced that this is how it must be done.

i later came to the compromise conclusion that the "per note" processing might best be done in hardware (it could save quite a few dumb DSPs), but the "per channel" processing (this is right after the notes for a particular instrument are summed to a bus, each instrument has it's own bus or "channel") where algorithms like reverbs and such, could have been more easily and certainly more flexibly done with a good DSP (in that era, it could be the SHaRC or maybe an ARM of some variation).

as it was, with the wonderful hardware solution we had to use, all sample processing was single-sample (no blocking samples) and there wasn't even a conditional branch instruction. so the algorithm had to do the same qualitative process for each sample (instructions are the same, parameters might change). that leaves out any frame-based processing like a pitch shifter.

this is what happens when a couple of hardware engineers dominate the design process. the hardware guys decide over the software guys or the DSP guys and the company is only able to see the world through the hardware manager's glasses. and the Not Invented Here syndrome is just entrenched.

so if you can do it in hardware, just do it in hardware. it may take you an order-of-magnitude more effort to write, implement, and debug the thing, but you'll be so proud of the fruits of that effort that you'll wanna do it again.

--

r b-j                  rbj@audioimagination.com 

"Imagination is more important than knowledge."

- T
- Tim Wescott
  
  Contact options for registered users
Vote on answer
posted
10 years ago

Sun, Sep 15, 2013 7:57 PM

What I have implemented is pretty much what's on Wikipedia's web page for a Kalman filter (a plain-old, actually -- it's not extended very much, rather it's time-varying).

Knowing what to _tell_ the Kalman -- that's where the estimation and KF comes from.

It runs great on a PC target running Linux, and I successfully convinced the customer that it can go into a DLL running under windows -- but at the time that I started the thread, the customer wanted to do it on a stand-alone board.

If you're doing a product design then one of the most important tradeoffs you have to make is to reduce the overall cost of the product as much as possible. This means both paying attention to the _whole_ cost of getting the product into the market and keeping it there, and keeping yourself from falling in love with any one particular way of doing things to the extent that you ignore cost/value tradeoffs in favor of choosing some particular implementation because it's sexy.

--
Tim Wescott 
Control system and signal processing consulting 
www.wescottdesign.com

- P
- Paul Rubin
  
  Contact options for registered users
Vote on answer
posted
10 years ago

Sun, Sep 15, 2013 8:11 PM

Not at the hardware level in serious detail, which is why I asked how you'd do it. I do know you have to do wide arithmetic and normalization on every operation, and in a CPU, this is all done with parallel circuitry. With an FPGA (i.e. narrow fixed point DSP blocks) you'd have to do the operation in a bunch of slices and juggle the intermediate results around (possibly involving multiple LUT delays) and it sounds messy. I also haven't looked at the function of DSP blocks enough to know if it's even possible to use several of them at the same time for an operation like this, without introducing more latency. I'd be interested to know if there are canned Verilog libraries for double precision IEEE floating point and whether they can do division, trig functions, etc.

Unless I'm mistaken, a current x86 can start a new double precision floating point MAC every cycle. Can you do that with an FPGA in a reasonable way?

- D
- David Brown
  
  Contact options for registered users
Vote on answer
posted
10 years ago

Sun, Sep 15, 2013 8:12 PM

He is simply disputing the idea that anything could be implemented in software more easily than in an FPGA. I know FPGA's can be useful devices, and I know that experts can write code for them faster and more efficiently than many non-experts would imagine, but I for one am getting fed up with this "FPGA are the best at everything - the fastest, cheapest, quickest development, lowest power, most efficient, longest living, best development tools, etc., etc." nonsense.

Please, Rick, we /know/ FPGAs are useful devices. But give it a rest? No sane person - not even a Xilinx or Altera salesman - would claim that implementing a Kalman filter is not orders of magnitude harder in an FPGA than in software on a processor with good double precision floating point support. The X and A salesmen will tell you that this is why they put hard ARM cores in their chips - so you can do the PWM, encoders, etc., in FPGA, and the complex maths in the cpu.

- R
- rickman
  
  Contact options for registered users
Vote on answer
posted
10 years ago

Sun, Sep 15, 2013 8:26 PM

If I sounded rude I apologize. That was not my intent.

A floating point multiply requires the unpacked mantissas to be multiplied, the exponents to be added and then adjusted to match the normalized product. So you have a small adder for the exponents with a smaller input from the normalization, a multiplier and a shifter (not a full barrel shifter). Not so bad. This is easy to pipeline, but that is irrelevant. It is easy to get a result in one clock cycle. Pipelining just speeds up the clock cycle. In addition, in an FPGA you are not so limited in how many multipliers you have.

Additions are actually harder. To add you have to first adjust the exponents to match. So one addend has to be shifted an arbitrary amount... enter the multiplier to be used as a barrel shifter. The mantissas are then added (or subtracted as the case may be) and the sum normalized... again requiring a shift by an arbitrary amount... another multiplier. You can do the shifts in the LUT fabric. For one or two it isn't so many, but they are slow by comparison. If you are pipelining you will especially want to use the multipliers.

The HDL has not typically supported real number synthesis in the not so recent past, but I believe this has become more widely supported. So it may be a lot easier to just code in reals and not worry about the *how* of floating point. But even if you do have to code your own FP multiplier, you do it once and use it as often as you like.

Far from a nightmare.

--

Rick

- R
- rickman
  
  Contact options for registered users
Vote on answer
posted
10 years ago

Sun, Sep 15, 2013 8:32 PM

I never said anything of the sort. Please stop with the nonsense yourself.

If you don't want to have a rational conversation, please don't reply.

If you want to explain why a Kalman filter is so hard in an FPGA, please do. I'm sure it is not hard to show what part of the algorithm is FPGA unfriendly.

As to the ARM cores, they are very recent if you look around. So it was never done until the ARM cores came out?

As to the complex maths statement, I wish Ray Andraka were here. Of course he never bothered with such silly conversations, but as to the math, well, he is pretty much the expert.

--

Rick

- P
- Paul Rubin
  
  Contact options for registered users
Vote on answer
posted
10 years ago

Sun, Sep 15, 2013 8:35 PM

Well, I think the idea is to implement a soft CPU running similar code to what you'd run in a hard CPU. I do see that there are more computational resources (lots of separate DSP blocks) available in the FPGA than on a CPU, but it's still unclear to me that they can be used to much advantage in this application.

It looks like TI has some double precision floating point DSP's (C67 series). I don't know if they'd fit your requirements. You probably know more about them than I do.

- R
- rickman
  
  Contact options for registered users
Vote on answer
posted
10 years ago

Sun, Sep 15, 2013 8:36 PM

Just for grins, if you are interested in how we did floating point before FPGAs and before the PC had floating point hardware google ST100 array processor.

formatting link

I worked on that machine and that is where I learned about how to design floating point in hardware. They used 1000 gate ECL gate arrays. lol Your cell phone can likely run rings around one and that hulk used a 220 volt power source!

--

Rick

- R
- rickman
  
  Contact options for registered users
Vote on answer
posted
10 years ago

Sun, Sep 15, 2013 8:40 PM

It's a good way to do it when you have multiplier blocks. Shift by n is multiply by 2^n of course. Otherwise a barrel shifter has to use n*(m-1) LUT4s where n is the bit width and m is the range of your shifting.

There is some irony in the fact that in an FPGA you use a 16:1 mux (the LUT memory address decode) to implement a 2:1 mux. In school we actually learned how to implement logic with muxes, but that was "pure" muxes. The LUT in most FPGAs just adds memory to the inputs of the mux which makes it much more practical.

The upshot is that implementing muxes in the LUT fabric is "expensive" and slow. A barrel shifter is just a big mux.

As to the tone this has taken, hey, if I was being rude, I apologize to all. I was just trying to relate facts.

In FPGA multiplier blocks are often 18 bits wide and are simple, but registered. They can even be pipelined for speed as an option. I believe they usually produce 36 bits out. I say "often" and "usually" to include the ones I have studied, there are others I haven't. So yes, I think you can call this "narrow" by any definition of floating point. Of course the multiplier width is not really hard to extend just as you would in software. The difference is that software extension of multiplication size slows the algorithm significantly. In an FPGA you have the choice of letting it run more slowly (but without the software overhead) or using hardware to speed it up.

I believe single precision IEEE FP has a 24 bit mantissa and that may not include the hidden bit which has to be handled in the hardware. Of course once you are in the FPGA domain there is often no need to be IEEE compatible, just "good enough". What do you wish the FPGA to do with "NAN"?

Then FPGAs have DSP blocks which I have not studied in much detail. They carry the integration concept a bit further and include full MACs and possibly other functionality useful in DSP. I can't say what size they are, but I know the accumulators are large, most likely larger than the multiply result so that many products can be accumulated without overflow.

I never said that using an FPGA for complex algorithms is as easy as using a scientific library on a PC. I was disputing the statement that it would be a "nightmare". An algorithm is an algorithm. I asked why a Kalman filter would be so hard to do on an FPGA and I was told to look up the algorithm and figure it out for myself. Ok... end of useful conversation...

I read this as the usual exaggeration of the difficulty in working with an HDL rather than an HLL. I don't find this to be true. But then I have come up the learning curve on VHDL and Verilog to a lesser extent. I also understand hardware at the gate level. Different perspective I suppose.

--

Rick

- R
- robert bristow-johnson
  
  Contact options for registered users
Vote on answer
posted
10 years ago

Sun, Sep 15, 2013 8:44 PM

...

well, that's what *they* would tell *me*. that just because i think it sexy to do the per channel processing in a general-purpose DSP or ARM because i have a branch instruction and can use that to group samples together and process in blocks, and do different tasks even at different sample block times, to distribute the load of processing a frame (so it's not all done at the end, making your worse case timing awful). as with chicks, sexy is in the eye of the beholder.

so they were saying that i was the one appealing to elegance and they just wanted to do the job with the resources that already existed in this ASIC. (they're "out chips", "our little babies we have invested so much in".) some intricate things could be done with the limited instruction set of this ASIC music processor, but many things (as i described) just could not be done with it at all. then cost-effectiveness must consider the cost of lost sale of a product because it's missing some esoteric thing (that processes audio in frames) that some competitor does implement. if your orientation is "we'll stick with out ASIC because we had to punch out 100000+ of them and need to recover that cost", then i would say by sticking to this 10 times more engineering man-hours are spent and you miss out on implementing some algorithm that can only be reasonably done by a general-purpose DSP or CPU, i would say that speaks against the $2 or $3 per unit manufacturing cost of the 64-voice chip that was just spec'd.

--

r b-j                  rbj@audioimagination.com 

"Imagination is more important than knowledge."

- R
- robert bristow-johnson
  
  Contact options for registered users
Vote on answer
posted
10 years ago

Sun, Sep 15, 2013 9:58 PM

oh, oh! DIVISION! i forgot about that one. that's a big one, Rick.

an algorithm requiring division is not the same as one that doesn't. an alg is not just an alg.

and if you doubt that, all's i have to say is "one rabbit stew coming up!"

--

r b-j                  rbj@audioimagination.com 

"Imagination is more important than knowledge."

- P
- Paul Rubin
  
  Contact options for registered users
Vote on answer
posted
10 years ago

Sun, Sep 15, 2013 10:12 PM

a) double precision floating arithmetic, as already mentioned; and b) as Tim says, it's much more complicated algorithm than something like an FIR filter. You have to do a bunch of numerical linear algebra at every step. I think this includes a matrix pseudo-inverse, which conceptually involves division (like in the Gram-Schmidt process). I don't know if there is a way around the division in practice.

I've never actually used or programmed a Kalman filter so I'm basically talking out of my butt here. But my picture of how it works is something like:

1) you have a state vector s of observables in your system, but the observations are presumed to come from noisy measurements. 2) You have a linear estimate (I think this is what Tim called the H matrix) of how the state is changing. E.g. at state s, you predict the next state to be something like p = s + (Hs)dt. 3) Now you make a new (noisy) measurement q, and compare it to your prediction p. The residual error (q-p) comes partly from noise in the measurement and partly from inaccuracy of H. 4) You do a least squares fit to find a new H that would have predicted q from s. This is where the pseudo-inverse comes in. So you are constantly tuning your predictor. Maybe there are other adjustments you also make, in the case of a nonlinear system. 5) Your updated state vector is some kind of weighted average between p and q.

I probably have at least some parts of that wrong, maybe all of it. I read about it some years ago and thought it was amazingly clever, but I never tried to implement it or to really understand it in detail.

- T
- Tim Wescott
  
  Contact options for registered users
Vote on answer
posted
10 years ago

Sun, Sep 15, 2013 11:19 PM

If your measurement is a single point (as happens to be the case here) then the "matrix" to be inverted is 1x1. You still need to do a divide, though.

That's a bit incorrect. Here's my version, which is hopefully more clear in a "we don't need no steenking math" sort of way than the Wikipedia article:

0) You have a system model in state-space form: a state vector, a matrix that describes how the state evolves from one time step to the next, a matrix that describes how the input vector affects the state vector at any given time step, and an output matrix that lets you calculate the output from the state.

1) Both the state update and the measurement are corrupted by noise. In a "pure" Kalman filtering application, this noise is white, zero-mean Gaussian with known covariance properties.

2) You have a series of actual measurements and inputs.

3) You _don't_ know x -- you can only estimate it

4) At each step of the filter, you take an actual measurement, the system model, and what you know about x (it's mean and covariance) and you compute a new estimate for x (both mean and covariance).

5) Within that computation you compute a new value for x (both mean and covariance) that's extrapolated from the prior x, the system model, and the input. Then you calculate the new output, compare it with the measured output, and use that error to calculate a final, corrected value for x (both mean and covariance).

There are a ton of matrix multiplies involved in the computation. Yes, it could be done with an FPGA, but I don't see the utility of doing that unless you need to.

Part of the computation of a "normal" Kalman filter involves, effectively, squaring a matrix, then doing math on it. Because of the squaring, the condition number of the matrix gets squared, too. You can dodge this, but only at the cost of using one of a family of so-called "square root" algorithms which, not surprisingly, require you to take a lot of square roots.

If you _were_ going to implement a Kalman filter in an FPGA, you'd probably also want to re-cast it to fixed-point math. You may need to use a huge precision (e.g. 64 bit) because of the squaring, but if you really needed the speed, it'd probably work.

--

Tim Wescott 
Wescott Design Services 
http://www.wescottdesign.com