Small, fast, resource-rich processor

- T
- Tim Wescott
  
  Contact options for registered users
Vote on answer
posted
10 years ago

Tue, Sep 17, 2013 4:48 PM

And the original request was for a suggestion for a board into which I could drop, without modification, existing, working, tested C++ code that runs on a PC...

--

Tim Wescott 
Wescott Design Services 
http://www.wescottdesign.com

- R
- rickman
  
  Contact options for registered users
Vote on answer
posted
10 years ago

Tue, Sep 17, 2013 5:22 PM

Guilty in your mind. That is exactly the point of contention. Which task are better in an FPGA and which are not. So far we have not actually been able to discuss the technical merits of any specific case because nearly all of the responses were emotional rather than rational. Your comments above are not an exception.

Does it really matter to the facts whether I am in the majority or the minority? No. I never said "all", that is *your* word. Give a specific quote for a statement I made that says FPGAs are better for

*all* jobs.

I always appreciate constructive criticism, but I can't really see it here. You are making claims about what I have said that *aren't* accurate. If you want to see a technical discussion, you need to address those who continue to make emotional statements.

Do you have examples? Otherwise this is just opinion and you are part of the problem you discuss in the previous paragraph.

I get tired of people making *unsupported* claims. This is a good example. What about an FPGA makes it "less good" at doing serial operations? Hmm? Or maybe I misunderstand what the comparison point is for the "less good" claim. Perhaps you mean FPGAs are less good at serial computations than FPGAs are at parallel computations. Even that I can't really see. FPGAs are agnostic about the method, they do serial or parallel equally well.

Other than multicore processors, they don't do things in parallel at

*all*! Processors always execute code sequentially.

Can you explain "natural"? That sounds like a bias. I can't measure "natural" in any way I know of.

It is easier to do in software only if someone has done it for you or you are using a processor where DP FP is done in hardware. Once you construct your basic DP FP algorithm in an HDL you never need to think about it again in an FPGA. So no, I don't agree that it is "hard" in an FPGA. Again, an FPGA is data type agnostic.

Good, then please explain what aspect of the filter is hard to do in an FPGA...

Or why bother with the C step at all? Why is it hard to code a KF in an HDL? I keep repeating the question and no one ever answers it. Claims are made repeatedly, but with *NO* supporting evidence, just more opinion.

Really? The OP said he implemented the algorithm on a standard MCU, the type you can put on a I know that FPGA's have many uses other than making things run fast -

You have drifted off target. The original issue was the statement that using FPGAs is a "nightmare". Where did I ever say an FPGA is a better solution for this KF than using a GP CPU?

Again, the OP has said that an ARM processor he used was way too slow. You can get faster ones, but they *aren't* "a couple of dollars".

So someone makes a claim that I disagree with, "FPGAs are a nightmare to use". I ask him why they are a nightmare and you say his statement stands unless I can prove otherwise.... interesting.

Your points are equally unsupported that it is hard to implement a KF in an FPGA. I just want you to tell me what part of a KF is *hard* to implement on an FPGA, that's all. Just show me the logic that is hard to do in an FPGA... That should be easy, right?

My evidence is that I can design any hardware in an FPGA that exists in any other digital device. So clearly an FPGA can do anything other devices can do unless you bump up against some limitation such as memory size or power dissipation, etc. I don't know of anything about FPGAs that are inherently *hard* to use. But I guess I can't prove the absence of a fault.

Is there some reason you can't discuss the facts rather than just opinions?

--

Rick

- R
- robert bristow-johnson
  
  Contact options for registered users
Vote on answer
posted
10 years ago

Tue, Sep 17, 2013 5:46 PM

...

in fact, sometimes with an apples-to-apples comparison (same word width), sometimes you get less mean square error with fixed. i have shown (at the 2008 AES) that comparing 32-bit IEEE float to 32-bit fixed, that if your required headroom is less than about 40 dB (and 40 dB headroom is an awful lotta headroom for audio, far more than necessary) that 32-bit fixed beats 32-bit float. since dB SNR + dB headroom add to a constant, it's even more pronounced if, say, only 12 dB headroom is needed (32-bit fixed point will have 28 dB better SNR than 32-bit float).

so sometimes it doesn't even have to be "wider".

i think that Glen was just covering his butt. certainly if your CPU "solution" is 1000 times shy of the computational bandwidth needed, making the "ware" a bit harder might be indicated. i dunno if you could even get a 1000 times improvement in speed, but maybe hope to.

--

r b-j                  rbj@audioimagination.com 

"Imagination is more important than knowledge."

- R
- rickman
  
  Contact options for registered users
Vote on answer
posted
10 years ago

Tue, Sep 17, 2013 7:40 PM

Ok, now we are at the crux of it finally. How is debugging software written in an HLL different from debugging bitware written in an HDL?

I won't call it hardware because 95% of the debugging is done in the simulator rather than on the bench in my experience. I find the HDL simulators much easier to use most of the time than HLL simulators or debuggers because they have great visualization and access. But then I will admit some bias here because my familiarity with HLL tools is some

10+ years old for the most part. HLL tools may have improved significantly too.

Old school FPGA design was done by making changes and downloading a new bit stream to test on the work bench, lather, rinse, repeat. This was very inefficient though because dragging out hardware test tools is a PITA. We have moved on...

If you are spending significant time debugging FPGAs in your system, you have a faulty development process. You can read nearly any book on the development process and they will tell you that every man hour spent debugging early in the process saves you 10 or more later in the process.

When I was working with FPGAs some 20 years ago we used schematic capture. The FPGA vendors started pushing HLDs as the tool of choice. Once they became more mature I realized just how powerful an HDL could be in terms of keeping the design process in the workstation and off the workbench. "Test fixtures" are very powerful tools to the point that I often have very little debug to do on the workbench, just verification.

I find work like DSP goes well if you have a designer for the specification of a function and a designer for the implementation of the function. They can be the same person, but don't need to be. In either case a specification needs to be written and that allows for separation of the two steps, specification and implementation.

Separating the two skill sets will help you find people to do the work. If this is your only problem in using FPGAs, please give me a call. I have had a lot of practice working remotely and it typically produces very good results.

--

Rick

- R
- rickman
  
  Contact options for registered users
Vote on answer
posted
10 years ago

Tue, Sep 17, 2013 7:50 PM

Uh, give me a number. How many multipliers do you need? I'm sure I can find a device with that number of multipliers.

Are you suggesting that you can't find a suitable FPGA for less than $4,000? Really?

Still, this is pretty far off topic. The OP said KF development on an FPGA would be a "nightmare". Needing a lot of multiplier blocks is hardly a "nightmare".

Actually, the OP has indicated that the problem can be solved in 64 bit fixed point or possibly even 48 bit fixed point.

--

Rick

- R
- rickman
  
  Contact options for registered users
Vote on answer
posted
10 years ago

Tue, Sep 17, 2013 8:00 PM

I read what you wrote and I acknowledge that I may not fully understand it. You used terms like, "large-scale enterprise software" which I am not familiar with. What is that supposed to mean?

"For FPGAs it is things like the design tools, place and route, timing and clock distribution. " What does that mean? What is "it"?

BTW, the part you responded to I was only replying to your words I quoted. I thought you were saying HDL coding is equivalent to assembly language coding. Is that not correct?

Please explain.

I would be interested in hearing how MCU development has changed. Care to elaborate?

--

Rick

- T
- Tom Gardner
  
  Contact options for registered users
Vote on answer
posted
10 years ago

Tue, Sep 17, 2013 8:30 PM

That's because you snipped it from my message. Hint: read the paragraph beginning "For enterprise software..."

Another example of you needing to read more slowly?

Read the context before that paragraph.

Another example of you needing to read more slowly?

No. That's why I quoted the context.

Another example of you needing to read more slowly?

- M
- Mark Curry
  
  Contact options for registered users
Vote on answer
posted
10 years ago

Tue, Sep 17, 2013 9:03 PM

I agree totally - I think. Generic floating point in FPGAs is pointless. You're better off using a custom format, designed for the problem at hand. This can be a static, fixed point (with pretty significant number of bits). This can be "a little floating point" with n bit mantissas and a few bits of exponent. But full on IEEE 754 (single or double) - that's dumb on an FPGA. There's little reason for a single wire on an FPGA to have that kind of dynamic range.

Haven't followed this whole thread in detail - not saying FPGAs are the best solution for a specific problem. But discounting FPGAs because they don't "do floating point" or "don't have enough precision" is wrong.

Regards,

Mark

- R
- Randy Yates
  
  Contact options for registered users
Vote on answer
posted
10 years ago

Tue, Sep 17, 2013 9:05 PM

Ha ha ha ha ha ha!!!

--
Randy Yates 
Digital Signal Labs 
http://www.digitalsignallabs.com

- M
- Mark Curry
  
  Contact options for registered users
Vote on answer
posted
10 years ago

Tue, Sep 17, 2013 9:11 PM

That's almost always the case (at least with video it is). The kernel of some algorithm is fairly easy (both design, and documentation, and modeling). The algorithm can be done in hw, or sw, or xxx (even Matlab!, heh).

All the blood, sweat and tears are spent in getting all data to the kernel, and then back out in a timely, ordered fashion...

Regards,

Mark

- M
- Mark Curry
  
  Contact options for registered users
Vote on answer
posted
10 years ago

Tue, Sep 17, 2013 9:16 PM

I really didn't intend that pun... Totally went over my head to you replied. And then couldn't figure out why you were laughing at me :)

--Mark

- D
- dp
  
  Contact options for registered users
Vote on answer
posted
10 years ago

Tue, Sep 17, 2013 11:38 PM

But your point :) is something I just wanted to make, it is a valid one. For DSP-ing (doing lots of MAC) one uses floating point on processors simply because/when a 64-bit FP register is the only large enough accumulator. I see no reason why one would implement FP on an FPGA to do the MAC thing when just having a large enough accumulator should be much easier, data won't have to be converted (as they normally come from some ADC or something) to FP etc.

Dimiter

------------------------------------------------------ Dimiter Popoff Transgalactic Instruments

formatting link

------------------------------------------------------

formatting link

- R
- Randy Yates
  
  Contact options for registered users
Vote on answer
posted
10 years ago

Wed, Sep 18, 2013 1:33 AM

I'm glad you got the (er) point...

--
Randy Yates 
Digital Signal Labs 
http://www.digitalsignallabs.com

- R
- Randy Yates
  
  Contact options for registered users
Vote on answer
posted
10 years ago

Wed, Sep 18, 2013 2:30 AM

Robert,

What do you mean by "headroom?" Do you mean the extra dynamic range required in the intermediate computations, such as is typically provided on fixed-point processors by the "guard" bits?

Also, could I please get a copy of your preprint?

--
Randy Yates 
Digital Signal Labs 
http://www.digitalsignallabs.com

- P
- Paul Rubin
  
  Contact options for registered users
Vote on answer
posted
10 years ago

Wed, Sep 18, 2013 5:43 AM

That's why 64-bit float was invented ;-). Seriously, the idea of double precision isn't that you need an ultra-precise final result, but rather, that if you're doing a numerical algorithm with a lot of steps that's accumulating a small amount of roundoff error in each step, the accumulated errors won't reach physical significance unless the calculation is unusually long or the algorithm is especially unstable at the input data. For that reason all these suggestions of using non-IEEE floating point formats sound hacky unless they're coming from numerics experts. The era of ad hoc floating point formats was the 1970's and earlier. The world has moved on since then.

- A
- Anssi Saari
  
  Contact options for registered users
Vote on answer
posted
10 years ago

Wed, Sep 18, 2013 7:48 AM

I seem to recall Altera mentioning that their new stuff improves on that and indeed Stratix V (and also Arria V and Cyclone V) has 27x27 multipliers and a 64-bit accumulator. But is that enough for double precision floating point?

- M
- Mark Curry
  
  Contact options for registered users
Vote on answer
posted
10 years ago

Wed, Sep 18, 2013 2:35 PM

That's not the case at all for FPGAs and probably ASICs too. In fact, there's new support in VHDL for the IEEE "variable precision" floating point. Where the number of bits in the mantissa, and exponent are explictly set.

It's a very good idea for HW design. And you don't need to be a "numerics" expert. It's not difficult at all. I point FPGA folks to Randy's fixed point tutorial all the time.

Regards,

Mark

- D
- David Brown
  
  Contact options for registered users
Vote on answer
posted
10 years ago

Wed, Sep 18, 2013 3:17 PM

If you can't see the issue with your recent posts and your attitude in them, then there is little more I can say here. Communication is more than the sum of the words you write, and the impression you give is from between the lines and general points more than anything specific. Yes, you are "guilty" in /my/ mind, and this is /my/ opinion - not the exact words you wrote. That's the point - this is the opinion I have formed from reading your posts. If I, and at least some others here, were not human then perhaps we would have have seen nothing but the technical points you made. From your posts in the recent threads, I am left with the impression that you are a "all I've got is an FPGA hammer" evangelist - and I know that is not true, and I know that is not the impression you are trying to give. But it is the impression you /are/ giving - and I thought you should be told.

But now I will try my best to be a /little/ more technical.

Yes, this is /opinion/ - this is perfectly obvious from my wording. It is a summation of countless years in embedded development - mostly microcontroller-based, but a little FPGA (and CPLD before that), and summation of talking to and listening to developers of all sorts. What do you want me to do - dig through comp.arch.embedded archives looking for examples of over-enthusiastic FPGA proponents? Show you my own mistakes, when I have started looking at FPGA solutions only to find cheaper and better alternatives? Or perhaps you think that while everyone else is biased and promotes their favourite technology over others, but FPGA fans are all precise and rational and would never consider suggesting one unless it were clearly the best idea?

Opinions are /good/. We need to be clear about what are opinions and what are facts, and we need to understand our own biases and those of others. But with that in mind, "opinion" is formed from our direct and indirect experiences. Your customers do not come to you because can develop for FPGAs - they come to you for your experience. They come for your /opinion/.

Obviously when we are asked for a professional opinion, we do more research and more justification than in a Usenet post. If you want hard evidence and justification for my claims about Kalman being much harder to do on an FPGA than in software, then I could certainly give you it - if you are willing to pay for my time. I could give you anything from bullet points, through web research, and up to full implementations in software and FPGA with the hours it took and the costs involved. But unless you are paying, Usenet opinion is all you get.

First off, are we all happy that FPGAs are really good at doing things in parallel?

Secondly, are we all happy that processors are really at doing things serially? They step through a sequence of instructions, doing (usually) one step at a time.

The point of contention is how good FPGAs are at doing things serially.

When the algorithm is a series of similar steps (such as "do a MAC on this data, then a MAC on that data, etc."), FPGAs are fine - you can write a state machine that handles the steps, along with some logic for end cases.

But when you have complicated branching, jumping, subroutines, etc., in your algorithm, then FPGA logic is hard. It is hard to implement, as you have to re-organise the algorithm into a form that the FPGA languages can handle. And all the time you are faced with balances - do you dedicate resources such as DSP blocks to particular tasks, or do you multiplex them across a range of uses? When you need to hold a number of 64-bit variables, do you use logic cells for these? If your algorithm needs a dozen such variables, you quickly lose a a percent or two of your total logic elements here. Or do you put them in a memory block - saving space, but requiring complex logic to address them? When you need complex sequential work, do you write it all out, using all the combinational and sequential logic, spending your resource-limited LUTs on multiplexers, counters, decoders, etc.? Do you implement some sort of state machine interpreter with the steps held in a ROM table?

And how do you test and debug all this? An FPGA simulator is great for some things, but not this - you want to be able to single-step the system, read out whatever variables you want, put breakpoints in the code, print out values to a file, etc. When you want to change the code, the programmer changes the code and quickly re-compiles - the FPGA developer needs a much more demanding re-build (and that's assuming there is no need to do the route and place part).

Once you get beyond a certain basic level of complexity, sequential work is best done with a sequential processor.

"parallel" is just a matter of perception. If a processor does one thing every microsecond, then it does a thousand things every millisecond - just as if it did those thousand things in parallel once per millisecond. You will notice that your PC is quite happy running multiple programs "in parallel" even if it only has one CPU.

Try /thinking/ rather than /measuring/.

FPGAs are Turing complete. So are cellular automata - but they are pretty hopeless for anything other than the type of simulation that is a "natural fit" for that type of computer. The same applies to FPGAs.

On processors that have appropriate hardware support, it is easy to do the double precision floating point because the hardware supports it - it is a single line of C code, or a few assembly instructions. On processors that don't have the support, it is /also/ easy to do because it is part of the basic toolchain for the chip (a compliant C compiler /must/ provide it).

On an FPGA, you either have to write the FP stuff yourself - taking a great deal of time and effort - or you have to use third-party blocks that are often expensive to buy, and take a lot of resources. A quick check of some IP blocks on Altera's site suggests that double precision floating point takes of the order of 2000 LE's - that is a /massive/ amount when you are using sanely priced devices. It means that you can pretty much forget about structures using dedicated FP blocks for each operation in an algorithm like KF - there simply aren't enough resources on a device. So you have piles of code (and work, testing, debugging, logic resources, etc.) to funnel everything through a few FP blocks. Alternatively, with a big enough FPGA and enough development time, you could probably write optimised FP blocks of different kinds for different parts of the system, and get it all squeezed in.

Tell me again how Kalman on an FPGA is /not/ massively harder to implement than doing it in software?

See above.

Note also that I think most people here agree that it would be significantly to implement KF in an FPGA than in software (even if they might not use the word "nightmare"). Thus it is /you/ who is making the extraordinary claim here, and it is /you/ who need to provide evidence suggesting it is not hard.

In the OP's situation when he only needs a small number of systems, I would probably do as he did and use a big processor (like an SBC) for speed of development. If I wanted to have minimal costs, I would re-implement it with fixed point (which is a minor change, once the algorithm is working, and easy to test and debug) and use a Cortex-M4 for two or three dollars.

There are chips that are certainly fast enough for under about $10, even if you stick to DP FP.

- P
- Paul Rubin
  
  Contact options for registered users
Vote on answer
posted
10 years ago

Wed, Sep 18, 2013 3:25 PM

Weren't you telling us just a few days ago how to implement basic arithmetic from scratch and "you only need to do it once"? That sure sounds more difficult than doing it zero times (because it's already there on CPU's).

Why don't you tell us how to do, say, basic Gaussian elimination in Verilog (double precision floating point of course). Then we can compare the difficulty with doing it in software. There is some pseudocode here:

formatting link

- U
- upsidedown
  
  Contact options for registered users
Vote on answer
posted
10 years ago

Wed, Sep 18, 2013 3:41 PM

What so special about IEEE float/doubles ?

The only, but _significant_, advantage was that finally you could easily transfer floating point data from one computer system (from different vendors) to an other in binary format using magnetic tapes and later TCP/IP.

Before this, at least for ad hoc transfers, it was common practice to print out float values as decimal digits in ASCII/EBCDIC onto the magnetic tape and then read those decimal strings such as "+1.23456789E+05" into the other system and convert it to the local floating point representation. This of course caused all kinds of rounding/truncation errors.

Fortunately with IEEE floats, this is no longer required, but only a few years ago, I had to write conversion routines for some old Siemens PLC floating point format with special location of the exponent field, the exponent bias/offset and hidden bit conventions to get the most of the accuracy needed.