Small, fast, resource-rich processor

- G
- glen herrmannsfeldt
  
  Contact options for registered users
Vote on answer
posted
10 years ago

Thu, Sep 19, 2013 8:25 AM

(snip, someone wrote)

I was pretty surprised by that statement, too.

Now, someone used to doing it can implement an algorithm in an FPGA pretty darn fast, but the way it is done tends to be very different. Well, it does if you want the speed.

Note for one, when you write a loop in C, you execute a statement or group of statements some number of times. A loop in verilog generates multiple copies of the hardware group. The number of copies has to be known at compile (logic synthesis) time.

I have worked on a number of systolic array implementations of dynamic programming algorithms. They look completely different from the software version.

-- glen

- R
- rickman
  
  Contact options for registered users
Vote on answer
posted
10 years ago

Thu, Sep 19, 2013 8:34 AM

And what did those experts say about the adequacy of the resolution. I guess that is what I thought needed to be addressed. Are you saying that you are concerned about the implementation working?

--

Rick

- T
- Tom Gardner
  
  Contact options for registered users
Vote on answer
posted
10 years ago

Thu, Sep 19, 2013 8:38 AM

The experience of the professional high performance computing (HPC) community indicates that is /not/ the case.

Googling for HPC on comp.arch (especially anything by Nick MacLaren who's been at the sharp end of that since the 60s) will give you indications as to why.

In particular you will get a feeling for why IEEE754 still has inherent problems, even though it is significantly better than the alternatives.

Analysing the algorithm is non-trivial, whether or not IEEE754 is used.

The one ray of light for embedded/dsp work is that the range of inputs is likely to be more constrained.

A ray of darkness for embedded/dsp work is that shortcuts will probably be taken to fit the algorithm into the available resources.

Yes indeed!

- P
- Paul Rubin
  
  Contact options for registered users
Vote on answer
posted
10 years ago

Thu, Sep 19, 2013 8:45 AM

Because if you use IEEE-754 double precision, you're already getting the benefit of the numerics experts who designed the standard. If you use anything else, you have to supply your own numerics experts.

- T
- Tom Gardner
  
  Contact options for registered users
Vote on answer
posted
10 years ago

Thu, Sep 19, 2013 8:45 AM

So, of course are FPGAs. And operational amplifiers. And discrete logic. And cellular automata. And...

All technologies have their strengths and weaknesses. All projects have their objectives and constraints.

The /only/ important point is to avoid choosing a technology that has unacceptable disadvantages in a given context.

In most cases there is more than one way to skin the cat, and it doesn't really matter which is chosen.

- T
- Tom Gardner
  
  Contact options for registered users
Vote on answer
posted
10 years ago

Thu, Sep 19, 2013 8:56 AM

Unlike hardware? :)

There is actually method in that madness: it is better acknowledge that re-integration will always break things and to have mentalities and processes to deal with that reality. Little breaks -> little repairs.

The alternative is to have infrequent massive integrations with consequent massive problems and huge repairs.

There's a balance to be struck, depending on the objectives. If the objective was to teach, then it looks like they struck the wrong balance!

In fairness, scalability and reliability requirements do take a toll! If everything can be done in one box the transaction rate can rise impressively. Replace "box" with "chip", and an analogous point can be made in the hardware/embedded arena!

What always amused me was the layering of synchronous comms protocols on asynchronous comms protocols on synchronous comms protocols on asynchronous comms protocols on synchronous comms protocols on asynchronous comms protocols on... You get the drift :)

- P
- Paul Rubin
  
  Contact options for registered users
Vote on answer
posted
10 years ago

Thu, Sep 19, 2013 9:02 AM

Numerics experts, for some level of "expert" depending on the algorithm and application.

Here you're in much better shape if you're just a bit careful, without having to be a numerics expert. Lots of calculations can be done in single precision all the way through if they are designed by experts but will fail if done that way naively, while naive methods in double precision will still work. So the rest of us are better off with double precision or (to listen to Prof. Kahan) even quadruple precision.

It's a very standard test. Actually I forgot to say, you're supposed to run multiple times in different combinations of IEEE rounding modes, to see how the answers vary. That's part of the reason the selectable rounding modes exist. It leads to sort of a poor man's interval arithmetic, which is apparently the really serious way to test.

Right, you still have to pay attention to where the algorithm starts getting wonky. But sometimes you can check that this won't happen in the relevant input range.

Basically you have to distinguish between numerical instability throwing your answers way off, and inaccuracy arising from the accumulation of small errors. Double precision simplifies your life by mostly getting rid of the second problem, and it helps with the first.

Yes, this is true.

- P
- Paul Rubin
  
  Contact options for registered users
Vote on answer
posted
10 years ago

Thu, Sep 19, 2013 9:17 AM

They say IEEE double precision is good enough for most things (I guess they want you to use extended precision for intermediate results as it's done on the x86), but single precision often goes astray. This has to do with the sizes of reasonable calculations and therefore the amount of error buildup they're likely to see. And yes, if for it's something critical, it takes more careful analysis. I don't know where Tim's KF thing would come into the picture.

Well, precision is relevant for two reasons: 1) numerical instability, e.g. from an ill-conditioned matrix in the KF example; 2) error buildup in an otherwise well-behaved calculation. For the first of these you have to think carefully about the application (e.g. does it get into highly nonlinear regimes?). For the second, double precision mostly takes care of it, given that these calculations are relatively small. If you're running something like a long range PDE solver it might be different.

I just mean there are a lot of mathematical and coding pitfalls you can get into when implementing this stuff, and IEEE arithmetic is designed to protect you from quite a lot of them. It's more forgiving towards straightforward implementations than systems that came before it.

- G
- glen herrmannsfeldt
  
  Contact options for registered users
Vote on answer
posted
10 years ago

Thu, Sep 19, 2013 9:21 AM

(snip, someone wrote)

If so, you better really pipeline the MAC. I suppose you could do that.

I don't know Kalman enough to explain that, but I think I can explain an FIR.

Pretty much, you want to unroll the inner loop. In a software implementation, you have a loop over the coefficients, each input value is multiplied by each coefficient and summed.

For a systolic array, you have a series of processing cells, usually in a linear array, and at each clock cycle data passes from one cell to its neighbor cell. (Usually in one direction, but both directions are also allowed.)

For FIR, probably the best way is with a pipeline such that the input data is clocked through the array, with one filter tap coefficient per cell. On the first clock cycle, the first data sample is multiplied by the first filter coefficient. On the second cycle, the first sample is multiplied by the second filter coefficent and the second sample by the first coefficient. I would pipeline between the multiply and add, but some might not.

On the third clock cycle, the first data point is multiplied by the third coefficent, the second point by the second coefficient, and the third point by the first coefficient, Also, in the add pipeline, the result from the first multplier is added to the result from the second multiplier. The partial sums travel down the pipeline with the input data, and finally the result comes out the other end, one value per clock cycle.

That works well if the hardware size is appropriate for the number of filter taps. If the filter gets longer, you either add more hardware or reuse the hardware. (I used to call this a virtual pipeline.) Each cell handles some fraction of the taps, multiplies them sequentially, and then passes the result on only after the appropriate number of clock cycles.

If you want to go faster, and add more hardware, you have to multiply more than one input point by each coefficient on each cycle. It takes a little more work to arrange the registers as appropriate, but it isn't so hard to do.

Note, though, that the arrangement of the pipeline depends on the number of taps.

-- glen

- D
- David Brown
  
  Contact options for registered users
Vote on answer
posted
10 years ago

Thu, Sep 19, 2013 9:24 AM

The HPC community have different requirements - they are basically the main users of IEEE functionality beyond the basic maths. They /do/ care about the treatment of NaNs, the consistency between different implementations, etc.

But this is c.a.e. - here it is seldom worth the cost or effort to support these features. And here we don't accept that the result of a calculation could be wrong or NaN - and you guarantee that in the same way as you do the rest of the verification, bug checking, testing and qualification of your software. (That is to say, the details and the effort vary enormously - the methods used for designing a singing birthday card and a pacemaker are rather different.)

Are we talking about one particular algorithm here (such as Kalman)?

It is not a "ray of light" - it is a powerful spotlight.

And if your algorithm is so convoluted that you fear values might go to infinity, or lead to divide by zeros, or lose too much precision, then you are screwed anyway. No amount of IEEE "magic" will help you.

- G
- glen herrmannsfeldt
  
  Contact options for registered users
Vote on answer
posted
10 years ago

Thu, Sep 19, 2013 9:49 AM

Not having actually tried, the data flow is nice and regular for ordinary Gaussian elimination, so it might not be so hard to do. (Though for fixed N.) I think it will be a harder if you want to do pivoting, though. It does have division, but much less than multiply (for reasonable sized N).

-- glen

- T
- Tom Gardner
  
  Contact options for registered users
Vote on answer
posted
10 years ago

Thu, Sep 19, 2013 9:53 AM

I should have been clearer; my comment was mainly a response to the "...should not be above your pay grade..." statement.

That may be your experience. It was definitely *not* the case for one of my embedded projects.

You're overstating your case to make a point - but I agree with the point in *most* cases.

Yes. :)

In my experience many software weenies take a deep breath when you ask them "if you have two numbers known to 1% and you do an arithmetic operation on them, what do you know about the result"

Most will eventually stumble towards the divide-by-zero case.

Most will also say "1%", incorrectly.

Many have to have it explained in detail, with examples, that you don't know *anything* about the result - if you are subtracting two nearly equal numbers.

And they are often the ones writing financial software algorithms.

Sad. Very sad.

Agreed, but the algorithm doesn't need to be convoluted for problems to arise. I've seen it when calculating the cost of a phone call!

And the "solution" used to get the code to pass a unit test (and therefore by definition correct) left my lower jaw flapping on my chest.

- G
- glen herrmannsfeldt
  
  Contact options for registered users
Vote on answer
posted
10 years ago

Thu, Sep 19, 2013 10:00 AM

(snip, I wrote)

You can, but it is pipelining that gets the speed.

And in most FPGAs the registers are there, used or not, so you might as well use them.

But as with using a CPU, if the combinatorial logic is fast enough, then maybe there is no reason to pipeline it.

-- glen

- D
- David Brown
  
  Contact options for registered users
Vote on answer
posted
10 years ago

Thu, Sep 19, 2013 11:12 AM

In embedded design, all generalisations are false...

Yes - it all depends on what these arithmetic operations are, and how your 1% is defined (1% of full scale? 1% away from the true value - and if so, what if the true value is 0?). Are you interested in error distributions or just margins? (I am a maths "weenie" as well as a software weenie, so I hope I'd do better than "most" here.)

Well, the economists that make the financial models in the first place have little basis in reality, so bad implementations are not going to make things worse!

Maybe the algorithm wasn't convoluted enough here.

Was it one of these great "solutions" that involves spotting that you are running as a test, and doing special-case code to pass the test? I've seen that done on occasion.

- T
- Tom Gardner
  
  Contact options for registered users
Vote on answer
posted
10 years ago

Thu, Sep 19, 2013 2:31 PM

Including that one :)

Your questions indicate that you are already streets ahead of most software weenies :(

Keep it simple: 1% of the value. That avoids considering what the full scale of a natural number might be!

In this case the customers shouted if the cost was off by a penny per call! Justifiably, IMHO.

But apart from that we are in violent agreement.

It really wasn't that complex: more or less fixed cost per call plus cost per minute. The complexity was in high availability, "low" latency, and deciding what the call's coefficients were in the first place.

No. Worse.

Start by using floating point for monetary values. Find you've got the wrong answer.

Invent a numerically invalid mechanism for getting the right value for the test cases.

Don't worry about other cases ("it passed the unit tests, so it is right", by definition).

Don't worry that it is computationally expensive (reasonable since that didn't affect performance).

And no, I'm not going to say what the invalid mechanism was, and if somebody guessed the answer I couldn't possibly comment.

- T
- Tom Gardner
  
  Contact options for registered users
Vote on answer
posted
10 years ago

Thu, Sep 19, 2013 2:44 PM

Sorry, that escaped by mistake.

and use it for all arithmetic for calculating the cost.

Don't even think that the "solution" might cause as many problems as it "cures".

Don't even think of considering that the algorithms might now be faulty by design, since...

- G
- glen herrmannsfeldt
  
  Contact options for registered users
Vote on answer
posted
10 years ago

Thu, Sep 19, 2013 4:47 PM

(snip)

I don't remember the IEEE rounding modes well enough, but the IBM Stretch has an option to shift in zeros or ones during post normalization. You can run a program both ways and look for different results. Seems to me that both in the case of Stretch and IEEE rounding modes you can have cases where the difference cancels out. (Subtract two values that rounded the same way.)

-- glen

- M
- Mark Curry
  
  Contact options for registered users
Vote on answer
posted
10 years ago

Thu, Sep 19, 2013 4:49 PM

In a lot of these recent arguments, I see both sides basically arguing the same point - they're in agreement - but coming to different conclusions.

In any DSP solution, you as the designer must analyze the conditions and determine: Do I have enough precision (i.e. do I have enough bits for my mantissa).

This is invariant whether a fixed point or floating point solution is used.

Secondly, you must analyze the conditions to determine if you have enough range. If you chose a floating point solution, this is less important - you're less likely to run into trouble. For a fixed point, or custom floating point, you must do the engineering. As a ASIC/FPGA person who's done this a bit, this can be tricky. There's rounding considerations throughout. But that's engineering. You do your best, tests extensively with real data, AND contrived data.

Then you add margin to be sure.

There's no relying on "They say double precision is good enough"

Paul - this is happening all the time in ASIC/FPGAs. We can't just punt and say "use double precision floating point everywhere." And I don't consider myself to be a "numeric expert" by any means..

Regards,

Mark

- G
- glen herrmannsfeldt
  
  Contact options for registered users
Vote on answer
posted
10 years ago

Thu, Sep 19, 2013 4:50 PM

(snip, I wrote)

Yes, I believe that they are in the billion transistor range, if you include the on-chip cache. I thought KF at least does some multiply, but there are algorithms that don't even do that.

-- glen

- G
- glen herrmannsfeldt
  
  Contact options for registered users
Vote on answer
posted
10 years ago

Thu, Sep 19, 2013 5:05 PM

(snip)

Yes. The reason that double precision is good enough is that enough algorithms have been written to avoid enough cancellation for it to be enough.

If you do the operations in the right order, double precision is usually enough to get single precision results, and that is usually good enough.

-- glen