extracting D from 1 / D*D

Hi Folks, Incredibly busy summer here, so before burning my brain cells, Googling or -worst- digging in my very dusty math courses, I submit this question to the DSP experts who usually float around, hoping some aren't at the beaches ;-)

How would you extract D from a X = K / (D * D) value (16 bits) ?

- in a small FPGA indeed, which has no embedded mult (but a mult could be synthesized)

- please don't answer to synthesize sqrt(1/x) !

- I have lots of clock cycles available to compensate the lack of FPGA resources.

- I would like to avoid a Piece-Wise Linear estimator if possible (though it would allow for coding sensor non-linearities).

In the past, I remember implementing an sqrt operator using a suite (the suite didn't converge as rapidly as it might, but the implementation was easy), so a similar technique would be nice. But I keep an open mind : any idea is most welcome :-)

Thanks, Bert

Reply to
Bert_Paris
Loading thread data ...

(snip)

There is an iterative sqrt algorithm that only does subtraction. For only 16 bits, it shouldn't take a lot of LUTs, and not so many cycles, either. I haven't thought about coding it in logic lately, though. What about the K?

-- glen

Reply to
glen herrmannsfeldt

As mentioned, I have no problem regarding sqrt, but it is 1 / Sqr(D) that I have to transform. The solution I had in mind before posting was to use a non-restore division (for 1/X) followed by the iterative sqrt. But maybe some one has a better idea.

Bert

Reply to
Bert_Paris

Have you figured out what accuracy you need? It will make a /big/ difference to the ease of implementation.

Implementing a linear interpolation means having a table of coefficients. Have you got space to allow lookup in a table of some sort?

You can reduce the coefficients a bit by first left-shifting X by two bits repeatedly until you have a 1 in either of the two MSB's. Once you have calculated D, right-shift it by 1 bit for each 2 bits you left-shifted X. This will save you from having to have high-precision lookups for small X.

You don't just have to do a square root - you also have to do a reciprocal. These are both relatively hard to implement accurately (unlike multiplication, which is fairly easy - and quite small if you have lots of clock cycles and can do it bit for bit).

If you are looking for a series of estimators, Newton-Raphson generally gives fast convergence. The sequence for your problem would be:

D_{n+1} = (3/2).D_n - (X/2K).D_n^3

Unfortunately, that will need a number of multiplies per cycle. You would also need to test how well it worked with limited bit precision.

Reply to
David Brown

Thanks for your feedback !

David Brown a formulé la demande :

10 bits looks good enough.

Not ideal in the context (uninitialized RAMs) :-( but that's doable.

If I calculate the sqrt first, I may not need this.

Absolutely !

While googling, I found a "magical" estimator based on manipulating an IEEE floating point as an integer (which they refine with an iteration of the NR above). Not useful here I guess.

After some thinking I will probably try this :

- calculate sqrt in format 8.2 or 8.3

- use a 1024 entries LUT for the reciprocal or (due to lack of ROM) :

- use a non-restore division 2**N / X

Thanks !

Reply to
Bert_Paris

(snip, I wrote)

In Newton-Raphson, reciprocal square root is easier than just square root, but that might not be true for the subtract based version. Non-restore divide is pretty small, or even restore divide, if you have the clock cycles to spare. How many cycles do you have?

-- glen

Reply to
glen herrmannsfeldt

Of course you'll have to deal with the special case 1/sqrt(0) = infinity. Leaving that aside:

This is very good advice, it makes the problem *much* easier.

This approach works because 1/sqrt(4D) = 1/sqrt(D) >> 1

The HW required is a leading-1 detector that selects whether to left shift the input 16 bits by 0/2/4/8/10/12/14 bits and later to right shift the output by 0/1/2/3/4/5/6/7 bits.

This approach means that you now only have to deal with the input range

16384...65535. Why is that an advantage ? Because the function 1/sqrt(D) is highly nonlinear over the range 1...65535 but only mildly nonlinear over the range 16384...65535. Linear interpolation will work *much* better over the reduced range.

Some numbers to back this up:

derivative of 1/sqrt(D) is -1/2Dsqrt(D)

derivative @ 1 = -0.5

derivative @ 16384 = -2.38e-7

derivative @ 65535 = -2.98e-8

The /gradient/ of the curve changes by a factor of roughly 17,000,000 times between 1 and 65535 but only by a factor of 12.5 between 16384 and

65535. In other words, the curve is *much* more linear over the reduced range.

Applying piecewise linear interpolation over the reduced range should work well.

As an example, using just 1 linear segment will give you around a 20% worst case error (at the middle of the range ie 40960).

Using 64 linear segments will give you about 12 bit accuracy.

Hope that helps.

Stephen Ecob Silicon On Inspiration Sydney Australia

formatting link

Reply to
Steve

(snip, someone wrote)

Usually this would be done with a barrel shifter, but they are not very LUT efficient in many FPGAs. If you aren't so restricted in cycles, though, you can use an ordinary shift register.

I am not sure sure how much that saves in cycles or cells.

-- glen

Reply to
glen herrmannsfeldt

You can use a Multiplier as barrel shifter too (Xilinx xapp195)

--
Uwe Bonnes                bon@elektron.ikp.physik.tu-darmstadt.de

Institut fuer Kernphysik  Schlossgartenstrasse 9  64289 Darmstadt
 Click to see the full signature
Reply to
Uwe Bonnes

Given that there are plenty of clock cycles available and not much HW, I'd spend the first 8 clock cycles looking at the two MSBs and left shifting two positions when those MSBs are 00. The detection of the zero input special case can also be folded into this part very efficiently.

Stephen Ecob Silicon On Inspiration Sydney Australia

formatting link

Reply to
Steve

(snip, I wrote)

Usually the barrel shifter problem comes up doing floating point in an FPGA. If you use the hardware multipliers for multiplying, you might not have them avaialble. But it takes two or three for floating point add, which usually makes add bigger than multiply!

For IEEE double, 53 bits is a big shifter!

-- glen

Reply to
glen herrmannsfeldt

y.

How much clock cycles is plenty?

You can:

a) start with some value for D b) compute T1=3DD*D c) compute T2=3DY*T1 d) check whether T2 > K e) use nested intervals to find the exact value of D in 16 iterations.

Using 16 bit iterative multipliers you are in the range of 80 luts for the datapath and 16*32 clock cycles. Using bit serial implementation you will be using even less luts for the data path and about 16*16*32 clock cycles.

An even simpler solution is not to use nested intervals but making use of the fact that: Y*(D+1)*(D+1) - Y*D*D =3D 2*Y*D + 1 and

2*Y*(D+1) - 2*Y*D =3D 2*Y

This means that you can scan all values of D using two additions per try: YDD_next =3D YDD + 2YD

2YD_next =3D 2YD + 2Y

Note that this version uses no multipliers.

resulting in

48 LUTs and 64k clock cycles (parallel) or 6 LUTs and 2M clock cycles (bit serial)

10 bit accuracy is 64 times faster at identical hardware cost.

Kolja cronologic.de

Reply to
Kolja Sulimma

Very nice use of second differences.

I believe there is a successive-approximation variant of this which will calculate an N-bit result in N cycles without multipliers. But it requires some ingenious juggling with shifts, and the margin of this post, etc, etc (meaning: I'm not smart enough to write down the algorithm nicely). I haven't yet done a LUT count. It will be a little larger than Kolja's numbers, but not massively bigger (I think). Bert, we can discuss offline if you wish.

cheers

--
Jonathan
replace spam with jonathan in the published email address.
Reply to
Jonathan Bromley

Hello all !

Sorry, I had to stay away from the forum for a few days.

*Thanks* a lot to all contributors for all their ideas.

I haven't made my mind yet since the sensor still needs some validation, but there is a lot of food for thougt here :-)

For the record, "many" means 100 clock cycles would be fast enough, more (1000 !) clock cycles would still be possible. That's why in theory, very compact implementations are possible. Large accumulators are possible since they can be pipelined / multi-cycled in this context. The "best" implementation is a compact enough one that is easy enough to code :-)

I'll let you know if the project is going to use this sensor, and which implementation I used.

Bert.

Reply to
Bert_Paris

ElectronDepot website is not affiliated with any of the manufacturers or service providers discussed here. All logos and trade names are the property of their respective owners.