speed... CORDIC vs. pure arithmetic expression

Hello!

I'm student, planing to work with FPGA for my thesis. Totaly unexperienced. Here I'm wondering about the speed. Don't feel the speed of multipliers, shifting, adding,...

My plan is to implement circular Hough transform for images. There are many issues, here I'm wondering about one of them.

I'll try to present mathematical problem in short for those who are unfamiliar with Hough transform and ask you what do you think which solution is better, faster,... FPGA with embedded multipliers will be provided for me. I'll propose 2 solutions.

Also, you could just tell me how fast are multipliers compared to some other operations, like shifting, adding numbers.... Just to feel this numbers, anything :) I don't need 'ns', even to say this operation is 2x faster than this one... I'll be using 18-bit signals. So if you post just comparision of this operations, it will be very helpful for me. I still don't know to read 'data sheets' that good so...

What is the speed of this?

18-bit signed number (signal), multiply (with embedded multiplier), shifting, adding?

-= Problem =-

It will be simplyfied...

I need to compute this: (1) x = R*gx/mag y = R*gy/mag

All of this are signed integers fitting in 18 bits. 'R'...constant gx,gy...inputs

'gx' & 'gy' represent vector (gx,gy) in Carthesian system.

'mag'...magnitude of vector (gx,gy)

I could rewrite (1) like this: (2) x = R*cos(fi) y = R*sin(fi)

'fi'...angle between vector (gx,gy) and x-axis

-= CORDIC =-

For CORDIC I like expression (2). Define this: grad = vector (gx,gy) sol = vector (R,0)

Using CORDIC I rotate vector 'grad' and 'sol'. They are rotate for the angle of 'fi'. 'grad' is rotated in one direction, 'sol' in the opposite. I don't need to compute 'fi'. Vectors are rotated until vector 'grad' becomes equal to (mag, 0), or just to say until the 2nd component of vector becomes 0. After this rotations, vector 'sol' holds the solution.

To be honest, haven't tryed implementing this one. I see that precision rotating angle in iterations can also effect this problem a lot. Anyway, if you can give me any kind of input for this problem, I'll be happy to hear.

-= arithmetic expression =-

Don't know how to call that. Hope you understand. I use this expression: x = (R*gx)/mag y = (R*gy)/mag

'mag' is computed like this: mag = abs(gx) + abs(gy)...it's precise enough for me

I'll be using embedded multipliers to compute 'R*gx', so no problem with that. Problem is divison. For division I'll use gold-smith algorithm. I think it should converge to my solution in 7 iterations, where in each iteration I have to multiply 2 numbers(actualy 2x times, but paralel, so no problem, I look at it like one operation) and then substract a number from one product (something like that).

Important part of pseudo-code: paralel: a1 = a1*m; b = b*m; following: m = constant - b;

-= help =-

I'm unexperienced. I wonder is it worth implementing both of them and comparing them. It could be interesting, but it's not my problem actually. I should implemented anything and my menthor would accept any solution since he's also unexperienced with it. If you think one solution is by far better, can you explain why you think so? If you think they are both good and fast comparing to each other, I'll probably try to implement both of them.

You can also propose anything else, if you feel so. There is an idea of distributed arithmetics, but still haven't spend time in learning about this solution.

Reply to
Frater
Loading thread data ...

Here's a link to somewhere for more reading on this type of DSP in FPGAs, and distributed arithmetic.

formatting link

HTH., Syms.

Reply to
Symon

If these are your two choices and the magnitude estimate is "precise enough" for you, perhaps a simple sin/cos lookup would be acceptable, too.

Rather than calculate the sin/cos values at each step to 18 bits, perhaps the +/-0.08% of +/- full scale error in a 1024-element quarter- wave lookup would make you happy. By using a 1kx18 table in a dual- port memory, you can look up both the sine and cosine with one physical on-chip memory (assuming 18-kbit rather than 9-kbit or 36- kbit FPGA families). You need a little manipulation to get the quadrants and sin vs cos figured out, but that's all pretty simple and easily pipelined.

Embedded multipliers are pretty fast. One multiply takes the same time as 2-4 adds depending on family and placement. Assuming you're talking "shift" in terms of several bits of shift index, a shift of

0-31 would take roughly the same amount of time as the other two situations... roughly. Division is the toughest operation even compared to the square root which you don't seem to be interested in :-).

If you know the "fi" angle and don't need to calculate that value from gx and gy, the sin/cos approach using lookup values will probably give you the best result. There was a discussion within the last few days in this newsgroup on sin/cos lookups that provided a couple suggestions - an Atmel link and mention of Xilinx tool capabilities.

I hope you have fun with the project!

- John_H

Reply to
John_H

Problem is that 'fi' is unknown. I would have to calculate it. Only inputs I get are: gx gy

To calculate 'fi' I could use CORDIC or arc tg(gy/gx), again coming to the same problem.

Wau! So multiplying is 2x faster than adding? Or you ment adding 2-bit by 2-bit signals?

Reply to
Frater

Adders are typically implemented with acceleration hardware in the form of "carry chains" such that adding two 4-bit values is almost the same speed as adding two 16-bit values. Almost. You can perform 2-4 add operations ((((A+B)+C)+D)+E) in the time it takes a multiply to complete. Looking at the Virtex-5 data sheet, for instance, the timing data found in

formatting link

shows performance information in table 27 with a 450 MHz 16-bit adder in the slower device. The *pipelined* multiplier in the DSP48E block is also shown as 450 MHz. A later table for the DSP48E shows a 2- register multiply as 275 MHz "without MREG." It's all got to do with pipelining. The 450 MHz *may* be a somewhat arbitrary limit because the larger V5 devices are limited to 450 MHz global clocks.

It might be fun to experiment with different approaches. Nice thing about Cordic is the free multiply and the clean pipelining.

Check out the timing data sheet for the device you're going to use to see if you can get a better feel for some things. People who have been around FPGAs long enough can go to the detailed timing characteristics to estimate delays. Happily, there's information that people new to the discipline can use as well.

- John_H

Different families will give slightly different relative scales.

Reply to
John_H

Your question is formed in a way that is leading people to tell you the

*latency* of a multiply vs an add. If you present your arguments to a combinatoral adder or a multiply block, after some small amount of time the output will be stable and you could clock it into a result register. That is not usually your utmost concern. Even if you have a multiplier which is clocked and takes several cycles to complete vs an adder which completes within one cycle, the multiplier will likely accept input and produce output on every single cycle. It has much more latency than the adder, but it can have equal throughput.

Also, in FPGAs you can almost always trade off area (how much of the FPGA is used) with speed. You are not dealing with a fixed set of operations that take a fixed amount of time, like a CPU. You can just plain have more operations if that's what it takes. In fact, if you use a "wizard" to produce a multiplier you get to choose how fast it works. If you do this in Altera's megafunction generator it will update the resources required as you make the changes.

--
Ben Jackson AD7GD

http://www.ben.com/
Reply to
Ben Jackson

How many values can fi take? For example, if you only need

1024 values between 0 and 2*pi you can just use blockrams in the FPGA to calculate cos(fi) and sin(fi) (e.g. you can fit 1024 18-bit values into a single blockram in a Virtex-2)

If you should do this or not depends on if you are constrained by the number of logic blocks or the number of memory blocks in the FPGA. If you are constrained by neither, go with the lookup-table since that will be far easier :)

/Andreas

Reply to
Andreas Ehliar

I do not think that you need multipliers at all. For a circular hough transform you process each point of your input (possibly preprocessed for gradients, whatever) and draw a circle in an output histogram.

Using Bresenhams circle algorithm you can compute the coordinates for the target pixels at a cost of two additions per pixel. No angles, multiplication, squares, roots, etc. involved.

Not only is bresenhams algorithm faster than your proposal it also avoids lots of issues related to rounding, error accumulation, etc. Bresenhams circles are perfect, sweeping phi as you suggest in your second formulation will look really ugly.

As you are drawing many, many circules of the same radius you could also precompute a list of coordinates into a BRAM by a CPU and then process that list for all input pixels in hardware . This would provide you with fast hough transform hardware for arbitrary shapes up to a certain size.

Kolja Sulimma cronologic ohg

Reply to
comp.arch.fpga

That's different approach. I'll need to think about it.

My implementation uses gradients. That's why I'm not drawing circles. When I detect an edge, because I have a gradient in this point, I know in which direction is the center located. So I'm not drawing a circel. I suppouse formula (2) was missleading, I used it to suggest CORDIC approach.

Still, I'll think about your approach, perhaps it's really good one for me as well, altough it reminds me of CORDIC because of many iterations involved.

Good idea!

Reply to
Frater

'fi' is uknown to me, I can compute it with 'gx' & 'gy', since they are vector coordinates. So it would be: atan2(gy,gx).

Reply to
Frater

In order not to spam, I'd like to thank everyone for your replies. It makes me feel good to see people who are interested to help with their knowledge.

I'm still looking forward to all your replies, this is not the end :)

Reply to
Frater

Also, you can still use the gradient to draw only 2/8th of the circle.

Note that drawing full circles provides a better immunity against different types of noise in the image. It will however definitely be slower than drawing only a single point.

Note also that depending on resolution, occupancy and FPGA size it might be benefitial to implement a many-to-one approach instead of a one-to-many approach.

To do this you implement a portion of the target bitmap in internal memory. For each block ram you have a processing element that can decide whether the input results in a hit in that RAM. Than you stream the input bitmap through your hardware. This results in linear extermal memory accesses and should therefore reach a higher throughput. I believe that you can do two pixels per hit BRAM per clock cycle for each BRAM that has a hit.

Kolja Sulimma

Reply to
comp.arch.fpga

(someone wrote)

(someone wrote)

x=R*gx/sqrt(gx**2+gy**2) y=R*gy/sqrt(gx**2+gy**2)

two multiplies, a sqrt and two divides should be a lot easier than atan2, cos, sin.

-- glen

Reply to
glen herrmannsfeldt

ElectronDepot website is not affiliated with any of the manufacturers or service providers discussed here. All logos and trade names are the property of their respective owners.