Hi,
I think I've got a really good way to improve a commonly used & well establ
ished algorithm that is often used in FPGAs, and it all checks out. The imp
lementation completes the same tasks in 2/3rds the cycles and using 2/3rds
the resources of an standard Xilinx IP block, with comparable timing).
I've verified that the output is correct over the entire range of 32-bit in
put values.
I can't find anything similar designs in a Google patent search, or looking
through journal articles. Once you are familiar with the original algorith
m, and the optimization is explained it becomes pretty self-evident in retr
ospect. It just seems the right way to do things.
What should I do?
Should I just throw the implementation on a website somewhere as a curiosit
y?
Publish it in an article?
Pass it to a local student to make a paper from it? (I'm not studying at al
l)
Attempt to patent and then commercialize it?
Thanks!
Mike

I think the best option is to write an article -- or a patent.
Simply because it's an extra opportunity to verify that your approach is
correct.
If there's a hidden mistake, a student might be unable to see it.
Gene

I agree with Gene, plus you might consider publishing the IP as open source
code on a website of your own or opencores.org.
--Mike
tablished algorithm that is often used in FPGAs, and it all checks out. The
implementation completes the same tasks in 2/3rds the cycles and using 2/3
rds the resources of an standard Xilinx IP block, with comparable timing).
t input values.
king through journal articles. Once you are familiar with the original algo
rithm, and the optimization is explained it becomes pretty self-evident in
retrospect. It just seems the right way to do things.
osity?
t all)

I'd publish - since you are not already in the IP licensing/patenting
groove I doubt if you would make any money from it but you might gain
kudos which may help you career and business. Xilinx might want to
publish it - which might give a lot more visibility.
If you have a web site you could put it on that.
MK

blished algorithm that is often used in FPGAs, and it all checks out. The i
mplementation completes the same tasks in 2/3rds the cycles and using 2/3rd
s the resources of an standard Xilinx IP block, with comparable timing).
input values.
ng through journal articles. Once you are familiar with the original algori
thm, and the optimization is explained it becomes pretty self-evident in re
trospect. It just seems the right way to do things.
ity?
all)
Licensing and selling IP comes with a bit of a learning curve and requires
an investment on your part. As Michael mentions, without some of that fram
ework already in place, a license vetted by an IP attorney, and a good mark
eting plan, you might not see a return on that investment.
If you want your name more prominently attached to it, I'd suggest posting
up on a personal Github account rather than opencores.org which makes you c
onform to their requirements (such as wishbone interface, etc.).
Xilinx always welcomes guest articles on their blogs (although those have b
een in flux since the recent reorg), and their e-magazine Xcell Journal (ag
ain, seems to have been discontinued and the Xcell Daily Blog archived)

I am serious in this reply. You may already know this, if so then I
post for other people to know as well.
I would recommend that if you have means to get by in your life as
you are today without receiving income from this invention / discovery,
then not only give it away to people, but to do so in the name of Jesus
Christ. Tell the people around you that because of what Jesus has done
for you, you are desiring to give to the people here the fruit of what
He first gave you (your intellect, abilities, opportunities to learn,
guiding your path through life, etc.).
By giving Him the credit for your invention / discovery, and not doing
it for money or some other thing here, then you will receive a reward
in Heaven for your labor / knowledge gift to other people, and that re-
ward in Heaven is like the burning bush, or the fishes and the loaves,
meaning it is able to be used without being consumed. In short: re-
wards in Heaven do not wear out or diminish with use. They remain in
their full gift for all time, even under maximum / heavy use.
Remember the Lord in your life, and give things to people citing Him
as the reason why you are doing so. You will receive the maximum re-
turn on your efforts, and it sounds like you will also help out many
people here on this Earth by your gift.
It would be a win-win for everyone in your giving.

g up on a personal Github account rather than opencores.org which makes you
conform to their requirements (such as wishbone interface, etc.).
OpenCores encourages use of the Wishbone interface for SoC components and t
hey do offer coding guidelines, but there are no requirements for either. F
or example in the entire DSP core section there are 38 entries, none of whi
ch are marked as Wishbone compliant.

tablished algorithm that is often used in FPGAs, and it all checks out. The
implementation completes the same tasks in 2/3rds the cycles and using 2/3
rds the resources of an standard Xilinx IP block, with comparable timing).
t input values.
king through journal articles. Once you are familiar with the original algo
rithm, and the optimization is explained it becomes pretty self-evident in
retrospect. It just seems the right way to do things.
osity?
t all)
s an investment on your part. As Michael mentions, without some of that fr
amework already in place, a license vetted by an IP attorney, and a good ma
rketing plan, you might not see a return on that investment.
g up on a personal Github account rather than opencores.org which makes you
conform to their requirements (such as wishbone interface, etc.).
been in flux since the recent reorg), and their e-magazine Xcell Journal (
again, seems to have been discontinued and the Xcell Daily Blog archived)
I never though I would agree with Rick, but....
All sounds like too much work. So here is a quick summary with C-like pseud
o-code. I'll put the HDL code up somewhere soon once I am happy with it. I
am removing the last rounding errors.
I've been playing with CORDIC, and have come up with what looks to be an ov
erlooked optimization. I've done a bit of googling, and haven't found anyth
ing - maybe it is a novel approach?
I've tested it with 32-bit inputs and outputs, and it is within +/-2, and a
nd average error of around 0.6.I a am sure with a bit more analysis of wher
e the errors are coming from I can get it more accurate.
This has two parts to it, both by themselves seem quite trivial, but comple
ment each other quite nicely.
Scaling Z
---------
1. The 'z' value in CORDIC uses becomes smaller and smaller as stages incre
ase:
The core of CORDIC for SIN() and COS() is:
x = INITIAL;
y = INITIAL;
for(i = 0; i < CORDIC_REPS; i++ ) {
int64_t tx,ty;
// divide to scale the current vector
tx = x >> (i+1);
ty = y >> (i+1);
// Either add or subtract at right angles to the current
x -= (z > 0 ? ty : -ty);
y += (z > 0 ? tx : -tx);
z -= (z > 0 ? angles[i] : -angles[i]);
}
The value for angle[] is all important, for example:
angle[0] = 1238021
angle[1] = 654136
angle[2] = 332050
angle[3] = 166670
angle[4] = 83415
angle[5] = 41718
angle[6] = 20860
angle[7] = 10430
angle[8] = 5215
angle[9] = 2607
angle[10] = 1303
angle[11] = 652
angle[12] = 326
angle[13] = 163
angle[14] = 81
angle[15] = 41
angle[16] = 20
angle[17] = 10
angle[18] = 5
angle[19] = 3
angle[20] = 1
If you make the following change:
for(i = 0; i < CORDIC_REPS; i++ ) {
int64_t tx,ty;
// divide to scale the current vector
tx = x >> (i+1);
ty = y >> (i+1);
// Either add or subtract at right angles
x -= (z > 0 ? ty : -ty);
y += (z > 0 ? tx : -tx);
z -= (z > 0 ? angles[i] : -angles[i]);
//!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
z 0 ? tx : -tx);
z -= (z > 0 ? ANGLE_CONSTANT : -ANGLE_CONSTANT);
z

As far as your "revised" angle[i] converging to a constant is concerned,
there's a simple explanation using the first two terms of the taylor
series for the arctan function:
arctan(x) = x - 1/3 * x^3 + o(x^3)
So that
angle[i] = arctan(2^-i) / pi * 2^i = (1/pi) * ( 1 - 1/3 * 2^-(2*i)) +
o(2^-(2*i))
Based on which you can easily say how many stages of the conventional
cordic algorithm do you need to skip (i.e. store the outputs in a lookup
table) for a given bit precision.
I don't know the literature well, but I think it would be cool if you
actually write an article detailing your approach!
Gene

If perchance this is related to your recent CORDIC rotator code,
I've seen a number of CORDIC optimization schemes over the years
to reduce the number of rotation stages, IIRC typically either
by a 'jump start' or merging/optimizing rotation stages.
If I ever manage to find my folder of CORDIC papers, I'll post some links...
-------
some notes on your CORDIC implementation

formatting link

- instead of quadrant folding, given I & Q you can do octant folding
(0-45 degrees) using the top three bits of the phase
- if you pass in the bit widths and stages as generics, you can
initialize the constant arctan table on-the-fly in a function
using VHDL reals
-Brian

oops, for some reason, when first reading this thread I didn't
see the later posts with the explanation... I'd swear they weren't
there, but maybe I was just scroll-impaired.
-Brian

That's good stuff. I wonder why you think using a BRAM is bad? It's good
to use the hard cores in an FPGA instead of the fabric--it's less power and
deterministic routing times.
Is the CORDIC still advantageous in a modern FPGA? The last time I needed
to find sine, I used a coarse BRAM lookup that output sine on one port and
cos on another. Those were used as derivatives for a 2nd-order Taylor. Tw
o multipliers (more hard cores) are used (using Horner's Rule) for the 1st
and 2nd-order interpolations. I don't remember how many digits of accuracy
this yields, but the latency is low.

?:
d to use the hard cores in an FPGA instead of the fabric--it's less power a
nd deterministic routing times.
d to find sine, I used a coarse BRAM lookup that output sine on one port an
d cos on another. Those were used as derivatives for a 2nd-order Taylor.
Two multipliers (more hard cores) are used (using Horner's Rule) for the 1s
t and 2nd-order interpolations. I don't remember how many digits of accura
cy this yields, but the latency is low.
Cordic is useful to compute high-precision atan2. Otherwise for 2 16-bit i
nputs, you'd need a ram with 2^32 addresses (maybe 16 times less if you tak
e advantage of the symmetries).

On Thursday, October 11, 2018 at 7:37:37 AM UTC-6, Benjamin Couillard wrote
:
inputs, you'd need a ram with 2^32 addresses (maybe 16 times less if you t
ake advantage of the symmetries).
Sorry; I didn't see that you were talking about arctan. It's not quite as
easy as sin/cos, but there is still the question as to whether a Farrow-typ
e architecture using a coarse lookup and Taylor interpolations would be bet
ter than a CORDIC, and I am guessing that with the BRAMs and multipliers in
present-day FPGAs, the answer would be yes.

?:
te:
it inputs, you'd need a ram with 2^32 addresses (maybe 16 times less if you
take advantage of the symmetries).
s easy as sin/cos, but there is still the question as to whether a Farrow-t
ype architecture using a coarse lookup and Taylor interpolations would be b
etter than a CORDIC, and I am guessing that with the BRAMs and multipliers
in present-day FPGAs, the answer would be yes.
Yes I think you're right it could work for an atan or atan 2. You could im
plemented a divider for atan 2 (y/x), a sign look-up (to get the whole 360
degrees), a BRAM + taylor interpolation.
For atan, you'd simply skip the divider and sign look-up part.
On the other hand, Xilinx and Altera offer plug-and-play Cordic cores for a
tan/atan2.

implemented a divider for atan 2 (y/x), a sign look-up (to get the whole 36
0 degrees), a BRAM + taylor interpolation.
atan/atan2.
It's probably better to multiply y/x by x to get a normalized ratio rather
than use a divider, which requires a lot of resources.
I recalled that I had to implement the atan2 function once for a QAM carrie
r/symbol recovery circuit. I didn't need great precision, so I split one q
uadrant into a grid and put the angle of the center of each grid square int
o a 2-dimensional lookup ROM. Then I could put X,Y into the ROM and get th
e coarse angle (which was then adjusted for the quadrant) and could use tha
t for carrier recovery.

d implemented a divider for atan 2 (y/x), a sign look-up (to get the whole
360 degrees), a BRAM + taylor interpolation.
or atan/atan2.
er than use a divider, which requires a lot of resources.
ier/symbol recovery circuit. I didn't need great precision, so I split one
quadrant into a grid and put the angle of the center of each grid square i
nto a 2-dimensional lookup ROM. Then I could put X,Y into the ROM and get
the coarse angle (which was then adjusted for the quadrant) and could use t
hat for carrier recovery.
One case that comes to mind : you have 2 quadrature signals
x = A(t) * cos (wt + some phase) + noise_x
y = A(t) * sin (wt + some phase) + noise_y
Atan2(y, x) = wt + some phase. The variations of A(t) (as long as they a
re slow-ish) will cancel each other out.
You can filter or average wt + some phase to extract the phase. Or derivate
"wt + some phase" then filter to get the filtered frequency.
So multiplying "y/x by x" would not make much sense in this case.

On Wednesday, October 17, 2018 at 9:26:38 AM UTC-6, Benjamin Couillard wrot
e:
:
uld implemented a divider for atan 2 (y/x), a sign look-up (to get the whol
e 360 degrees), a BRAM + taylor interpolation.
for atan/atan2.
ther than use a divider, which requires a lot of resources.
rrier/symbol recovery circuit. I didn't need great precision, so I split o
ne quadrant into a grid and put the angle of the center of each grid square
into a 2-dimensional lookup ROM. Then I could put X,Y into the ROM and ge
t the coarse angle (which was then adjusted for the quadrant) and could use
that for carrier recovery.
are slow-ish) will cancel each other out.
te "wt + some phase" then filter to get the filtered frequency.
No, multiplying by x doesn't make sense. Perhaps using a ROM for 1/x and a
multiplier would be better than a full divider.

ElectronDepot website is not affiliated with any of the manufacturers or service providers discussed here.
All logos and trade names are the property of their respective owners.