how to speed up my accumulator ??

M

Moti Cohen 21 years ago

Hello all, I've a design that contains a NCO (Numerically controlled oscillator). The NCO consists of a 32'bit accumulator. when i write the accumulator straight forward like this -

process (clk,resetn) begin if resetn = '0' then accumulator '0'); elsif clk'event and clk ='1' then accumulator

Vote

H

Hal Murray 21 years ago

google for carry-save adder. Or counter.

The idea is to break the adder into chunks. The carry-out of each chunk goes into a FF and then into the carry-in of the next chunk. Chop it up into chunks that are small enough that they meet your speed requirements.

With modern dedicated carry logic, this doesn't work as well as it did in the old days.

The suespammers.org mail server is located in California. So are all my other mailboxes. Please do not send unsolicited bulk e-mail or unsolicited commercial e-mail to my suespammers.org address or any of my other addresses. These are my opinions, not necessarily my employer's. I hate spam.

Vote

A

Antti Lukats 21 years ago

Vote

M

Moti 21 years ago

Hi Hall,

you said -> The idea is to break the adder into chunks..

I know that I need to break the logic but my problem is what to do with the feedback path, should I break it too ?

Regards, Moti.

Vote

M

Moti 21 years ago

Hi Antti,

you worte ->

formatting link

NCO with max (virtual) frequency of 11 (eleven)GHz!

I couldnt find any detailed description there (only features + deliverables description for buying it)

you worte -> For your speed you possible can optimize the adder to get the performance

How would you suggest on doing this ?

you worte -> similarly in FPGA's with no special serdes there would be still be some speed gain using the NCO at lower frequency and calculatig maybe 4 or 8 bits per clock and then using very fast shift register to shif the bits out. that approuch would be useable for 400M+ frequencies (within FPGA fabric

It seems to be very very interesting solution for me (higher frequency = less jitter !! ) but I didnt realy understood how does it works so I will appreciate it if you will provide me with more details or a with a link to a detailed desciption..

Thanks, Moti.

Vote

R

rickman 21 years ago

Vote

A

Antti Lukats 21 years ago

Vote

M

Mike Treseler 21 years ago

Vote

M

Moti 21 years ago

Hi Rickman,

First of all, thanks for the code example It's always nice and clearer to get one of this. there is only one thing bothering me in your code - the "accsingle" register is sampled on each rising edge of clock and therefore does not improves the setup time (and therefore the frequency & clk rate) i suppose that it should be sampled on every 2'nd clock. So maybe your code contains a typo but the idea is "almost" clear and i'ts a very clever one.

I presented this subject (my problem) to our algorithm's guy and he figured out a very nice way of breaking the logic into to or more levels (4, 8..) , but he is still working on it I will write the code here when he will finish it..

Thanks Moti.

Vote

R

rickman 21 years ago

Vote

M

Moti 21 years ago

Hi Mike,

Yes I know that, but my design inc_value'length is almost as the accumulator'length ( maybee I will be able to decrese two bits..) so it won't give me much more slack..

Thanks. Moti.

Vote

R

rickman 21 years ago

Yes, both accsingle and accdouble are sampled on the rising edge of the clock, but only when phase is high and so only *every other* clock. I guess I figured that would be obvious. The addfast signal captures the output of a mux on *every* clock so that it still has to run at full speed. But this path has no carry, so it should be faster than your previous result.

In any regard, you can likely improve your results by floorplanning so that the registers involved are in ajacent (or even the same) CLBs to optimize routing. I see no reason that your original design would not run at 200 MHz.

You will find that approach reduces the length of the carry path. But the basic minimum path is from one register output through the LUT and into a second register. This will be the ultimate limit for any adder design if you reduce the carry delay to a single LUT. To reach the full speed capability you likely will need to floorplan to get the optimally fast routing which will be between registers in the same CLB. At that point your carry delay may not matter with your requirement of 5 nS. Typically the carry delay is < 0.1 ns/bit or < 3.2 ns for the 32 bit adder.

I guess all those words are trying to say that you can only do so much with pipelining an adder. Pipelining will break up the carry delay, the finer you break it up, the closer to get to the reg -> LUT -> reg delay, not zero delay. My dual parallel approach gets you directly to the minimum delay if that is what's needed. But try floorplanning before you do any more work with the algorithm. That should be sufficient at

32 bits.

Also, you did place and route it, right? The timing results from synthesis are not very accurate since they "estimate" routing times.

Rick "rickman" Collins rick.collins@XYarius.com Ignore the reply address. To email me use the above address with the XY removed. Arius - A Signal Processing Solutions Company Specializing in DSP and FPGA design URL http://www.arius.com 4 King Ave 301-682-7772 Voice Frederick, MD 21701-3110 301-682-7666 FAX

Vote

A

Antti Lukats 21 years ago

[lots of snipped]

:) ok, well your code "AS IS" did not synthesise so I tried mind guess an fix to get it synthesize, posible making an error in the guess work. YES, calculating 2 bits per clock is a solution, this is also what I suggested in one of my earlier posts

I presented the synthesis (and timing) of the code "as you wrote" it (after fix) I dont see the output mux in your code, and I did not add it either

generically I agree similar approuch (if code is correct) runs about twice the speed

I posted both synthesis estimate and post place and route timings, in any case both approuch are 210MHz +

No floorplanning, just set clock constraint to 5ns nothing more

yes, possible i corrected your code incorrectly :(

I used all signal 32 bit wide, inc_value as input port

Vote

A

Antti Lukats 21 years ago

oscillator).

accumulator

[snip]

hm... out of curiosity I did check DDSX ipcore in 2X mode (that is calculating 2 bits per clock), the following stats are for

- 32 bit wide accumulator

- 32 bit variable phase increment value

Synthesis: Selected Device : 3s1500fg320-5 Number of Slices: 33 out of 13312 0% Minimum period: 4.577ns (Maximum Frequency: 218.508MHz)

Post P&R Timing: Timing constraint: TS_clk = PERIOD TIMEGRP "clk" 5 nS HIGH 50.000000 % ;

497 items analyzed, 0 timing errors detected. (0 setup errors, 0 hold errors) Minimum period is 4.657ns.

----------------------------------------------------------------------------

All constraints were met. Design statistics: Minimum period: 4.657ns (Maximum frequency: 214.731MHz) So DDSX ipcore can calculate 2 bits per clock (to be muxed or serialized) at max frequency 214MHz using 33 Slices! Ok, lets add one more slice for the mux or shifter that comes to 34 slices :) DDSX ipcore (in 2x mode) runs completly at 0.5 x DDS frequency! So if the FPGA fabric can run a 2 bit shifter at 400MHz then the DDS would run at virtual 400MHz Real 400MHz is only used in one slice doing the shift or not at all when the DDR iocell uses 2 phases of the clock. Antti PS just did run timing check on the 10GHz version of DDSX no problems either :) Sure 10GHz only with V4FX or V2ProX (using GT10 as serializer)

Vote

M

Moti 21 years ago

Hi Rickman,

I wrote ->

You wrote -> Yes, both accsingle and accdouble are sampled on the rising edge of the clock, but only when phase is high and so only

*every other* clock

That's what I ment : as to my understanding accdouble is indeed being sampled every other clock but, accsingle is samped on every clock as follows : when phase = '1' accsingle is being updated : accsingle

Vote

M

Moti 21 years ago

Another question regarding the NCO...

Does any of you guys knows the algorithm for calculating the jitter frequency on the NCO output (MSbit) . I know that the jitter magnitude is + - [reference clock period / 2] and I know that I can see it (the frequency) also in a Spectrum analyzer but I will be glad to have a formula for calculating it in advance.

Thanks again, Moti.

Vote

J

John_H 21 years ago

It's a bit involved to find the largest jitter components but I've worked the problem in the past and found a direct correlation between my expected jitter components and the FFT of the NCO output. Effectively, your jitter components are at the offsets between the frequency of your NCO and the best fractions that approximate your NCO output-to-input frequency ratio (and the harmonics thereof). If you can figure the closest fractions in sequency, you can get your main jitter components. There is some mixing among these frequencies but it tends to be significantly lower than the main peaks in the scenarios I ran.

Vote

M

Moti 21 years ago

Hi Jhon, thanks for your reply, altough I have to admit that I didnt entirely understood how to actually caluculate the frequency. Best regards, Moti

Vote

R

rickman 21 years ago

The output mux is the two assignments to accfast, one when phase is '0' and the other when phase is '1'.

I took another look at the code and I don't see anything that would not synthesize. What did the tool complain about?

I only see one timing value for each example. What were the critical paths in each design?

That should work. Can you post the code you worked with?

Rick "rickman" Collins rick.collins@XYarius.com Ignore the reply address. To email me use the above address with the XY removed. Arius - A Signal Processing Solutions Company Specializing in DSP and FPGA design URL http://www.arius.com 4 King Ave 301-682-7772 Voice Frederick, MD 21701-3110 301-682-7666 FAX

Vote

A

Antti Lukats 21 years ago

[snip]

Rick you can try your own code with XST it complains about the sll at least! Maybe there is better(read proper fix) to main

the timings I posted I always posted synthesis estimae and post P&R timings news posting did change the text aligne so was hard to read

below is what I used (fast-do-not-think-at-all .. fixed) from your code

----------------------------------------------------------------------------

library IEEE; use IEEE.STD_LOGIC_1164.ALL; use IEEE.STD_LOGIC_ARITH.ALL; use IEEE.STD_LOGIC_UNSIGNED.ALL; entity dds is Port ( clk : in std_logic; rst : in std_logic; inc_value : in std_logic_vector(31 downto 0); fout : out std_logic); end dds; architecture Behavioral of dds is signal accsingle : std_logic_vector(31 downto 0); signal accdouble : std_logic_vector(31 downto 0); signal accfast : std_logic_vector(31 downto 0); signal phase : std_logic; begin process (clk, rst) begin if rst = '1' then phase

Vote

how to speed up my accumulator ??

Join the Discussion

Didn't find your answer?