You have obviously been reading some ASIC/VLSI oriented Arithmetic texts which don't really apply to FPGA except in general algorithms. The ripple carry is I'd guess disproportionately faster in FPGA relative to Lut logic by about 3x than in ASIC since its given for free and highly optimized. Think about it, an adder might be placed in any set of Luts lined up in a column so only a ripple can be provided. All the clever logic schemes are highly irregular and must use Lut logic about 3-5x slower than ASIC logic.
I also wanted 32bit add for a cpu, the ripple was too slow IIRC about
170MHz for simple a+b but registered on inputs & output, using fastest speed grade for V2Pro.
For awhile I used a CSA array, ie 7 8bit ripple add sections with a follow stage to combine the proper select carries. It did cycle faster, maybe near 300MHz IIRC but it used up about 3x the area and needs the extra pipeline. Another downside was that in order to use CSA the addition is done twice with a carry in of both 1 and 0 for the cells
8-15,16-23,24-32. This doubles the fanout on those registers driving duplicate adders and limits the speed up and complicates hand placement. CSA and CLA, Ling schemes are better used for ASICs and full custom. I believe some Altera devices have CSA adder logic built in.
In the end I flipped to an alternate approach, use a 2 cycle design that is limited to the 16bit critical path, this also happens to be very close to blockram cycle time so now in 2 clocks, I get 4port ram,
32b add at near 150MHz (actual clock is 2x that). It also uses far less HW and had other simplifying effects elsewhere. Then by placing pair of
2cycle cpus on oposing clocks, it gets the equiv performamance of 1
300MHz cpu datapath. This idea can usually be copied for DSP engines pretty well.
johnjakson at usa dot com