Out of curiosity: I am currently studying some problems which require many population- count operations ov various sizes.
I have used population count in a course I taught last year, as it can demonstrate various design styles (function, recursive function, nested GENERATE statements, etc.) in VHDL. I used a tree of full adders and the performance on a Spartan 3 as reported by the xilinx tools was moderate (using pipelining). * I think that one should get a much faster implementation even on an FPGA if one uses redundant representations very much like multipliers do (which do essentially several population circuits of the partial products).
So, FPGA-experts: is that true, from which pipeline depth/input size is it worth? How is the cost/performance compared to an full-adder implementation?
Theoreticians: I could not find a complexity expression for this operation. But it sounds like textbook knowledge. Maybe I have the wrong textbooks.
Andreas