I started with Ray Andraka's papers, "A Survey of CORDIC algorithms...", and "High Performance DDC for FPGAs". From the second paper it seems that DDS approaches with phase resolution larger than 10 bits or so should be done with other methods, such as CORDIC.
Using the CORDIC should also give you the complex mix for free. I'm thinking that you would put your real input into x, set y to zero, and then put in a phase accumulated value for your desired channel LO value. Then you can get out a "de-rotated" I and Q from the X and Y outputs. There's some quadrant mapping in there too.
I'm going to start hacking on one of the Xilinx System Generator CORDIC blocks (SINCOS?) to get what I need. In practice, how many iterations or PEs do you need to get a significant SFDR such as 96dB? With DDS functions there's a formula ceil(SFDR/6) for the phase width. Is there something similar for a CORDIC implementation?
For the 96dB phase-dithered DDS, I'm seeing 15 Block RAMs required, which is expensive. I'm assuming the cores (Sysgen blocks) are using quarter-wave tables? The Taylor-series DDS drops down the RAM but requires multipliers.
Thanks for any suggestions, Brady