Hi Mike,
This of course depends on how many of these shift registers you have as compared to other logic in your chip, and whether you're going to be using all the block rams for another purpose. "Wasted" blocks are irrelevant when your design is using 70% of the logic and 30% of the RAMs in the chip.
You can implement shift registers by using the alt_shift_taps megafunction in Quartus II. Assuming these are single-tap shift registers, you can fit 36 of them per Cyclone II/III M4K/M9K block RAM. Or you can choose to implement them in LEs if you'd prefer (by changing a flag in the megafunction), which may be the correct choice if your design has spare FFs lying around. Quartus will automatically use the LUTs for other functions in your design, so you are not "wasting" the rest of the LAB unless your design is FF limited.
If you use Spartan-3E, it will take you 36 LCs. From a silicon area perspective, a M4K/M9K RAM takes a similar amount of area, so
*archtiecturally* distributed RAM doesn't help in this case -- which doesn't mean much to you since you can't change the number of RAMs we choose to put in a chip.
In case this factors into your decision, remember that with Cyclone III, you can buy a lot more logic / RAMs per $ than you can with an older 90-nm family. And you get lower power to boot. Also, if performance is a factor at all, Cyclone II/III of significantly higher performance.
Distributed RAM isn't useless. There are certain applications where it makes sense. But these applications where many small, narrow independent memories are (a) needed and (b) dominate the logic/ram utilization are rare. And in exchange, each logic cell is costing more silicon area to build -- would you prefer 10% or 20% more logic cells in your chip or would you prefer distributed RAM? Different customers will say different things.
Regards,
Paul Leventis Altera Corp.