Special-purpose compute engines are unavoidably rather expensive;
The second-biggest Spartan3 chip has 96 18x18->36 multipliers, which gives you eight 54x54 and a 72x72 to work with. A medium-sized XC2VP50 Virtex2 Pro has 232 of the 18x18 multipliers, and a pair of PPC440 CPUs -- and sixteen Hypertransport links -- but probably costs as much as a Madison 1300MHz/3MB (IE $2000 or so). But I don't know what speed you can clock that great array of multipliers at.
[note I've cross-posted this to comp.arch.fpga in case they know the speed and cost details off the top of their heads; the idea is to implement an array of double-precision FMA units on an FPGA, to see how they'd compare to the few much-faster-clocked FMAs on high-end CPUs. I don't know how exotic the Spartan3/4000 or the XC2VP50 are]Tom