A number of the various papers fail to search out the best space time tradeoffs. Mistakes like doing 64bit floating point multipliers the hard way in an fpga, or doing an FFT/IFFT as wide parallel which isn't always the best space time tradeoff.
There are MANY other architectures that can be developed to optimize the performance of a particular application to FPGA, beside brute force implementation of wide RISC/CISC processor core elements here. Frequently bit serial will yield a higher clocking rate (as it doesn't need a long carry chain), and doesn't need extra logic for partial sums or carry lookahead, so it also delivers more functional units per part, but at the cost of latency which can frequently be hidden with the faster clock rate and high function density per part. It can also remove memory as a staging area for wide paralle functional units, and thus remove a serialization imposed by the solutions architecture.
Bit serial operations using Xilinx LUT fifo's can be expensive in both power and clock rate reductions, but that is not the only way to use LUTs for bit serial memory. Consider using some greycode counters and using the LUT's simply as 16x1 rams instead ... faster and less dynamic power.
There are lots of ways to get unexpected performance from FPGAs, but not by doing it the worst possible way.
Be creative. $30M US of FPGAs and memories can easily build a
1-10 Petaflop super computer that would smoke existing RISC/CISC designs ... we just don't have good software tools and compilers to run applications on these machines, or have developed enough programming talent used to getting good/excellent performance from these devices.
There are a few dozen better ideas about how to make FPGAs as we know them today, into the processor chip of tomarrow, but that is another discussion.
Consider distributed arithmetic made FPGA's popular for high performance integer applications, and it's not even a basic type available from any of the common compilers or HDL's. Consider the space time performance of three variable floating point multiple accumulate (MAC) algorithms using this approach for large matrix operations.
Consider this approach for doing high end energy/force/weather simulations using a traditional Red/Black interleave as you would use for these applications under MPI. 3, 6, 9, 12 variable MAC's are a piece of cake with distributed arithmetic, and highly space time efficient. The core algorithms of many of these simulations are little more than MAC's, frequently with constants, or near constrants that seldom need to be changed.
Consider for many applications the dynamic range needed during most of the simulation is very limited, allowing systems to be built with FP on both ends of the run, and scaled integers in the middle of the run, even simpifing the hardware and improving the space time fit even more.
The big advantage to FPGAs is breaking the serialization that memory creates in RISC/CISC architectures. Memoryless computing using pipelined distributed arithmetic is the ultimate speedup for many applications, including a lot of computer vision and pattern recognition applications.
So read the papers carefully, and consider if there might not be a better architecture to solve the problem. If so, take the numbers and conclusions presented with a grain of salt.