Comments following.
Berty's general comment corresponds to a key part of solving the critical path: isolate it. Several times where I had trouble meeting critical timing (generally due to the tool's inability to map and/or place things well), I would break out that portion of the design (figuratively, by removing other constraints, or literally, by creating a test design just containing the portion of interest), and play with location and timing constraints. The smaller design, or loosely constrained design, routes MUCH faster, and you can iterate. In the process, I found out the the Xilinx 4k series (XL, XLA?) had a VERY fast input to output path, if you just needed an inverter. However, you had to choose the correct IOB groupings--the pins not only had to be close, there couldn't be any unbonded IOs between them. I also found that left-to-right was different from right-to-left for adjacent CLBs.
These and other design tricks were learned over the years by experimenting; it's how you learn to get the most performance out of the parts (along with reading the App notes and data sheets, and following the relevant newsgroups, etc.) You tend to leave some significant performance on the table if you just press "Run" and hope for the best.
Jason