You only need one fast clock at whatever freq you get near fmax, and use clock enables to move the datapath pipeline forward every 1/3 fast clock. The only thing I don't like about that is the energy needed to clock all the logic at 3x the enabled rate but it is safe. A more elaborate clock system could produce 3x and 1x with out skew too.
Synthesis will then tell you your new limit is your datapath or control logic while your BRAMs have good slack. You must have some deep logic paths which is to be expected for an early design. Perhaps you can now think about using the 3x or full clock to redesign that 100MHz logic to get it to run a bit faster using less hardware and or more pipelines.
Perhaps you have wide adders or muls, try pipelining these, other than that I can't say without knowing your architecture. I assume you can synth each block seperately to get a feel for what each block limit is. When you put them all together they will usually run slower.
Remember, in FPGA, the BRAMs are about as good as in an ASIC but the logic is proportionately say 3x slower so optimal FPGA architecture can never be the same as for ASIC. The adders are worse since you are stuck with ripple designs while real ASIC designs can use alll sort of neat look ahead or carry save or carry select, but don't waste time in that direction in FPGA.
So what is your widest adder or deepest logic path?
John