Somehow mentioning Superscaler just told me your clock is headed down hill. How many LUT levels of logic do you expect and what is your target frequency? Is this a Uni, commercial or hobby project?
There are really good reasons to consider time driven ports.
A very fast n cycle design can run at the limit of the BRAM or a16b adder or about 3 LUT levels of logic all which are way faster than say a 32b add. This will use about half the total logic and still execute near 150MHz compared to a true simple 1 cycle design. Less logic is much easier to floor plan too.
In my processor design I get 4 effective ports out of 1 BRAM (regRR alternates with regW+fetchI) and that runs at +300MHz using 2 clocks per register opcode in V2Pro -5. The datapath combines 2 half 16b results, and the variable length encoded instruction set uses time based muxing to build opcodes rather than lots of mux arrays. The datapath has no register forwarding or hazard logic since the whole thing runs 4 threads. Thats a whole lot of logic not there to slow things down. With 8 clocks per thread opcode, even DRAM cycles don't look so bad provided only 1 thread does a load/store every 16 cycles or so.
This is inspired by commutating latency hiding DSP design principles rather than the desire to match current full custom cpus that try (and mostly fail) to get more than 1 opcode per clock. The real problem in computing is not how fast processors might crunch data, but the memory systems ability to feed that.
An earlier design that was straight 1 cycle used 3x the logic, 2x the BRAMs and still couldn't get anywhere near 300MHz/2 with all the side control logic stacking up.
Time driven logic will always run faster than parallel complex logic, but if you are prototyping or just studying comp architecture, clock performance doesn't really matter so much.
FPGAs are good for soft cpu design for true RISC in the John Cocke sense, not the OoO SS VLIW EPIC sense that brute force transister design makes possible.
John Jakson Transputer guy