Hi there,
I would be happy about some suggestions on how I could start to make my design faster. My design is a processor (18 bit datapath)and the critical path looks like this:
- Instruction register (containing number of register)
- Register file (distributed RAM)
- Mux (2-way, selects either register or RAM)
- Mult18x18 within the ALU
- Mux (ALU output selector)
- Register file (distributed RAM)
Target is a Spartan3 speed grade 4. I ran PAR at highest effort. Tim Delay type Delay(ns) Logical Resource(s) ---------------------------- ------------------- Tiockiq 0.259 EX_Instr_adr1_1 net (fanout=18) 2.114 EX_Instr_adr1 Tilo 0.608 regs_a10_Mram_RAM_inst_ramx_0.F net (fanout=2) 0.693 EX_Regs1do Tilo 0.608 data1mux_Mmux_q_Result1 net (fanout=5) 2.617 EX_Data1 Tmult 3.493 alu_Mmult_prod_inst_mult_0 net (fanout=1) 2.378 alu_prod Tilo 0.550 alu_result16 net (fanout=3) 1.061 EX_Data3 Tds 0.519 regs_a14_Mram_RAM_inst_ramx_0.F ---------------------------- --------------------------- Total 14.900ns (6.037ns logic, 8.863ns route) (40.5% logic, 59.5% route)
Now how could I start improving the design? I don't want to split this up into two cycles (because instruction level parallelism is low and I need one result to compute the next).
I notice that the net delay of the instruction register is quite high. Does this have to do with the fanout? Fanout is 18 (because the value is used as an address to 18 parallel distributed RAM LUTs). I've heard of duplicated registers. Would that help? And then, how would I achieve it? Automatically through a setting? Manually? Is there an elegant way to do it?
Another thing I've heard about is RLOC constraints. I never dared try them so far. Do you think I could improve the design, and by how much?
Of course, I highly appreciate any (other?) suggestions on how to speed up my design. I might also consider changing the architecture, if it doesn't mean I have to change the whole concept of my processor.
Also, I am looking for good literature on FPGA implementation.
Thanks in advance! K.B.