Clock speed problem. How can I proceed?

I have a 500K gate, 40MHz design. When I first implemented that for xc2v6000, P&R succeeded with 2ns margin. Now I decided to inserted a parallel - serial converter, by using DCM to raise the clock to 80MHz, synchronized with original 40MHz clock. The logic in P-S is quite simple, however, after I ran P&R, the timing violation for 40MHz occur and max path roared to 45ns, but timing for 80MHz is 10.5ns.

I thought it was caused by the P-S and the output path, so I added a ring of 8 registers to the serial output in order to break down the output path and minimize its effect on the 40MHz design (whose timing was very tight). Now I achieved a 10.2ns on 80MHz clock and 37ns on 40MHz clock. Far from my goal. Device utilization was abundant at this time.

Am I doing the right thing? What trick can you advise?

Thank you

