This excellent website discusses fast division:
I was able to build the tail object so that it used the carry chain hardware in a Virtex2 (through K-maping the logic table). Unfortunately, the triple head and double head are not nearly as easy. Has anyone done any work to make the triple head use the carry chain logic? If it's not possible, we should look at changing the slice layout in future chips. This is one of those parts that uses way more than its fair share of slices. My quick implementation of the triple head used two 3-bit adders and a 6bit wide 16 input mux. I recognize that on a Stratix2 those adders could both be done in fewer luts, but I'm still left with a 6x16 lookup table.