That is very familiar from college, but I don't recall the utility. It would be useful for multiplying by constants, but otherwise how would this be used to advantage? It would save add/subtract operations, but I can't think of another situation where this would be useful.
If the algorithm is doing an add and shift, the add does not increase the time or the hardware. If building the full multiplier, an adder is included for each stage, it is either used or not used. When done in software, the same applies. It is easier to do the add than to skip over it.
I asked here and on comp.arch.embedded. It's for a guy who's doing assembly-language programming on a PIC12xxx -- for that guy, and for a small range of constants (4 through 7), it can save time over a full- blown multiplication algorithm.
Oh, sure, I use that all the time for constant multiplication. No magic involved. In some cases (such at multiplying by 4) it is simpler to just use a simple shift and add (which becomes trivial for a single bit). Even for multipliers with two bits set, it is easier to use a simple add the shifted values. In fact, the only case of multiplying by numbers in the range of 4 to 7 where Booth's algorithm simplifies the work is 7. Since there are only four cases, I expect this could easily be implemented as four special cases.
I think you need to look at the algorithm again. The degenerate case is a multiplier with alternating ones and zeros. An add or subtract is needed at each 1->0 or 0->1 transition. Since every bit is a transition you still need an adder/subtractor for every bit.
Of course you could add logic to detect these cases and do fewer adder ops, but then that is not Booth's algorithm anymore and is much more complex. Booth's algorithm looks at pairs of bits in the multiplier, this would require looking at more bits.
If you are thinking in terms of constant multiplication then again, this is a modified method that combines Booth's with straight adds.
Multiplexers are not low cost in any sense in many technologies, but it doesn't matter. Booth's algorithm doesn't use multiplexers.
In an FPGA the unit of size is the Logic Cell (LC), a LUT and a FF. Because there is extra carry logic in the LC one bit of an adder is the same logic as one bit of a multiplexer. The only disadvantage of an adder is the carry propagation time, but they don't cascade, properly designed you end up with one ripple cascade through one adder and then the other logic delays as the adders all cascade in parallel.
I don't see how you use multiplexers instead of adders. If the multiplier changes from one calculation to the next you need adders in every position. If the multiplier is fixed the combinations of sums is fixed and no multiplexers are needed.
and N/2 multiplexers which have the same delay... PLUS the selectable inverter which may or may not be combined with the adder. What's the point? A simple adder has N stages of delay in an FPGA, same as the much more complicated modified Booth's adder. In an ASIC there may be some advantage. In software, I expect the much more complicated control will make the modified Booth's algorithm the slowest of the three.
People so often forget that multiplexers are not trivial logic.
On 9/3/2016 8:46 AM, firstname.lastname@example.org wrote:
Look at your diagram again. The multiplexer is muxing the output of the previous stage or the output of the stage prior to that if the previous stage is skipped. The multiplicand is the other input to the adder and the control signal is derived from the multiplier which means there is an added level of delay into the first stage with then has to ripple through the entire structure. The multiplexers *must* be in the path of the additions.
I'm not sure how much faster a multiplexer is compared to the adder. Even if you bypass a bunch of adders, you pass through an equivalent number of muxes. In an FPGA this is equivalent in speed. In an ASIC you may see some speed advantage.
Bit late but it sounds like Horner decomposition. Yep that guy..
I couldn't find a quick reference on line. The notes in our compilers for inline constant multiplies is implemented with Horners polynomial decomposition. This requires no temporary space and only uses shifts and adds.
I remember writing the code on a weekend on a trip where I was getting caught up on a bunch of potentially interesting papers and it was pouring rain. I decided this would be just as much fun anyway.
Old enough the paper is in a banker box close enough to the center that it hasn't become nesting material for the field mice.
This diagram is not totally clear, but it does not appear to be implementable in an FPGA using the carry chain. Without carry chain the adds get much slower and/or use twice as much space. The decoders and multiplexers do result in a larger design in most FPGAs.
It might be possible to use the carry chain across the adds. This would be difficult to code in a way that is implemented with carry chains. Without the carry chains, the full adder bits require running through an extra LUT both for the input and the output in the devices I have looked at.
If this multiplier were to be pipelined, it would be exceedingly large using 2N FFs for every stage. In that case the multiplier would also have to be pipelined resulting in 3N FFs per stage.
Regardless, this is a fairly pointless exercise for FPGAs since most of them offer hard multipliers which run much faster than multipliers in the fabric typically. I think this would only be useful in an ASIC.