MAC Architectures

- T
- Tim Wescott
  
  Contact options for registered users
posted
18 years ago

Wed, Oct 19, 2005 1:25 PM

Jeorg's question on sci.electronics.design for an under $2 DSP chip got me to thinking:

How are 1-cycle multipliers implemented in silicon? My understanding is that when you go buy a DSP chip a good part of the real estate is taken up by the multiplier, and this is a good part of the reason that DSPs cost so much. I can't see it being a big gawdaful batch of combinatorial logic that has the multiply rippling through 16 32-bit adders, so I assume there's a big table look up involved, but that's as far as my knowledge extends.

Yet the reason that you go shell out all the $$ for a DSP chip is to get a 1-cycle MAC that you have to bury in a few (or several) tens of cycles worth of housekeeping code to set up the pointers, counters, modes &c -- so you never get to multiply numbers in one cycle, really.

How much less silicon would you use if an n-bit multiplier were implemented as an n-stage pipelined device? If I wanted to implement a

128-tap FIR filter and could live with 160 ticks instead of 140 would the chip be much smaller?

Or is the space consumed by the separate data spaces and buses needed to move all the data to and from the MAC? If you pipelined the multiplier _and_ made it a two- or three- cycle MAC (to allow time to shove data around) could you reduce the chip cost much? Would the amount of area savings you get allow you to push the clock up enough to still do audio applications for less money?

Obviously any answers will be useless unless somebody wants to run out and start a chip company, but I'm still curious about it.

--

Tim Wescott
Wescott Design Services
http://www.wescottdesign.com

- P
- Pramod Subramanyan
  
  Contact options for registered users
Vote on answer
posted
18 years ago

Wed, Oct 19, 2005 1:48 PM

There's no lookup table. Its just a BIG cascade of and's. This might help:

formatting link

I think this would lead to lousy performance on small loops - such as those found in JPEG encoding.

Quite a lot of the chip cost depends on the design complexity and the amount of time and money spent in R&D, not to mention the quantity of chips the company hopes to sell, so its not a direct proportional relation between cost and size of chip. If you're trying to save money, you could try using a fast general purpose microcontroller instead of a DSP.

- N
- Noway2
  
  Contact options for registered users
Vote on answer
posted
18 years ago

Wed, Oct 19, 2005 1:59 PM

Your question got me thinking, trying to recall the discussions I had in the microprocessor architecture classes. So here is some food for thought:

I seem to recall that (back then? - 99 -> 01) that multipliers were assumed to take multiple cycles, I think for the class purposes we usually assumed three or four cycles. Sometimes the premise was that there were multiple multipliers and other ALU units that could be used simulataneously. If an instruction was set to execute and there weren't resources available, this resulted in a pipeline stall, but otherwise the apparent output was single cycle. I even believe we had test problems dealing with determining how many multipliers a processor required versus other resource items (each with a $ value attatched), given a certain mix of instructions and having to determine the optimal resource mix.

In the latter portions of the class, we got away from the CPU architecture and spent a lot of time dealing with the concept of maintaining single cycle execution through the use of compiler scheduling. A lot of emphasis was placed on scheduling algorthims that scanned for data and resource dependancies and how code will get executed out of sequence to maximize resource utilization.

Another concept that was raised is the idea of sub cycle (clocking) or micro-operations where in a single "instruction cylce" multiple processor cycles would occur while still maintaing the apparent single cycle execution.

I would imagine that modern DSPs rely on techniques like these, or some totally new ones, to maximize the throughput.

- T
- Tim Wescott
  
  Contact options for registered users
Vote on answer
posted
18 years ago

Wed, Oct 19, 2005 2:42 PM

Interesting. So that's what they actually do in practice, just copy a page out of a textbook? Wouldn't the stages of adders really cause a speed hit? To have your signal ripple through so many stages would require you to slow your clock way down from what it could be otherwise

-- it seems an odd way to build a chip who's purpose in life is to be really fast while doing a MAC.

Good point. Yes it would, unless you used some fancy pipelining to keep the throughput up (which would probably require a fancy optimizer to let humans write fast code).

Yet DSP chips cost tons of money, which disappoints Jeorg who designs for high-volume customers who are _very_ price sensitive. The question was more a hypothetical "what would Atmel do if Atmel wanted to compete with the dsPIC" than "should I have a custom chip designed for my

10-a-year production cycle".

--

Tim Wescott
Wescott Design Services
http://www.wescottdesign.com

- B
- Bob Monsen
  
  Contact options for registered users
Vote on answer
posted
18 years ago

Wed, Oct 19, 2005 6:02 PM

Actually, I believe that prices have little to do with cost, particularly in high volume, low material cost items like ICs. This is true until the item has become a commodity, where anybody can make it. At that point, market factors start to bring the prices down. Until that point, pricing is more closely related to the cost of what the item replaces.

In the case of DSP chips, the replacement is a traditional microprocessor, with its fast external memory, PC design and debug time, etc.

So, cutting down on the silicon area won't help prices; it'll just increase the profits of the chipmakers. What helps prices is stiff, fair competition, and lots of it. So, chip makers try to differentiate their designs, making it hard to 'jump ship' and head off in new directions, thus keeping a a particular group of users a 'captive audience'. Once standardization sets in, they are doomed to compete.

--
Regards,
  Bob Monsen

Let us grant that the pursuit of mathematics is a divine madness of the 
human spirit.
- Alfred North Whitehead

- P
- Peter Alfke
  
  Contact options for registered users
Vote on answer
posted
18 years ago

Wed, Oct 19, 2005 6:14 PM

Newer FPGAs have lots of fast 18 x 18 multipliers. The humble XC4VSX25 has, among other goodies, 128 such multipliers running at max 500MHz single-cycle rate. The mid-range SX35 has 192, and the top SX55 has 512 such fast 18 x 18 multipliers each with its associated 48-bit accumulator structure. We invite you to keep that kind of arithmetic performance busy... No wonder these FPGAs can outperform sophisticated and expensive DSP chips.

Peter Alfke, Xilinx

- L
- langwadt
  
  Contact options for registered users
Vote on answer
posted
18 years ago

Wed, Oct 19, 2005 6:17 PM

Tim Wescott skrev:

snip

afair the delay for the straight forward N*N bit parallel multiplier is

only around double the delay of a N bit adder, i.e. the longest path in the multiplier is lsb to msb plus top to bottom

I think its more likely that they look at different options and find the smallest that is fast enough ;)

have a look at

formatting link

snip

I'm not sure the size of the multiplier makes a big difference, my guess is that if you look at the die you would see that most of it is memory

what price are you looking for?, how much memory?, how fast?

Not that I will build you one, but I'm curious :)

-Lasse

- E
- Eric Jacobsen
  
  Contact options for registered users
Vote on answer
posted
18 years ago

Wed, Oct 19, 2005 7:56 PM

A while back when I was doing such things Wallace Trees and Booth Multipliers were all the rage. Doing a search on those turned up Ray Andraka's page (no big surprise :)) which has a really good discussion on alternatives.

Since then things have gotten even smaller and faster and, as someone else pointed out, the FPGA companies now find it prudent to splatter large numbers of very fast single-cycle multipliers around their parts just because they can (and becuase they know people will use them). I've no clue what they're doing there, but efficient single-cycle multipliers have been around for a long time in various flavors. I'm sure they're not all the same.

Eric Jacobsen Minister of Algorithms, Intel Corp. My opinions may not be Intel's opinions.

formatting link

- D
- David Brown
  
  Contact options for registered users
Vote on answer
posted
18 years ago

Thu, Oct 20, 2005 1:45 AM

Single-cycle multipliers in small microcontrollers are frequently 8x8, which is obviously much easier. The chip mentioned, the msp430, does

16x16, but it is not actually single-cycle (as far as I remember). The other big difference compared to expensive DSPs is the speed - it is a lot easier to do 16x16 in a single cycle at 8 MHz (the top speed of the current msp430's) than at a few hundred MHz (for expensive DSPs).

- J
- Jon Harris
  
  Contact options for registered users
Vote on answer
posted
18 years ago

Thu, Oct 20, 2005 8:08 AM

In addition to the single-cycle MAC, there is also all the structure needed to keep that MAC busy, i.e. dual busses with single-cycle access to memory.

True, a fast MAC is most useful when you are doing a bunch of them in a row, like for example in a FIR filter. But a fast multiply is pretty darn useful too, and often can be taken advantage of 1 or 2 at a time.

BTW, on a SHARC, the set-up is basically 3 cycles: set-up 2 pointers (typically one for data, one for coefs) and initialize your loop counter. You may need to pre-fetch the data too, so that could add another cycle, or could be built into the main loop. Not too bad, though. It doesn't take too long of a filter, matrix, vector op, etc. to start paying dividends. A idiosyncratic feature of the SHARC is that for fixed point, there is a true single-cycle MAC, whereas for floating point, you have a parallel multiply/add instruction, and you can't use the result of the multiply in the add. At first glance it seems quite restritive, but in practice it just means one more stage of pipelining before you can rip off those single-cycle FIRs. I'm guessing a single-cycle 40-bit floating-point MAC would have been the slowest instruction on the chip by a mile, and would have forced a much slower max clock rate.

I don't think it would save that much total real estate. I'm basing this on the fact that Analog Devices started including two complete multipliers and ALUs in all their new SHARCs. Or maybe with die shrinks, silicon area is not such a big deal? When you can take advantage of the second unit, you can get FIRs in 1/2 cycle per tap, or parallel operation on a second data stream "for free"--nice!

My guess is that going to a 2-stage multiplier would not allow you to get you anywhere near twice the clock frequency. On most DPSs, it's not just the MAC but every other instruction is also single cycle, so unless you can get them all to run 2x, you can't double the clock rate. The MAC is probably the slowest path, but I would guess there are a lot of other "close seconds". I've often wanted to see data from a DSP manufacturer on the speed of different instructions, just out of curiosity. It would also allow you to determine if you could reliably overclock a chip based on what instructions your application used.

- B
- Bevan Weiss
  
  Contact options for registered users
Vote on answer
posted
18 years ago

Thu, Oct 20, 2005 2:08 PM

It's much much harder than just copying a page out of a textbook. There's small optimizations that depend strongly on data distributions, etc etc. Even before the designer can begin laying out the multiplier, which is pretty much the hardest part, they have to work out whether it has the characteristics required.

As an example I recently designed a 4bit*4bit multiplier as a class project. It's much harder than many people realise to do, and it's complexity grows exponentially (in most cases) to the input bit width.

Sometimes it may be as simple as laying down a standard multiplier block (from one of many IP libraries around) however in most DSPs this will be the critical timing path for single cycle operation and so must be hand modified to produce acceptable path delays, then assessed under all conditions.

Certainly not a lookup table, that would indeed be simply copying from a book, and would also require (2^(2*N))*N/4 bytes of storage. For anything but small N this would be enormous, and not very efficient in terms of chip real estate.

As an aside, the other members of my class implemented their multipliers in a pipeline configuration, whilst I did mine in a completely parallel configuration (with ripple adder as high speed wasn't a design consideration). This means that others had 2/3/4 cycle latencies whilst mine was a single cycle. The trade-off is that the upper frequency of mine was more limited than was their's due to the increased path delays.

Getting single cycle high speed multipliers is a very challenging prospect, and one which much research is still ongoing.

You should have a go at making up a simple 3bit*3bit multiplier using single transistors on a PCB sometime.. it's quite similar to the layout flow used in IC design.

- T
- Tim Wescott
  
  Contact options for registered users
Vote on answer
posted
18 years ago

Thu, Oct 20, 2005 3:20 PM

The original question was for an under-$2 DSP chip capable of doing audio frequency stuff, including FFTs. I'm not the fellow who asked; it just sparked a tangential thought in my head about why there isn't some intermediate step on the way to a full-speed DSP.

So I couldn't tell you exactly how much speed and memory, given I don't know how often he was considering doing the FFTs, or how many points, etc.

Hey Jeorg! You listening? What did you need to do, exactly?

--

Tim Wescott
Wescott Design Services
http://www.wescottdesign.com

- R
- rhnlogic
  
  Contact options for registered users
Vote on answer
posted
18 years ago

Thu, Oct 20, 2005 6:50 PM

Well, the original question was about $2 DSP's, presumably for low cost products. So are there any FPGA's currently available for $2 or less which include any built-in single-cycle 18 x 18 multipliers, plus enough logic cells to replace the rest of the functionality of a single low cost DSP?

Thanks.

--
rhn A.T nicholson d.O.t C-o-M

- L
- langwadt
  
  Contact options for registered users
Vote on answer
posted
18 years ago

Thu, Oct 20, 2005 8:50 PM

Tim Wescott skrev:

snip

I never have to buy stuff so I don't know anything about prices, but philips recently announced a couple of 70MHz ARM7TDMIs in the 2$ range, it's not DSPs but at 70MHz and one cycle per 8bits of

32*32->64bit multiply it'll do some dsp

-Lasse

- K
- Kolja Sulimma
  
  Contact options for registered users
Vote on answer
posted
18 years ago

Fri, Oct 21, 2005 2:02 PM

If you leave out FFT-based multipliers which are only benefitial for really large mutliplicands (a few hundred bits) you need to perform N^2 1-bit additions for a multiplication. These N^2 1-bit adders can be arranged in different ways called for example array multipliers or wallace tree multipliers to change the critical path and the layout. But we proved that they all can be transformed into each other just by swapping wires around.

formatting link

The only thing you can do about area is trading off area for delay. You can forexample reuse the same N adders over N clock cycles.

Pipelining does not save area. But it can increase performance a little. You can cut the critical path in half by adding N pipeline stages. (Beware that without pipelining while you perform N additions of N bits the length of the critical path is only O(N) for all multiplier architectures achieved by reordering)

The multiplier is a large part of the CPU core, probably the largest in a DSP, but the area of most processor chips including DSPs is dominated by caches and other memory like reordering buffers, shadow registers, TLA buffers, etc.

Kolja Sulimma

- J
- Jim Granville
  
  Contact options for registered users
Vote on answer
posted
18 years ago

Fri, Oct 21, 2005 6:01 PM

TI have just volume-released their 100MHz FLASH controllers, start at sub $5, so not quite a $2 target, but these have FLASH(not ROM) and include 12 bit 6Msps ADCs, a 150ps resolution PWM, and CAN bus

150ps PWM is a challenge even for FPGA ....

formatting link

The sub $2 Philips devices have quite low code sizes, but they could do some 'audio frequency stuff'...

-jg

- R
- Rich Grise
  
  Contact options for registered users
Vote on answer
posted
18 years ago

Fri, Oct 21, 2005 8:44 PM

I don't know how they do it this days, but I do know that with a whole shitpot load of adders, you could do it in n propagation delays, where n is the width of whichever operand you arrange to come in sideways. I almost drew a schematic. You have a set of adders as wide as operand "A", and its inputs are operand "A" and the "latest" partial product - and its outputs go to another bank of adders whose other inputs are either "A" again or 0, and so on - the other operand, "B", would be presented down the side of the array, deciding which partial products get added to and which don't. The LSB, of course, gets sent out as "product", and the carry is the MSB of the next partial product. They form a parallelogram.

I just fired up that Xilinx S/W to see what it's got in the way of symbols, and it already has a 16-bit adder. With 16 of them, and 256 AND gates, I could build a 16 x 16 multiplier that would have an answer in about 16 or 17 propagation delays. :-)

OK, it's gonna be a day or so. Please be gentle, it's my first time. :-)

Thanks! Rich

- K
- Kolja Sulimma
  
  Contact options for registered users
Vote on answer
posted
18 years ago

Fri, Oct 21, 2005 9:45 PM

Actually, if you cannot do full custom circuit optimizations (e.g. because you do standard cell design or because you are using LUTs in an FPGA) swapping wires is the only possible structural optimization. All other multiplier transformations can be reduced to swaps.

An extremely nice property of swapping wires is, that it can be done after placement. This is such a huge advantage that we were able to beat sophisticated multiplier generators with a simple greedy algorithm when applying it after placement:

formatting link

Kolja Sulimma

- H
- Hal Murray
  
  Contact options for registered users
Vote on answer
posted
18 years ago

Sat, Oct 22, 2005 12:31 AM

Wouldn't it go faster (log N) if you used a tree rather than a long skinny chain?

--
The suespammers.org mail server is located in California.  So are all my
other mailboxes.  Please do not send unsolicited bulk e-mail or unsolicited
commercial e-mail to my suespammers.org address or any of my other addresses.
These are my opinions, not necessarily my employer\'s.  I hate spam.

- R
- Radioman
  
  Contact options for registered users
Vote on answer
posted
18 years ago

Sat, Oct 22, 2005 8:06 AM

The DSP is a TMS57002, which up until now is not obsolete. It's a 24 bit fixed point DSP whichis sold in the US also used by Line6, Zoom and others. Behringer is using it on older designs but recently I saw much of the Motorla 56364 and very powerful Shark processors in their products. Also I heard they designed their own DSP which are used in the current stomp boxes.