Cortex-M3 vs PIC32 divide instruction

- J
- Jon Kirwan
  
  Contact options for registered users
Vote on answer
posted
12 years ago

Wed, Sep 7, 2011 9:15 PM

I am similarly curious and would like to know the theoretical details.

If I do uncover the details in the Cortex-M3 case, I'll write a little something about it here. It's possible that there are some university docs I'll find that clue me in. I might even get lucky and someone at ARM may respond kindly. It may be that someone here knows, too, but just hasn't said as much, yet. Chances are this isn't some deep dark secret. Just that I haven't yet come across it, is all. I am remarkably ignorant.

If you are interested in the details regarding the M4K (PIC32) method, then I can write a lot on that unremarkable topic. That one is easy. I could design the hardware myself almost in my sleep.

Jon

- D
- David Brown
  
  Contact options for registered users
Vote on answer
posted
12 years ago

Thu, Sep 8, 2011 7:31 AM

The basic division algorithm is a "subtract conditionally" loop, which is approximately one cycle per loop. You can double the speed by simply doing two bits at a time. That takes more hardware - you have to do three comparisons in parallel rather than just one, but it's not too expensive (multiplying by 0b00, 0b01, and 0b10 are all zero cost, and multiply by 0b11 is not hard). That trick does not scale well, however

- doing another bit in each cycle means much more hardware, and the depth of the combinational logic involved will mean slower clock speeds.

You can also save a clock cycle or two at the ends of the algorithm by careful setup - the cycles are usually still there in the latency, but get hidden within the rest of the instruction pipeline.

Early-exit testing can also be done - typically once the numerator part has been reduced to 0 (or less than the numerator), you can do a fast exit. Some implementations may also have tricks like barrel-shifting at the start to "cancel out" any factors of 2 in the figures.

Beyond that, faster division is usually done by computing the reciprocal, then multiplying. That is particularly useful for large bitwidths and floating point (i.e., hardware 64-bit floating point).

For integer work, it is usually best to leave that to the compiler's optimiser - a compiler may turn "x/3" into "(x * (2^n / 3)) >> n" for a suitable n. This generally only makes sense if the cpu can quickly multiply numbers of twice the bitlength of x.

- D
- dp
  
  Contact options for registered users
Vote on answer
posted
12 years ago

Thu, Sep 8, 2011 10:40 AM

y

t

Oh I know what can be done, Jon apparently does too. Like I said I have written numerous divisions on various machines. I did not use the "count leading zeroes" on the power 64/32 division though (was tempted but my time was a lot more important than the CPUs back then). Then dropping out on a 0 dividend (and/or shifting in advance by the least of leading zeroes) can reduce execution time in some (many) cases, but it does add cycles to the worst case so I am not sure I would go for it anyway, would be application specific I guess if it goes that narrow to the wire.

Dimiter

------------------------------------------------------ Dimiter Popoff Transgalactic Instruments

formatting link

------------------------------------------------------

formatting link

- J
- Jon Kirwan
  
  Contact options for registered users
Vote on answer
posted
12 years ago

Thu, Sep 8, 2011 11:49 PM

Yup.

And David especially knows I have done so. We've talked about it. And he's aware of both my ignorance and skill. So I think he knows my boundaries.

One doc I saw wrote "11 to 32" clock cycles for the division. But more of a marketing piece. Another, in a datasheet, I find a maximum of 35 cycles. That one I believe, as it accounts for leading byte checks and result posting. I've looked a little bit at the 5-stage pipe docs and it's pretty fancy. It includes two separate result bypass paths to accelerate results for following instructions to avoid waiting for the register posting to take place. Nice.

Interestingly, although the PIC32 achieves a fair pace (up to

80MHz), it isn't up to the MIPS synthesizable core, claimed to be over 400MHz capable at 90nm and over 200Mhz capable at 130nm.

Anyway, I'd love to delve into the details. Too bad the .v RTL modules aren't public. (Unless some kind soul can point me to them? ;)

Per other suggestions, I already plan to spend about $1000 or so getting various tools set up for the PIC32 (swapping in my older Pro Mate II tools, buying a REAL ICE and ICD3, and then some 'stuff' to update my toolset.) I already have the tools purchased for the midrange Energy Micro EFM32, which is a Cortex-M3, and have started testing development there. It was some work to get unlimited, free tools up and working -- thanks very much to CodeSourcery for that, by the way, as I am now in debt to them. Next week or two, I will start up on the PIC32, as well.

The PIC32's MDU operates "autonomously." So it continues on a division _and_ following instructions so long as an IU pipeline stall isn't triggered with the use of an MDU op. I am curious about how it functions in the presence of interrupts. But there is a LOT OF DOC to read, yet. So I am behind on that score. In any case, a cursory glance over the IEMAW pipeline doesn't seem to require a stall by itself. So I remain unsure.

I haven't read up on the Cortex-M3 DIV and SDIV. Doc seems in several different places, too. But the Cortex-M3 may not be autonomous. Looking at DDI0337G, page 1-12, Figure 1-2, seems to suggest it isn't autonomous and is squarely in the "Ex" pathway. So I'd guess a Cortex-M3 must wait for it. Either way, it's always a two-edged sword so I don't think one is necessarily better than another.

Noting from that same figure, register write-backs must complete in Ex -- no need for the PIC32 bypass routes. But that is probably more the reason why Cortex-M3 would cycle slower on the same feature size/process, too.

I am curious about the Cortex-M3 DIV and SDIV implementation in hardware. I'd love to see the details. But it is looking as though the CPU waits for it. So if you have a 12 cycle DIV in progress, that figure seems to suggest to me that there will be a series of Ex pipeline stalls while the division completes and posts its results to registers.

Anyway, this is all on my 'off hours' for hobby work and will be some joy ahead.

Jon

- D
- dp
  
  Contact options for registered users
Vote on answer
posted
12 years ago

Fri, Sep 9, 2011 12:11 AM

Division not stalling the pipeline may be pretty rare, at least I have not seen it on the power parts I use (5200B, a really nice one). But division just takes up everything.

As a side note I found out the hard way the depth of the pipeline. Needed MAC - FMADD, as they have it, FP add and accumulate. Naively, I did it in a loop as with a DSP and expected to get the specified 2 cycles per 64*64 FMADD. Got > 10. Ouch, this was close to ruining the entire design effort. So I spent a day or two and eventually wrestled down the data dependencies which were causing this; it took using 24 of the 32 FP regs to do so, though. Well, as a side benefit I saved some loads (once I have 8 samples and 8 coefficients in the regs I did not have to waste them and load again). So eventually I got

5.5 nS (2.5 nS per cycle) doing the loop (this includes everything, perhaps loading from DDRAM to cache at times etc.).

Well, this got way aside. I won;t delete it though, not so many chances to talk work :-).

Dimiter

- J
- Jon Kirwan
  
  Contact options for registered users
Vote on answer
posted
12 years ago

Fri, Sep 9, 2011 1:24 AM

Just as an aside, Bipolar Integrated Technologies (BIT) made some FP hardware that included a fully combinatorial FP division chip. Remarkable piece of work. I don't think ANYONE else has done it. Of course, I think they are long gone now. I worked with an engineer who was part of the design process, though.

Getting back to the MIPS M4k/PIC32, their "blurbs" all use the term "autonomous" and the detailed pipeline descriptions I've laid eyes on appear to support that. I would post the details here, but I'm sure it would bore most folks. It would run a few pages and covers a lot of detail about each of the IEMAW pipelines.

Ouch, as you say.

This implies a GHz processor of some kind, doesn't it? And here all I'm talking about is 30-80MHz clocks. I won't be getting any 2.5ns per cycle.

I love hand-crafting code to make the most of a processor that isn't on the bleeding edge of technology. Part, not all, of that includes meticulous attention to detail -- perhaps the kind of thing that appeals to those who do ships in a bottle? I like it when some of the project requires that kind of thing. It helps fill out another fun dimension to a project that is bounded on all sides by many worries and considerations which all must be considered and weighed.

By the way, I've taken note of the barrel shifter in the Cortex-M3. There appears to be a 'lane changer' in the A state of the M4k. But I am not seeing a barrel shifter. So more to look at and compare, I guess.

Jon

- D
- dp
  
  Contact options for registered users
Vote on answer
posted
12 years ago

Fri, Sep 9, 2011 3:07 AM

Well the 5200B is a 400MHz part, would be nice to have it in GHz range but I don't see it coming (has an unbelievably convenient/flexible/you name it DMA engine which Freescale have abandoned, probably has been perceived as "too complex" for "most" users by the top brass or something).

Not sure how either of these (lane changer/barrel shifter) is designed, I never designed any. But MIPS sure do have single cycle shifts "barrel" like? Probably something like the rlwnm and rlwimi on power (rotate left then and with mask, rotate left then insert under mask control). I have never used MIPS but it looks way closer to what power is. The greatest ARM disadvantage I see is that it has only 16- registers, this can be rather limiting on a load/store machine.

Dimiter

- J
- Jon Kirwan
  
  Contact options for registered users
Vote on answer
posted
12 years ago

Fri, Sep 9, 2011 4:29 AM

Yeah, that 2.5ns figure. I just meant a lot closer to GHz than I will likely see soon.

Interesting you bring up DMA. I was using the SiLabs part to gain access to an internal, 1MHz, 16 bit SAR ADC (which, if external, would have cost me as much as the chip or more) and it required DMA -- (given that this is an 8051 style cpu, that won't be surprising to anyone.) Turns out, the documentation on the DMA section is poor and if you really want to know exactly what you are doing when you copy someone else's supposedly working code then the doc is not adequate to the task. So I sent off my questions, worked with the local disty, they pressed, I pressed, we all pressed. But SiLabs (US, anyway) didn't know. Thing is, they had to track down the DMA section designer who apparently now is off- shore; I think in Singapore or something. Took them months. He'd designed it 8 years before that time. But I got my answer, at least. Took maybe three months to do?

Thing is, no one else had ever asked the questions I asked. Yet they were the kinds of "off by 1" question that I think people who did ANYTHING OTHER THAN USE BOILERPLATE CODE would have done. So it is clear to me that those using the chip either used some canned library or else just copied and pasted sample code. I must have been the only person actually caring about understanding how to write my own stuff for the DMA. Or else the answer would have been much easier to get.

There's a point. I need to think about all this in the context of a few specific algorithms to see how it all pans out. I am enjoying this.

Jon

- R
- Robert Wessel
  
  Contact options for registered users
Vote on answer
posted
12 years ago

Fri, Sep 9, 2011 4:32 AM

Unless you're using "fully combinatorial" in a way I'm not expecting, that seems improbable, unless the FP numbers were very short.

It's certainly theoretically possible, but the number of terms that you'd get would be astronomical. Assuming 24 bit mantissas and (incorrectly) assuming you didn't need to look at anything else, you'd be looking at 48 input bits affecting each output bit, and probably on the order of 2**44 to 2**45 input terms (each one an and gate averaging about 45 inputs) for each of the output bits.

Now if they implemented a full hardware divider, that's certainly possible, although if they went through that much trouble I hope they'd pipeline it for throughput.

- J
- Jon Kirwan
  
  Contact options for registered users
Vote on answer
posted
12 years ago

Fri, Sep 9, 2011 7:52 AM

I'm dredging my memory and I'm glad you questioned this. I'm now thinking I must have been wrong about the combinatorial nature -- it almost must have been some kind of internal free running oscillator detail that escaped my notice then and my memory converted that into 'fully combinatorial' when it shouldn't have. I recall that one "set inputs, waited out a path delay, and got the result." It's possible that I interpreted that poorly through the fog of memory and that there was some kind of freely running internal clock mechanism. If so, I would not have been savvy enough at the time to have asked that question and today I would have lost other details that might have corrected my memory there.

(These were part of a set of ECL FP chips around the late

1980's and up through about 1990 or so. And very fast.)

I am with you on this. You make a good point.

I now also remember something about triple-diffused poly-Si bipolar process with some kind of special "self-aligning" nature that allowed for very small feature size at the time. It was impressive for its time, though. Very much so.

Now I'm mad at myself for not verifying my memory before writing. Your point is good. I'm very curious about exactly what they did do and didn't do. So I'm going to pay the price and look right now and see if there are any details on the web....

Hmm...

Okay, there is US Patent 5153848 (and others, too.) That one is titled, Floating Point Processor with Internal Free-Running Clock, and says in the Abstract, "The multiplier and divider are pipelined internally, driven by a fast, two-phase internal clock that is transparent to the user."

So I think I know where my confusion came from.

Thanks, Robert.

Jon

- A
- Anders.Montonen
  
  Contact options for registered users
Vote on answer
posted
12 years ago

Fri, Sep 9, 2011 12:35 PM

I couldn't find any information in either the PIC32 docs or the MIPS architecture docs, but I'm leaning towards the MDU ignoring interrupts altogether. If the ISR uses the MDU, then the pipeline will stall until the previous instruction completes.

That is my understanding. You could try asking for information regarding the division implementation on ARM's tech forum[1]. There's a lot of noise, but also some interesting posts there.

-a

[1]

- A
- Anders.Montonen
  
  Contact options for registered users
Vote on answer
posted
12 years ago

Fri, Sep 9, 2011 12:55 PM

Replying to myself, but this is apparently how the MDU worked in pre-MIPS32/64 days, nowadays it's a bit smarter. See pp. 108-109 in See MIPS Run 2nd ed. (Seems to be easy enough to find naughty PDF versions if you don't own a paper copy.)

-a

- D
- dp
  
  Contact options for registered users
Vote on answer
posted
12 years ago

Fri, Sep 9, 2011 7:30 PM

I was much luckier with the SDMA on the 5200B. Got in touch with the guy who had architected it, he was as helpful as he could possibly be, sent me some more data in addition to what was to be found on the web. Took me a few days to grasp it all and begin using it, but once I did it was really useful. The final thing I made for it was the Ethernet engine, the on-board Ethernet controller is only FIFO-ed both ways, the rest must be done by the SDMA. Took me a only day or so to implement the queue of buffers with pointers to packets etc. (obviously it took me a lot longer to integrate Ethernet completely, but this was CPU code).

Me too.

Dimiter

- J
- Jon Kirwan
  
  Contact options for registered users
Vote on answer
posted
12 years ago

Fri, Sep 9, 2011 11:33 PM

Well, it wasn't all that easy to find. Lots of Chinese language versions floating about. But my chinese vocabulary is about 30 characters or so. So I'm very limited. I can count, draw "man" "big" and "too much", and things like "mouth" "door" and "window." Then I start running out.

I did find this as the only 'good' version easily findable:

formatting link

2007, apparently, for the book. Some years, now.

I read the relevant material. It mostly addresses older designs. For example, when it brings up the newer MIPS32 architecture at the top of page 109, it says "the instructions behave themselves," which is not entirely descriptive for me. Then it goes on to talk about "older CPUs" for the entire rest of that paragraph, the next one, and the next one before going on to another section. It never does talk, in detail, about the very architecture I want to know about. Microchip and the PIC32 aren't even mentioned. The R4000 is, but the M4k is mentioned just once near the top in a long list of names. On page 38 they say that the multiply takes 4-12 clocks, so I know we are on "different pages" already. It's an old book, by now.

But I love it, too!!! Thanks.

By the way, it says at the bottom on page 38, "Integer multiply and divide operations never produce an exception; not even divide by zero..." Maybe I can take it that the PIC32 follows this guideline? Also, I read from this book that pipeline exceptions are delayed (allowed to flush through the pipeline) before being acted upon. But since by the earlier note a DIV cannot cause a divide-by-zero error, they never generate an exception. In fact, since the MDU is "autonomous" it probably would be very hard for it to "insert" any kind of exception into that pipeline flow, anyway. So that makes sense, too.

Elsewhere, they do talk about the floating point coprocessing taking multiple clocks. I can probably use their handling of that, by analogy, to help me understand what would happen in the MDU case. here, it says on page 52 that "FP computations are allowed to proceed in parallel" with the execution of later instructions, and the CPU is stalled if an instrction reads a result register before the computation finishes." Which suggests the answer to a question I'd posed earlier as a possible difference between the Cortex-M3 and the M4k -- that the Cortex-M3 stalls until complete but the M4k does NOT stall. But it also suggests that even in the face of an interrupt event, the division continues and doesn't cause a stall unless there is a need for that in the new instruction stream.

Thanks for the book suggestion. It supplements what I already know about the old R2000 TLBs, too. This is bringing back lots of memories, now.

Jon

- A
- Anders.Montonen
  
  Contact options for registered users
Vote on answer
posted
12 years ago

Sat, Sep 10, 2011 12:55 AM

It is more concerned with the architecture (ie. MIPS32/64) than any particular implementation. The first edition touched more on different cores as the system level architecture wasn't standardised yet, but that's at least supposed to be fixed by now.

Well, the MIPS32 architecture manual states that they don't, so I would assume so.

That was my conclusion as well.

I think it's a good computer architecture book, with a more practical bent than the usual textbooks. I actually like the 2002 first edition better, but the second one is definitely more relevant to modern CPUs.

-a

- J
- Jon Kirwan
  
  Contact options for registered users
Vote on answer
posted
12 years ago

Sat, Sep 10, 2011 2:24 AM

I'll look for the earlier one, as well, then. Knowledge like this only very gradually fades away, if at all. I also like paper, especially for something like this where I just lay out on the floor to read, so I will likely see about getting genuine editions.

Jon

- J
- Jim Granville
  
  Contact options for registered users
Vote on answer
posted
12 years ago

Sat, Sep 10, 2011 11:50 PM

That's more a flash artifact, than a process one.

Flash based uC seem to be rather stuck, for the last half decade, in speed at the 80/100/120MHz region, and only the RAM based ones get into the hundreds of MHz (like the sub $2 DSPs I mentioned above)

Even if the Flash limits the CPU speed, one of my peeves, is very few parts allow the peripherals to run to the silicon process speed, instead forcing the peripheral clock to be

- M
- Mark Borgerson
  
  Contact options for registered users
Vote on answer
posted
12 years ago

Sun, Sep 11, 2011 3:32 AM

- J
- Jim Granville
  
  Contact options for registered users
Vote on answer
posted
12 years ago

Sun, Sep 11, 2011 5:55 AM

w

Perhaps, but that ceiling is rather above the Flash-Speed limit I was mentioning, and it clearly is not too much of an actual problem, as SOME uC vendors can manage Peripheral clocks faster than core speeds. It is a slow trend, I'd like to see become faster...

-jg

- J
- Jon Kirwan
  
  Contact options for registered users
Vote on answer
posted
12 years ago

Sun, Sep 11, 2011 1:35 PM

I'm aware. A wide flash bus is often used these days to compensate (with a little bit of ram to hold at least one line of it.) There's some commentary about this regarding the PIC32 in the architecture overview.

It's also true for the MSP430's new FRAM device, as well. A wide FRAM bus is used with a little bit of cache ram to help reads. And on any day, FRAM writes are much faster than flash writes.

I didn't know which ones from TI and ADI you were referring to. I had done a quick check on Digikey, selected DSPs and selected ADI and TI and then sorted on price and that they actually have one or two in stock. But nothing under $2, or close, showed up on the first sorted page.

Flash read cycle times may be slow (and writes so slow it needn't be mentioned), but anything external to the device will also be slower than the cpu core can achieve. Within the chip, outputs have known loads and the transmission gates and inverters can be sized exactly for that known situation. Anything leaving the chip must go through oversized drivers which drive unknown loads and must therefore be sized for the worst design case. And that must also include the wire bonds, chip carrier, and leads as well as whatever an end use might add to that. Trace widths must be wide enough to handle, at least in microcontrollers whose pins often these days must handle tens of milliamps, to sustain those currents and survive metal migration for some given lifetime, as well.

I have a hard time imagining that unknown external loading drivers/inputs can ever (on the same die) achieve speeds that internal signals can, with known-in-advance loads used to size them.

But a closer reading of your comment might be that you mean to talk about cases where the processor itself is built on a fast, clocked process, let's say 400MHz given worst case pathways and pipelining limits (M4k at 90nm.) But that the flash and cache (and, I suppose, also the necessity these days for marketing purposes that a microcontroller include bullet-proof, class-A crystal drivers rather than specify some more complex high-speed design) may limit the useful speed to something less, say 80MHz. (Though I've read someone to say they've clocked the PIC32 at 120MHz, just had to set wait states for the flash.) That in this case, say, a sampling ADC of the SAR style might be clocked still at

400MHz to process a captured sample at whatever resolution without regard to the CPU clock rate? Is that it?

Jon