Cortex-M3 vs PIC32 divide instruction

- J
- Jon Kirwan
  
  Contact options for registered users
posted
12 years ago

Tue, Sep 6, 2011 7:39 AM

I've finally been considering a project to use either a Cortex-M3 or a PIC32 processor and I've a technical question unrelated to any "business issues" between these options -- the divide instruction operation. Both of these cores include one but I'm interested in any remarkable technical details between them, including cycle counts but not limited to that (load-store time is fair game.)

From what I've been able to garner from skimming the docs, the Cortex-M3's MDU executes an SDIV or UDIV in anywhere from

2 to 12 clock cycles, but with a comment suggesting that it takes less time when the operand sizes are similar. Which doesn't tell me what the typical time may be. Also, it's been a bit of a pain searching for good assembler docs on the Cortex-M3. But I've only been at it for about an hour or so, so it's likely I am just slow and ignorant -- not that there aren't good caches out there I should have found.

On the PIC32, the docs are clearer. It's "one bit per clock" and it includes an "early detection" of sign/zero bits in the upper bytes to help goose that along where 7, 15, or 23 bits worth might be skipped. Worst case, it says, is 35 clocks. It also stalls the 5-stage pipe if another division is issued before the earlier one completes.

I am wondering if anyone has had direct experience playing with either of these in the area of writing floating point libraries and has had a chance to compare their relative utility for that purpose and might comment on any relatively significant details related to that effort -- speed being the main question here.

At first blush, I'd say

- D
- David Brown
  
  Contact options for registered users
Vote on answer
posted
12 years ago

Tue, Sep 6, 2011 7:54 AM

There are many tricks that can be employed with hardware division to make it faster in all or some cases - there is no good way to guess how they are implemented in these two cpu's. But there will not be any "hidden issues" - the division instructions on both architectures work, they are both slow, and the time varies depending on the operands in a way that is difficult to predict and virtually impossible to utilise. And in both cases, the timing of the divide instruction will be only a small part of a software floating point division routing - the variations between different toolchain's floating point routines will be much higher than the variation between run-times for divide on either processor.

I don't know what more you are looking for. If you want to divide unknown integers, using the cpu's divide instruction. If you want to divide by a known constant integer, let the compiler handle it - either it will use the hardware divide instruction, or it will do something fancier like multiplying by the reciprocal scaled by a power of two. Knowing the nasty details of the hardware division implementation will not change that.

If you want to do very fast floating point, get a processor that has hardware floating point (Cortex-M4 will be available soon, there are real MIPS cpu's available instead of PIC32, there are plenty of PPC-based microcontrollers with hardware floating point, etc.).

- J
- Jon Kirwan
  
  Contact options for registered users
Vote on answer
posted
12 years ago

Tue, Sep 6, 2011 9:45 AM

I have other reasons that factor into this decision that preclude any other choice, right now. I'm not looking for the fastest FP, anyway. So that's not the primary goal here. I am curious about the details. That's all. And I'd like to make my _own_ judgment, not simply compare other peoples' FP packages that already exist. I'm looking at gaining a deep understanding of these two processors' approaches in the NARROW case of these particular instructions.

I do not need an education about "time varies" and "let the compiler handle it." You should know me well enough by now for that. I'm already prepared to examine flash, sram, and cache issues. I need to know the specific details here. Part of where I may be going is into things you may not think to consider, such as interrupt latency, for example, or simply for self-education about how the Cortex-M3 does it (I already _know_ how the PIC32 does it internally.) Don't presume too much about my purposes -- they are not run of the mill at the very least.

I simply need very detailed information. I've been having a little difficultly laying hands on it in the Cortex-M3 case. I'm hoping someone can point me well.

But thanks for the time. It is appreciated.

Jon

- A
- Arlet Ottens
  
  Contact options for registered users
Vote on answer
posted
12 years ago

Tue, Sep 6, 2011 9:53 AM

In the ARM reference there's the following comment: "Division operations use early termination to minimize the number of cycles required based on the number of leading ones and zeroes in the input operands."

That looks similar to what the PIC32 does, but with more bits/cycle.

- D
- David Brown
  
  Contact options for registered users
Vote on answer
posted
12 years ago

Tue, Sep 6, 2011 10:21 AM

Yes, I know that - that made it a particularly odd question from you.

When you ask for unusual information like this, the real purpose is important - otherwise I can only guess that it is /pure/ curiosity (and I can understand that as a reason, and wish I could help you there).

I would be surprised if you can get the detailed information you would like - such implementation details tend to be well hidden from mere mortals.

One thing you might be able to find out about is how the division affects pipelining - but on an M3, with its short pipeline, that won't make a big difference.

Regarding interrupts, AFAIK instructions on the M3 (and MIPS) are not interruptable (unlike some m68k cpus, for example), so maximum interrupt latency will be affected by division instructions.

- A
- Anders.Montonen
  
  Contact options for registered users
Vote on answer
posted
12 years ago

Tue, Sep 6, 2011 10:32 AM

You want the ARMv7-M Architecture Reference Manual off of ARM's website.

-a

- J
- Jon Kirwan
  
  Contact options for registered users
Vote on answer
posted
12 years ago

Tue, Sep 6, 2011 11:17 AM

I think I have that for the assembly part of things. If you are referring to the near-end where the Appendices are at, then I'm already aware of those sections (B, C, F, G, H.) I did also look at the timing information in Chapter 18-1, for example, of DDI0337 on the Cortex-M3 for r1p1, r2p0, and r2p1. Though perhaps I haven't read it well enough.

I think I have been there. But I may have missed something, too, and I appreciate the suggestion

Jon

- J
- Jon Kirwan
  
  Contact options for registered users
Vote on answer
posted
12 years ago

Tue, Sep 6, 2011 11:30 AM

The purpose is due diligence and to illuminate speculations I may yet develop. It's not a crystal clear process that I can readily explain. But I do know _what_ I want to know.

If it helps, imagine that I'd like to develop a cycle- accurate simulator.

Appears to be hidden from me, tonight. So maybe you are right.

I _am_ able to garner better information from the M4k. I still need to find out if the DIV can be interrupted.

Yes, 3 stage vs 5 stage on the M4k. I also took note that Microchip licensed the M14k, too.

Yes, that is one of several considerations I have in mind. Only one of them. But an important one. I am not yet certain about the M4k on this point.

Anyway, thanks for the thoughts. I will see what I can find out there. It is an omen that you don't know. So that suggests your earlier point about the difficulty here may be correct.

Jon

- J
- Jon Kirwan
  
  Contact options for registered users
Vote on answer
posted
12 years ago

Tue, Sep 6, 2011 12:14 PM

So far, I've found the phrase "Autonomous multiply/divide unit" in the datasheet for the 5xx, 6xx, and 7xx units from Microchip. Their dual bus choice also supports transaction aborts to improve interrupt latency. I already know that issuing another MDU instruction before an earlier divide has completed will result in an "IU pipeline stall." But this doesn't make it clear what happens if another MDU instruction is NOT issued in the interrupt routine, for example. It may be possible that the "autonomous" unit works in parallel, so long as no attempt is made to access the MDU until it is done. If so, that would be fine to learn.

I'll write Microchip on this point to get clarification. You may be right about all this. Might as well dot that i, cross that t.

BTW, I am also considering porting my own O/S to either the Cortex-M3 or the PIC32. But again, this is only one facet of what I'm thinking about. it is NOT the totality. But this question is germane here, too.

Jon

- A
- Anders.Montonen
  
  Contact options for registered users
Vote on answer
posted
12 years ago

Tue, Sep 6, 2011 12:40 PM

Footnote e to table 18-1 in the Cortex-M3 r2p0 TRM states that "DIV is interruptible (abandoned/restarted), with worst case latency of one cycle."

-a

- J
- Jon Kirwan
  
  Contact options for registered users
Vote on answer
posted
12 years ago

Tue, Sep 6, 2011 1:01 PM

Thanks!

Jon

- D
- David Brown
  
  Contact options for registered users
Vote on answer
posted
12 years ago

Tue, Sep 6, 2011 1:31 PM

The M4K is an older architecture (or at least it is closer to the older MIPS architectures), with a simpler structure and lots more information about it. You'll get better luck there.

The key thing to look for here is the data that is stored on the stack, or in dedicated registers, when an interrupt or other exception hits. On the m68k, for example, the processor can generate a rather extensive stack frame including the state of internal registers that are not otherwise accessible, holding partial results for division, progress counters for move-multiple instructions, etc. On RISC architectures you don't get a stack frame for exceptions, but critical context data is put into dedicated registers that must be preserved if you are going to enable nested interrupts. You should be able to see from the details of these registers where things can be interrupted.

While I know many things, I don't know everything! My knowledge of MIPS is based on a book on the "MIPS RISC Microarchitecture" I found in a second hand bookstore 20 years ago, and read for fun before I had even thought of doing embedded programming as a job.

- D
- David Brown
  
  Contact options for registered users
Vote on answer
posted
12 years ago

Tue, Sep 6, 2011 1:34 PM

OK, it is interruptible in that way - that's good for avoiding long interrupt latency. Some cpu's (such as some m68k devices) can be interrupted in the middle of an instruction like divide, and then continue where they left off rather than starting anew.

- J
- Jon Kirwan
  
  Contact options for registered users
Vote on answer
posted
12 years ago

Tue, Sep 6, 2011 1:51 PM

ARM has been around a LONG time. But I worked on MIPS R2000 back circa 1986/1987. Was that before the ARM/Acorn? I don't recall when the R4000 came out but it must have been after the Acorn. I think trying to decide which is older is going to be a bunch of quibbling.

There's a point for me to go look up.

Mine all comes from working with the R2000 and a nice, long lecture for a couple of days from Hennessey when I visited them back when they first opened up an office near Weitek (their first office.) I'm very comfortable with the R2000.

Jon

- A
- Arlet Ottens
  
  Contact options for registered users
Vote on answer
posted
12 years ago

Tue, Sep 6, 2011 2:31 PM

Interestingly, the Cortex isn't very pure RISC anymore, and it does have a stack frame for exceptions. It doesn't save partial results, but it does save a couple of registers, which allow an interrupt handler to be written in pure C, and it allows hardware nesting of interrupts. The link register which normally contains the return address is set to a magic value, so on function return, the core knows to do a return from exception instead.

- D
- David Brown
  
  Contact options for registered users
Vote on answer
posted
12 years ago

Tue, Sep 6, 2011 2:56 PM

Yes, ARM has been around for ages - it was probably around 1988 that I first used an ARM (Acorn Risc Machine) on an Archimedes. But the architecture has gone through a great many changes since then - the Cortex M3 is significantly different both in programming model and in implementation. MIPS has remained a lot more constant. So the M3 is really one a few years old, while the R4000 is /much/ older, and much more studied.

- T
- Tim Wescott
  
  Contact options for registered users
Vote on answer
posted
12 years ago

Tue, Sep 6, 2011 4:28 PM

Sometimes the goal is to write fast-enough floating point in a processor that won't otherwise break the system budget, be it power consumption/ dissipation, size, BOM cost, etc.

Jon's asking about _writing_ a floating point library, so I assume he's working at a project front-end, counting clock cycles to make sure that things will work.

--
www.wescottdesign.com

- J
- Jim Granville
  
  Contact options for registered users
Vote on answer
posted
12 years ago

Wed, Sep 7, 2011 11:42 AM

he

If the speed of this matters a lot, you are best to simply get a device, and try it. 'Modern data' tends to be more and more superficial, and that is one reason there are more cheap Eval/Starter kits. Note that other devices are not standing still either - I see both TI and ADI are now boasting of sub $2 DSPs (tho RAM based)

TI's strangely lacks Timer capture, (they must want you to buy other variants there) but does have high speed USB for a small cost adder. ADIs has good timers, but no USB. Both, of course, have very fast maths support, and quite large ROMS with Floating point as well. -jg

- D
- dp
  
  Contact options for registered users
Vote on answer
posted
12 years ago

Wed, Sep 7, 2011 5:27 PM

the

e.

Last (only...:) ) time I used a TI DSP was apr. 10 years ago, the 5420. Their divide was straight forward, use "subtract conditionally" in a repeat (penalty free) loop. I also have wondered - just vaguely, though - how do they accelerate division on various architectures, e.g. the power core I use now needs only 14 (or was it 16?) cycles for a 32/32, older implementations of that core (the original 603e, that is) needed 30+, 37 IIRC. I have been moaning so many times of having to write yet another division - I think the only architecture which saved me that was the

68k, ppc didn't, it does not have the 64/32 68k has, not on 32 bit machines) that I use the chance to ask Jon to share his findings, I am also really curious about it.

Dimiter

------------------------------------------------------ Dimiter Popoff Transgalactic Instruments

formatting link

------------------------------------------------------

formatting link

- J
- Jon Kirwan
  
  Contact options for registered users
Vote on answer
posted
12 years ago

Wed, Sep 7, 2011 8:38 PM

Jim, there is a difference between knowing something through theory and knowing something only through experimental result. Although it is _practical_ and often _sufficient_ to know through result, it is also true that all I'd learn is the results for the specific cases I'm able to spend time testing. Theory informs a volume. Results inform specific points within that volume. I want both. Just buying a device only gives me a few data points. That's not enough.

In the case of the PIC32, I have the theory. So I am fully able to predict just about any situation I'm given. (Except that I still don't have the theory about what happens in the presence of an exception -- but I will get that from Microchip directly.)

Anyway, I know you are being practical. But I want to go beyond knowing only what a few tests may tell me.

Yes, but the designers _know_ the theory. So it is available somewhere. And I'm not really wanting to poke out experimental results and try and develop theories of my own that match what I observe when it might just be nice to get the low-down from someone who actually knows what is going on. Which is why I decided to just ask here. (The other option would be to write ARM, I suppose -- and I will do that if nothing comes of the details here and simply hope they are moved to respond to me. I _know_ Microchip will respond, from past experience with them.)

I am familiar with older families from both through coding applications -- the ADSP-21xx from ADI; the TMS320C30 and C40 from TI. I'm not completely unaware of newer parts, too.

But like most projects, there are a number of boundary conditions involved and the DIV details I mentioned is only one of many. But DSP processing is decidely NOT the main focus nor is floating point. I merely mentioned FP as a segue, because I felt that anyone writing assembly coded FP would possibly know the theory I was looking for. That doesn't mean that is my focus. I also mentioned interrupt latency issues, later. There are many considerations.

Jon