Cortex-M3 vs PIC32 divide instruction

Do you have a question? Post it now! No Registration Necessary

Translate This Thread From English to

Threaded View
I've finally been considering a project to use either a
Cortex-M3 or a PIC32 processor and I've a technical question
unrelated to any "business issues" between these options --
the divide instruction operation.  Both of these cores
include one but I'm interested in any remarkable technical
details between them, including cycle counts but not limited
to that (load-store time is fair game.)

From what I've been able to garner from skimming the docs,
the Cortex-M3's MDU executes an SDIV or UDIV in anywhere from
2 to 12 clock cycles, but with a comment suggesting that it
takes less time when the operand sizes are similar.  Which
doesn't tell me what the typical time may be.  Also, it's
been a bit of a pain searching for good assembler docs on the
Cortex-M3.  But I've only been at it for about an hour or so,
so it's likely I am just slow and ignorant -- not that there
aren't good caches out there I should have found.

On the PIC32, the docs are clearer.  It's "one bit per clock"
and it includes an "early detection" of sign/zero bits in the
upper bytes to help goose that along where 7, 15, or 23 bits
worth might be skipped.  Worst case, it says, is 35 clocks.
It also stalls the 5-stage pipe if another division is issued
before the earlier one completes.

I am wondering if anyone has had direct experience playing
with either of these in the area of writing floating point
libraries and has had a chance to compare their relative
utility for that purpose and might comment on any relatively
significant details related to that effort -- speed being the
main question here.

At first blush, I'd say <12% clocks is better than <35%.  But
there may be other issues.  And while the PIC32 approach is
something I already know how it must be done internally, I'm
curious about exactly what method is used in the Cortex-M3
approach for its division operation -- it's not clear to me.
(VHDL or Verilog code would make that very clear to me, if
anyone has it or a pseudo version of it.)

Jon

Re: Cortex-M3 vs PIC32 divide instruction
Quoted text here. Click to load it

There are many tricks that can be employed with hardware division to
make it faster in all or some cases - there is no good way to guess how
they are implemented in these two cpu's.  But there will not be any
"hidden issues" - the division instructions on both architectures work,
they are both slow, and the time varies depending on the operands in a
way that is difficult to predict and virtually impossible to utilise.
And in both cases, the timing of the divide instruction will be only a
small part of a software floating point division routing - the
variations between different toolchain's floating point routines will be
much higher than the variation between run-times for divide on either
processor.

I don't know what more you are looking for.  If you want to divide
unknown integers, using the cpu's divide instruction.  If you want to
divide by a known constant integer, let the compiler handle it - either
it will use the hardware divide instruction, or it will do something
fancier like multiplying by the reciprocal scaled by a power of two.
Knowing the nasty details of the hardware division implementation will
not change that.

If you want to do very fast floating point, get a processor that has
hardware floating point (Cortex-M4 will be available soon, there are
real MIPS cpu's available instead of PIC32, there are plenty of
PPC-based microcontrollers with hardware floating point, etc.).


Re: Cortex-M3 vs PIC32 divide instruction
On Tue, 06 Sep 2011 09:54:00 +0200, David Brown

Quoted text here. Click to load it

I have other reasons that factor into this decision that
preclude any other choice, right now.  I'm not looking for
the fastest FP, anyway.  So that's not the primary goal here.
I am curious about the details.  That's all.  And I'd like to
make my _own_ judgment, not simply compare other peoples' FP
packages that already exist.  I'm looking at gaining a deep
understanding of these two processors' approaches in the
NARROW case of these particular instructions.

I do not need an education about "time varies" and "let the
compiler handle it."  You should know me well enough by now
for that.  I'm already prepared to examine flash, sram, and
cache issues.  I need to know the specific details here. Part
of where I may be going is into things you may not think to
consider, such as interrupt latency, for example, or simply
for self-education about how the Cortex-M3 does it (I already
_know_ how the PIC32 does it internally.)  Don't presume too
much about my purposes -- they are not run of the mill at the
very least.

I simply need very detailed information.  I've been having a
little difficultly laying hands on it in the Cortex-M3 case.
I'm hoping someone can point me well.

But thanks for the time.  It is appreciated.

Jon

Re: Cortex-M3 vs PIC32 divide instruction
Quoted text here. Click to load it

Yes, I know that - that made it a particularly odd question from you.

Quoted text here. Click to load it

When you ask for unusual information like this, the real purpose is
important - otherwise I can only guess that it is /pure/ curiosity (and
I can understand that as a reason, and wish I could help you there).

Quoted text here. Click to load it

I would be surprised if you can get the detailed information you would
like - such implementation details tend to be well hidden from mere mortals.

One thing you might be able to find out about is how the division
affects pipelining - but on an M3, with its short pipeline, that won't
make a big difference.

Regarding interrupts, AFAIK instructions on the M3 (and MIPS) are not
interruptable (unlike some m68k cpus, for example), so maximum interrupt
latency will be affected by division instructions.

Quoted text here. Click to load it


Re: Cortex-M3 vs PIC32 divide instruction
On Tue, 06 Sep 2011 12:21:08 +0200, David Brown

Quoted text here. Click to load it

The purpose is due diligence and to illuminate speculations I
may yet develop.  It's not a crystal clear process that I can
readily explain.  But I do know _what_ I want to know.

If it helps, imagine that I'd like to develop a cycle-
accurate simulator.

Quoted text here. Click to load it

Appears to be hidden from me, tonight.  So maybe you are
right.

I _am_ able to garner better information from the M4k.  I
still need to find out if the DIV can be interrupted.

Quoted text here. Click to load it

Yes, 3 stage vs 5 stage on the M4k.  I also took note that
Microchip licensed the M14k, too.

Quoted text here. Click to load it

Yes, that is one of several considerations I have in mind.
Only one of them.  But an important one.  I am not yet
certain about the M4k on this point.

Anyway, thanks for the thoughts.  I will see what I can find
out there.  It is an omen that you don't know.  So that
suggests your earlier point about the difficulty here may be
correct.

Jon

Quoted text here. Click to load it

Re: Cortex-M3 vs PIC32 divide instruction

Quoted text here. Click to load it

Footnote e to table 18-1 in the Cortex-M3 r2p0 TRM states that
"DIV is interruptible (abandoned/restarted), with worst case latency of
one cycle."

-a

Re: Cortex-M3 vs PIC32 divide instruction
On Tue, 6 Sep 2011 12:40:00 +0000 (UTC),

Quoted text here. Click to load it

Thanks!

Jon

Re: Cortex-M3 vs PIC32 divide instruction
Quoted text here. Click to load it

OK, it is interruptible in that way - that's good for avoiding long
interrupt latency.  Some cpu's (such as some m68k devices) can be
interrupted in the middle of an instruction like divide, and then
continue where they left off rather than starting anew.



Re: Cortex-M3 vs PIC32 divide instruction
Quoted text here. Click to load it

The M4K is an older architecture (or at least it is closer to the older
MIPS architectures), with a simpler structure and lots more information
about it.  You'll get better luck there.

Quoted text here. Click to load it

The key thing to look for here is the data that is stored on the stack,
or in dedicated registers, when an interrupt or other exception hits.
On the m68k, for example, the processor can generate a rather extensive
stack frame including the state of internal registers that are not
otherwise accessible, holding partial results for division, progress
counters for move-multiple instructions, etc.  On RISC architectures you
don't get a stack frame for exceptions, but critical context data is put
into dedicated registers that must be preserved if you are going to
enable nested interrupts.  You should be able to see from the details of
these registers where things can be interrupted.

Quoted text here. Click to load it

While I know many things, I don't know everything!  My knowledge of MIPS
is based on a book on the "MIPS RISC Microarchitecture" I found in a
second hand bookstore 20 years ago, and read for fun before I had even
thought of doing embedded programming as a job.

Quoted text here. Click to load it


Re: Cortex-M3 vs PIC32 divide instruction
On Tue, 06 Sep 2011 15:31:58 +0200, David Brown

Quoted text here. Click to load it

ARM has been around a LONG time.  But I worked on MIPS R2000
back circa 1986/1987.  Was that before the ARM/Acorn?  I
don't recall when the R4000 came out but it must have been
after the Acorn.  I think trying to decide which is older is
going to be a bunch of quibbling.

Quoted text here. Click to load it

There's a point for me to go look up.

Quoted text here. Click to load it

Mine all comes from working with the R2000 and a nice, long
lecture for a couple of days from Hennessey when I visited
them back when they first opened up an office near Weitek
(their first office.)  I'm very comfortable with the R2000.

Jon

Re: Cortex-M3 vs PIC32 divide instruction
Quoted text here. Click to load it

Yes, ARM has been around for ages - it was probably around 1988 that I
first used an ARM (Acorn Risc Machine) on an Archimedes.  But the
architecture has gone through a great many changes since then - the
Cortex M3 is significantly different both in programming model and in
implementation.  MIPS has remained a lot more constant.  So the M3 is
really one a few years old, while the R4000 is /much/ older, and much
more studied.

Quoted text here. Click to load it


Re: Cortex-M3 vs PIC32 divide instruction

Quoted text here. Click to load it

Interestingly, the Cortex isn't very pure RISC anymore, and it does have
a stack frame for exceptions. It doesn't save partial results, but it
does save a couple of registers, which allow an interrupt handler to be
written in pure C, and it allows hardware nesting of interrupts. The
link register which normally contains the return address is set to a
magic value, so on function return, the core knows to do a return from
exception instead.

Re: Cortex-M3 vs PIC32 divide instruction
On Tue, 06 Sep 2011 12:21:08 +0200, David Brown

Quoted text here. Click to load it

So far, I've found the phrase "Autonomous multiply/divide
unit" in the datasheet for the 5xx, 6xx, and 7xx units from
Microchip.  Their dual bus choice also supports transaction
aborts to improve interrupt latency.  I already know that
issuing another MDU instruction before an earlier divide has
completed will result in an "IU pipeline stall."  But this
doesn't make it clear what happens if another MDU instruction
is NOT issued in the interrupt routine, for example.  It may
be possible that the "autonomous" unit works in parallel, so
long as no attempt is made to access the MDU until it is
done.  If so, that would be fine to learn.

I'll write Microchip on this point to get clarification.  You
may be right about all this.  Might as well dot that i, cross
that t.

BTW, I am also considering porting my own O/S to either the
Cortex-M3 or the PIC32.  But again, this is only one facet of
what I'm thinking about.  it is NOT the totality.  But this
question is germane here, too.

Jon

Re: Cortex-M3 vs PIC32 divide instruction

Quoted text here. Click to load it



Sometimes the goal is to write fast-enough floating point in a processor
that won't otherwise break the system budget, be it power consumption/
dissipation, size, BOM cost, etc.

Jon's asking about _writing_ a floating point library, so I assume he's
working at a project front-end, counting clock cycles to make sure that
things will work.

--
www.wescottdesign.com

Re: Cortex-M3 vs PIC32 divide instruction
Quoted text here. Click to load it

In the ARM reference there's the following comment: "Division operations
use early termination to minimize the number of cycles required based on
the number of leading ones and zeroes in the input operands."

That looks similar to what the PIC32 does, but with more bits/cycle.

Re: Cortex-M3 vs PIC32 divide instruction
Quoted text here. Click to load it

You want the ARMv7-M Architecture Reference Manual off of ARM's website.

-a

Re: Cortex-M3 vs PIC32 divide instruction
On Tue, 6 Sep 2011 10:32:53 +0000 (UTC),

Quoted text here. Click to load it

I think I have that for the assembly part of things.  If you
are referring to the near-end where the Appendices are at,
then I'm already aware of those sections (B, C, F, G, H.)  I
did also look at the timing information in Chapter 18-1, for
example, of DDI0337 on the Cortex-M3 for r1p1, r2p0, and
r2p1.  Though perhaps I haven't read it well enough.

I think I have been there.  But I may have missed something,
too, and I appreciate the suggestion

Jon

Re: Cortex-M3 vs PIC32 divide instruction
Quoted text here. Click to load it

 If the speed of this matters a lot, you are best to simply get a
device, and try it.
 'Modern data' tends to be more and more superficial, and that is one
reason there are more cheap Eval/Starter kits.
 Note that other  devices are not standing still either - I see both
TI and ADI are now boasting of sub $2 DSPs (tho RAM based)

 TI's strangely lacks Timer capture, (they must want you to buy other
variants there) but does have high speed USB for a small cost adder.
 ADIs has good timers, but no USB.
 Both, of course, have very fast maths support, and quite large ROMS
with Floating point as well.
 -jg

Re: Cortex-M3 vs PIC32 divide instruction
Quoted text here. Click to load it

Last (only...:) ) time I used a TI DSP was apr. 10 years ago,
the 5420. Their divide was straight forward, use "subtract
conditionally"
in a repeat (penalty free) loop.
I also have wondered - just vaguely, though - how do they accelerate
division on various architectures, e.g. the power core I use now
needs only 14 (or was it 16?) cycles for a 32/32, older
implementations
of that core (the original 603e, that is) needed 30+, 37 IIRC.
I have been moaning so many times of having to write yet another
division - I think the only architecture which saved me that was the
68k, ppc didn't, it does not have the 64/32 68k has, not on 32 bit
machines) that I use the chance to ask Jon to share his findings,
I am also really curious about it.

Dimiter

------------------------------------------------------
Dimiter Popoff               Transgalactic Instruments

http://www.tgi-sci.com
------------------------------------------------------
http://www.flickr.com/photos/didi_tgi/sets/72157600228621276 /

Re: Cortex-M3 vs PIC32 divide instruction

Quoted text here. Click to load it

I am similarly curious and would like to know the theoretical
details.

If I do uncover the details in the Cortex-M3 case, I'll write
a little something about it here.  It's possible that there
are some university docs I'll find that clue me in.  I might
even get lucky and someone at ARM may respond kindly.  It may
be that someone here knows, too, but just hasn't said as
much, yet.  Chances are this isn't some deep dark secret.
Just that I haven't yet come across it, is all.  I am
remarkably ignorant.

If you are interested in the details regarding the M4K
(PIC32) method, then I can write a lot on that unremarkable
topic.  That one is easy.  I could design the hardware myself
almost in my sleep.

Jon

Site Timeline