PIC vs ARM assembler (no flamewar please)

ucadv07 · 2007-02-14T23:18:52+00:00

Had a discussion with a _hardware_ guy (as in transistors and OP-amps)about "powerful" micros.He his a PIC guy and claimed that PIC have a very nice instruction setand is a pleasure to work with in assembly. He also mentioned the hewould rather use a dsPIC instead of an ARM7 because ARM7 is very hardto program and has a confusing assembly (we never talked application,so I assume he meant this holds regardless of application). He alsosaid that another major advantage of dsPIC is that its a PIC, hencethe know-how and toolchain advantage...Completely shocked, I told him that my experience was the exactopposite, and I really enjoy ARM assembler (well, maybe not enjoy...).Anyway, after that, the discussion turned into a flamewar...So what do you say? Maybe I have been wrong all the time?What do you guys think about the instruction set and architecture,provided that you were forced to code in assembly and we ignored thefact that these is more of an apples vs pink-flying-elephantscomparison...(you can also include your background and your other favorite microssuch as AVR and MSP4xx, but_ please_ don't flame. and you must REALLYHAVE WORKED with all of them, no gusses please :) )((yes, I REALLY do want your answers. Because I suspect the answerwill differ very much dependent on your background, and experience andyour application, and I think that information would benefit thislittle community))-shocked

D

David Brown 19 years ago

True enough.

Marketing terms are not necessarily the same thing as technical terms - I was careful to say "Freescale refers to as ..." rather than "is".

However, there is no fixed distinction between RISC and CISC. The two terms refer to a range of characteristics commonly associated with RISC cpus and CISC cpus. Some chips clearly fall into one camp or the other, but most have at least slightly mixed characteristics. The ColdFire core is very much such a mixed chip - in terms of the ISA, it is noticeably more RISCy than the 68k (especially the later cores with their more complex addressing modes), and in terms of its implementation, it is even more so. Even the original 68k, with its multiple registers and (mostly) orthogonal instruction set is pretty RISCy.

So the ARM is moving from a fairly pure RISC architecture, through the Thumb (with it's more CISCy smaller register set and more specialised register usage) and now Thumb-2 (with variable length instructions). It's gaining CISC attributes in a move to improve code density at the expense of more complex instruction decoding.

The ColdFire, on the other hand, has moved from the original 68k to a more RISCy core, with a much greater emphasis on single-cycle register-to-register instructions and a simpler and more efficient core, in order to improve performance and lead to a smaller implementation.

There are still plenty of differences between the architectures, but there is no doubt that there are a lot more similarities between the ARM Thumb-2 and the ColdFire than between the original ARM and the original 68k.

The AVR? I can't think of any others.

Vote

J

Jim Granville 19 years ago

I think you missed my uC = microcontroller. (AVR CPU) What you state is correct for megabyte CPUs, with all the cahce and SDRAM fruit, but certainly NOT true for single chip microcontrollers.

CPUs being pressed into uC service, is one of the drawbacks with some approaches. Quick and dirty, yes, efficent, no.

Sounds like a poor example of how anyone would do this today.

Look at the XC166, and eZ8, for examples of how you can do very efficent memory overlays.

In a uC, you are talking of a few K's of memory, so speed should not be an issue at all.

-jg

Vote

M

msg 19 years ago

I too miss the TMS9900/99000 ISA; I was always impressed by the performance of Ti's DX-10 o/s running on 64kbyte, 3.3MHz 9900 servicing sixteen terminals with decent response time; context switching was fast. Ti provided some good multitasking realtime executives for industrial control as well.

I've recently enjoyed working on the (older) Intel 8096/80x196 with its 256 registers addressed as memory and three-operand-capable instructions; it is somewhat of a challenge to limit tasks to sets of working registers within the on-chip set for fast context switches without using a stack. For the small-ish uC projects I'm doing, the 9900 ISA would be far more efficient and useful.

Regards,

Michael

Vote

R

rickman 19 years ago

I understand. The TMS9995 was much closer to an MCU with onboard RAM and it still was much slower than register based CPUs.

But for RAM to be as efficient as a register file it has to be triple ported so you can read two operands and write back another... or you have to go to an accumulator based design. Once you have triple ported RAM, you have just added a register file! A rose by any other name still smells as sweet...

Vote

R

Roberto Waltman 19 years ago

Aha. And the PDP-11s running RSX-11 or "Young-UNIX". Or HP-1000 under RTE-II/III. Or NOVAs, Burroughs, or ...

But the real reason for the good performance of these very limited (for today's standards) systems was not their "advanced" architectural features, but that the people writing their software were aware of the systems limitations, and acted accordingly...

Roberto Waltman

[ Please reply to the group, return address is invalid ]

Vote

J

Jim Granville 19 years ago

Correct, that's the hardware level detail.

The really important point, is at the SW level, you now access any small clusters of Register-Mappable-RAM variables VERY efficently indeed, using register opcodes.

- Such clusters of variables are very common in code

- eg a Real time clock subroutine, could be fully coded using register opcodes, with a single Ram-locate operation on entry.

Fast context switching is also now built in. Stack usage drops. Lots of benefits, but you DO have to design the chip more as a system, and not simply buy and paste-in an IP core.

It's also backward compatible. If you are uncomfortable with the overlay, or the tools are catching up, just leave the register pointer alone, and you have plain-old-vanilla-RISC.

See the XC166, and IIRC the Sun CPUs used to allow a partial page overlap, so you could pass params in Ram.Registers, and allow locals as well, with very low pointer thrashing.

-jg

Vote

W

Wilco Dijkstra 19 years ago

These are not examples of a RAM mapped register file, just of a hardware assisted context switch. So the contents of the RAM are copied to/from the register file but are not kept in sync until the next context switch.

Even a few KB of SRAM is much slower than a register file.

Wilco

Vote

J

Jim Granville 19 years ago

Which are not ? - perhaps you are talking about the TMS9900 ?

If you meant the eZ8, then perhaps reading up on the Register Pointer operation would assist. In the eZ8, the register pointer adds to the

4 bit register operand, to map/overlay those 16 registers, into up to 12 bits of RAM

Slower, yes. 'Much slower' is moot - given that the bottle neck in most CPUs/uC is code access from FLASH, and that on-chip SRAM speeds are MUCH FASTER than Flash speeds, so it's not looking like the determining-speed path.

There seems to be no practical speed impact from this, when you look at the Mhz speeds of real devices like the St10/XC166 cores ?

-jg

Vote

D

D. 19 years ago

Indeed. The Register Pointer in itself is made up of two separate parts, and those parts are used to add to the register operand, as JG said. It allows you to have 4-bit addressing (a group of 16 working registers, with the full RP being used for the complete address), or 8-bit addressing (a page, with only half of the RP being used), or the absolute 12-bit address. Throw in compatibility with older code from when the Z8s could only address 2^8 bits of RAM (or register file), and you've got a pretty good blend of power and low code size.

It's very logical and intuitive, when you think of it.

Regards, D.

Vote

W

Wilco Dijkstra 19 years ago

RISC and CISC are about instruction set architecture, not implementation (although it does have an effect on the implementation).

Well, let's look at 10 features that are typical for most RISCs today:

large uniform register file: no (8 data + 8 address registers)
load/store architecture: no
naturally aligned load/store: no
simple addressing modes: no (9 variants, yes for ColdFire?)
fixed instruction sizes: no
simple instructions: no (yes for ColdFire)
calls place return address in a register: no
3 operand ALU instructions: no
ALU instructions do not corrupt flags: no
delayed branch: no

So that is 0 for 68K, 2 for ColdFire. ARM scores 8, Thumb scores 6, Thumb-2 7. MIPS scores 10 (very pure). This clearly shows 68K and ColdFire are CISCs, while the rest are RISCs.

Yes, RISCs have become more complex. However that doesn't make them CISC! Although ARM is not a pure RISC to start with, Thumb-1 and Thumb-2 are only slighly more complex and still have most of the RISC characteristics.

Indeed, it has gained 2 points by removing some of the complex micro coded instrucions and addressing modes, thus allowing a simpler more pipelined implementation. But it clearly doesn't make it a RISC like the marketing people want us to believe...

I'd say that any similarities only exist on a superficial level. For example the variable length instructions in Thumb-2 are easier to decode than 68K or ColdFire.

Hitachi SH and ARC for example.

Wilco

Vote

D

Didi 19 years ago

Generally true, but there are exceptions. TIs 54xx DSPs have some registers memory adressable (not all, e.g. not the accumulators, just the so called "auxilary registers"). Whether they are really memory addresses or not I don't know, the RAM is on-chip at address

0 (about where these register are), and this RAM allows 2 accesses per cycle, so there is no slowdown out of that. But given that this architecture allows 3 RAM accesses per cycle (or was it 4?), this is hardly surprising, it is designed to not have a memory bottleneck.

Dimiter

Vote

W

Wilco Dijkstra 19 years ago

No, I meant the XC166 (SPARC, AMD29K etc) register windows.

The eZ8 is really weird indeed, you can either call it a CPU with a large register file or a CPU with no registers and direct memory addressing. The instruction cycle timings are pretty slow so its either fetch speed or the register access that is holding it back.

While SRAM is faster than flash, it wouldn't be fast enough to be used like a register in a simple MCU. On ARM7 for example, register read, ALU operation and register write all happen within one clock cycle. With SRAM the cycle time would become 3-4 times as long (not to mention power consumption).

That's because the XC166 uses registers and not RAM.

Wilco

Vote

J

Jim Granville 19 years ago

To get a handle on what On-Chip, small RAM speeds can achieve, in real silicon, look at the FPGA Block Sync RAMS - those are smallish block, Dual ported, and plenty fast enough to keep up with the cycle times of a CPU. I don't see FPGA CPUs being held back by their 'slow sram', as you claim ?. RAM based DSPs are now pushing 1GHz, and that's larger chunks of RAM than are needed for register-maped-memory.

-jg

Vote

J

Jonathan Kirwan 19 years ago

I just dumped my message in progress on this -- you said what I wanted to say very clearly. I use such DSPs. I think Wilco must be stuck thinking in terms of external bus drivers where what is connected is unknown and the bus interface designer must work to worst cases. Too much ARM, perhaps?

Jon

Vote

J

Jonathan Kirwan 19 years ago

I respect your knowledge and skill, Wilco, but I cannot agree with this as I understand you writing it here based upon my experiences.

I spent 1-on-1 time with Hennessy and listened to the reasoning he used. RISC was all about thinking in detailed terms of practical implementation. They were faced with access to lower-technology FABs (larger feature sizes, fewer transmission gates and inverters, etc.) and wanted to achieve more with less. Doing that was everything about implementation and the instruction set architecture was allowed to go where it must. That this worked out to being a 'reduced instruction set' was something that came out of achieving competing performance out of lower-tech FAB capability than folks like Intel or Motorola had available to their flagship lines of the day.

There was a design philosophy based upon theory -- that was simply the realization that many of the things that slowed down a CISC was also a matter of perceived convenience for programmers, so the policy was then to get rid of anything and everything that slowed down the clock rate without paying _well_ for that delay. A focus on throughput. The fact that removing barriers to speed also happened to reduce the need for more transistor equivalents was the happy coincidence that fueled the initiative. The instructions were a result of the application of focusing on implementation details -- not some instruction set theory under which the implementation then followed. If higher level features were cheap to implement and paid for themselves in performance, they were simply kept. Very practical, hard nosed approach.

If you ever listened to such a lecture by those actually doing the work, you'd see this narrow focus. The register flags that signalled whether or not a register was in-use as a destination were tossed as too expensive -- they required infrastructure in order to delay the processor and the combinatorial worst-case path of the whole of that meant additional __delay__ in each clock cycle, whether or not this interlock was useful instruction to instruction. You paid for it on every cycle, need it or not. So out it went. No interlocks. Sorry. Similar thinking was involved in the Alpha's refusal to do 'lane changes,' for example.

Hennessy had a huge blow up of the 68020 CPU in one room at MIPS (which was quite near Weitek, at the time), when I visited. He would go through each and every detail of the implementation there and talk about it, at length, and explain why it was worthwhile... or not... and what the exact quantitative cost was in each cycle's timing and over the broader arch of an application.

Some of the difficulties were higher memory bandwidths required, once you started tossing out stuff like register interlocks, microstore and its associated sequencing overhead, lane changing, etc. But if that could be satisfied, and that was kind of possible at the time with some static ram from performance semi, it would perform like a bat out of hell. So to speak.

But the focus was on implementation on lower-tech FABs and, while doing that, still competing with CISC and beating it.

Of course, FABs got a lot better and access to high tech FAB resources became increasingly brokered to keep them running 24/7, and the driving need for lower-tech feature sizes became relaxed. Also, CISC looking external designs could now be designed with internal RISC processors, built-in TLBs, re-order buffers, registration stations with multiple functional units to share, jump prediction, .... so much so, that in fact Intel started putting L1 cache memory on-die. There was so much excess available, they ran out of nifty ideas and the best they knew to do with it was suck up die space with cache memory.

So the RISC drive relaxed. At least, on the consumer market area.

But for those making cheap embedded controllers, I suspect that die size and effectively using somewhat lower FAB technology remains useful. So the low-transistor count approaches once the much lauded domain of RISC remain important.

Jon

Vote

J

Jim Granville 19 years ago

All one can really derive in meaning from RISC, is Reduced Instruction Set Computer - any other assertions become in the eye of the beholder, or worse, spin doctoring - so there is little point in slicing and dicing the details of what is, or is not, RISC.

-jg

Vote

J

Jonathan Kirwan 19 years ago

Real meaning is found in the details of how things work, not in some banner or ideology. Which is, I suppose, about what I said.

Thanks, Jon

Vote

W

Wilco Dijkstra 19 years ago

Jim, you earlier wrote "I think you missed my uC = microcontroller." - I don't think 1GHz DSPs/FPGAs are micro controllers. Yes high end SRAMs on advanced processes easily reach 1GHz, but my point (and I think Rick's) is that registers are much faster still.

No, not at all. I'm talking about needing to access the SRAM several times per cycle to read/write the registers (as explained in the first paragraph). Therefore the speed of a CPU using SRAM rather than registers becomes a fraction of the cycle time of the SRAM.

A register file is a small dedicated structure designed for very high random access bandwidth. SRAM simply can't achieve that.

Wilco

Vote

D

David Brown 19 years ago

The whole point of RISC is to be able to make a more efficient implementation - it is an architectural design philosophy aimed at making small and fast (clock speed) implementations.

Typical CISC is 4 to 8 registers, each with specialised uses. Thus the

68k is far from typical CISC, and is much more in the middle.

The 68k can handle both operands of an ALU instruction in memory, which is CISC. The ColdFire can have one in memory, one in a register, which is again half-way.

That is purely an implementation issue for the memory interface. It is common that RISC cpus, in keeping with the aim of a small, neat and fast implementation, insist on aligned access. But it is not a requirement - IIRC, the some PPC implementations can access non-aligned data in big-endian mode. The ColdFire is certainly more efficient with aligned accesses, but they are not a requirement.

The addressing modes for a ColdFire "move" instruction are:

Rx, (Ax), (Ax)+, -(Ax), (d16 + Ax), (d8 + Ax + Ri*SF), xxx.w, xxx.l, #xxx

The source and destination addressing modes can be mixed as long as only one of them needs an extension word.

The 68k had several other modes in its later generations, and they could be freely mixed for the source and destination.

I am not familiar enough with the ARM (it's 17 years since I programmed one), but if we look at the PPC, it has addressing modes roughly equivalent to:

Rx, (Rx), (d16 + Rx), (Rx + Ry), xxx.w

Using update versions of the instructions, you get something much like the (Ax)+ and -(Ax) modes as well as more complex modes.

All in all, the CF modes are only marginally more complex than the PPC modes.

The big difference, however, is that the CF can use these modes on ALU instructions and not just for loads and stores - but that has already been counted above.

The instruction set for the PPC contains much more complicated instructions than the CF. The 68k has things like division instructions, which the CF has dropped.

A far more useful (and precise) distinction would be to look at the implementation - does the architecture use microcoded instructions? RISC cpus, in general, do not - that is one of the guiding principles of using RISC in the first place. Traditional CISC use microcode extensively. The 68k used microcode for many instructions - the CF does not.

More generally speaking, CISC has specific purpose registers, while RISC have mostly general purpose registers. Yes, the CF has extra functionality on A7 to make it a stack pointer. Putting the return address in a register, as done in RISC cpus, is not an advantage - it is a consequence of not having a dedicated stack.

If we add in some other features that are a little more implementation dependant (and therefore entirely relevant, since that is the reason for RISC in the first place), things are a bit different:

Single-cycle register-only instructions: yes
Short execution pipeline: yes
(Mostly) microcode-free core: yes
Short and fast instruction decode: half point
Low overhead branches: yes
Stall-free for typical instruction streams: yes

Suddenly the scores are looking a bit different.

Perhaps we could compare the CF to traditional CISC features:

Specialised accumulator: no
Specialised frame pointer: no
Specialised index registers: no
Microcoded instructions: no
Looped instructions: no
Direct memory-to-memory operations: no
Bottlenecks due to register or flag conflicts: not often
Long pipelines: no
Register renaming needed for fast implementation: no
Unaligned code: no
Highly variable instruction length: half (only 1, 2, or 3 16-bit words)
Instruction prefix codes: no

I could go on - and I expect you could too.

As I said, with the Thumb-2, the ARM is gaining the CISC feature of variable length instructions - I did not say it is changing into a CISC architecture. The real world is grey - there is no dividing line between CISC and RISC, merely a collection of characteristics that some chips have and others don't. Adding these variable length instructions is a good thing, if it doesn't cost too much at the decoder. It increases both code density and instruction speed, since it opens the path for 32-bit immediate data (or addresses) to be included directly in a single instruction.

My point is not that the CF is a RISC core - I never claimed it was. But neither is it a CISC core in comparison to, say, the x86 architecture. If there were such a thing as a scale running from pure RISC to pure CISC, then the CF lies near the middle. It is not as RISCy as the ARM, but is somewhat RISCier than the original 68k.

My original comment was pretty superficial.

The hard ones from the 68k were dropped in the ColdFire, precisely to allow a faster, more RISC-style decoder.

I haven't looked at them, so I'm happy to take your word for it.

Vote

W

Wilco Dijkstra 19 years ago

You're free to disagree but there is consensus about what RISC and CISC are. It's unfortunate that many confuse ISA and implementation... Please read this excellent article by John Mashey:

formatting link

Hennessy & Patterson's "Computer Architecture: A Quantitative Approach" is well worth reading too.

It is true that in those early days they wanted to cram a complete CPU on a single die (including caches to speed up memory access) and the only way to achieve that was to throw out everything unnecessary.

Those days are long gone, transistor budgets are much larger now. Today all CPUs, whether RISC or CISC, use the same implementation techniques to achieve high performance.

Again, it is true that in the early days the focus was on getting performance without much regard for anything else. However saying that the instruction set design followed from the implementation is incorrect. RISC started as a reaction against the CISC goal of "closing the semantic gap" after IBM studies showed only a few simple instructions were used 90% of the time. It's about taking a quantitative approach to instruction set design.

RISC takes the interaction between the various components of a complete system into account (compiler, ISA, implementation). The result of this is a particular set of features in the ISA, not in the implementation. A microcoded RISC is still a RISC, a pipelined CISC is still a CISC!

Those were mistakes indeed that were corrected in later RISCs. Some of the early ideologies were taken too far, and concentrated too much on a single implementation rather than on the ISA (which lives for many implementatios). Going for all out clock speed without thinking about power consumption, codesize, ease of compiler design etc is a bad idea.

Many early RISCs ended up with features that were found to have a negative impact in the end (either in software or in later CPUs). Alpha byte access is a great example of this, delayed branches is another. MIPS quickly realised the silliness of omitting interlocks. :-)

At the time, yes. Nowadays it is accepted that while RISC still has some advantages over CISC (eg. area, power consumption, design effort), CISC CPUs can be made as fast as RISC CPUs as long as you put enough effort into it. Of course CISCs can compete on one out of power, area or speed, not on all at the same time!

Correct. It's no surprise most 32-bit embedded CPUs are RISC.

I think the real lesson was not to adhere to the early dogma's too strongly. RISC has evolved over time, and so has CISC. RISCs have fixed their early mistakes about thinking too much about the first implementation rather than ISA. As we discussed before, RISCs have taken on more complex features as transistor budgets grew. CISCs have moved from mostly micro code to mostly pipelined single cycle instructions. The key features that differentiates RISC from CISC both then and now are all about instruction set architecture, not implementation.

Wilco

Vote

PIC vs ARM assembler (no flamewar please)

Join the Discussion

Didn't find your answer?