PIC vs ARM assembler (no flamewar please)

On Feb 20, 9:53 pm, Jonathan Kirwan wrote: [snip interesting comments]

With the note that code density is also a factor. If the area saved by simpler decoding comes at the cost of more area in Icache (for the same performance) or FLASH, then simpler decode is a net loss. Simpler decoding can also save power, but the reading of larger instructions consumes more power. RISC also reduces the design effort required and testing complexity. At higher volumes design cost becomes less significant; so the balance point in the trade-offs between code density and implementation complexity changes (e.g., per chip design cost savings can be translated into larger chip area). Per chip design cost savings can also be translated into a better (faster and/or more power-efficient and/or smaller) process technology.

(Greater design effort [whether from ISA factors or greater effort to optimize the design for power, performance, and/or area] also increases scheduling risks; so a sub-optimal ISA or implementation might be safer. Safer probably means easier access to start-up capital [a double-whammy because a simpler design also requires less start-up capital]. Of course, one also cannot trade costs [number of designers] for time to completion at a fixed rate.)

It might also be noted that a move to multiple cores per chip multiplies the decode area savings (but does not reduce Icache costs) while shared FLASH cost remains constant.

As you implied, the trade-offs for a real product are much more complex.

Paul A. Clayton just a technophile and babbler

Reply to
Paul A. Clayton
Loading thread data ...

My point is that there is no difference between registers in triple ported RAM and a large register file. If I have 1kB of triple ported RAM, I can play the same game and static allocate memory to interrupt routines for zero overhead context switching.

Stack usage drops because the TMS9900 ISA did not support stacks very well. There was no stack usage on subroutine calls because a link register was used (R12 or R13, IIRC). Before another routine was called the register had to be saved or a full register context change had to be done. This was over 20 years ago, so I may not remember the details correctly. But I remember distinctly that I was initially impressed with the 9900, but eventually realized that this was outdated technology as CPU speeds and RAM densities increased.

With more limited capabilities due to the register linking for subroutine calls.

Passing params in registers is still used without a special register block pointer. With a significant number of registers being used for housekeeping, there is limited utility of to using a block pointer. Say you allocate the top 8 registers as "important" registers (I'm sure there is a term for this, but I can't recall it) that must be saved when a routine is called. The lower 8 are considered as volatile and can be reused as required or data passed in these. To use the lower 8 and save the upper 8 you need to adjust the register pointer down by 8 cells. You still need to copy some of this data since some of the new upper registers (old lower 8) are hardware dedicated and will clobber data otherwise.

Yes, it can be made to work, but I never saw a big savings in speed or memory usage. Perhaps you saw different applications.

Reply to
rickman

... snip ...

Since the fundamental limitation today is propagation time, premium performance is found in small devices. A chip is obviously a smaller device than a PCB. Once you shuffle everything, including the dog, into one chip you can gain performance by adding features. So I predict that evolution will tend in the CISC direction. We see this in spades in the embedded arena.

Also remember that NOT having to drive off-chip lines produces heavy reduction in power and area, and increase in speed, all at the same time!

Imagine a chip with 2G memory, a PPC instruction set, and a USB external interface. It needs 6 pins, possibly 8 to allow for a clock. The memory would be ECC, since the cells would be rather small and highly subject to bit drops. All on 1/4 inch square package! External HD access would suffer.

Probably a simple stack oriented instruction set would be better. Memory access on such a machine would not be significantly slower than register access.

--
 
 
 
 "A man who is right every time is not likely to do very much."
                           -- Francis Crick, co-discover of DNA
 "There is nothing more amazing than stupidity in action."
                                             -- Thomas Matthews
Reply to
CBFalconer

That's a good summary.

There are various CISCs (eg VAX, MSP430) that have 16 registers, while most RISCs have 32 or more.

The ColdFire is no different from 68K in this aspect. Most ALU operations can do read/modify/write to memory and the move instruction can access two memory operands.

Unaligned accesses are non-trivial so most RISCs left it out. However modern CPUs nowadays have much of the required logic (due to hit-under-miss, OoO execution etc), so a few RISCs (ARM and POWER) have added this. Hardware designers still hate its complexity, but it is often a software requirement. Quite surprisingly it gives huge speedups in programs that use memcpy a lot.

...

It's the (d8 + Ax + Ri*SF) mode that places it in the complex camp. The first not only uses a separate extension word that needs decoding but also must perform a shift and 2 additions...

What PPC instructions are complex? PPC is a subset of POWER just like CF is a subset of 68K, so most of the complex instructions were left out.

This is misguided. RISC *enables* simple non-microcoded implementations. One can make a micro code implementation of a RISC, but that doesn't make it any less RISC.

It is an advantage as it avoids unnecessary memory traffic - a key goal of RISC.

I don't see how the scores change at all. Most of the features you mention are "yes" for 68K implementations (except for the original

68000 which scores 4 out of 6), ColdFire and ARM.

Many famous CISCs are not accumulator based, eg PDP, VAX, 68K, System/360 etc. Accumulators are typically used in 8-bitters where most instructions are 1 or 2 bytes for good codesize.

Implementation detail. CF is still complex enough that micro coded implementations might be a good choice.

Loop mode is just an implementation optimization that could be done on any architecture.

Eh, what does move.l (a0),(a1) do? It's valid on CF.

Longer than an equivalent RISC (mainly due to needing 2 memory accesses per instruction and more complex decoding). And likely longer than a simpler microcoded implementation.

Sure, there is always a grey area in the middle, but most ISAs clearly fall in either camp. If you use my rules, can you mention one that scores 4 or 5?

Actually, embedding large immediates in the instruction stream is bad for codesize because they cannot be shared. For Thumb-2 the main goal was to allow access to 32-bit ARM instructions for cases where a single 16-bit instruction was not enough. Thumb-2 doesn't have immediates like 68K/CF.

I agree CF is less CISCy than 68K but it is still more CISCy than x86. If it dropped 2 memory operands, removed ALU+memory operations,

32-bit immediates and the absolute and (d8 + Ax + Ri*SF) addressing modes then I would agree it is a RISC...

Wilco

Reply to
Wilco Dijkstra

FPGAs are certainly used a microcontrollers, and in increasing volumes. CPU designers had better be aware of just what a FPGA soft CPU can do these days, as they are replacing uC in some designs.

?

OK, I'll try ome more time. You seem to be stuck on a restrictive use of SRAM, so I'll use different words. Let's take a sufficently skilled chip designer, that he knows various RAM structures, and that he will not use vanilla SRAM (as Rick has aleady mentioned) but will use something more like the dual port sync ram of the FPGAs, I gave as an example. Yes, this ram is more complex than simplest RAM, (which is why Infineon keep the size to 1-2K), but it buys you many benefits on a uC, and the die size impact of such RAM is still tiny.

Q: What percentage of a RISC(eg ARM) die is taken by the registers themselves ? A: A miniscule fraction

Reply to
Jim Granville

They can indeed, but FPGA prices need to come down a lot before it becomes a good idea in a high volume design. I've worked with big FPGA stacks for CPU emulation and large/fast FPGAs can cost well in the 5 figures a piece. Even the smallest ARM uses a big chunk of a large FPGA. So you can only use very simple CPUs in a small FPGA.

I'm with you. Inventing a new kind of SRAM with 3 read and 2 write ports would do the job indeed. But it is going to be big compared to using standard single ported SRAM, so there needs to be a major advantage.

What bottleneck? You lost me here... Adding more registers doesn't automatically improve performance.

At 2KB it would double the size of an ARM7tdmi and slow it down a lot without a redesigned pipeline (it needs to support 2 accesses in less than half a cycle at up to 120MHz, so the SRAM would need to run at 500MHz). I think you're assuming register read/write is not already critical in existing CPUs - it often is.

However what use do you have for 256/512 fast registers? You can only access 16 at any time...

Sure, it just gets progressively slower with size and number of ports.

Wilco

Reply to
Wilco Dijkstra

Yes, and to do that zero context switch, you need a register frame pointer (or similar). You do not want to use this as mere smart-stack, but to allow all the Reg opcodes to work, on any window into that larger memory. Triple ported, or dual port, depends more on the core in question.

Even the lowly 80c51 has a register frame pointer,(all of 2 bits in size), and it does overlay the registers with RAM. The z8 expands this to 8 bits, and I think the XC166 uses a 16 bit one.

? - you've lost me here. In subset mode, you simply ignore the register frame pointer, and it is _exactly_ the same as your un-enhanced core.

-jg

Reply to
Jim Granville

I think of SRAM as "static RAM." Nothing more than that. This means it can be single-ported, or multi-ported. The only discerning issue is whether or not it is static and can retain its contents down to a DC clock. This includes latches. The actual cell of an SRAM can be implemented in a variety of ways and with a variety of surrounding control logic.

So again I think you are considering __external__ SRAM packages commonly found and a bus upon which it operates or are otherwise locked into some mental viewpoint you aren't escaping just yet and one that doesn't reflect actual cpu design practice. Registers are, in fact, almost always implemented as SRAM in an ALU design today, whether as multiported or not. (They used to be nmos dram in some processes, but I don't know of any of those now.) Saying "SRAM simply can't achieve that" sounds silly to me. It does, because that's what registers actually happen to be.

Jon

Reply to
Jonathan Kirwan

Well, with SRAM I'm thinking of standard single ported SRAM like 99.9% of the SRAM that exists, whether as external packages, on-chip RAM, part of a cache, cell libraries etc. Dual ported SRAM is pretty rare (note dual ported caches don't actually use dual ported SRAM). Anything else you'll likely have to design yourself at the transistor level.

Not for synthesized CPUs. Standard cell libraries don't provide the right number of ports or the right pitch, so ARMs typically have register files created from flops and muxes. CPUs that are largely handcrafted can obviously create specially designed RAMs with enough ports to achieve the required bandwidth. Even then they use various techniques to reduce the number of ports.

Wilco

Reply to
Wilco Dijkstra

My first reaction to the above is that ASIC cpu designs have control over all this and they use that flexibility as a matter of course, too. And none of this addresses itself to the fact that registers are, in fact, SRAM. So your differentiation is without a difference.

So now you bring this in? I thought you were simply saying SRAM and registers are different, which I don't agree with because registers are sram. And now you talk about synthesized cpus to see if that might help your case?

What exactly is the difference in your mind between a flipflip and an sram bit cell? I'm curious.

Not sure what to say to all that.

Jon

Reply to
Jonathan Kirwan

If it is a async vanilla SRAM cell there is a slight difference.

- with a register you can read and write on the same clock edge.

If it is a Sync SRAM cell, there is no difference. Both have a Clock, and a Tsu and Th.

Dual port Sync SRAMs are quite common fare, across most FPGA vendors.

A one generation back FPGA (so as to not be too unfair ) specs Tsu of 0.52ns, Th of 0ns, Tco 0.6ns, and Tclk of 572Mhz, on a 2K byte dual port RAM. These devices deliver 100-200MHz Soft CPU speeds, and the SyncSRAM speed does not look like the main bottleneck.

-jg

Reply to
Jim Granville

I'm aware. I was curious what Wilco was thinking about.

Jon

Reply to
Jonathan Kirwan

It's nice that we don't entirely disagree!

I still don't see the point of trying to make black-and-white classifications of cpus as *either* CISC, *or* RISC. You could divide them into load/store and non-load/store architectures, which is perhaps the most important difference (although there are no doubt hybrids there too). Using that definition, the msp430 is CISC - but it has plenty of RISC features (such as 16 registers - a lot for its size).

IIRC, the 68k could do some ALU operations with both operands in memory (such as ADDX), and MOVE operations can use any addressing mode for both operands. The CF is more limited to simplify decoding and operand fetch.

Another example of the simplifications is that the CF no longer supports byte or (16-bit) word sizes for most operations - about the only instructions that support sizes other than the native 32 bits are MOVEs. So for other data sizes, you effectively have a load/store architecture.

I've worked for years with the 68332, and in recent times I've worked with the ColdFire. I've studied generated assembly code, often made with the same compiler, from the same source code. There is no doubt whatsoever - the generated CF code makes much heavier use of register-to-register instructions, with code strategies more reminiscent of compiler-generated RISC code. This is partly because some of the more costly memory operation capabilities were dropped from the 68k, and partly because the CF is more heavily optimised for such RISC style instructions. If you were to think of the CF as a RISC core with a bit too few registers, but some added direct memory modes to compensate, you'd program fairly optimal code - the same is not true for the 68k.

That *is* surprising - the memcpy() implementations I have seen either use byte for byte copying, or use larger accesses if the pointers are (or can be) properly aligned.

Yes, that's a complex one, and it's slightly surprising that it survived the jump from 68k to CF. I think it was included as it is the only mode that can get its address from the sum of two registers, which is a common requirement (the PPC has such an addressing mode). Since an extension word is needed, the 68k architecture put the extra bits to good use - a scale factor of 1, 2, 4 or 8, and the remaining bits giving an offset which is probably seldom used.

The mask and rotation instructions are examples of complex ALU instructions, and there are several multi-cycle data movement instructions (such as the load multiple word, and the load string word).

Again, I don't see RISC vs. CISC as a black and white division, but as a set of characteristics. Microcoding is a CISC characteristic - it is perfectly possible to have a mostly RISC core with CISCy microcode.

It avoids an extra memory write (and subsequent read) in leaf functions, at the cost of extra instruction fetches for the code to save and restore the link register for non-leaf functions. I can't give you a detailed analysis of the costs and benefits here, but I'd be surprised if it is a distinct advantage.

Exactly the point - when you include these typical RISC features as well as your chosen features, the CF scores much more like the ARM. I'm not claiming in any way that the CF is RISCier than the ARM, or even *as* RISCy - just that it has far more typical RISC features than you give it credit for.

Specialised accumulators are a typical CISC feature, even though they are by no means universal.

I intended to refer to ALU operations, sorry.

Are you are making this up out of thin air?

I don't have any details of the CF pipeline. But a mispredicted branch that hits the instruction prefetch cache (thus avoiding instruction fetches) executes in 3 cycles. That's definitely a short pipeline.

A fair proportion of CF instructions are single-word, and a single memory access reads two such instructions. I'd estimate that you'd have slightly less than one memory access per instruction on average, but of course that's highly code dependant. Instructions are aligned with their extension words as they are loaded into the prefetch cache, so decoding is not any more complicated or time-consuming than for a RISC instruction set - the coding format is nice and regular.

I wouldn't use your rules - they are picked specifically to match you argument (and even then, you placed the ARM Thumb at 6). Add in the six I picked, and the ColdFire is at 8 out of 16. Of course, my rules, like yours, are arbitrary and unweighted, so they hardly count as an objective or quantitative analysis.

Most ISAs can certainly be classified as roughly RISC or roughly CISC - I'll not deny that, and given a choice of merely RISC or CISC, I'd classify the CF as CISC without hesitation. All I am trying to say is that there are characteristics that are typical for each camp, and that architectures frequently use characteristics from the "opposing" camp to make a better chip. The CF has a lot more RISC features than most CISC devices, and the ARM is picking up a few more CISC features with their newer developments. My original statement, that the inclusion of variable-length instructions in Thumb-2 makes the ARM more like the CF, is true.

Embedding large immediates in the instruction stream is good for code size if there is no need to share them. If they are shared, then the typical RISC arrangement of reading the values from code memory using a pointer register and 16-bit displacement is more code efficient (for 3 or more uses of the 32-bit data), but less bandwidth efficient (taking a

32-bit instruction and a 32-bit read, compared to a single 48-bit instruction).

Of course, that would require support for 48-bit instructions rather than just 32-bit, which might not be worth the cost.

I must have misread that - are you saying the CF (and 68k) is more CISCy than the x86 ??

That's true - but then it would not be nearly as good a core. Just because there are some truly horrible CISC architectures, does not mean that all things RISC are better!

mvh.,

David

Reply to
David Brown

The ARM is a very poor choice for a CPU in an FPGA, and it most certainly does not follow that only simple CPUs can be used in small FPGAs. The most common soft processors used are the Nios II (Altera) and the Microblaize (Xilinx) - both are designed specifically for FPGAs, and will give much more processing power for the same area in the FPGA than a "standard" CPU core. The ARM, like the ColdFire, is designed to be efficient and easily synthesizable on ASICs and other fine-grained architectures - FPGA optimised designs are significantly different.

On the ColdFire, the register set is a very significant part of the die area - so much so, that in designing the ColdFire v1 core, FreeScale considered halving the number of registers.

That's true - look at the rather mediocre real-world performance of the Itanium, for an example.

Reply to
David Brown

You're using a different definition of SRAM.

Wikipedia defines SRAM as a regular single ported cell structure with word and bitlines which is laid out in a 2 dimensional array and typically uses sense amps. It mentions dual ported SRAM and calls it DPRAM.

An SRAM bit cell is designed to be laid out in a 2 dimensional structure sharing bit and word lines thus taking minimal area. A flop is completely different. There are lots of variants, but they typically have a clock, may contain a scan chain for debug and sometimes have special features to save power. Note that flops are typically used in synthesized logic rather than latches and precharged logic. They are irregular and much larger than an SRAM cell, but they have a good fanout and can drive logic directly unlike SRAM.

So while logically they both store 1 bit, they have different interfaces, characteristics, layout and uses. I hope that clears things up...

Wilco

Reply to
Wilco Dijkstra

I am referring to the way the TMS9900 links subroutines. They save the return address in a register. This is in part because the use of the register pointer partially negates the need for a stack. But you still have to save the return address before you link to another subroutine. If you are changing the register pointer, you either have to save the old one on a stack or a register or you have to hard code the register pointer restore.

I seem to recall TI having set a convention that used an extra location in memory at the start of a routine. I believe this was to load the register pointer, but I'm not certain. I just recall that the overall effect was not really any better than using a stack with internal registers.

If you are using internal, multiport RAM for memory mapped registers, then you are really just using a large register file. Don't some of the RISC CPUs do that?

Reply to
rickman

The PDP-11's JSR allowed something like that, as well. The instruction was/is:

JSR reg, destination

It would: scratch subroutine.

Or, if it had supported the more flexible PDP11 JSR, you could choose and that would not be needed unless you used some non-PC register as your linkage.

Yikes! If I gather you correctly, that's what we got away from with the concepts of using a stack! Storing in a subr location means program lifetime and it's not recursive or re-entrant that way!

Jon

Reply to
Jonathan Kirwan

It's the definition I was taught in the 1970s, both by engineers who practiced at the time and by manufacturers who made the parts I used. I can refer you to data books on the subject, I suppose. Not that it would change your point... or mine.

I retain the general classification of the term 'SRAM' from the roots by which it got its name. Not the wiki definition, where new terms are applied and old ones redefined.

And that explains your use of the term and my difference with it. I don't think I'll change my use, yet.

Jon

Reply to
Jonathan Kirwan

Sidebar: The reason I won't is that I need a term that retains the general classification. It's meaningful to me. And if I adopted your use, I'd lose that word's denotation without another to replace it. Unless you can tell me what replaces that usage, today.....

Jon

Reply to
Jonathan Kirwan

You (and Jim) are free to use your definition - it just may cause some confusion every now and again... I have no idea whether there are any terms that have the meaning you use, I think each kind of memory got its own name. There are so many variations and new memories are appearing all the time which don't fall in existing categories...

Wilco

Reply to
Wilco Dijkstra

ElectronDepot website is not affiliated with any of the manufacturers or service providers discussed here. All logos and trade names are the property of their respective owners.