PIC vs ARM assembler (no flamewar please)

- W
- Wilco Dijkstra
  
  Contact options for registered users
Vote on answer
posted
17 years ago

Thu, Mar 1, 2007 2:57 PM

Most CPUs, from low to high end, use a Harvard architecture, so there is no issue with unified memory. If you're accessing unified main memory a lot, you may need to add a L2 cache etc.

I've done that on various high-end ARMs. ARMs are often fetch limited due to branches even when they are perfectly predicted. The reason is that it takes several cycles to predict and read the branch destination, while a branch is often removed from the instruction stream. So you waste 1-3 fetch slots for each branch which is precious bandwidth.

I bet ColdFire implementations are also severly fetch limited. Assuming fetch is 32-bit it would not be able to execute a sequence of 32-bit instructions at 1 cycle per instruction for very long. Now imagine a sequence of 48-bit instructions...

Unless we're talking about serious OoO CPUs, speculative fetching doesn't go beyond a few instructions (2-5), and stops at a cachemiss (both for power consumption and performance reasons).

That's wishful thinking - speculative fetching often fails in hiding branch latency, so it definitely can't hide a cachemiss. x86 CPUs can't fetch even close to 150 cycles ahead. In fact x86 CPUs are fetch limited too, that's why they doubled fetch width to 256 bits.

About the same number of cycles would be lost on an I-cache miss. All cache misses are expensive, it hardly matters whether they are code or data.

Clearly RISC designers decided it wasn't worth the complexity. Interestingly compilers have moved on to use the load method. 32-bit immediates in the instruction stream is a bad idea, but 64-bit immediates would be insane...

Wilco

- E
- Everett M. Greene
  
  Contact options for registered users
Vote on answer
posted
17 years ago

Thu, Mar 1, 2007 6:29 PM

Say again? Especially the part about the low end.

On the lowest of the low-end, 4-bit processors?

- W
- Wilco Dijkstra
  
  Contact options for registered users
Vote on answer
posted
17 years ago

Thu, Mar 1, 2007 6:46 PM

I was precisely using a scale between 0 and 10 to avoid black/white classifications. Scores between 4 and 6 are in a grey area indeed. MSP430 and CF/68K score well below 4, so are clearly CISCs irrespectively what the marketing departments claim.

Yes, on CF only MOVE can have 2 memory operands. But almost all ALU operations can still read/modify/write memory.

Yes. I'm not sure why they didn't remove the 32-bit operations too, if they had, it would definitely simplify hardware. The removal of

16-bit memory operations has little effect otherwise (they could have kept them for better 68K compatibility).

I think the RISC-style code would run pretty well on 68K, especially on the later implementations. Compilers have improved a lot since those early days, and keeping variables cached in registers is pretty much essential nowadays. So while it made sense to use complex instructions at the time on the 68000, it probably doesn't anymore.

The difficult case is when source and destination are not aligned. A good memcpy never uses byte copy, not even in this case. Unaligned accesses allow this case to be sped up to almost the same speed as word aligned copies (only one of the pointers is unaligned).

The mask and rotate are not very complex. Many RISCs have similar operations, including bitfield insert and extract, and execute them in a single cycle. The same is true for ARM's shift and ALU instructions.

Load/store multiple is indeed complex, but it is one of the most useful instructions that exist. They are perfect for memcpy and efficient saving and restoring a large number of registers on function entry/exit at virtually no codesize cost. Some implementations even transfer 2 registers per cycle thereby doubling memory bandwidth. For Thumb-2 I invented a special variant where you combine 2 load instructions to consecutive addresses into a single instruction.

So their cost/benefit ratio is so good that it's a no brainer. A CPU can treat them as a sequence of loads or stores so it fits fine in a typical RISC pipeline.

What I mean is that micro code is an implementation detail like pipelining. Implementations vary over time depending on the available chip technology. In the early days of RISC, pipelining, caches and no micro code were indeed RISC characteristics. Few would call a CPU RISC today just because it is pipelined or has caches... Nowadays CPUs micro sequence complex instructions rather than micro code.

It is. Most functions are leaf functions, so as long as you don't need the register you avoid having to save/restore it, thus speeding up the call and return. When calling another function you need to save it indeed, but you can save several registers in one go using the load/store multiple instructions. Returning and reloading is done in a single instruction again. At worst (when you don't already need to save some registers) it takes one extra instruction, on average it is a win.

I think you mean CF v2 which has a 4 stage pipeline. It achieves about the same performance as the 3 stage ARM7. However memory instructions are so slow (it's more a "micro coded" than a pipelined implementation) it is better to avoid them altogether.

ColdFire v4 uses a 10 stage pipeline to execute "most" instructions in 1 cycle (I don't think it can do 2 memory accesses per cycle). It is claimed to give similar performance as the 5 stage ARM9.

This clearly shows that ColdFire needs far more pipeline stages than a RISC to get similar performance, while a simpler micro coded implementation has fewer pipestages.

If literals aren't shared you break even on codesize on Thumb/Thumb-2. My statistics showed that on ARM literals are shared over 3 times on average (sharing happens across functions within source files), making it a definite win (3 * 48 > 3 * 32 + 32).

Yes. x86 has at most one memory operand while CF/68K need 2. IMO they should either have kept full binary compatibility or removed all of the complex instructions.

Wilco

- W
- Wilco Dijkstra
  
  Contact options for registered users
Vote on answer
posted
17 years ago

Thu, Mar 1, 2007 10:31 PM

How about some popular ones like PIC, 8051 or AVR?

We were talking about 32-bit CPUs. 4-bit is dead-end, not low-end.

Wilco

- G
- Grant Edwards
  
  Contact options for registered users
Vote on answer
posted
17 years ago

Thu, Mar 1, 2007 10:37 PM

Some popular VN ones:

MSP430, 6502, 8080, Z80, 6805, 6811,

--
Grant Edwards                   grante             Yow!  Have my two-tone,
                                  at               1958 Nash METRO brought
                               visi.com            around...

- W
- Wilco Dijkstra
  
  Contact options for registered users
Vote on answer
posted
17 years ago

Thu, Mar 1, 2007 11:58 PM

Both 6805 and 6811 have Harvard variants too. Does that make them a hybrid? Many 8/16-bit MCUs are Harvard because it allows 64K for code and 64K for data without horrible paging tricks. Not sure whether these are popular but some more 8/16-bit Harvards are XA, MAXQ, C166, ST10, Z-8, ...

Wilco

- R
- Robert Adsett
  
  Contact options for registered users
Vote on answer
posted
17 years ago

Fri, Mar 2, 2007 12:45 AM

The ST10's I've looked at have all been Von Neuman. No hint of separate code and data spaces that I can recall.

Robert

--
Posted via a free Usenet account from http://www.teranews.com

- D
- Deep Reset
  
  Contact options for registered users
Vote on answer
posted
17 years ago

Fri, Mar 2, 2007 10:20 AM

Is the ST10 what the Transputer became? Von Neumann for sure - odd stack-based register set, but certainly not Harvartd.

Deep.

- R
- rickman
  
  Contact options for registered users
Vote on answer
posted
17 years ago

Fri, Mar 2, 2007 1:00 PM

No, the transputer just drifted off to that PCB in the sky. They had problems getting the T9000 ready for production and never made it. By the time it would have been ready to ship, the Pentium would have been out.

I worked on a large signal collection system for NSA that was transputer based. They used both the fixed point and the floating point processors along with TI DSPs. I don't know how many of these systems they ever built, but I suspect it was not many. Each system used maybe a hundred total CPUs/DSPs. So there was at least one customer who used these parts.

- E
- Everett M. Greene
  
  Contact options for registered users
Vote on answer
posted
17 years ago

Fri, Mar 2, 2007 6:35 PM

And where did you find these Harvard variants?

- G
- Grant Edwards
  
  Contact options for registered users
Vote on answer
posted
17 years ago

Fri, Mar 2, 2007 8:22 PM

I've never seen any except for home-made ones. Since there's an external pin that is asserted for code-fetch operations, you can include that in your chip-select logic and create a harvard architecture externally. If you do that, it's advisable to leave one chunk of ROM space that's also in the data space to make the compiler's job easier.

--
Grant Edwards                   grante             Yow!  .. here I am in 53
                                  at               B.C. and all I want is a
                               visi.com            dill pickle!!

- W
- Wilco Dijkstra
  
  Contact options for registered users
Vote on answer
posted
17 years ago

Fri, Mar 2, 2007 8:23 PM

dcd.com.pl - they sell versions of many CISC CPUs for use in FPGAs. They are typically much faster both in max frequency and CPI.

Wilco

- W
- Wilco Dijkstra
  
  Contact options for registered users
Vote on answer
posted
17 years ago

Fri, Mar 2, 2007 8:39 PM

Look for ST10's with a MAC unit - they are sometimes called ST10-DSP.

At university we had a 16-way T-800 system. Although I love the Transputer idea (and 16 cores at the time were very impressive), each Transputer was pretty slow. One of the demos was a parallel Mandelbrot calculation which was far from impressive considering I could achieve similar speeds using one 8MHz ARM2. It's a shame really, if they'd used a faster RISC they might have been more successful...

Wilco

- R
- Robert Adsett
  
  Contact options for registered users
Vote on answer
posted
17 years ago

Fri, Mar 2, 2007 9:23 PM

I just did. Again no hint of separate code and data space. In fact the memory organization sheet of the ST10F269 makes a point of it being a unified memory space.

I'm kind of curious of the C166 as well since the ST10/C167 are just grown up versions of the C166. I didn't think they had separate code and data spaces and the quick look I took didn't find any mention of it. I could easily have overlooked it though and I haven't used them.

Robert

- W
- Wilco Dijkstra
  
  Contact options for registered users
Vote on answer
posted
17 years ago

Sat, Mar 3, 2007 1:05 AM

See

formatting link

It can do

1 code and 3 data accesses per cycle - necessary to get good DSP performance.

The C166 can do 1 code and 2 data accesses every 2 cycles to the internal ROM/flash and dual ported RAM. It's sluggish - multiply takes

10 cycles! XC166 is a proper DSP with 1 cycle MACs.

So they're all Harvards with multiple memories and a unified memory space.

Wilco

- R
- Robert Adsett
  
  Contact options for registered users
Vote on answer
posted
17 years ago

Sat, Mar 3, 2007 3:53 AM

I see what you are getting at, although it does seem limited to the versions with the DSP added. It's also not what I would have considered Harvard architecture since there is no separation of address spaces 'just' multiple paths into the same address space.

I always thought of Harvard architecture as splitting address space not a bandwidth optimization. Am I wrong in that?

Robert

--
Posted via a free Usenet account from http://www.teranews.com

- G
- Grant Edwards
  
  Contact options for registered users
Vote on answer
posted
17 years ago

Sat, Mar 3, 2007 5:33 AM

That's my definition as well.

I've known one or two people who are of the opinion that multiple busses means "harvard" even thought there's only one address space. I think they're wrong.

--
Grant Edwards                   grante             Yow!  MMM-MM!! So THIS is
                                  at               BIO-NEBULATION!
                               visi.com

- E
- Everett M. Greene
  
  Contact options for registered users
Vote on answer
posted
17 years ago

Sat, Mar 3, 2007 7:21 PM

The classic Harvard architecture has separate instruction and data buses accessing different memories and never the twain shall meet. The AVR is an example with its 16-bit instruction bus and 8-bit data bus. Most of the Harvard architecture processors have a quantum tunneling mechanism for accessing the instruction memory as data to circumvent one of the drawbacks of the architecture.

The use of extra signals to select memory components is a mechanism for getting another bit or so of addressing range but does not in any way convert a von Neumann architecture to Harvard.

- G
- Grant Edwards
  
  Contact options for registered users
Vote on answer
posted
17 years ago

Sat, Mar 3, 2007 7:38 PM

So you're saying it has nothing to do with the number of address spaces?

Something like a TMS320C4x with a single address space but 3 different physical memory busses is a harvard archicture?

--
Grant Edwards                   grante             Yow!  I'm having
                                  at               a tax-deductible
                               visi.com            experience! I need an
                                                   energy crunch!!

- W
- Wilco Dijkstra
  
  Contact options for registered users
Vote on answer
posted
17 years ago

Sat, Mar 3, 2007 11:13 PM

Indeed. The original goal was to allow simultaneous access to both code and data memories. How they fit in the memory map is a different matter. The early pure Harvards didn't allow data access to instruction memory at all, but few (if any) of these exist today.

Having 2 or more address spaces doesn't make a CPU a Harvard. Many CISCs have a separate IO space for example.

Yes. Most Harvards have a single address space. Without that you'd need special instructions to access the different address spaces (eg. LPM on AVR, MOVC on 8051). Not good for C...

Wilco