Pipelined 6502/z80 with cache and 16x clock multiplier

- B
- Brett Davis
  
  Contact options for registered users
posted
13 years ago

Mon, Dec 20, 2010 2:25 AM

EETimes had an interesting article asking if 4-bits was dead.

formatting link

These chips have been pad limited for 2 decades, and as such are probably manufactured at fabs that are beyond obsolete.

You could take a public CPU design like OpenRISC and replace the instruction decoder, and get an easy ~4x performance jump running

65c802/65c816 code.

Compatibility with the Apple// disk controller would be poor. ;) But lets ignore that for the moment, Apple made some work arounds for the Apple2GS, so that can be fixed.

Step 2 would be to add a boot loader to set the cache modes up correctly for your memory spaces, so everything is not write through. That will get you another ~2x speedup.

One should be able to do at least a 16x local clock multiplier, especially if the base clock is a pathetic 2 MHz. That will get you another ~8x speedup.

The end result will be a ~quarter of the speed of the native OpenRISC opcodes, due to being register starved, but close enough not to matter? Compatibility is important.

The end result would still be pad limited and tiny, and made at a still obsolete if newer fab.

Is there a market for a 6502 era CPU that ran ~10x faster at ~10% more cost?

I think I just described the AVR8 family of CPUs, so the answer would be yes...

Brett

- Merry Christmas

- S
- Stephen Fuld
  
  Contact options for registered users
Vote on answer
posted
13 years ago

Mon, Dec 20, 2010 4:12 AM

Perhaps, but I don't think that is the target market for 4 bit processors. ARAIR the bit market for 4 bit processors is in the really low requirements consumer areas, such as remote controls for TV's etc. and things like microwave ovens, clothes washers, etc. As such, speed isn't an issue, but low cost and perhaps low power (i.e. longer battery life) is. So the market you are talking about may very well exist, but that doesn't mean it will replace the 4 bit market.

--
  - Stephen Fuld
(e-mail address disguised to prevent spam)

- D
- D Yuniskis
  
  Contact options for registered users
Vote on answer
posted
13 years ago

Mon, Dec 20, 2010 9:21 AM

I no longer read EETimes... are they considering the

6502 and Z80 "4 bit" parts because they have 4 bit ALUs within? Or, are they restricting their 4 bit handle for things like the 4004?

Or anything else that uses timing loops or implicitly counts on execution speed wrt peripheral interfaces, etc. (e.g., some devices have internal synchronizers which won't run faster just because the processor wants to run faster -- "recovery times").

The military, IIRC, still uses 6502's in some weapon systems (?)

Rabbit tried this approach with the Z80. But, decided to make something that wasn't 100% compatible with the original Z80. At least the Z180 devices didn't suffer this fate.

- L
- larwe
  
  Contact options for registered users
Vote on answer
posted
13 years ago

Mon, Dec 20, 2010 10:46 AM

People keep saying this, and never mind that it is totally untrue. Many 4-bit applications are speaking toys, which have a huge (multi megabit) mask ROM array on-chip. Yes, the processor core would be pad- limited if it was made on a die by itself. But it never is.

- J
- Jim Stewart
  
  Contact options for registered users
Vote on answer
posted
13 years ago

Mon, Dec 20, 2010 6:03 PM

I would add that anyone working on a 4-bit implementation probably wouldn't be discussing it on comp.arch.embedded.

- L
- linnix
  
  Contact options for registered users
Vote on answer
posted
13 years ago

Mon, Dec 20, 2010 9:45 PM

They are pad limited only for the app. Why build/bond more pads than necessary? You can build the exact number of pads in ASIC.

The article never mention 65XX as 4 bit.

Who care about Apple Compatibility? Not even Apple.

- M
- Muzaffer Kal
  
  Contact options for registered users
Vote on answer
posted
13 years ago

Tue, Dec 21, 2010 5:42 AM

dead.http://www.eetimes.com/discussion/other/4211452/Is-4-Bits-Dead-

In this context pad-limited most probably means that the die size is decided strictly by the number of pads one has to put around the die even if there is significant core area is left unused. Of course there technologies where one doesn't need to put the pads around the die and they can be placed inside but the cost for such a technique would probably be too high for the applications mentioned.

--
Muzaffer Kal

DSPIA INC.
ASIC/FPGA Design Services

http://www.dspia.com

- L
- linnix
  
  Contact options for registered users
Vote on answer
posted
13 years ago

Tue, Dec 21, 2010 6:33 AM

Around the die could be small (500). For example, some LCD and OLED controllers are very thin and long, in order to have as many pads as possible. So, if the app requires it, it could be done.

- B
- Brett Davis
  
  Contact options for registered users
Vote on answer
posted
13 years ago

Fri, Dec 24, 2010 1:26 AM

I looked at the Rabbit CPUs, its a quite nice upgrade that makes using C code for it viable.

Its not pipelined, 2 cycles per opcode byte and data byte.

Something I would have expected ~6 years after the Z80 came out, not ~26 years. ;)

The 20 year RISC fad has caused designs in the low end to needlessly fall behind the times.

Brett

- D
- D Yuniskis
  
  Contact options for registered users
Vote on answer
posted
13 years ago

Fri, Dec 24, 2010 2:54 AM

I didn't see that it bought you anything "appreciable". I.e., if you aren't going to make a "100% compatible" device, then why not come up with an entirely different design (instead of reheating one that was decades old)?

Most of its "improvements" come from a full 8-bit ALU (the original Z80 had a 4 bit ALU so had to push things through, "twice")

You can blame that on Zilog's unbelievable ineptitude. They announced many "nice" (potentially) successors to the Z80 but failed to deliver on any of them. Company had the market in its hands to *lose* -- and promptly

*did* exactly that!

Nowadays, you can take many of these low end design cores and roll your own processor. E.g., I think there is an "open core" version of the Z80 (T80?). Go play with it! :>

- D
- David Brown
  
  Contact options for registered users
Vote on answer
posted
13 years ago

Fri, Dec 24, 2010 11:29 AM

I guess they wanted an almost-but-not-quite Z80 to run their almost-but-not-quite C :-)

The 6502 was a pipelined design - IIRC it overlapped at least part of the instructions while competing designs (like the Z80) were entirely non-pipelined. This was one of the reasons that the Acorn BBC Micros were faster than many other home micros of that era.

- L
- larwe
  
  Contact options for registered users
Vote on answer
posted
13 years ago

Fri, Dec 24, 2010 12:43 PM

Well, "6502 faster than Z80" is only valid for small values of "faster"; are we measuring a tight loop around a NOP, or an actual useful function?

Of course I did love the Beeb.

- D
- D Yuniskis
  
  Contact options for registered users
Vote on answer
posted
13 years ago

Fri, Dec 24, 2010 8:06 PM

It also depends a lot on how you normalize "operating conditions" (same clock frequency? same memory access time? same set of available I/O's? etc.).

6502/68xx vs 808x/Zx80 typified the early split between processor designs. Memory mapped vs. dedicated I/O space; interrupt handling; "single accumulator" vs. register file; etc.

I had a friend who worked in the 68xx camp while I was dealing with 808x's... watching each other write code was almost an "anxious" event -- wondering what was going to happen next. E.g., I would plan ahead so everything I needed ended up in registers; he would grab what he needed *as* he needed it...

Then TI came along suffering some major hallucinations with their

99xx(x)'s... :-/ (clean idea but technology went a different way).

(sigh) It's too bad how *few* designs we have now to choose from.

- D
- David Brown
  
  Contact options for registered users
Vote on answer
posted
13 years ago

Sat, Dec 25, 2010 3:45 PM

Note that I didn't say the "6502 is faster than the Z80". I said the BBC Micro was faster than most contemporary home micros - several of which happened to be Z80-based (such as the ZX Spectrum). There were many reasons for this. The fact that the 6502 was pipelined meant that it was significantly faster than it should have been as an 8-bit register-based processor at a slower clock rate than the Z80 (2 MHz vs.

3.5 MHZ, IIRC) was only one of the reasons.

Looking purely at the cpu, the 6502 was fast for some things and slow for others. It had fast zero-page access - if you could hold your important data there, code size was small and speed high. But if you needed to do a lot of data movement or 16-bit arithmetic, it was a lot slower than the Z80 (which was partly 16-bit).

- B
- Bernd Paysan
  
  Contact options for registered users
Vote on answer
posted
13 years ago

Sat, Dec 25, 2010 9:46 PM

When you place pads around the die, you can pack them a lot closer together (staggered, minimal pitch 40µm) than when you place them on top of the die for bumps (minimal pitch 250µm, but now it's a 2d mesh, not just pads at the corners).

IMHO, there's absolutely no point - with actually still useful processes

- in putting a CPU on a chip without the memory. When you add the memory, how many bits your CPU uses doesn't matter that much - it's more how complex your CPU is. My b16 is small enough that there is really little total area benefit from making it even smaller (and it's 16 bit). The main contributor to the area on my projects is the actual memory - and there, what you want is a compact program, not a low-bitsize CPU (where the program is significantly larger to achieve the same thing, since more instructions are necessary).

--
Bernd Paysan
"If you want it done right, you have to do it yourself"
http://www.jwdt.com/~paysan/

- P
- Paul A. Clayton
  
  Contact options for registered users
Vote on answer
posted
13 years ago

Sat, Dec 25, 2010 10:51 PM

On Dec 19, 11:12=A0pm, Stephen Fuld wrote: [snip]

Performance does seem to be unimportant. Note also that low power can also exploit energy harvesting. It is surprising to me, however, that a mask-configurable 8-bit or 4-bit processor is not implemented (or would dynamic selection be more appropriate/sufficiently power- efficient? or fuse-configurable?), given that the registers/ALU/etc. (or even the entire processor core) take up a small fraction of the chip area. (At larger bit widths, providing a half-width double- threaded mode could be useful.) It does seem unlikely to me that a 4-bit processor makes sense: I strongly suspect that the area savings relative to an 8-bit processor are not significant (when ROM, RAM and peripherals take up most of the die space) and it seems likely that memory accesses would consume a significant fraction of the power reducing the impact of a wider processor. If the processor tends to be either fully active or fully off, then differences in leakage (static) power would not be significant either it seems. (For very simple processors, would wave pipelining and asynchronous methods be attractive?)

The article linked to by the mentioned article

formatting link

t-processing/ states that "EM Microelectronics . . . approaches a developer and works to demonstrate how the 4-bit device can provide differentiation to the developer=92s design and end product." Perhaps I am being excessively cynical, but this sounds like the technique of some software vendors--approach less-technically knowledgeable managers to make the sale. I admit that being limited to a small number of products by ROM mask validation cost constrains the methods available to market the product, but it seems that a bidding process would also work and still be open.

Having recently read Stanley Mazor's "The History of the Microcomputer

- Invention and Evolution"

formatting link

It seems that like the 4004 ("Dynamic RAM memory cells were also used inside the CPU for the 64-b index register array and 48-b Program counter/stack array."), some parts of these ultra-low power processors could be implemented in DRAM, especially for data that does not need to persist between active periods and is accessed regularly during active periods (e.g., the PC might be loaded from a vector table). (For such specialized processors, it seems that it might be reasonable for at least some interrupts to load values from a local ROM table into some registers.) (Xuejun Yang, Nathan Cooprider, John Regehr, "Eliminating the Call Stack to Save RAM" proposed putting return addresses into ROM to reduce RAM requirements, which might also be useful for odd Architectures with a tiny return address stack [the paper also proposed allocating local variables into the global variable section using global liveness analysis to minimize memory usage].)

Paul A. Clayton just a technophile

- G
- George Neuner
  
  Contact options for registered users
Vote on answer
posted
13 years ago

Sat, Dec 25, 2010 11:51 PM

The original Z80 was 2.5Mhz. IIRC, the early Z80 machines faired rather poorly in a number of comparisons against the Apple's 1Mhz

6502.

Of course, the 6502 needed some clever coding to beat the Z80 in a general mix of tasks (see below). And when Zilog introduced the 4Mhz part, the Z80 became faster in general.

[At least until the 65816 came along ... it took a 16Mhz Z80 to best a 3Mhz 65816 at most tasks. The 8Mhz 65816 was an even match to a (real mode) 10Mhz 80286 at many tasks. The 286 was faster at software FP arithmetic and, of course, offered protected mode multitasking. The 816 also could multitask, but the implementation was (usually) more complex due to the hardware stack being restricted to the first 64KB of memory. And, of course, there was no memory protection.]

Yes, the 8080/Z80 had a number of dual 8/16 bit registers ... but IIRC

16-bit arithmetic took 2 extra cycles.

The 6502 had only 8-bit registers. To do multi-byte arithmetic quickly the data had to be in the zero page - addresses 00h..FFh - for which the 6502 had a special 3-cycle address mode vs 4..6 cycles for accessing a general 16-bit address, depending on the index mode.

George

- G
- George Neuner
  
  Contact options for registered users
Vote on answer
posted
13 years ago

Sun, Dec 26, 2010 2:45 AM

The Apple ][ disk controller has precisely clocked subroutines that require the 6502 to run at 1MHz. The Apple //gs forced the 65816 into 6502 "compatibility mode" whenever it accessed addresses in E00000h..E1FFFFh (where the bus slots, video and Apple ][ compatible devices lived).

The 65816 brought out address valid lines for cache and DMA implementations. The stock Apple //gs had DMA and the various accelerator boards for it added cache to the CPU.

I have no idea whether 6502 compatible chips are in demand for any purpose.

Even so, I wouldn't bother with the 6502, but rather I would implement the 65816 (or the 65802 if you need 6502 pin compatibility). The

658xx ISA is a superset of 6502 that is cleaner and easier to work with even for 8-bit code.

George

- L
- larwe
  
  Contact options for registered users
Vote on answer
posted
13 years ago

Sun, Dec 26, 2010 3:39 AM

At least until a year or two ago, Sunplus had a range of chips that were 6502 core with some restrictions (IIRC no Y register and maybe a couple of other oddities). Winbond and a couple of others also use

6502 or 65816 cores in their toy chips. I guess it depends on whether you already have proprietary dev tools (for compiling proprietary languages, building audio projects, assembling LCD data, etc) that target 6502.

- D
- D Yuniskis
  
  Contact options for registered users
Vote on answer
posted
13 years ago

Sun, Dec 26, 2010 3:56 AM

Wasn't the 2A03 also a 6502 derivative?