Clearly some explanation required...
The original design did in fact have a clean simple 64k block of RAM, where no page was special, and yes, it was hung off the address/data buses and l ife was simple. The timing, however, was not. The original 6502 had a 2-pha se clock, in which it could present the address at the beginning of the clo ck cycle, and the data would be on the data-bus ready for use at the end of the same clock cycle. In this way, instructions that took two memory acces ses {read opcode, read argument} could be retired in just 2 clocks.
I don't have that luxury :) I wanted everything to be simple synchronous lo gic that presented the data on clock N and the result was available at cloc k N+1. In fact I wanted it even simpler. I wanted to keep the internal logi c as simple as possible as well, making everything synced to always @ (pose dge(clk)), which means that after a decode operation, it takes 3 clocks to read the data ([abus register. I'm already running the CPU internal state at 4x the nomina l 'cpu clock' to get the clock-cycle accuracy I need for the instructions t hat the 6502 took 2 clocks to process, and I'm inserting wait states to syn c up the longer (up to 7, so 28 clocks in my world) ones. Once the basic sy stem is in place, I'll allow that to optionally relax, and I can run it in cycle-accurate or "turbo" mode. Perhaps I'll have a "turbo" button (showing my age, here).
My solution to the 2-cycle instructions was to declare 2 pages-worth of reg isters: page-0, (which is special for the 6502, with special opcodes that t ake less time to run if they access there) and the stack (which is page-1). The 6502 has an 8-bit stack-pointer, that it always prepends 01h to (to fo rm 16'h01xx), providing a 256-deep stack. The use of a register array for b oth these pages significantly helps when I only have 2 clocks to play with. Obviously when the CPU wants to store or read values, I need to determine if it's page-0 or page-1 and redirect accordingly, but that's not a high pr ice to pay.
"Relatively complete" is an interesting term. I have a CPU that will execut e (at least in simulation [grin]) all the opcodes I have implemented - it d oes the decode, figures out the addressing mode of the instruction (up to 8 of them), processes the result, and updates the {memory, registers, proces sor-state-flags, stack} accordingly. The issue is that I've implemented abo ut 1/3 of the opcodes right now... So, the answer is "it depends on what yo u mean" :)
se the instruction decode logic is not implemented. This can
optimized because the output is never gated into the next register.
The full code is actually available (see link above, or go to http://0x0000 ff.com/6502/)
Cheers Simon.