6502 FPGA core

I've implemented a first version of a 6502 core. It has a very simple architecture: First the command is read and then for every command a list of microcodes are executed, controlled by a state machine. To avoid the redundant VHDL typing, the VHDL code is generated with a Lisp program:

formatting link

This is the output:

formatting link

I've tested some instructions, like LDA, and looks like it works, but I'm sure there are many bugs and not all features are implemented (e.g. BCD mode or interrupt handling). It uses 2,960 LEs with Quartus 7.1, which is too much compared to the 797 LEs of the T65 project. Any ideas how to improve it? My idea was, that the synthesizer would be able to merge the addressing mode implementations for the commands, but maybe this has to be refactored by hand.

My goal is to beat the T65 project in LE usage. Speed and 100% compatibility with the original 6502 (e.g. the strange S0 and V-flag feature or the original hardware reset vectors) is not important for me, but code compiled with

formatting link
must work.

Most FPGAs have some kbyte memory (>5 kByte, even for inexpensive FPGAs, freely configurable as ROM and RAM), so maybe a good idea would be to store some microcode in memory? What instruction set is useful to implement the

6502 instruction set? Maybe a Forth-like microcode?

Any ideas how to improve the Lisp code? I like my idea of using a lambda function in addressing-commands, because this looks more clean than a macro, which I've tried first, but I don't like the explicit call of emit-lines. How can I refactor it to a more DSL like approach?

--
Frank Buss, fb@frank-buss.de
http://www.frank-buss.de, http://www.it4-systems.de
Reply to
Frank Buss
Loading thread data ...

That's a lot of ground to make up. Is the 'fat' in any one area ? Does the 797LE version have BCD and Interrupts ?

Err, why not use/improve the T65 work ?

-jg

Reply to
Jim Granville

Looks like it has interrupts, but no BCD.

It's more fun to implement it myself :-)

I've started a new version, see below. Now it is more clean VHDL code and it should need very few LEs, but a ROM of maybe 1 kbyte. Every microcode is executed in one clock cycle. I plan to implement a call/return microcode, with a callstack size of 1 address, too, for helping to reduce the ROM size (e.g. most addressing modes can be implemented in subroutines). For writing the microcode program and creating the MIF file, I'll write a Lisp program again.

library IEEE; use IEEE.STD_LOGIC_1164.ALL; use IEEE.STD_LOGIC_ARITH.ALL; use IEEE.STD_LOGIC_UNSIGNED.ALL; use work.ALL;

entity t_rex_test is port( clock_50mhz: in std_logic; led: out unsigned(7 downto 0); button: in unsigned(3 downto 0); dip_switch: in unsigned(3 downto 0); neg_reset: in std_logic); end entity t_rex_test;

architecture rtl of t_rex_test is

-- bit position in microcode for indicating the last -- microcode command in a program constant mcode_stop_bit : integer:= 7;

-- type for CPU addresses subtype address_type is std_logic_vector(15 downto 0); -- type for CPU data words subtype data_type is std_logic_vector(7 downto 0);

-- microcode commands constant mcode_load_pc : data_type := x"01"; constant mcode_store_address : data_type := x"02";

-- CPU RAM signals signal address : address_type; signal data : data_type; signal q : data_type; signal wren : std_logic := '0';

-- microcode ROM signals signal mcode_address : std_logic_vector(8 downto 0); signal mcode_q : data_type; signal mcode_code : data_type; signal mcode_stop : boolean;

-- scratch register signal working : address_type := x"0200";

-- current command signal command : data_type;

-- CPU registers signal pc : address_type := x"0200"; signal sp : address_type := x"01ff"; signal accu : data_type; signal x : data_type; signal y : data_type; signal z_flag : std_logic; signal n_flag : std_logic; signal c_flag : std_logic; signal v_flag : std_logic; signal i_flag : std_logic; signal d_flag : std_logic;

-- CPU statemachine type cpu_state_type is ( read_command_state, wait_for_read_state, read_memory_state, wait_for_mcode_index, read_mcode_index, execute_mcode, read_mcode ); signal cpu_state : cpu_state_type := read_command_state;

begin

-- CPU RAM instance_ram: entity ram port map ( address => address(11 downto 0), clock => clock_50mhz, data => data, wren => wren, q => q );

-- microcode ROM instance_microcode: entity microcode port map ( address => mcode_address, clock => clock_50mhz, data => x"00", wren => '0', q => mcode_q );

-- read command and execute microcode process(clock_50mhz, neg_reset) begin if neg_reset = '1' then pc

Reply to
Frank Buss

OK.

ROM makes sense, pretty much every FPGA these days have these for free, and they should be use more in Soft CPU designs.

Let us know how the different approach impacts LE count.

-jg

Reply to
Jim Granville

Nice work Frank!

I haven't looked in detail at your work, but the general idea of doing nostalgic implementations, doing it for fun and doing it in a minimum resource-fashion is my cup of tea. Just one suggestion that is something that I'm on right now (or... one of the things I'm on right now). If you are willing to sacrifice ALOT FMAX to save FPGA-resources maybe an inner CPU with a very simple instructions-set could do? By doing this and building the instruction-set in reusuable pieces I think there are potential for resource-gains to earn. But if you remember the speed these hogs were doing in the wild days (1,2,4,8 Mhz) maybe similar preformance is still OK.

OK, you won't win prices in minimum-power-usage, in readability or probably in anything but I have a hunch this is the way to achieve MAXIMUM usage of resources. Me, myself have been working on implementing a minimum 68K-core this way. Still alot of work left todo, but the current reading of 8% of a Spartan3-200K is quite nice.. My goal is to, at least, get something working in about 20% of a Spartan3-200K.

/Magnus

Reply to
spartan3wiz

Yes, this was my idea. I have enhanced my FPGA implementation to a Forth-like CPU, this is the current version:

formatting link

It has the following microcodes:

call load-pc load-address load-q load-accu load-x load-y store-pc store-address store-data store-accu store-x store-y inc dec add lshift-8 or nop

The load and store commands loads and stores from the specified register to an internal stack (stack size is configurable). "call" executes a program at the location specified in the next byte. A return is implemented, if bit

7 is set in the microcode. The rest are instructions needed to make it simpler to implement the 6502 instruction set, e.g. "or" pops the first two values from stack, does a binary OR and pushs the result back to stack.

Testing the microcodes with a simulator is too time consuming, so I've implemented an emulator in Lisp, which creates the opcode ROM, too and the constant list for the microcodes for pasting into the VHDL code:

formatting link

Playing with it is really nice, e.g. this is the output of an interactive session:

CL-USER > (dump #x1f00)

1F00: 00 00 00 00 00 00 00 00 NIL

CL-USER > (execute-command) current registers: a: 00, x: 00, y: 00 pc: 0200, mcode-address: 0000,

executing microcode: CALL a: 00, x: 00, y: 00 pc: 0201, mcode-address: 0119,

executing microcode: LOAD-PC a: 00, x: 00, y: 00 pc: 0201, mcode-address: 011A,

executing microcode: STORE-ADDRESS a: 00, x: 00, y: 00 pc: 0201, mcode-address: 011B,

executing microcode: LOAD-PC a: 00, x: 00, y: 00 pc: 0201, mcode-address: 011C,

executing microcode: INC a: 00, x: 00, y: 00 pc: 0201, mcode-address: 011D,

executing microcode: STORE-PC a: 00, x: 00, y: 00 pc: 0202, mcode-address: 011E,

executing microcode: LOAD-Q a: 00, x: 00, y: 00 pc: 0202, mcode-address: 012B,

executing microcode: STORE-ACCU a: 2A, x: 00, y: 00 pc: 0202, mcode-address: 012C,

NIL

CL-USER > (execute-command) current registers: a: 2A, x: 00, y: 00 pc: 0202, mcode-address: 012C,

executing microcode: CALL a: 2A, x: 00, y: 00 pc: 0203, mcode-address: 010C,

executing microcode: CALL a: 2A, x: 00, y: 00 pc: 0203, mcode-address: 0106,

...

NIL

CL-USER > (dump #x1f00)

1F00: 2A 00 00 00 00 00 00 00 NIL

This was the executing of the following small program, compiled with cc65:

.org $200 lda #42 sta $1f00

Now I can implement and test the rest very fast, because I can add debugging output very easily, implement all addressing modes of one instruction with a higher-level instruction to avoid (Lisp) code duplication etc.

On the FPGA side I have to simulate the microcodes, only. If every microcode does what it should do, then all microcode programs should work immediatly, because they were tested with my Lisp prorgam before.

I think readability is very good (ok, maybe because I know Forth and Lisp) and power-usage should be good, too, because fewer LEs are used. My current Forth FPGA implementation needs 319 LEs (about 5% of the small Cyclone EP1C6Q240C8). But I expect 10 times slower than e.g. the T65, so the all in all cycles per power would be not so good.

Nice. How does your microcode looks like? Some instructions are very similar to the 6502, so maybe we can develop the perfect microcode for both :-)

--
Frank Buss, fb@frank-buss.de
http://www.frank-buss.de, http://www.it4-systems.de
Reply to
Frank Buss

[...]

(Followup was set to comp.arch.fpga, but since this is Lisp-related, I've changed followup to c.l.l.)

It seems to me that a more natural way to represent this code in a Lisp program would be to use some form of syntax trees.

For example, the VHDL statement

if q = x"00" then z_flag

Reply to
Sven-Olof Nystr|m

Ahh nice! I think we have connected brains! :-)

Actually my idea isn't that refined yet. I am doing the long-way- around the problem. After seeing some of the different retro-cores out there (6502, 8051,Z80 and 68K) that I think took just too much resource off my poor small FPGA, I decided that it must be possible to implement something smaller and still have full functionality. The people who have implemented these cores straight in VHDL/Verilog are fantastic people way over my current limit I think, so all cheers to you!

If I/we don't need the 50MHz (or so) that can be achieved in today's standard hobby FPGA-development-cards, maybe the FMAX CAN be, in some way, converted into reused blocks.

I started my 68K-project by using the fantastic, already super- optimized Picoblaze of Ken Chapman that are Xilinx-specific and SMALL:

formatting link
formatting link

By shifting over to, for example the Pacoblaze:

formatting link

I can then make it run on anything and still be quite small!

My work is only half-way finished and I'm not sure It'll fit into the tight program-space, but by taking on hard projects at least you learn something I think. To fit this into the program space of the Picoblaze I need to do MAXIMUM reuse of the assembler code, thus crystallizing out the nice reusable parts into assembler sub-routines. Maybe even finding reuse in places where it otherwise might be missed..

By using an 8-bit CPU to emulate an 16-bit CPU I can save resources but get a hard performance-hit. It is very time-consuming doing these tests and there are lots of stuff left to fix, for example the memory access problem, but I'm keep trying until... well until I don't feel like it! :-)

Then when I'm finished (and have something working), I have several next step possibilities of which I would like to try all.

1) Just keep the slow small core making sure it run on Picoblaze (xilinx hardware) as well as Pacoblaze (anything..) and does its job.

2) Try removing all unused instructions from the 8-bit CPU's instruction set, thus making it even smaller BUT destroying the possible future upgrades and removing the compatibility of the internal parts (this would only be pubhished as already compiled BIT- files I think..)

3) Try adding extra instructions (from implementing the assembler sub- routines into new instructions) by looking at profiling of solution 1) running. By doing this we can find the perfect balance (or several balances) between size/speed depending on the demands on the goal circuit usage.

4) A combination of 2) and 3)

5) Maybe building something completely new out of the things learned from all the above..

But your thoughts on doing something generic just sounds NICE!

A nice tool that kept me going this far is:

formatting link

/Magnus

snipped-for-privacy@frank-buss.dehttp://www.frank-buss.de,http://www.it4-systems.de

Reply to
spartan3wiz

Just checking if you have seen the work of Jan Decaluwe ?

formatting link

If this runs slower, one of my pet ideas for FPGA cores, is to design them to run from SerialFLASH memory. Top end ones (winbond) run at

150MBd of link speed, so can feed nearly 20MB/s of streaming code. Ideally, the core has a short-skip opcode, as the jump in such memory has a higher cost.

-jg

Reply to
Jim Granville

Or a "four address instruction" like the Pilot Ace, with SerialFlash in place of a tube full of mercury?

- Brian

Reply to
Brian Drummond

snipped-for-privacy@frank-buss.dehttp://www.frank-buss.de,http://www.it4-systems.de

Don't worry to much about speed. You will be amazed how easy it is to optimize uCode, as soon as the processor really works. And nobody says, that you have only one execution unit in the system. (Had once something like this with 2.5 execution units, controlling a

36 bit proecessor (data width)

But, excellent project !

Reply to
emu

You've lost me ?

-jg

Reply to
Jim Granville

Me too, but this looks relevant

formatting link

Reply to
Tommy Thorn

formatting link

Wow, that's quite impressive. A 1MHz clock, back in 1951!

I had not thought of Serial Data, only Serial code access, as those speeds are getting tolerable, and the pin/pcb savings are massive.

Most FPGAs have some SRAM, and uC projects commonly need less DATA than Code, but it raises a good point: Serial data _could_ also be used, and the Ramtron FRAM devices would be good candidates - up to 64K bytes of Data, in 20MHz SPI. So, you'd set that up on separate pins.

-jg

Reply to
Jim Granville

In some designs of that era, three address instructions were common, source1, source2 and dest, very like the register addresses in a RISC. The innovation here was a fourth address; for the next instruction, coded to appear out of the delay line (or drum memory) just when it was needed. Important because the next location in program memory would have flashed past, and you'd have to wait for the memory's cycle time (or a whole drum revolution) before it came round again.

Apparently it was a headache to hand-code for maximum performance, or "offered great scope for programmer ingenuity" :-) but worthwhile for heavily used code. (I believe it had the first floating point library, coded this way)

But it could still be useful for streaming instructions from serial memory.

- Brian

Reply to
Brian Drummond

formatting link

"it is not thought wise to design for higher speeds than this as yet"

formatting link
(from 1945)

May 1950 according to

formatting link
which has some details. Apparently both code and data, but the "fourth address" was specifically to optimise code location.

Surprisingly small, according to

formatting link

- Brian (wondering how many tubes you can fit in a CLB)

Reply to
Brian Drummond

Somewhere around here I have a (very old) reference manual for the 6502

- one of my all time favourite processors - that actually listed the instruction decode by bit positions. I'll have to dig it out and amuse myself by writing some code to actually do the decode using straight combinational logic ;)

Cheers

PeteS

Reply to
PeteS

formatting link
contains the full archive.

Some documents from 1951 contain information on the final machine.

formatting link
"Report on the Pilot Model"
formatting link
"Programming and Coding for the pilot model"
formatting link
appendix describing 1954 modifications, inc. drum store.

The first of these refers to the 1948 "Progress Report"

formatting link
which defines the terms and symbols used.

- Brian

Reply to
Brian Drummond

ElectronDepot website is not affiliated with any of the manufacturers or service providers discussed here. All logos and trade names are the property of their respective owners.