Simulation vs Synthesis

- S
- Simon
  
  Contact options for registered users
Vote on answer
posted
8 years ago

Tue, Dec 1, 2015 6:15 AM

Clearly some explanation required...

The original design did in fact have a clean simple 64k block of RAM, where no page was special, and yes, it was hung off the address/data buses and l ife was simple. The timing, however, was not. The original 6502 had a 2-pha se clock, in which it could present the address at the beginning of the clo ck cycle, and the data would be on the data-bus ready for use at the end of the same clock cycle. In this way, instructions that took two memory acces ses {read opcode, read argument} could be retired in just 2 clocks.

I don't have that luxury :) I wanted everything to be simple synchronous lo gic that presented the data on clock N and the result was available at cloc k N+1. In fact I wanted it even simpler. I wanted to keep the internal logi c as simple as possible as well, making everything synced to always @ (pose dge(clk)), which means that after a decode operation, it takes 3 clocks to read the data ([abus register. I'm already running the CPU internal state at 4x the nomina l 'cpu clock' to get the clock-cycle accuracy I need for the instructions t hat the 6502 took 2 clocks to process, and I'm inserting wait states to syn c up the longer (up to 7, so 28 clocks in my world) ones. Once the basic sy stem is in place, I'll allow that to optionally relax, and I can run it in cycle-accurate or "turbo" mode. Perhaps I'll have a "turbo" button (showing my age, here).

My solution to the 2-cycle instructions was to declare 2 pages-worth of reg isters: page-0, (which is special for the 6502, with special opcodes that t ake less time to run if they access there) and the stack (which is page-1). The 6502 has an 8-bit stack-pointer, that it always prepends 01h to (to fo rm 16'h01xx), providing a 256-deep stack. The use of a register array for b oth these pages significantly helps when I only have 2 clocks to play with. Obviously when the CPU wants to store or read values, I need to determine if it's page-0 or page-1 and redirect accordingly, but that's not a high pr ice to pay.

"Relatively complete" is an interesting term. I have a CPU that will execut e (at least in simulation [grin]) all the opcodes I have implemented - it d oes the decode, figures out the addressing mode of the instruction (up to 8 of them), processes the result, and updates the {memory, registers, proces sor-state-flags, stack} accordingly. The issue is that I've implemented abo ut 1/3 of the opcodes right now... So, the answer is "it depends on what yo u mean" :)

se the instruction decode logic is not implemented. This can

optimized because the output is never gated into the next register.

The full code is actually available (see link above, or go to http://0x0000 ff.com/6502/)

Cheers Simon.

- B
- Brian Drummond
  
  Contact options for registered users
Vote on answer
posted
8 years ago

Tue, Dec 1, 2015 12:03 PM

At this stage, it's probably OK to simply trust synthesis until the design is largely complete. I

f your simulation tests are thorough enough, that's what matters.

You can mess with a temporary framework of attributes to preserve signals, but IMO it's wasted time and effort, especially since what's "preserve"d through synthesis can still be trimmed by the mapper, so you might have to push the rope a little harder.

You can possibly stub out blocks (containing some dummy observable, like an OR gate) and fill them in later.

But I'd probably press ahead with proving the design in simulation until there was enough to be worth synthesis.

-- Brian

- T
- Tom Gardner
  
  Contact options for registered users
Vote on answer
posted
8 years ago

Tue, Dec 1, 2015 1:22 PM

Is this because the 6502's stack pointer is only 8 bits long? It can only address 256 bytes of RAM, so bits 8-31 /cannot/ be used.

From

formatting link

Stack Pointer -------------

When the microprocessor executes a JSR (Jump to SubRoutine) instruction it needs to know where to return when finished. The 6502 keeps this information in low memory from $0100 to $01FF and uses the stack pointer as an offset. The stack grows down from $01FF and makes it possible to nest subroutines up to 128 levels deep. Not a problem in most cases.

- S
- Simon
  
  Contact options for registered users
Vote on answer
posted
8 years ago

Tue, Dec 1, 2015 2:52 PM

Sorry, I didn't really explain the 32-bit newSPData register, did I ? In my defence, my 3-year old was clamouring for his evening meal, and his mother was busy :)

What I'd been trying to do was split up the code into separate areas by mod ule, so generally speaking:

- there's a module ("decode.v") which takes in raw opcodes and outputs th e instruction type, and the addressing mode of the opcode (one of {accumula tor, Immediate, relative, absolute, zero-page, absolute-indexed-x, absolute

-indexed-y, zero-page-indexed-x, zero-page-indexed-y, indirect, indirect-x, indirect-y})

- there's a module ("execute.v") that handles doing the actual work of eac h opcode, placing the results in intermediate registers (output ports of th e module)

- and there's an overall harness-it-all-together module ("cpu_6502" in 650

2.v) which instantiates the above

The stack and page-zero are special (as mentioned before) for speed reasons , and I don't know of a way to share an array of registers between modules. For zero-page this isn't an issue, there's only one byte to write, and it can be passed back as 'storeValue' with an 'action' of {UPDATE_A, UPDATE_X, UPDATE_Y} and that byte will be placed in the correct processor register b ased on the action.

For the stack, though, I need to pass back (so far) up to 3 bytes of data. The BRK instruction simulates an interrupt, pushing (in order) {PC high-byt e, PC low-byte, Processor-status-flags} onto the stack, then reading the 2 bytes at the interrupt vector {16'hFFFE,16'hFFFF}, and setting the contents of those two bytes into PC.

The 32-bit (I went for 4 bytes not 3. If it needs to be changed, I can do s o later) 'newSPData' register, combined with the 2-bit count 'numSPBytes' i s how I implemented passing back the bytes from "execute" to the overall mo dule to update the array of registers that constitute my stack. The "execut e" module can pass back up to 4 bytes, and the overall module ("cpu_6502") that actually contains the stack register array will do the right thing, ba sed on 'numSPBytes' if the 'action' contains the bit 'UPDATE_SP'. This is a ll done in the `EXECUTE stage of the overall module in 6502.v

The addition I made last night was to expose the 255th byte of the stack as an external top-level port (the stack grows downwards, so this is the firs t byte of the stack) and synthesise. The lower 8 bits of the 'newSPData' re gister are those that would be inserted into the first position on the stac k, and indeed those lower 8 bits were not optimised away. From this, I conc lude it is the stack being optimised away that is the root cause of my warn ing messages.

'stack' was otherwise totally internal to the "cpu_6502" module, and althou gh it had a writer (the `EXECUTE stage can write up to 3 bytes to it), ther e is currently no reader for those registers. I'm up to 'EOR' (the 6502 use s EOR for what the rest of the world calls XOR), and the first instruction to implement reading the stack is 'PLA'. I might try jumping ahead to imple menting that instruction rather than going strictly alphabetically (I didn' t want to miss one :)

Hope that clears things up a little.

Cheers Simon

- S
- Simon
  
  Contact options for registered users
Vote on answer
posted
8 years ago

Tue, Dec 1, 2015 2:58 PM

Thanks :)

As I mentioned just above, I might jump ahead and implement PLA (which will force a *read* of the stack values rather than just the current writes) an d see if that has an effect.

The simulation tests at the moment are me going through (for every opcode) (for every addressing mode) ...

- Check the decoding - Check the timing (varies based on addressing mode) - Check the results

... in the simulator using the waveforms. It is, however, getting to the po int where writing a formal test of each of the above would start to become beneficial. I want to make sure that any later additions don't affect any p revious results. It's effort to do so, and my time is limited, but it will actually save time in the long run.

Cheers Simon

- T
- Tom Gardner
  
  Contact options for registered users
Vote on answer
posted
8 years ago

Tue, Dec 1, 2015 3:10 PM

That's what suites of test benches are for. The software world has triumphantly reinvented the concept and called them "unit tests".

It is normal to have a hierarchy of test suites. Some can be run frequently because they are are a fast "sanity check" that just tests simple externally observable behaviour of whatever unit is being tested. Some tests are run at major points in the design because they test the internal operation in detail, and hence are slow.

- S
- Simon
  
  Contact options for registered users
Vote on answer
posted
8 years ago

Tue, Dec 1, 2015 4:07 PM

results. It's effort to do so, and my time is limited, but it will actually save time in the long run.

[grin] I'm well aware what unit tests are for, I've written a *lot* of them in my day job over the last few decades, although admittedly not in verilo g :) The problem is not the lack of knowledge (for once), it's the will to sit down and do something that doesn't seemingly advance the project... It' s a lot more fun to write code than to write code that tests code...

As I said though, it is getting to the point (in all honesty, it's way past the point) where manual checking of things like this is no longer viable. Unit tests feature in my future ...

Cheers Simon.

- T
- Tom Gardner
  
  Contact options for registered users
Vote on answer
posted
8 years ago

Tue, Dec 1, 2015 5:51 PM

:)

Ah, but are you ready for the softies next dogma, "TDD"? Take a good thing, unit tests, and confidently state that they are necessary /and sufficient/ for a good product.

None of this BUFD (big[1] up-front design) nonsense. Write a test, create something that passes the test, and move on to the next test. Never mind the quality/completeness of the tests, if you get a green light after running the tests then /by definition/ it works.

Yup, ignorant youngsters are taught that and believe it :(

[1] in typo veritas: I first wrote "bug" :)

- G
- glen herrmannsfeldt
  
  Contact options for registered users
Vote on answer
posted
8 years ago

Wed, Dec 2, 2015 12:21 AM

(snip)

Early in the processing the synthesis tools flatten the netlist.

That is, all the modules go away. Just a big collection of gates like one big module. We find it easier to think about logic one module at a time, but it seems not easier for the computer.

Not so much longer after that, duplicate logic, including duplicate registers are detected. If you have two registers in different modules with the same inputs and clocks, one will be removed. (Same module, too, but that is more obvious to us.)

Later, any logic where the output doesn't go anywhere is removed, recursively. Also, any logic that has a constant output is removed, and replaced by the constant.

You might find:

formatting link

interesting.

(Even though it has ended, it looks like it will still let you sign up.)

-- glen

- G
- glen herrmannsfeldt
  
  Contact options for registered users
Vote on answer
posted
8 years ago

Wed, Dec 2, 2015 12:23 AM

As I said before, one reason to turn off the optimization is to see how big it will be when it isn't optimized out.

It is sometimes useful to know early how big an FPGA is needed.

But for actual use, you might just as well let it optimize away.

-- glen

- G
- glen herrmannsfeldt
  
  Contact options for registered users
Vote on answer
posted
8 years ago

Wed, Dec 2, 2015 12:32 AM

(snip)

I believe it is one of Brooks' laws of software engineering (applies here, even though it isn't software):

"Writing the code takes the first 90% of the time, debugging takes the second 90%."

formatting link

-- glen

- B
- BobH
  
  Contact options for registered users
Vote on answer
posted
8 years ago

Wed, Dec 2, 2015 1:55 AM

Often the auto-wire "feature" will generate a replacement. If you go through the logs, it is noted, and usually the auto-wire will be a single wide signal instead of a bus, so it shows up that way too.

- B
- BobH
  
  Contact options for registered users
Vote on answer
posted
8 years ago

Wed, Dec 2, 2015 2:19 AM

If I understand correctly, the root of the problem you are describing is that you are trying to use an array of registers as RAM, and it is optimizing out big chunks or all of it. Trying to build a synthesizable array of addressable registers is a pain in the butt in Verilog. There is probably a way to do it with genvars or maybe a for loop, but in the past I have just brute forced it. Using genvars seems like the a promising path, but the only exposure to them that I have had is debugging cases where Xilinx ISE (v14) would not handle them as expected.

The brute force might look like:

module reg_ram ( input wire [1:0] address, input wire [7:0] write_data, input wire write_en, input wire clk, input wire rstn, output reg [7:0] read_data );

reg [7:0] cell0, cell1, cell2, cell3;

always @(posedge clk or negedge rstn) if (~rstn) cell0

- R
- rickman
  
  Contact options for registered users
Vote on answer
posted
8 years ago

Wed, Dec 2, 2015 3:58 AM

That is why VHDL has strong typing, errors like this are made *very* clear.

--

Rick

- S
- Simon
  
  Contact options for registered users
Vote on answer
posted
8 years ago

Wed, Dec 2, 2015 4:49 AM

So this evening I implemented the PLA instruction, which reads from the sta ck (at the current location of the stack pointer) and stores the value ther e into A. Synthesis took about 3x as long, and at the end of it there's a w hole bunch of Info messages about how it wasn't storing the stack in a bloc k ram for this reason or that.

Looking at the registers, I jumped from ~260 to ~520, so it looks as though the variably-indexed (via SP) set of stack registers were incorporated int o the design again :) Phew!

I guess I'll just get on with it and implement more instructions - I was ju st afraid that as the design got larger, it would be harder to debug. Looks like it might have been easier :)

Thanks again for all the help everyone, especially the verilog examples Bob :)

Simon

- R
- rickman
  
  Contact options for registered users
Vote on answer
posted
8 years ago

Wed, Dec 2, 2015 5:02 AM

You do have an issue if a block RAM is not being used. The code I've seen looks like you are writing from a functional perspective rather than structural. I would suggest you write a module for a block RAM using example code provided by your chip manufacturer. Then incorporate that RAM module into your code as appropriate.

Block RAM must have a register delay in the RAM itself. There are other restrictions as well, the details depending on the vendor. If you code the module by the provider's example you should get a block RAM. This should also help you see the limitations of how you can use that RAM.

I have had similar problems coding adders when I was trying to use the carry out. One small issue with how I was using the adder resulting in a second adder being used to generate the carry out.

--

Rick

- S
- Simon
  
  Contact options for registered users
Vote on answer
posted
8 years ago

Wed, Dec 2, 2015 4:08 PM

But I don't want a block-ram. I don't want to pay the penalty of a clock-cy cle for access to the values. I want a block of 256 registers, which I can access with as-close-to-zero time cost as possible. Block-ram's are great, but in this case I really want just a whole bunch of registers.

I'm conscious that something is screwy. I don't understand why an array of registers declared as...

/////////////////////////////////////////////////////////////////////// ///// // Set up zero-page as register-based for speed reasons /////////////////////////////////////////////////////////////////////// ///// reg [`NW:0] zp[0:255]; // Zero- page

... should exhibit a whole bunch of warnings along the lines of

INFO: [Synth 8-5545] ROM "zp_reg[255]" won't be mapped to RAM because a ddress size (32) is larger than maximum supported(25)"

Um, que ? Address size == 32 ? Even if you treat it as a 1-bit array, t hat's only 11 bits of address (8 * 256 = 2048) to access any given bit. H mm, now there's a thought. I wonder if declaring:

reg [2047:0] zp;

.. and doing the bit-selections might be a way to do it. No array, just a f reaking huge register. I wonder how efficient it is at ganging up LUTs to m ake a combined single register...

I actually might try implementing a module along the lines of BobH's code a bove - rather than just declaring the register array, and see how that work s out. At the moment I'm busy writing unit tests :) Cheers Simon

- R
- rickman
  
  Contact options for registered users
Vote on answer
posted
8 years ago

Wed, Dec 2, 2015 4:28 PM

Ok, I understand better now.

Now I am lost again. Why are you trying to change the code that is giving you 256 registers? The only RAM in FPGAs these days is synchronous RAM. If you don't want the address register delay then your only choice is to use fabric FFs.

--

Rick

- G
- GaborSzakacs
  
  Contact options for registered users
Vote on answer
posted
8 years ago

Wed, Dec 2, 2015 4:41 PM

You don't need VHDL, just Verilog 2001 and use `default_nettype none to prevent auto-wire generation.

--
Gabor

- M
- Mark Curry
  
  Contact options for registered users
Vote on answer
posted
8 years ago

Wed, Dec 2, 2015 5:55 PM

Huh. I missed what led up to this, but explicity coding up each case like this is entirely unneccesary in verilog.

reg [ 7 : 0] cell [ 3 : 0]; always @( posedge clk ) // NO ASYNC RESET - messes up optimization - no reset at all actually is prefered if( write_en ) cell[ address ]