more than one clock

Hi all,

Up until now, everything I've done has been synced to a single clock running around the FPGA. I now want to add a hardware divider (64 or 32 bit dividend, 32 bit divisor) as a peripheral to the CPU, and it's going to be s...l...o...w - still faster han doing it in s/w of course :-)

A nice way to speed it up then would be to clock the divider circuit at a multiple of the rest of the CPU... Now I've read of things like 'metastability' and advice to 'never use gated clocks' and such, so I was wondering if the following would be safe if the divider clock is running at M times the cpu clock (using a DCM) ?

clock action

0 cpu writes dividend & divisor to divider module input ports +1 cpu sets the 'go' input to divider high and waits for 'rdy'

+0.x divider starts (N internal cycles) to perform division

+0.N divider writes result to module output port +0.M divider writes 'rdy' to module output

+1 cpu reads result from divider, sets 'go' signal low.

+1.x divider sets 'rdy' low since 'go' has gone low

The syntax for the clock here is that numbers before the '.' represent CPU clocks, numbers after represent divider clocks. The 'x' in stages 3 and 7 just represents the fact that the divider clock may be a few periods ahead (in internal cycles) of the cpu clock - not really important, also the syntax '+0.N' really means +(N/M).(N%M) since N/M is highly likely to be >1...

Since the divider waits for M internal clocks (1 whole cpu clock) after writing the result and before writing 'rdy', doesn't that mean the result will be stable before the cpu reads it ? Is M clocks delay sufficient ? Would less do ?

Or is the whole idea a complete idiocy and should I scuttle back to completely synchronous designs [grin] ?

Thanks in advance for any help :-)

Simon.

Reply to
Simon
Loading thread data ...

Hi Simon,

Living a clean and pure life.

Unless your CPU is really slow for some reason, I would expect the max clock rate that your divider circuit to be similar to the max clock rate of your CPU. Both tend to be dominated by carry chains, add/sub in the CPU, and sub/compare in a divider.

Anyway, your description of passing data back and forth between the CPU and divider is reasonable, except you are missing a pair of synchronizers. These are typically a pair of flipflops, so you are off by 4 flipflops.

If your want to read about metastability and synchronizers, I recommend:

formatting link

If you run the two sections of your design at different clock rates, although there are special cases where it can be made to work if the clocks are phase locked (with perhaps a DCM), there are difficulties related to clock skew and jitter that may conspire to defeat you any way.

It is far easier to just say that the clocks are asynchronous, and then use standard techniques (synchronizers) to deal with it.

Lets look at your plan:

Good. No race condition here, since data is guaranteed stable before you set the GO signal.

So this is where you need the first synchronizer. The CPU GO bit should feed a two stage synchronizer (two FFs), both of which are clocked by the divider clock. Otherwise the GO signal arrives asynchronously in the divider's clock domain, and metastability or race conditions could occur. Synchronizing the GO signal reduces the probability of problems. Not to zero, but with good synchronizer design, vanishingly rare (like once per megayear).

You don't need any synchronization on the data, since it was all set up prior to the GO signal, so it is guaranteed stable by the time the divider trys to look at it after it receives the synchronized version of the GO signal.

Good. No race condition here, since data is guaranteed stable before you set the RDY signal.

Same story as above. The RDY signal needs to synchronized with a two stage synchronizer, that is clocked by the CPU's clock.

Right. The '0' is really several cycles later for the CPU. The exact number doesn't really matter, as you have it waiting for the synchronized version of the RDY signal.

The M internal clocks avoids a race condition, by guaranteeing that the data is stable before RDY is sent, but the CPU still needs to see RDY in its own clock domain, so it must be synchronized.

Nope. This is fine, and you got it mostly right. Add 4 FFs, and you should have a fine reliable system.

Next project, add floating point multiply, divide, add, subtract.

Hope this helps. Philip

Philip Freidin Fliptronics

Reply to
Philip Freidin

First off, Philip - thanks very much for the help - it's clear to me now why you need the synchronizers :-)

The CPU module itself is happy to run at ~70MHz, but when I wrap the rest of the SOC around it, it drops right down to ~35 MHz, so it is pretty slow :-( I've yet to convince myself of the reason for the slowdown (I've thought it was lots of things, worked around the issue and not fixed the problem!). Now that Jim Wu has shown me how to do hierarchical floorplanning, I might be able to make the mess that is my cpu a little more logically-laid-out, which might help :-)

I thought I might be able to get 2x (hey, maybe 3x :-) performance from the divider module if it's running as a separately-clocked domain loosely coupled to the main CPU.

I'll bet there are some "old-timers" looking at those figures and thinking to themselves, "what's the problem ? Only 70 MHz ? I can do that in my sleep!" [grin]

Wow! Lots of info - excellent stuff :-)

I've just spent the last 3 of the last 4 days trying to port gcc to the architecture - is that one beast of a program! Instead I settled for 'lcc', and got it working in half a day :-) I ended up having to rewrite my assembler quite a bit to cope with the degenerate assembly syntax that 'lcc' produces, and along the way it turned into a linker rather than an assembler ('as' is now implemented with 'cp' :-). Now I can type 'lcc file.c -o file.out' and end up with a Motorola S-record file ready for upload.

What's next to do:

o First, redo the internals a bit, so the data bus is only driven when necessary (at the moment it's all the time, and I want peripherals to be master-capable on the SOC - even to have more than one CPU sharing the SOC bus). This implies a priority controller, and bus-request lines etc.

o After that, I want more interrupt control (a bit like the m68k 'trap' or arm 'swi' calls, so the CPU can raise an interrupt for itself when it gets a bus error, for example). The idea is that any peripheral (including a cpu) can send a (32 bit) message to any other, the address defining the message type, the data-bus defining the message value. Eg: network controller peripheral buffer is 3/4 full, so it sends IRQ to an OS driver on a CPU to read the data and create some more space. Traps, interrupts and signals all with the same technique :-)

o Next up is an SDRAM controller (although I ought to be able to take one of the existing Xilinx ones), because at the moment, the whole thing is running out of 4 blockrams. At that point, it'll probably become important to have a pipelined arch. just to hide the IF delay and load/store times.

o Then I want an instruction cache, probably not a data cache (again due to the desire for multiple processors on the same SOC, coherency would become an issue without a bus-snooping protocol).

o When all that's in place, I'll want to think about an RTOS, although I'll probably port freertos or something rather than write my own. Of all the tasks in the whole project, this is the only one I've had some experience with! [grin]

o Finally, I can start to think about adding floating-point support :-)

So, lots to keep me busy :-)

Simon.

Reply to
Simon

ElectronDepot website is not affiliated with any of the manufacturers or service providers discussed here. All logos and trade names are the property of their respective owners.