Memory controller design

- P
- Piotr Wyderski
  
  Contact options for registered users
posted
17 years ago

Sat, Dec 30, 2006 10:48 PM

Hi,

I would like to connect many independent data source/targets to a common data stream. There will be a 36-bit static RAM block of 2^20 words (9x IDT71V428-12) running as fast as possible, i.e. at ~83MHz, which is supposed to be the main storage of the system and a number of completely unsynchronized components, trying to send/receive their data streams to/from the RAM block. The FPGA chip will be a Spartan 3 or 3E, I haven't chosen it yet. The FPGA will host, among other things, the following components:

a) a 2-way 18-bit SIMD fixed-point complex math processor running at 65 MHz. All its simple scalar instructions should complete in 1 cycle, which is doable, as there are hardware

18x18 multipliers. It will thus consume 292,5 MiB/s of the avaliable bandwidth.

b) a high-speed USB2.0 bidirectional 8-bit datalink running at 48Mhz, which gives 48 MiB/s.

c) an Ethernet 100 controller, full duplex mode => ~20 MiB/s.

d) an LCD display driver, about 2 MiB/s.

e) many slow links (SPI-like, AC-97 TDMA etc.), won't consume much bandwidth.

The total bandwidth is 373 MiB/s, which easily covers the requirements. My idea is to implement a static DMA-like RAM transaction slot allocator, which will grant the bus for the CPU in 65 slots out of 83, in 11 for the USB link etc., but how to implement a bunch of low-latency half-duplex bridges between the 83MHz domain and the remaining ones? I don't want to waste my precious BRAMs for that purpose, so what should I do?

Best regards Piotr Wyderski

- J
- Jerzy Gbur
  
  Contact options for registered users
Vote on answer
posted
17 years ago

Sun, Dec 31, 2006 11:18 AM

Hi Piotr,

Piotr Wyderski napisa³(a):

IMHO you should use at least BRAM for preparing data to/from SRAM's BUS. BUS side should work at 83MHz, but inner side should work faster to accomplish multiplexing data in adequate "slots". I don't know how you like to match Address BUS and Data BUS, If I were you I use second BRAM for matching address.

Best Regards,

Jerzy Gbur

- P
- Piotr Wyderski
  
  Contact options for registered users
Vote on answer
posted
17 years ago

Sun, Dec 31, 2006 12:10 PM

Yes, but this way the fast random access time will be lost and the whole system will behave like a DRAM-based system with a tiny cache. Another option is to clock the CPU at 83MHz to match the bus speed and add the HLD signal, like in the old good DMA controllers. It simplifies a lot of things, but the initial question "how to connect many slower participants to the bus?" remains open. In this design some of them can be easily attached, as

83/2 = 41,5 and 83/4 = 20,75, so my USB and Ethernet links could work synchronously with the bus, but many other sources (AC-97 codecs, display) cannot by synchronized this way.

It depends what you call "inner side". The CPU is supposed to work at 2--3 times higher frequency than I said, to hide its internal simple pipeline and appear to be one cycle design. But its memory interface is bounded by the available bandwidth. There is a large data source/target domain that _must_ be clocked at 65MHz, but I can connect it via a BRAM to the CPU domain.

What do you mean by "bus matching"?

Best regards Piotr Wyderski

- K
- KJ
  
  Contact options for registered users
Vote on answer
posted
17 years ago

Sun, Dec 31, 2006 6:49 PM

"Piotr Wyderski" wrote in message news:en6qbk$p2s$ snipped-for-privacy@news.dialog.net.pl...

The function that you're describing is an arbitrator; you have multiple sources that need to share access to a shared resource (the SRAM), the management of who gets control of that resource at any particular time is up to whatever arbitration function you choose to implement.

If you view it in that context your 'bunch of low-latency half-duplex bridges' will present as much of a challenge as you may think. The best way to go about this is to start with the entity definition for the SRAM arbitration function. Each potential master requires a private interface to the arbitrator, the arbitrator also has a master interface to the external SRAM itself. So if you have 10 potential sources to the SRAM then the arbitrator will have 10 slave interfaces (to each of those sources) plus an SRAM master interface.

Next consider the requirements of each of those sources. Do they have some sort of 'wait' signal that will cause it to hold address and write data (during a write) and cause it to hold address while waiting for a read to complete? What kind of read cycle time performance is required? It sounds like you have a handle on the bandwidth requirements but are there any latency requirements (i.e. how long can something 'wait')?

If you go about the process as figuring out the requirements of the arbitration function and work through the requirements that each master presents and the target SRAM slave then it should start to fall into place.

Kevin Jennings

- P
- Piotr Wyderski
  
  Contact options for registered users
Vote on answer
posted
17 years ago

Sun, Dec 31, 2006 8:44 PM

Yes, they do.

The CPU must run as fast as possible because of its computationally

-intensive tasks, but no access time restriction is required, i.e. it is not important whether a particular single load or store takes one or ten cycles to complete, as long as they statistically complete in 1.28 cycle (83/65) on average for a trurly random access pattern. The USB and Ethernet links work similarly, as their master controllers are in the FPGA itself (i.e. no external component screams "feed me!"), so again, there are no real-time requirements. The only real-time components are AC-97 codecs and the display (that is, its pixel bus), but they are slow.

Fortunately not, only the bandwidth matters. Well, several channels have bounded maximal latency, but it is so long compared to the RAM bus cycle that it could be easily fulfilled by an approprate arbitration function. A simple round-robin prioritizer will be perfectly enough.

Well, think of many DMA channels connected to much slower clock domains, it's a good model. The problem is how to pass their data and configuration parameters between the main clock domain and their respective domains.

Now I think that a separate RAM clock domain is too hard to be implemented reliably, so I can redesign the system in order to run the CPU at the same clock rate. It will allow me to implement the arbitrator in an old way, i.e. to add the HLD signal to the CPU and state that the DMA controller has higher priority, but it will require more (mostly unidirectional) synchronization bridges elsewhere. They must be made of CLBs, because I need BRAMs for better purposes.

Best regards Piotr Wyderski

- N
- Nico Coesel
  
  Contact options for registered users
Vote on answer
posted
17 years ago

Sun, Dec 31, 2006 8:54 PM

This is not so difficult to implement. Using a priority encoder and a state-machine which performs a memory transaction, the entire arbiter is almost finished. The trick is to design the state-machine in a way the maximum bandwidth can be used and the bandwidth is shared properly.

There is also a different approach which has been discussed in this group before. I believe it is called a ring bus. It seems pretty clever and I will consider using it the next time I have to share a memory between different devices.

Daniel Sauvageau wrote something about it before in a thread called 'ddr with multiple users':

Why use a ring bus?

- Nearly immune to wire delays since each node inserts bus pipelining FFs with distributed buffer control (big plus for ASICs)

- Low signal count (all things being relative) memory controller: - 36bits input (muxed command/address/data/etc.) - 36bits output (muxed command/address/data/etc.)

- Same interface regardless of how many memory clients are on the bus

- Can double as a general-purpose modular interconnect, this can be useful for node-to-node burst transfers like DMA

- Bandwidth and latency can be tailored by shuffling components, inserting extra memory controller taps or adding rings as necessary

- Basic arbitration is provided for free by node ordering

The only major down-side to ring buses is worst-case latency. Not much

of an issue for me since my primary interest is video processing/streaming - I can simply preload one line ahead and pretty much forget about latency.

Flexibility, scalability and routability are what makes ring buses so popular in modern large-scale, high-bandwidth ASICs and systems. It is

all a matter of trading some up-front complexity and latency for long-term gain.

--
Reply to nico@nctdevpuntnl (punt=.)
Bedrijven en winkels vindt U op www.adresboekje.nl

- K
- KJ
  
  Contact options for registered users
Vote on answer
posted
17 years ago

Tue, Jan 2, 2007 4:58 PM

Why would you think that? It's not true.

The arbitrator will get implemented however you want it to be, you're not constrained to a simple priority scheme....the arbitration function is what you're designing.

As a general approach, transferring between different clock domains is accomplished with a dual clock fifo. In many cases a full blown fifo is not really needed, mainly just some handshaking signals to go back and forth are the tricky part. In any case, the address, data and control signals get sampled in the clock domain that they originate from and the handshake acknowledge signal gets generated in the 'other' clock domain. All of this though has absolutely nothing to do with memory controllers or arbitration it has to do with clock domain transfers.

Although the SRAM can operate without a clock, it is very unlikely (also read as very difficult, many times just not possible in an FPGA environment) to design an arbitrator to operate without a clock so you'll most likely need to choose one clock from your system; if this clock happens to be the same as the CPU (or any of the other SRAM masters) than that particular master(s) will be able to operate synchronously and directly with the arbitrator. Any master device that operates synchronous to another clock will need to be synchronized first.

Whether or not the clock domain crossing fifos require CLBs or BRAMs depends mostly on how quickly data comes in from a particular master and how much latency it can tolerate. Usually the quicker the memory request comes in the more depth you need to your fifo otherwise you compromise system performance. You can implement the fifo in CLBs or BRAMs as you see fit, it's a tradeoff you make based on resource usage, clock speed and system performance.

Kevin Jennings

- J
- jerzy.gbur
  
  Contact options for registered users
Vote on answer
posted
17 years ago

Wed, Jan 3, 2007 10:35 AM

I mean "Inner" = into FPGA,

Address and adequate data must be transferred at same time, that was what I mean.

If you don't want build cache, you may connect CPU to SRAM only, all others signals transfer through CPU.

If you want we can meet personally and talk about it - we're living in the same town :)

Best Regards,

Jerzy Gbur