ddr with multiple users

D

David Ashley 19 years ago

Hi,

I have about 4 different independent things that each need to access a ddr.

On one hand it seems I can make them all wishbone compliant then just have a wishbone ddr interface.

Would be workable/advisable to instead just have each device control the ddr itself, and use the ddr's own interface directly?

I'd only need one complicated mechanism to initialize the ddr after reset, but from then on each of the user processes can just request access to the ddr, and when granted just take over the lines.

One concern is that ddr timing at 100 mhz is pretty tight. Having the logic to combine 4 different sources into control signals for the ddr might add too much overhead. Of course it can be accomplished with a single LUT just doing a non-registered OR, if all 4 sources know to zero out their control lines when they're not the master...

Any tips/advice welcome.

Thanks-- Dave

David Ashley http://www.xdr.com/dash Embedded linux, device drivers, system architecture

Vote

J

Jonathan Bromley 19 years ago

Seems to me that your second idea would involve each device having a complete DDR access controller in it. That sounds like quite a bad idea to me; if you're going to make good use of SDRAM you need to keep track of the RAM's internal state to some extent (which banks are active, current row address, that sort of thing) and it would be very difficult for all four accessors to keep that kind of internal state in step.

Of course, passing each client's requests to the RAM controller is sure to cost some latency if you use a common controller. If you have a custom controller (rather than a standard single-port controller on a common bus) then you can hide most of that latency in the arbitration delay, at the expense of some extra complexity.

It's an interesting question, though. I need to deal with something quite similar in the immediate future, so any other ideas would be gratefully received! Oh, and does anyone have any strong opinions (positive or negative) about any of the available open-source DDR controllers?

Jonathan Bromley, Consultant DOULOS - Developing Design Know-how VHDL * Verilog * SystemC * e * Perl * Tcl/Tk * Project Services Doulos Ltd., 22 Market Place, Ringwood, BH24 1AW, UK jonathan.bromley@MYCOMPANY.com http://www.MYCOMPANY.com The contents of this message may contain personal views which are not the views of Doulos Ltd., unless specifically stated.

Vote

D

David Ashley 19 years ago

No matter what happens, 4 separate widgets need to gain access to memory. If they have to interface to some other controller anyway, what's the advantage? Why not make the DDR itself the controller?

For example wishbone, the idea behind replacing DDR's own interface with a wishbone in-between would be based on assertions that

1) DDR is overly complex, wishbone's simpler 2) IP core reuse -- wishbone is more standard

I *need* to be able to burst large blocks of memory to the DDR. If I use a wishbone interface, then there needs to be some mechanism to translate a burst from one clock domain (the wishbone's) to the DDR's clock domain. That might involve some sort of fifo...and sounds complicated to me.

On the other hand the DDR controller is actually not that complicated. Using DDR itself would allow known, easy burst accesses, and memory bandwidth can be maximized.

Regarding the DDR's internal state, I'm planning on all widgets doing burst accesses, and each access would only be to a single row. If each widget just precharged the row upon exit, the overhead would be minimal, yet bandwidth would still be good.

Finally, if a refresh cycle needs to be imposed, that can be done with a 5th widget that just does a refresh cycle, or it could be a function of the arbitrator itself.

-Dave

David Ashley http://www.xdr.com/dash Embedded linux, device drivers, system architecture

Vote

N

Nico Coesel 19 years ago

I have created something similar. I created a fifo in a block ram which is used to source or sink data to or from the ddr memory from and to multiple devices at different speeds. In my application I need to write or read large bursts of data so I created a fifo which can work only one direction at a time. I use interleaved fixed burst sizes of 8 (16 bits per DQ line in one transaction) so the overhead is minimal.

You might want to look into some sort of caching scheme anyway because accessing ddr to read or write just one address is dead slow.

I created a hook into the ddr statemachine which allows me to execute any type of 'instruction' on the ddr memory from a microcontroller. This reduces initialisation to a software thing. Don't forget about refresh.

You can run the DDR controller at 50MHz (half the ddr memory clock) which relaxes the timing for the address and control signals a lot (also good for meeting EMC limits because the drive strength can be reduced and the signals carry lower frequencies). The only line that actually needs tight timing is CS. Fortunately, this is the least loaded line in large multi-chip memory setups.

The data lines are a different story. Using a clock with a fixed delay (in my case 90 degrees is just fine) to capture the data gives more than enough margin.

As you can read, I didn't use the MIG tool (doesn't work for a Spartan3/200).

Reply to nico@nctdevpuntnl (punt=.) Bedrijven en winkels vindt U op www.adresboekje.nl

Vote

J

John Williams 19 years ago

Some off-the-shelf options that spring to mind:

Firstly there's the opb_mch_ddr interface core that comes with Xilinx EDK - in addition to the OPB bus interface (which you can ignore), it has 4 independent channels that support a fairly simple simple cacheline fetch protocol (Xilinx CacheLink). In reality it's just their FSL port, with a specific access protocol.

It's intended for interfacing MicroBlaze CPUs to memory, but no reason you couldn't use it for something else. Current versions of the core are fixed priority on the 4 ports, but I believe that round robin and other priority schemes are in the roadmap (if you read the VHDL sources anyway).

Xilinx also has a MPMC (multiport memory controller) that was developed for the gigabit serial reference design (GSRD), also worth a look. I think you need to register to download this design.

MPMC uses something called LocalLink, again just a sort of cacheline read/write protocol, nothing too tricky I don't think, and there should be full source examples of how to drive it in the reference design.

Ultimately you have to arbitrate somewhere, be that on a bus, in the memory controller, or at the DDR pin/signal stage. But as others have suggested, that may be more trouble than its worth.

Regards,

John

Vote

C

Christian Kirschenlohr 19 years ago

Hi David, i had a similar problem where 3 different clocked users, needed to access a SDRAM Interface wich run by a clock of 166 MHz. The main problem was that every user needed full page bursts in/out of the RAM. So every one of the users got its own dc-Fifo. A single arbitration unit took care that all users got access at the right time. As SDRAM Controller I used the OpenCore from Altera. This Design runs very well in Cyclone and Cyclone II devices. Hope that helps.

Regards Christian

Vote

K

KJ 19 years ago

Then you need a 4 port arbitrator to control access to the DDR.

The arbitrator would then have 4 user ports and one DDR port, all can be Wishbone compliant and connect up nicely.

Probably not. Arbitration by nature needs 'global' knowledge of the scope of what it is arbitrating in order to be effective: Some things needed are

- It needs to know about the 4 (or however many) users

- Preferred burst sizes for each port

- How long the other ports should wait while they're waiting for their turn (i.e. how important is latency).

- Arbitration scheme (round robin, etc.)

All of the above can be implemented in a single arbiter where basically all of the above can (and should...IMO) be input as generic parameters and will be very efficient in terms of logic resources. Without expanding the arbiter design beyond the design considerations that are important to your particular application, you can write such an arbiter and still parameterize it enough that it might be either directly useful the next time this situation comes up again or will at least provide a solid baseline for you to generalize it a bit more for that next application without having to totally re-invent the wheel.

Each device (i.e. user of the DDR) should not be concerned with this type of information they should just think that they have exclusive access to a Wishbone port. Gumming up the user design with this info would be counterproductive at best. By moving bits and pieces of the arbitration into each user's design code you're most likely to create something that is

- Less efficient in terms of logic resource usage than it should be (at best it would be no worse but I doubt it could be better)

- Created a less than useful 'device' since now it is a device that is only applicable if it is used in a system with three other users all sharing a DDR.

Again, what is needed is the arbitration logic.

The four port inputs into the arbiter can be as fast or faster since they will all be inside the FPGA. DDR may be fast but FPGA internal is faster.

KJ

Vote

D

David Ashley 19 years ago

The arbitration is separate from the interface. Wishbone probably already has an arbitor capability built in, I'd guess.

It's either USER Arbitrator Wishbone DDR or USER Arbitrator DDR

Actually the arbitration logic is not really *in* the chain, it just selectively allows the USER to access the next stage on the other side.

It's critical that bursts be handled well. DDR effectively has a minimum burst of 2, and the 2 addresses are always at A and A+1. Probably A is even also, but I don't remember at the moment.

Also a burst within DDR can't cross a row. Then there is arbitrary CAS latency.

Wishbone supports bursting, but there probably aren't any restrictions. That is, bursts probably don't need to start on an even address, and they don't have to end before they cross some arbitrary boundary.

But I can easily make the 4 users of the DDR work within the DDR's limitations. They can also take full advantage of the DDR's capabilities.

With the wishbone approach I get a generic piece of logic I can reuse with other DDR's. But at what cost?

Complications:

1) To support bursting, it needs to have some sort of fifo. An easy way would be the core stores up the whole burst, then transacts it to the DDR when all is known. But to reduce latency, the DDR transaction probably ought to start while the USER is still pushing data into the wishbone interface. The whole goal is to get as close to 2 memory accesses per clock, since that's what DDR supports.

2) The wishbone core must deal with page crossing bursts somehow. This would mean breaking up a burst into 2 ddr bursts. Otherwise if I impose address/burst restrictions on the wishbone core, it's not 100% compliant, I'd expect.

3) The wishbone core must deal with the even/odd address limitations, otherwise it's not 100% compliant, I expect.

The disadvantages of involving wishbone are

1) More complicated, more work, later time to market 2) Almost certainly will introduce latency in pipeline 3) To implement, I've got to learn wishbone AND ddr, as opposed to just ddr now and perhaps wishbone at a later date.

The advantages are

1) Single logic driving DDR pins, so supposedly clock timing can be met easier. 2) More general for code reuse, since lots of things already support wishbone.

Also of note that the end result of all this is a system meant to be released as open source. That's why if I were going to use wishbone I'd feel compelled to do it right.

Anyway it still isn't clear to me the wishbone approach is automatically right for this particular application.

Thanks for everyone's input on this so far.

-Dave

David Ashley http://www.xdr.com/dash Embedded linux, device drivers, system architecture

Vote

K

KJ 19 years ago

We may not be meaning the same thing when we say 'interface'. In your example, you have four 'users' who need to share a common resource, the DDR. Maybe you're seeing this as all one 'interface' but in reality it consists of several of what I would call 'interfaces'. One way to approach this problem would be to use a single DDR controller code and arbitrate access to the input. In that scenario you would have 11 interfaces:

#1-4 are the individual master interfaces out of each of the four 'users'. #5-8 are the individual slave interfaces that are single, individual targets of #1-4. #9 is a single master interface that connects to #10. #10 is the slave side interface of a DDR controller code. #11 is the master side interface from your DDR controller that is intended to hook up to the actual DDR itself.

The task then would be to...

- Implement the function Arb() performs the translation from interfaces #5-8 to create #9.

- Connect up all of what are now point to point connections.

Now, there is some function that I'll call f() that implements whatever is necessary to go from interface #10 to interface #11. Presumably this is simply the OpenCores DDR controller or some other commercial controller. In any case, those cores all fit the basic interface structure that I've defined above to get from #10 to #11. They don't fit mapping more than one input to the DDR directly.

What I thought you were suggesting is that you take this function f() and replicate it 4 times and then add the arbitration between the outputs of those four f() functions before applying it to the physical DDR and putting this code in with the four users.

You could implement it this way, but if you do I'm confident that you'll be chewing up many more logic resources than you would if you instead focused on creating the arbitration function Arb() which performs the magic to connect interfaces #5-8 to interface #9.

Guess again. Wishbone is strictly a point to point interface. By that I mean that it simply defines the signals to/from the master, the signals to/from the slave and how those signals accomplish data transfer. The logic for multiple masters off a slave or multiple slaves off of a master is outside of the Wishbone specification.

What Wishbone brings to the table is a common interface. This is handy since by my definition of 'interface' that are 11 of them in play...with the exception of #11, all can be Wishbone or any other standard you want to code to. Wishbone doesn't bring a lot of baggage so that IF you need to have multiple masters/slaves that you don't have cumbersome extra logic. Altera's 'Avalon' and OpenCores 'SimpCon' interfaces are all similar in that regard. They are all point to point but can be used in a multi-master/multi-target system quite easily.

My point was that, viewed in this light, the arbitration function which connects #5-8 to #9 can be both somewhat generic (i.e. could be used to arbitrate other devices besides DDR) and yet still be parameterized to handle the pecularities for your particular application (i.e. bursting whenever possible to DDR).

Again, considering what I consider to be an 'interface' puts this directly into the chain. If you go the route of implementing multiple f() functions inside the four users you still end up with the same 11 interfaces but now some of them are buried inside each of the four users so they are only going away in the sense that you're drawing the border line around a slightly bigger area. Now you would have four 'users' that do not have native Wishbone interfaces but instead have a native DDR interface that then needs to be arbitrated and translated into the same final output interface to the DDR.

Still using the approach that I suggested, all of that can be handled with the Arb() function as well.

At the expense of now making those 4 users tuned specifically to the nuances of DDR. If you migrate this to some other memory technology then you have to retune each of these four for the new nuances of that memory.

Good question. I can't really give details, but I'll say that I've implemented the approach that I mentioned for interfacing six masters to DDR and the logic resources consumed were less than but roughly comparable to that consumed by a single DDR controller. I had all the same issues that you're aware of regarding how you need to properly control DDR to get good performance and all.

The Arb() function that I implemented is also paramterized so that I could use it to interface effectively with a PCI bridge as well without changing any code (only the parameter settings when instantiating the entity).

I'd suggest keeping along that train of thought as you go forward but keep refining it.

Here is where Wishbone lets you down a bit. There was a discussion on this newsgroup called 'JOP on Avalon' or something like that. It was primarily between myself and two others where we discussed the relative merits of Wishbone, Avalon and SimpCon. You might want to peruse that a bit since with Wishbone you have to go a bit outside of the normal spec by using what Wishbone calls 'tags' to get the full performance on the DDR. It's not really violating Wishbone, it's just not built into it as cleanly as it is with Avalon or (from my limited knowledge of) SimpCon. The issue is that 'tags' are not required to be implemented in any specific way but Wishbone has sort of set aside a particular way of tagging that will help you get the full performance.

The key in any of this though is the realization that the address phase and the data phase of any transaction are independent. A master device can initiate a second command on the address bus even before the first has completed. Even if you're not considering Altera's Avalon as a bus for your design, their documentation of that bus and how those two phases of a bus cycle are treated is very good and worth the read. Pay attention to the section regarding 'slave side read latency' and then compare that to Wishbone. It's good reading and may give you a somewhat different perspective and can certainly help with this arbitration function even if you don't implement using Avalon.

Nope, that's up to the arbitrator...if it has been given the knowledge of the concept of 'bursts' and further parameterized by 'burst sizes' and 'address boundaries'.

Shouldn't need to go that way though....crossing a page boundary should at worst cause wait states on the user's master side if it is hammering memory. If the user is lightly touching it then it shouldn't even cause that.

It will probably consume more logic resources your way which could impact price.

Guess I'm confused a bit. If it's needs to be 'open source' than it would seem that standarding on Wishbone would be a good thing and having tuned the 'users' to a DDR interface would be less flexible. This might be just a case of where you draw the boundary around the 'box'. Maybe from the perspective of someone using your widget they don't care directly about 'user1'...'user4' just that there are 4 of them and they can all talk to DDR and whether or not you use a standard interface to implement them is about as relevant as whether you code your state machines in the 'two process' template or the 'one process' template.

What I've found though is that using a logically complete handshaking protocol does not impose much if any extraneous logic resource usage, and using that protocol even on internal interfaces that nobody really cares about is actually an aid in getting everything debugged and workng properly with the added benefit being that other people can more readily understand those internal interfaces (if needed) since it conforms to an established protocol.

It may not be given your particular constraints.

You're welcome. Good luck on your design

KJ

Vote

J

jacko 19 years ago

hi

in/out busses: low-master, high-master, bottleneck-chain

and you would need three instances.

cheers

Vote

D

David Ashley 19 years ago

I'm looking at the opencores DDR controller for reference and educational purposes, that's what it appears most suited for. I'm all new to VHDL and FPGA design mind you, but one has to start somewhere. I did a fair amount of 7400 series logic design about 1980-1982 but things are a bit more intricate now.

This is what I consider the most important paragraph of your response, and based on this I'll probably abandon the multiple DDR aware master idea.

My understanding of wishbone is certainly incomplete, I had forgotten it was a master/slave system for connecting 2 endpoints.

What I had been thinking of was sort of like one of the xilinx busses (opb?) where they just wire-OR control signals together, and all inactive bus drivers are supposed to drive their signals to logic

0 when they don't own the bus.

This boils down to each of the 4 "masters" just has some representation of the DDR's pins, plus a mechanism to request the bus. Until the bus has been granted, each master shuts up. Once bus is granted, the owning master can then diddle the lines and it's like that single master is controlling the DDR itself.

Now in retrospect it occurs to me the main benefit of something like that would be in minimizing latency, but only in the case where the DDR is mostly inactive. If it's frequently being used, then each master must wait for its turn anyway and latency is out the window.

I'm starting to like this approach. Each master could then just queue up an access, say WRITE = ADDRESS + Some number of 32 bit words of data to put there READ = ADDRESS + the number of words you want from there.

In either case data gets shoved into a fifo owned by the master. Once the transaction is queued up, the master just needs to wait until it's done.

Let's see what the masters are:

1) CPU doing cache line fills + flushes, no single beat reads/writes 2) Batches of audio data for read 3) Video data for read 4) Perhaps DMA channels initiated by the CPU, transfer from BRAM to memory, say for ethernet packets.

2,3,4 latency isn't an issue. #1 latency can be minimized if the cpu uses BRAM as a cache, which is the intent.

Thanks for taking the time to write all that! Dave

David Ashley http://www.xdr.com/dash Embedded linux, device drivers, system architecture

Vote

D

Daniel S. 19 years ago

Since routing multiple 32+bits buses consumes a fair amount of routing and control logic which needs tweaking whenever the design changes, I have been considering ring buses for future designs. As long as latency is not a primary issue, the ring-bus can also be used for data streaming, with the memory controller simply being one more possible target/initiator node.

Using dual ring buses (clockwise + counter-clockwise) to link critical nodes can take care of most latency concerns by improving proximity. For large and extremely intensive applications like GPUs, the memory controller can have multiple ring bus taps to further increase bandwidth and reduce latency - look at ATI's X1600 GPUs.

Ring buses are great in ASICs since they have no a-priori routing constraints, I wonder how well this would apply to FPGAs since these are optimized for linear left-to-right data paths, give or take a few rows/columns. (I did some preliminary work on this and the partial prototype reached 240MHz on V4LX25-10, limited mostly by routing and 4:1 muxes IIRC.)

Daniel Sauvageau moc.xortam@egavuasd Matrox Graphics Inc. 1155 St-Regis, Dorval, Qc, Canada 514-822-6000

Vote

W

Weng Tianxiang 19 years ago

Hi Daniel, Here is my suggestion. For example, there are 5 components which have access to DDR controller module. What I would like to do is:

Each of 5 components has an output buffer shared by DDR controller module;
DDR controller module has an output bus shared by all 5 components as their input bus.

Each data has an additional bit to indicate if it is a data or a command. If it is a command, it indicates which the output bus is targeting at. If it is a data, the data belongs to the targeted component.

Output data streams look like this: Command; data; ... data; Command; data; ... data;

In the command data, you may add any information you like. The best benefit of this scheme is it has no delays and no penalty in performance, and it has minimum number of buses.

I don't see ring bus has any benefits over my scheme.

In ring situation, you must have (N+1)*2 buses for N >= 2. In my scheme, it must have N+1 buses, where N is the number of components, excluding DDR controller module.

Weng

Vote

D

David Ashley 19 years ago

Weng,

Your strategy seems to make sense to me. I don't actually know what a ring buffer is. Your design seems appropriate for the imbalance built into the system -- that is, any of the 5 components can initiate a command at any time, however the DDR controller can only respond to one command at a time. So you don't need a unique link to each component for data coming from the DDR.

However thinking a little more on it, each of the 5 components must have logic to ignore the data that isn't targeted at themselves. Also in order to be able to deal with data returned from the DDR at a later time, perhaps a component might store it in a fifo anyway.

The approach I had sort of been envisioning involved for each component you have 2 fifos, one goes for commands and data from the component to the ddr, and the other is for data coming back from the ddr. The ddr controller just needs to decide which component to pull commands from -- round robin would be fine for my application. If it's a read command, it need only stuff the returned data in the right fifo.

I don't know, I think I like your approach. One can always add a 2nd fifo for read data if desired, and I think the logic to ignore others' data is probably trivial...

-Dave

David Ashley http://www.xdr.com/dash Embedded linux, device drivers, system architecture

Vote

K

KJ 19 years ago

Not sure what is being 'shared'. If it is the actual DDR output pins then this is problematic....you likely won't be able to meet DDR timing when those DDR signals are coming and spread out to 5 locations instead of just one as it would be with a standard DDR controller. Even if it did work for

5 it wouldn't scale well either (i.e. 10 users of the DDR).

If what is 'shared' is the output from the 5 component that feed in to the input of the DDR controller, than you're talking about internal tri-states which may be a problem depending on which target device is in question.

You haven't convinced me of any of these points. Plus how it would address the pecularities of DDRs themselves where there is a definite performance hit for randomly thrashing about in memory has not been addressed.

A unique link to an arbitrator though allows each component to 'think' that it is running independently and addressing DDR at the same time. In other words, all 5 components can start up their own transaction at the exact same time. The arbitration logic function would buffer up all 5, selecting one of them for output to the DDR. When reading DDR this might not help performance much but for writing it can be a huge difference.

That's one approach. If you think some more on this you should be able to see a way to have a single fifo for the readback data from the DDR (instead of one per component).

KJ

Vote

W

Weng Tianxiang 19 years ago

Hi, My scheme is not only a strategy, but a finished work. The following is more to disclose.

What means sharing between 1 component and DDR controller system is: The output fifo of one component are shared by one component and DDR controller module, one component uses write half and DDR uses another read half.
The output fifo uses the same technique as what I mentioned in the previous email: command word and data words are mixed, but there are more than that: The command word contains either write or read commands.

So in the output fifo, data stream looks like this: Read command, address, number of bytes; Write command, address, number of bytes; Data; ... Data; Write command, address, number of bytes; Data; ... Data; Read command, address, number of bytes; Read command, address, number of bytes; ...

In DDR controller side, there is small logic to pick read commands from input command/data stream, then put them into a read command queue that is used by DDR module to access read commands. You don't have to worry why read command is put behind a write command. For all components, if a read command is issued after a write command, the read command cannot be executed until write data is fully written into DDR system to avoid interfering the write/read order.
The DDR has its output fifo and a different output bus. The output fifo plays a buffer that separate coupling between DDR its own operations and output function.

DDR read data from DDR memory and put data into its output fifo. There is output bus driver that picks up data from the DDR output buffer, then put it in output bus in a format that target component likes best. Then the output bus is shared by 5 components which read their own data, like a wireless communication channel: they only listen and get their own data on the output bus, never inteference with others.

All components work at their full speeds.

Arbitor module resides in DDR controller module. It doesn't control which component should output data, but it controls which fifo should be read first to avoid its fullness and determine how to insert commands into DDR command streams that will be sent to DDR chip. In that way, all output fifo will work in full speeds according to their own rules.
Every component must have a read fifo to store data read from DDR output bus. One cannot skip the read fifo, because you must have a capability to adjust read speed for each component and read data from DDR output bus will disappear after 1 clock.

In short, each component has a write fifo whose read side is used by DDR controller and a read fifo that picks data from DDR controller output bus.

In the result, the number of wires used for communications between DDR controller and all components are dramatically reduced at least by more than 100 wires for a 5 component system.

What is the other problem?

Weng

Vote

K

KJ 19 years ago

Weng,

OK, I'm a bit clearer now on what you have now. What you've described is (I think) also functionally identical to what I was suggesting earlier (which is also a working, tested and shipping design).

suggested though. A better partioning would be to have the fifos and control logic in a standalone module. Each component would talk point to point with this new module on one side (equivalent to your components writing commands and data into the fifo). The function of this module would be to select (based on whatever arbitration algorithm is preferable) and output commands over a point to point connection to a standard DDR Controller (this is equivalent to your DDR Controller 'read' side of the fifo). This module is essentially the bus arbitration module.

Whether implemented as a standalone module (as I've done) or embedded into a customized DDR Controller (as you've done) ends up with the same functionality and should result in the same logic/resource usage and result in a working design that can run the DDRs at the best possible rate.

But in my case, I now have a standalone arbitration module with standardized interfaces that can be used to arbitrate with totally different things other than DDRs. In my case, I instantiated three arbitrators that connected to three separate DDRs (two with six masters, one with 12) and a fourth arbitrator that connected 13 bus masters to a single PCI bus. No code changes are required, only change the generics when instantiating the module to essentially 'tune' it to the particular usage.

One other point: you probably don't need a read data fifo per component, you can get away with just one single fifo inside the arbitration module. That fifo would not hold the read data but just the code to tell the arbitrator who to route the read data back to. The arbitor would write this code into the fifo at the point where it initiates a read to the DDR controller. The read data itself could be broadcast to all components in parallel once it arrives back. Only one component though would get the signal flagging that the data was valid based on a simple decode of the above mentioned code that the arbitor put into the small read fifo. In other words, this fifo would only need to be wide enough to handle the number of users (i.e. 5 masters would imply a 3 bit code is needed) and only deep enough to handle whatever the latency is between initiating a read command to the DDR controller and when the data actually comes back.

KJ

Vote

W

Weng Tianxiang 19 years ago

same

Hi KJ,

My design never use module design methodology. I use a big file to contain all logic statements except modules from Xilinx core.

If a segment is to be used for other project, just a copy and paste to do the same things as module methodology does, but all signal names never change cross all function modules.

Individual read fifo is needed for each component. The reason is issuing a read command and the data read back are not synchronous and one must have its own read fifo to store its own read data. After reading data falled into its read fifo, each components can decide what next to do on its own situation.

If only one read buffer is used, big problems would arise. For example, if you have PCI-x/PCI bus, if their modules have read data, they cannot immediately transfer the read data until they get PCI-x/PCI bus control. That process may last very long, for example 1K clocks, causing other read data blocked by one read buffer design.

Strategically, by using my method one has a great flexibility to do anything you want in the fastest speed and with minimum wire connections among DDR controller and all components.

Actually in my design there is no arbitor, because there is no common bus to arbitrate. There is onle write-fifo select logic to decide which write fifo should be picked first to write its data into DDR chip, based on many factors, not only because one write fifo has data.

The many write factors include: a. write priority; b. write address if it falls into the same bank+column of current write command; c. if write fifo is approaching to be full, depending on the source date input rate; d. ...

Different components have different priority to access to DDR controller. You may imagine, for example, there are 2 PowerPC, one PCI-e, one PCI-x, one Gigabit stream. You may put priority table as like this to handle read commands: a. two PowerPC has top priority and they have equal rights to access DDR; b. PCI-e may the lowest one in priority, because it is a package protocol, any delays do few damages to the performance if any. c. ...

Weng

Vote

W

Weng Tianxiang 19 years ago

controller

then

just

for

the

tri-states

address

that

other

same

one

to

(instead

Hi KJ, If you like, please put your module interface in the group and I would like to indicate which wires are redundent if my design was implemented.

"In my case, I instantiated three arbitrators that connected to three separate DDRs (two with six masters, one with 12) and a fourth arbitrator that connected 13 bus masters to a single PCI bus."

What you did is to expand PCI bus arbitor idea to DDR input bus. In my design DDR doesn't need a bus arbitor at all. All components connected with a DDR controller have no common bus to share and they provide the best performance over yours. So from this point of view, my DDR controller interface has nothing common with yours. Both work, but in different strategies.

My strategy is more complex than yours, but with best performance. It saves a middle write fifo for DDR controller: DDR controller has no special write fifo, it uses all component write fifo as its write fifo, saving clocks and memory space, getting best performance for DDR controller.

Weng

Vote

D

David Ashley 19 years ago

This is an interesting point. I just finished "VHDL for Logic Synthesis" by Andrew Rushton, a book recommended by earlier post a few weeks ago so I bought a copy. Rushton goes to great pains to say multiple times:

"The natural form of hierarchy in VHDL, at least when it is used for RTL design, is the component. Do not be tempted to use subprograms as a form of hierarchical design! Any entity/architecture pair can be used as a component in a higher level architecture. Thus, complex circuits can be built up in stages from lower level components."

I was convinced by his arguments + examples. I'd think having a modular component approach wouldn't harm you, because during synthesis redundant interfaces + wires + logic would likely get optimized away. So the overriding factor is choosing which is easiest to implement, understand, maintain, share, etc. IE human factors.

Having said that as a 'c' programmer I almost never create libraries. I have source code that does what I want, for a specific task. Later if I have to do something similiar, I go look at what I've already done and copy sections of code out as needed. Perfect example is the Berkeley Sockets layer, the library calls are so obscure all you want to do is cut and paste something you managed to get working before, to do the same thing again...Alternative would be to wrap the sockets interface in something else, supposedly simpler. But then it wouldn't have all the functionality...

-Dave

David Ashley http://www.xdr.com/dash Embedded linux, device drivers, system architecture

Vote

ddr with multiple users

Join the Discussion

Didn't find your answer?