Sharing BRAM between Xilinx PowerPC's (on data-OCM ports)

- J
- Jeff Shafer
  
  Contact options for registered users
posted
18 years ago

Thu, Feb 2, 2006 10:31 PM

Hi,

Question summary: Can I successfully share a Xilinx dual-port BRAM between two PowerPC data-OCM's where it is possible to write and read the same location at the same time without corrupting the data? (by different processors from different ports of the BRAM) I don't care if the read returns the old data or the new data from the write, but I want the read results to be deterministic and repeatable without corrupting the write.

The (lengthy) details:

I have an XC2VP30 FPGA with two PPC 405 processors. In an EDK project, I am trying to use the data-side OCM port on each processor to connect to a dual-port BRAM. This BRAM will be used as a common scratchpad that is accessible from both processors.

The problem I experience is best demonstrated in a simple producer/consumer software program. PowerPC #1 is the producer and writes each BRAM location with 1 of 13 pre-determined 32-bit values. It runs in a tight loop and repeatedly fills the memory.

PowerPC #2 is the consumer and repeatedly reads each BRAM location in a tight loop. If the value it reads is not one of the 13 pre-determined values that could be written, an error has occurred.

**Because each processor loops forever and runs different code of different lengths, it is possible for PowerPC #1 to be writing the same address at the same time that PowerPC #2 is attempting to read it from the opposite memory port.**

The BRAM is 8kB in size. Out of 1 million BRAM reads, approximately 10-20 have invalid results that should not appear anywhere in memory. These errors are randomly distributed through the entire test length and do not cluster at the beginning or end. Immediately re-reading that error location a second time will typically read a correct value, but not always. (Sometimes we can read for a hundred times without getting a legit value, at least until PPC #1 loops around and re-writes that location again). I interpret this to mean that *most* of the time, just the read is corrupted, but that

*sometimes* the write itself failed to fully update the memory.

In the VII-Pro Users Guide, I see there are specifications regarding writing and reading to the same address in a dual-port BRAM at the same time. Our PowerPCs (300 MHz) and data-OCM interfaces (100MHz) are all rising-edge aligned and would seem to be vulnerable to this situation.

I tried changing the BRAM write mode from "WRITE_FIRST" to "READ_FIRST" via a UCF constraint in hopes of getting legitimate reads out of the second port. The error behavior improved slightly but was still present. I verified that the constraint was successfully applied in the FPGA Editor.

I also tried running both OCM ports (or both Power PC's) on inverted clocks so they were out of phase. Unfortunately, we want both processors to share the same PLB bus, so this technique is also not possible.

** In the error tests, both the instruction and data caches are on in both processors and are set to only cache the PLB-based BRAM, not the scratchpad in question. However, if I turn the data cache completely *OFF* for both processors (but leave the I-cache on), then no errors are reported. We can run the test for hours. Too bad we need that cache for a real system, although there's not really much data to cache in the test program. We have played for hours with different PPC instructions to make the memory guarded, enforce in-order execution, flush the cache (even though the OCM is supposedly not cached), and have had zero luck. **

In the FPGA editor, the scratchpad BRAM blocks (4 of them) are perfectly placed in between the two processors, not shoved to a far-off corner of the chip or anything like that.

I guess my question is: Has anyone ever successfully shared a BRAM between two PowerPC data-OCM's where it is possible to write and read the same location at the same time? I would be perfectly happy if the read produced the old value at that location and not the new value. (Which is what I thought changing the write mode to READ_FIRST would produce).

Thanks,

Jeff

- J
- Joseph
  
  Contact options for registered users
Vote on answer
posted
18 years ago

Fri, Feb 3, 2006 7:28 AM

Jeff,

I experienced some of this same quirkiness with a shared DSOCM BRAM and could'nt pin down the cause. This non-determinism led me to use a shared PLB BRAM, where I experienced no such problems. If your system requirements will allow it, I would switch to using the PLB bus since there isn't much documentation out there on using cooperating PPCs, let alone via sharing a DSOCM BRAM. On the other hand, if you figure out what is going on, I'd love to hear about it!

Good luck, Joey

- J
- Jeff Shafer
  
  Contact options for registered users
Vote on answer
posted
18 years ago

Fri, Feb 3, 2006 8:09 AM

Thanks Joey, I'll definitely keep everybody up to date if I get this working. Xilinx support keeps sidestepping the issue. I already have shared memory across the PLB working with no problems, but it's kinda slow. I'd much rather use the 3-cycle OCM instead. It just seems like such an obvious way to connect the two processors. Of course, if I have to implement a lock in the PLB BRAM to get the OCM BRAM to share properly, that might negate much of the performance advantage of using the OCM in the first place.

Jeff

- S
- Sylvain Munaut
  
  Contact options for registered users
Vote on answer
posted
18 years ago

Fri, Feb 3, 2006 7:52 PM

As I understand, you use 1 port always for read and another always for write.

So you could clock the write port on the negative edge even while keeping the two processors and their plb bus on the same clock. Let's says clk is your main clock. Instead of feeding di,dip and wren directly to the BRAM, register them in the clk domain, then connect the output of these registers to the respective port of the BRAM and feed "not clk" to the wrclk ping of the BRAM.

Sylvain

- J
- Jeff Shafer
  
  Contact options for registered users
Vote on answer
posted
18 years ago

Fri, Feb 3, 2006 10:08 PM

Thanks for the idea Sylvain. That would work for the test program I described, but that was written only to illustrate the problem. In reality, we'd like to have this shared scratchpad readable and writable by both PowerPCs. So while registering and inverting the clock would work for the writes, I think it would corrupt reads by that processor....

Thanks,

Jeff

- S
- Sylvain Munaut
  
  Contact options for registered users
Vote on answer
posted
18 years ago

Fri, Feb 3, 2006 11:34 PM

It depends of the controller. If you can modify it to tolerate more pipeline (i.e. the read data doesn't appear at T+1 but T+2 (or T+3, depdnds if your timing margin requires u to re-register dout or not)).

But the two processors are not equal ... one has less latency ...

- J
- Joseph
  
  Contact options for registered users
Vote on answer
posted
18 years ago

Thu, Mar 2, 2006 9:37 PM

Not sure if this is relevant to your problem anymore, but I found the problem with my design (briefly described above). There is only one signal that can be assigned in the dsbram_if_cntlr, that is BRAMDSOCMCLK (is this what you were running at 100MHz?). We were running our PPCs at 200MHz and our PLB bus at 100MHz using proc_clk_s and sys_clk_s, respectively. It was such a habit to assign all non-processor clks to sys_clk_s, that we did so for BRAMDSOCMCLK. After reading the data sheet for dsbram_if_cntlr, we found that the BRAMDSOCMCLK signal needed to be 1-4X the processor clk. The slow clock we gave to the BRAM caused our unpredictable behavior. If anything, I have learned to read those data sheets a little better. This may or may not be the same as your problem, Jeff, but thought I would follow up with the fix we came up with for our system. We are running a prodcuer/consumer type system as well and haven't had any problem with inconsistencies. The shared BRAM is a circular FIFO in our system. I could provide details on its operation if you still have trouble with your system.

- J
- Joseph
  
  Contact options for registered users
Vote on answer
posted
18 years ago

Thu, Mar 2, 2006 9:50 PM

Not sure if this is relevant to your problem anymore, but I found the problem with my design (briefly described above). There is only one signal that can be assigned in the dsbram_if_cntlr, that is BRAMDSOCMCLK. We were running our PPCs at 200MHz and our PLB bus at

100MHz using proc_clk_s and sys_clk_s, respectively. It was such a habit to assign all non-processor clks to sys_clk_s, that we did so for BRAMDSOCMCLK. After reading the data sheet for dsbram_if_cntlr, we found that the BRAMDSOCMCLK signal needed to be 1-4X the processor clk. The slow clock we gave to the BRAM caused our unpredictable behavior. If anything, I have learned to read those data sheets a little better. This may or may not be the same as your problem, Jeff, but thought I would follow up with the fix we came up with for our system. We are running a prodcuer/consumer type system as well and haven't had any problem with inconsistencies. The shared BRAM is a circular FIFO in our system. I could provide details on its operation if you still have trouble with your system.

- J
- Jeff Shafer
  
  Contact options for registered users
Vote on answer
posted
18 years ago

Fri, Mar 3, 2006 4:19 PM

Hi Joseph,

Hmmmn, let me try that. Nothing else has seemed to work so far, aside from stripping out everything in the system except for the processors and OCM which obviously is no "solution". That just happened to have "magic" placement and no problems. I have my system setup the same way you had your's originally (with the dsbram_if_cntrl at 100 MHz and the PowerPC at

300 MHz). I'll let you know what happens next week. We've got a big project deadline in the next few days that's keeping me busy, and, aside from this OCM issue, the system is performing great. We optimized our software enough to not need the second processor at the moment, so I can safely ignore this for a little while. :-)

Thanks,

Jeff

- J
- Jeff Shafer
  
  Contact options for registered users
Vote on answer
posted
18 years ago

Tue, Apr 4, 2006 5:00 AM

Hi Joseph,

Thanks again for your help. Alas, I'm still not having consistent results with my setup. Sometimes place-and-route will give me a system that passes the producer-consumer tests, and sometimes it won't. Tiny tweaks in the UCF file seem to have a large impact success or fail in the final result. (i.e. if I overconstrain a 100MHz clock to 9.9ns versus 10ns)

Anyway, I'm curious about the solution you found for your system. You recommended running the BRAMDSOCMCLK input to the dsbram_if_cntlr at 1-4x the processor clock, right? Where in the data sheet for the controller does it say this is required? Maybe I've got an older version with EDK 7.1 because I can't find it. You're correct, though, that this clock signal is passed directly through the controller module and becomes the clock input to the BRAM (either BRAM_CLK_A or _B).

As far as I can find, the whole point of setting the clock ratio in the C_DSCNTLVALUE parameter in the dscom module is to allow a *slower* BRAM clock than PowerPC clock, not a *faster* clock. On the VIIPro, you can go up to a ratio of 4:1. This is from the PowerPC block reference guide in the table that shows the ratio of CPMC405CLOCK : BRAMDSOCMCLK.

I guess the problem with clocking the BRAM : PowerPC at a 1:1 ratio is that it's impossible for me to do. I run my PowerPC's at 300Mhz, and there's no way I can clock the rest of the controller logic and the 4 BRAMs at that speed. (The mapper tells me it's impossible even given perfect routing). I'm actually suprised you were able to route the BRAM successfully at 200MHz, as the best I can get is maybe 125 MHz given the rest of the contraints of the design. Good work!

Thanks,

Jeff

- J
- Joseph
  
  Contact options for registered users
Vote on answer
posted
18 years ago

Tue, Apr 4, 2006 8:01 PM

Jeff,

Well, your surprise at my clocking was warranted... I checked and I seem to be running everything at 100Mhz now. Now I am curious if I can get the speeds up as well. My processors do communicate well over the OCM still. I think what happened for me was that I saw the 4:1 ratio in the block reference guide and realized that my different clocks were causing a problem. I probably, at that point, made everything run at

100Mhz and gave BRAMDSOCMCLK the same clock as the processor to be certain I would get 1:1. When I posted to the group, I got the ratio backwards (upside-down?). Thanks for setting me straight!

Like you, I have moved on and around this problem a bit. I was surprised to see the clock values I had in my system. If I get time, I may adjust them... maybe try the whole system at 200Mhz and see what happens. Or just try 4:1 to see if things work. Speed isn't as important in my system as it seems to be in yours, just the coherent communcation is what matters. Oh, those touchy OCMs...

Joey

- J
- Jeff Shafer
  
  Contact options for registered users
Vote on answer
posted
18 years ago

Wed, Apr 5, 2006 1:08 AM

Ok, thanks Joey, I just wanted to make sure I wasn't missing something obvious with this OCM problem. At least it works on your system! What I'll probably try is to do a place-and-route with just the PowerPC's and OCMs in the system. The last time I tried that on an otherwise empty system the memory worked just fine. Then, I can use the FPGA editor to capture that routing to a UCF file and just lock the placement down. (Some other post mentioned a "Directed Routing Constraints" command to try).

Wish me luck...

Jeff