There have been some previous threads concerning problems with the OCM bus, like these two:
-- "Sharing BRAM between Xilinx PowerPC's (on data-OCM ports)"
- also titled ""Sharing BRAM between Xilinx PowerPC's (on data-OCM ports)"
To echo what has been discussed in these threads, our research team has experienced several issues with the OCM bus. Specifically, we have been using the BRAMs connected to the OCM bus with an EDK-provided OCM Controller version 3.00.a on the Virtex 2 Pro family. While these issues did not manifest themselves on the XC2VP30 part (on the XUP board) for this projectwe experienced them on the XC2VP70 parts (on the BEE2 board). To get an understanding of how we have our design is structured, please refer to page 4, figure 4a of this paper: (more info about our project is In all we saw three bugs:
Bug 1 ==== PowerPC 405 pseudo code .... store @ address A with data B to OCM BRAM couple ALU ops (setting up load/store base addresses in registers) store @ address C with data D to OCM BRAM some ALU ops (setting up load/store base addresses in registers) store @ address E to PLB to our PLB pcore (300 cycles - TCC cache in our design) load @ address A, returns data D from OCM BRAM
Issue: In this scenario that the OCM load returns the last OCM store, regardless of the address that was written and read. For some reason, the OCM bus and/or BRAM controller is buffering/caching the last value stored. We have only observed this buffering taking place when we have the 300-cycle PLB store. A shorter PLB store does not reproduce this effect.
Workaround: We inserted a synch instruction after the PLB store.
Bug 2 ====
range for the OCM BRAM would appear on the PLB bus. Unfortunately, since there is no slave device on the PLB bus within that address range, we get a PLB bus error. We discovered this problem using ChipScope.
Workaround: We could not find a software-based workaround as for Bug1, so we moved the OCM BRAMs to the PLB bus.
Bug 3 ===== pseudo-code loop_start: data = load from A # port A of BRAM connected to OCM controller if(data == flag) goto exit # write is done thru port B, connected to our pcore goto loop_start exit:
Issue: Sometimes the processor is stuck in this loop forever, even though the BRAM entry at address A has been written by our pcore using the BRAMs other port with the flag (user_switch in figure 4 a in this paper:Both ports are using the same clock (100 MHz) and clock edge. Moreover, we have conflict resolution logic in our pcore such that if the pcore detects the assertion of the port A's enable signal, it stalls the write going to port B until port A's enable signal is de- asserted. In spite of this, 1 out of a million setting of the flag variable does fails (i.e. the processor is stuck in this infinite loop). At first, we suspected that the datapath to our pcore or the pcore was failing, but we verified that the datapath was bug-free. Bug 1 led us to suspect that that in the cases that there is a failure, the OCM bus or BRAM controller is buffering variable data so that it misses the update from our pcore.
Workaround: We implemented a high-level software time-out mechanism such that after the processor loops for a certain period of time, it sends a retry notification to flag-setter. This solution works reliably.
Summary: Eventually, we removed all the OCM buses from our design and then we migrated the BRAMs to the PLB buses. Using PLB BRAMs removed all three bugs without the insertion of the software workarounds we previously developed. While we cannot comment about the OCM interface on Virtex 4 or 5's, we recommend against using the OCM BRAMs on the Virtex 2 Pro family (especially on the XC2VP70 parts).
Performance Note: We initially chose to use the OCM bus because of its short latency as compared to the PLB. To compare the latencies of using BRAMs on the two buses, we wrote the following test code in PowerPC assembly:
turn on instruction cache (use PLB BRAM to store code) # loop code fits inside PowerPC cache set loop iterator register # 32k iterations set base address start PowerPC timer loop_start: load ... load # 100th load bdnz loop_start # decrement loop iterator and goto loop_start if not less than zero read PowerPC timer cycles per load = timed cycles / (number of iterations * 100)
The following are the results of OCM and PLB BRAM controllers: OCM load = 2 cycles OCM store = 2 cycles PLB load = 11 cycles PLB store = 8 cycles
Stores take less time on the PLB since the ack is asserted as soon as the data arrives at the controller, whereas for the load the ack takes place when the data is read from BRAM. Also, turning on the instruction cache shortens the latencies.