Making a 32KB BRAM block, virtex-4

Hi all,

I am working on making a second level cache for microbalze processor on a ML403 board. This comes in the virtex-4 family. I am looking at

32KB space for cache data. The tag stays in a separate block. The problem that I am having here is that each BRAM primitive is of 2KB, hence i am in an uncomfortable situation in which I would have to use 16 different variables.

Is there a way in which I can combine the 16 primitives and get a 32Kb block ram? If so, please specify some details and some links which have information regarding the same.

Thanks in advance,

Bhanu

Reply to
Bhanu Chandra
Loading thread data ...

Hi all,

I am working on making a second level cache for microbalze processor on a ML403 board. This comes in the virtex-4 family. I am looking at

32KB space for cache data. The tag stays in a separate block. The problem that I am having here is that each BRAM primitive is of 2KB, hence i am in an uncomfortable situation in which I would have to use 16 different variables.

Is there a way in which I can combine the 16 primitives and get a 32KB block ram? If so, please specify some details and some links which have information regarding the same.

Thanks in advance,

Bhanu

Reply to
Bhanu Chandra

Yes, write some code (VHDL or Verilog) that instantiates 16 BRAMs and defines how you want them to be connected.

Decode the upper 4 address bits into your 32K address space to use as like a 'chip select' which you would then use to select which one of the 16 BRAMs you want to write to.

Those same 4 address bits would also be used as the 'select' input to a

16->1 mux which would be used to select which BRAM data output is the 'read data' output of your 32K memory.

Kevin Jennings

Reply to
KJ

or for better timing split the block rams bit wise.

For example, you need 16 block rams, so if you want a 32 bit wide memory use the RAMB16_S2_S2 primitive (for dual port) and assign 2 bits of your databus to each memory. The addresses are common.

/MikeJ

formatting link

Reply to
MikeJ

It would be better to set the BRAMs up as 16Kx1, using as many as you need for bits and then a simple 2:1 mux to select between two banks for the 32K size. This eliminates a lot of the external logic by using more of the internal decode. As a result you get considerably better timing and power dissipation. Also, it turns out it is much easier to route because each BRAM has only one read data and one write data rather than the full width of the BRAM.

Reply to
Ray Andraka

Mike, Ray, Absolutely right. I must've had my 'slow performance' mode hat on for some reason.

Kevin

Reply to
KJ

Just a note for the archive: I agree with Ray, but if power consumption is a consideration, you need to experiment with both implementations. Some architectures have power burn strongly affected by the number of enabled BRAMs.

Reply to
Tim

Either way you have the same number of BRAMs unless your data bus is the right width to take advantage of the parity bits in the wider configuration to reduce the BRAM count. I don't recall what the OP's width was, but I was thinking it was 16 bits, in which case the parity bits aren't used.

In any event, the extra logic needed to mux 32 banks of 16/19 bit wide BRAMs rather than the 2:1 mux needed to select from pairs of 16Kx1 banks is going to consume far more power than an additional BRAM.

I didn't mention it in my original post, but the logic resources used are less for the 16Kx1 implementation as well. Generally speaking, you want to use the deepest aspect ratio that fits with your design. The exceptions come in for special cases where the number of BRAM available is limited and using the parity bits will reduce the BRAM count.

Reply to
Ray Andraka

What you say may be true in almost all cases - I haven't done the comparison across the many and various FPGAs. But it certainly isn't true for at least one Xilinx family - if power consumption is an issue it's worth making the checks and knowing for sure.

Reply to
Tim

Tim, I am failing to see it. If you are building a 16x32K memory, for example, you could do it with

32 16Kx1 plus 16 2:1 muxes, or you could do it with 32 1Kx18 BRAMs plus 16 16:1 muxes, which occupies about 3x the number of LUTs. Same number of BRAMs, more logic.

If have a width where you use the parity bits, then yes there is a difference in the BRAM count, for example an 18x32K you use 36 16Kx1 or

32 1kx18. In that case, then yes you use 4 more BRAMs, to save a relatively small number of LUTs, and the power consumption is probably less.

In the general case though, the answer depends on the candidate BRAM organizations, and this is only true because the extra memory density is only available in the x9, x18, and x36 configurations.

Reply to
Ray Andraka

The point I saw in his earlier post: if you have 32 16kx1 memories, you're enabling at least 16 memories. If you use 32 1kx16 memories, only one needs to be enabled.

If the power for 15 enabled RAM access cycles (versus disabled cycles) is significantly greater than the power for the multiplexer logic and increased routing burden, the power question isn't a gimme. If all 32 memories are always enabled in both schemes, the point is moot; the 16kx1s will win out. If only the decoded memory is enabled, the difference might be large. Or it might not.

- John_H

Reply to
John_H

Yes. But the power consumption can also vary depending on how many BRAMs are active. It depends on the BRAM implementation and I haven't looked at the V4 case. By effectively freezing most of the BRAMs at any moment, the power hit goes down. It's the same with external SRAM and DRAM, where the power consumption goes up and down with the number of banks active, the number of RAMs active, and the read/write/refresh state.

For the common case, where the finer details of power saving aren't a concern, the point you made is the most relevant - use as much as possible of the BRAM's internal decode logic. It's probably faster, almost certainly it uses less power, and it's free.

Reply to
Tim

External SRAMs usually completely disable themselves after they are deselected and the last burst has ended. Assuming the BRAM functional schematics from the V4/V5 specifications are architecturally accurate for the few details shown, it seems that each BRAM port's address registers (and the address decoder they happen to sit in front of) operate regardless of the port's 'enable' signal which appears to only control read, write and output latch operations - not address decoding.

If so, the only ways to completely 'suspend' a V4/V5 BRAM (prevent internal activity) would be to either stop the clock (regional clock mux?) or freeze addresses (+data for extra nanowatts, maybe) before the BRAM's. Both approaches imply an extra wait state before accessing a disabled BRAM (can be hidden by pipelining BRAM controls) and the second one would also cost way too many extra FFs/routing, most likely undoing any savings from that second approach in the process.

I think the clock gating with 16:1 BRAM mux approach would have the highest chance of achieving measurable power reduction. On the V5, this might be an even better candidate: 36kbit BRAMs mean half as many BRAMs to mux/control and six input LUTs enable fast, efficient, single-slice

2x(8:1) muxes.

In any case, it sounds like quite a bit of trouble only to optimize tens of microwatts away from the bottom line.

--
Is LUT6 the greatest thing since sliced bread? Maybe not - but finally 
having single-LUT 4:1 muxes is certainly great.
Reply to
Daniel S.

It has been 5 years since I did that sort of design, but at the Spartan-2/2e generation the power savings were not microwatts - they dominated power consumption on a 100MHz design. Could have been a design wrinkle - my designs have been known to have wrinkles!

It would be interesting to hear from Peter or Austin what the situation is now - if ENA is deasserted, does BRAM power drop right down? This may be answered in the User Guides - I'll get to read them one day...

Reply to
Tim

The user guide schematics and general description only scratch the surface. As I said, the schematic does look like ENA leaves the address decoder active but user specs are often only rough approximations of the actual hardware implementation's architecture, presented only to give the user an idea of what the thing is about. Having first-hand feedback about the scope of ENA's (near-)gate-level side-effects as a power-saving measure would certainly be ideal.

Since I have some V2Pros here, I went to check if their datasheets had more concrete details but they turned out to be even more sparse. Maybe I will put together a dramatization of this case using four sets of

32BRAMs each (16k x 32bits with 16kx1 VS 512x32 BRAMs... and ENA VS clock gating, assuming this trick works with V2P) and see what happens.
Reply to
Daniel S.

Reply to
Peter Alfke

ElectronDepot website is not affiliated with any of the manufacturers or service providers discussed here. All logos and trade names are the property of their respective owners.