Spartan 3E Not enough block ram.

Do you have a question? Post it now! No Registration Necessary

Translate This Thread From English to

Threaded View
I trying to port a design (video scaler) from Virtex 4 to Spartan 3E.
Currently having trouble with not enough block rams.

My reference (top) design uses only 15 Block Rams, but after wrapping a
"wrapper"  module around my top design. The block Rams shoots up to 60.
Thus, I decided to go and look into this wrapper module. It instantiates
alot of this dp_bram module. So, I went to look into this dp_bram module and
found the following codes. ( I took out relevant portions of it for easier
understanding)

(Just for note: This wrapper module is the wrapper around the video scaler
for trial synthesis.  It includes two instances of the scaler, as well as a
simple bus interface for the control register inputs.)

--Information in Entity portion--
data_width           :  integer  := 8;
mem_size             :  integer  := 1920
wr_addr  :  in    std_logic_vector((LOG2_BASE(mem_size) - 1) downto 0);
rd_addr  :  in    std_logic_vector((LOG2_BASE(mem_size) - 1) downto 0);
din      :  in    std_logic_vector((data_width - 1) downto 0);


--Architecture portion--
process (wr_clk)
   begin
      if (wr_clk'event and wr_clk = '1') then
         if (ce = '1') then
            if (wr_en = '1') then
               mem_array(conv_integer('0' & wr_addr)) <= din;
            end if;
         end if;
      end if;
   end process;

   process (rd_clk)
   begin
      if (rd_clk'event and rd_clk = '1') then
         if (ce = '1') then
            if (rd_en = '1') then
               dout <= mem_array(conv_integer('0' & rd_addr));
            end if;
         end if;
      end if;
   end process;

Now I guess, this mem_array (mem_array(conv_integer('0' & wr_addr)) <= din;)
is the main culprit using the block rams for data storage, right?
Hmm, so how should I go about trying to solve this problem of having not
enough block rams. Then, having some DDR SDRAM on my board, I have tried to
learn about dram and dram controller but woah, it is a little too
overwhelming to understand (little experience with fpga here). Could anyone
please simplify the usage of dram? (or maybe it is really so not simple?
hmm)

Oh yah, I have also tried looking at Xilinx memory interface generator for
DDR SDRAM controller. Trying to figure how to use it and later how to
integrate into my design.

(gosh why cant I just have the mem_array just automatically use the
dram....)






Re: Spartan 3E Not enough block ram.
As I think has been answered on here before, DRAM has a complex
sequence of commands that must be issued to it, there is no way to
"just know how to use it".  An interface will probably take in the
order of 500-1000 slices in your part.  If this is an eval board, I'd
be surprised if the stuff the board came with didnt have SOME sort of
a dram interface in the examples.... but I don't know.

Yes, the construct you pointed out is most likely what is inferring
the block rams.  To confirm that, look at the "language templates"
section of ISE.  It will show you exactlly the constructs that infer
different pieces of hardware.

It sounds like the only way to do what you want is to use the external
ram.  Which means finding or designing a DRAM interface.  You will
also, however, need to redesign the hardware AROUND the memory, since
it will not operate as fast or as smoothly as the block rams.  Most
likely, the DRAM interface will be buffered by a FIFO to the remainder
of the FPGA.  You will have worse latency and bandwidth as compared to
block ram.

Sorry for the bad news... but it doesnt sound like this is a cut and
paste.  You're gonna have to teach yourself a little bit about
hardware and vhdl and fpgas and then you're gonna have to do some
actual design - go figure.




Quoted text here. Click to load it



Re: Spartan 3E Not enough block ram.
Quoted text here. Click to load it

1920x8bits = 15.3kbits, this is less than one BRAM... I am guessing
"data_width" and "mem_size" are generics and the actual parameters on the
instance are larger than that or there are multiple instances of it. If
those are the actual parameters and there is only one instance, this code
fails to explain the 44 extra BRAMs. Even if it was x8bytes, this would
still be only 8 BRAMs instead of 45. It seems like your posting is lacking
some critical details that make it impossible for us to make educated
guesses. Also, having a wrapper around your 'top' design for a synthesis
implementation is suspicious.

The code you posted is a BRAM inference wrapper for a dual port RAM with
independently clocked read and write ports, the first real questions are:
how many times is this generic wrapper used, what are the instance
parameters in each case and what are they for?

Since a very decent scaler can be achieved with five lines worth of video
data that would require 15 BRAMs for inputs and three more for output
buffering, chances are that your scaler's wrapper needs a diet unless it
does other fancy things you may not be aware of.

Quoted text here. Click to load it

Because DRAMs have refresh cycles, row activation, row precharge and
numerous other quirks designers have to take care of before initiating any
actual data transfers... I told you so last week. Even if Xilinx had
hardware memory controllers, you would still have to work with the variable
latency and possible read re-ordering.

Re: Spartan 3E Not enough block ram.
Quoted text here. Click to load it
variable

Well, the reference design data file did say that the purpose of the wrapper
is to minimize the number of external interconnect required by the scaler so
that it can fit into a realistic target device. My wrapper has instantiates
2 sequential lookup tables and 6 horizontal coefficient tables and 3
vertical coefficient tables.

Hmm, I guess I will be looking at other designs which have interface with
the DDR SDRAM controller and from there, try to understand the interface and
I hope to be able to do the same for my video scaler design.

Thanks alot for checking for me that the code is a BRAM inference wrapper
for a dual port RAM. with independently clocked read and write ports.

Yeh and sadly the evaluation board doesn't come with a dram interface
instruction guides or somewhat. IT only comes with the pin numbers and a few
brief description, alas.



Re: Spartan 3E Not enough block ram.
Quoted text here. Click to load it

So your scaler uses an FIR filter for scaling with 3x6 grid sampling and
the coefficient sets for each available scaling factor are stored in
BRAMs... yup, this can cost a handful of BRAMs - and you will have a hard
time dumping these or your video line buffers in DRAMs.

You really need to look into exactly what is consuming how many BRAMs how
and why as I suggested in my previous message. Coefficient tables and video
line buffers will be difficult to shift into DRAM: you will need some BRAMs
to buffer data to/from the DRAMs and are not going to be any better off if
you end up with as many FIFO BRAMs as you originally needed plain BRAMs for
the initial design... actually, you will be worse off given the extra
glue-logic.

If coefficient tables are eating up those ~40 BRAMs, you may be in serious
trouble since DRAMs cannot be programmed by a bitfile - you will need some
method of initializing the DRAM. Analyze the design carefully to see how
the BRAMs are used, there may be a few reduction tricks that can be applied
to spare a few.

Quoted text here. Click to load it

Even if you find a suitable DRAM controller to paste into your design
overnight, reworking your scaler to work with it will require some
significant effort.

Quoted text here. Click to load it

The coding style is obvious to anyone who has spent any significant amount
of time getting 'fancy' BRAM inferences to work. The independent read and
write port BRAM, independently clocked or not, is the easiest to get right
to the point of being nearly impossible to mess up.

Quoted text here. Click to load it

If you want to interface DRAMs, I suggest you start by looking up the
datasheets for your board's DRAMs. At the very least, it will give you an
idea of how kludgey a DRAM interface can be. The road to a fast and stable
DRAM interface is full of bumpy quirks.

Re: Spartan 3E Not enough block ram.
Quoted text here. Click to load it

Yeh true, well my supervisor was asking me to start from at least an
existing design and start learning at least how a DRAM controller works.
But well no matter, still can't begin how to go about even "pasting" it. It
really seems complicated with all the addresses, data and then the control
signals.

Quoted text here. Click to load it
Heh, oops. guess I really haven't spend significant amount of time on FPGA
yet.
Well, sometimes, I really wonder about how designers know which feature and
how many of that feature will be used and why the codes they write will
exactly used that feature in the FPGA chip. Like, if I use array for storing
some data, block rams will be used? and if I use * , mulitpliers will be
used?. Sorry if this is just too inane to answer. Just that for synthesis,
everything seems so automatic. Well, I do know about the language templates
but it looks really different.


Hmm, I guess it looks good to perform some reduction tricks that you
mentioned. I had a question on hand though.
Let's say I have the following instantiated modules under my wrapper module
as shown:
pvs_wrapper
                - h_seqlut_inst - dpbram - rtl (dp_bram.vhd)
                - v_seqlut_inst - dpbram - rtl (dp_bram.vhd)
                - h_coeff_0_0_inst - dpbram - rtl (dp_bram.vhd)
                - h_coeff_0_1_inst - dpbram - rtl (dp_bram.vhd)
                - h_coeff_1_0_inst - dpbram - rtl (dp_bram.vhd)
                - h_coeff_1_1_inst - dpbram - rtl (dp_bram.vhd)
                - h_coeff_2_0_inst - dpbram - rtl (dp_bram.vhd)
                - h_coeff_2_1_inst - dpbram - rtl (dp_bram.vhd)
                - v_coeff_0_inst - dpbram - rtl (dp_bram.vhd)
                - v_coeff_1_inst - dpbram - rtl (dp_bram.vhd)
                - v_coeff_2_inst - dpbram - rtl (dp_bram.vhd)
                - pvs_top_structural (pvs_top.vhd)
                    - ...
(erm you can ignore the pvs_top and onwards as those will be for the
original design modules)

Now each of the dp_bram.vhd has the following codes:
mem_array(conv_integer('0' & wr_addr)) <= din;
dout <= mem_array(conv_integer('0' & rd_addr));
as you mentioned, will be using the dual port rams. Thus, I was thinking of
maybe combining some of the modules together? Hmm though usage of the block
rams will still be the same.
Any common techniques that you would know that people have used when they
wish to reduce BRAMs?
(Well the trial synthesis did write out a report and it said that 60 Block
rams was used so I guess that they would have already tried to optimize the
usage and 60 block rams is the lowest they could have gone)

Quoted text here. Click to load it
video
BRAMs
for

Hmm can this be worse off? Well definitely FIFO Brams will be needed but it
certainly be lesser than not using the DRAM right?
If not, what purpose would the DRAM serve? (which is to store data)




Re: Spartan 3E Not enough block ram.
Quoted text here. Click to load it

The DRAM control/data/etc. signals is one thing, making the DRAM work
exactly the way you want it to is quite another - you need to familiarize
yourself with DRAMs' internals... learn what row activation and precharge
do, why these are necessary, how they can affect your design and how you
can work around these delays by doing pipelined burst transfers. There are
a bunch of other quirks that can be exploited or have to be avoided, the
ones enumerated are simply the more fundamental ones IMO... and do not
forget those auto-refresh cycles.

Quoted text here. Click to load it

These things are automatic only to the extent where the HDL coder follows
some limitations. For SRAMs/ROMs/registers to get mapped onto BRAMs, the
synthesis tools must be able to reduce the access/data logic down to
something supported by the hardware. For a BRAM, this means the logic must
be reducible to no more than two read+write+address+clock sets. Depending
on the target device, there may be additional restrictions such as
read-write policies - write first, read before write or no change.

Multipliers also have their share of quirks, particularly if you want to do
pipelining.

Quoted text here. Click to load it

Look at your synthesis reports pay attention to each BRAM's inference data
and port mappings. Look for memories that are under 8kbits and are not
using both read and write ports - these may be mergeable if they use the
same clocks.

Quoted text here. Click to load it

1) FIR filters are often symmetric: the nth tap (n=0..N) has the same
coefficient as the (N-n)th one... it is unlikely that this optimization has
not already been done if applicable to your filter but double-checking is
cheap.

2) If your coefficient tables (ROMs?) use under half a BRAM and only one
port, you should be able to merge two tables into one BRAM by using both
ports for reading: map one address to "'0' & addrA" and the other to "'1' &
addrB".

3) Examine the tables to find redundancies and equivalences, it may be
possible to multiplex accesses to the coefficient tables.

Quoted text here. Click to load it

How much data do you need to put in the DRAMs? 60 BRAMs x 2KB each = 120KB
max., assuming you intended to put everything on the DRAMs. The question
you really need to ask yourself is: can you afford the glue-logic?

Try ripping the memory controller off any project with DRAM controller you
may have handy, do an unconstrained implementation run (use a slow clock
like 50MHz and the memory controller's top as your synthesis project's top)
and see how resource-hungry the memory controller you have is.

Re: Spartan 3E Not enough block ram.
Quoted text here. Click to load it

Yeh kind of learnt little about the refreshing of the DRAM in school and it
was difficult.
Hmm, i seriously need some ultra pure basic on how to use the DRAM, any such
books or websites or watever?

Quoted text here. Click to load it

Meaning the structure of the HDL codes is written such that the synthesis
will infer what kind of devices to use.
(Amazing..write first or write later also have effect on the device being
used...) oh the XST guide offered alot of help on this.
Quoted text here. Click to load it
Wow, the synthesis report is so cool. It tells me alot of information, like
which modules contains their respective warnings, the devices inferred from
the each modules (like adders, subtractors,etc) and analysis of different
values for the data types.
Thumbs up!

Anyway got this information on the BRAMs
57 rams
      RAMB16_S2_S2                : 2
      RAMB16_S36_S36            : 18
      RAMB16_S4_S4                : 25
      RAMB16_S9_S9                : 12
All are 16kbits Ram with different port widths for A and B (as indicated by
16_Sx_Sy)

Quoted text here. Click to load it
they wish to reduce BRAMs?
Quoted text here. Click to load it
has not already been done if applicable to your filter but double-checking
is cheap.
Quoted text here. Click to load it
The nth tap as the same coefficient as the (N-n)th one.
Hmm I have only a H tap and a V tap. I don't see a series of taps being used
though.
Just for info sake, I have already tried to reduce H tap to 4 and keeping V
tap as the default 4 to fit the multipliers and also reduce a few BRAMS
being used.
But I can't find anywhere about the coefficient of the tap.

Quoted text here. Click to load it
& addrB".
Quoted text here. Click to load it
Nah, no chance. from my final synthesis report, all of them are using dual
port rams. Anyway the codes were already written for dual port BRAM so there
shouldn't be any reason that my coefficient tables will be using under half
a BRAM, ya?


Quoted text here. Click to load it
hmm....erm....I'm not sure to go about this method.
Anyway, regarding the tables, I guess the tables of coefficients have not
yet been used for input in the design (thus I felt kinda puzzled by your 1st
and 3rd methods). Right now, I'm just synthesizing designs which some of
them will process the tables later on. So what the tables have inside are
the issue here.
All the codes in the each of those sequential lookup and coefficient tables
are using the same dp_bram, then the wrapper instantiating them will porting
different values in them.
Quoted text here. Click to load it
top)
Quoted text here. Click to load it

Oh yah well, I'm not sure if how much data is needed but based how many
BRAMs is used and each Brams is a 16kbit. 57 x 16k =  912k. But anyway, the
board I will be using  has four DDR SRAM Memory Chips (512Mbit). And since
it could provide so much storage, surely my chip should be able to handle or
afford the glue-logic.

Hmm oh yah I have used 4570 out of 14752 slices. Now I read that I can use
the distributed rams from the remaining slices and I thinking it is not
enough to reduce the BRAMs till there is no overloading but it certainly
reduce to a extent.
I went to check out the language templates for using distributed rams,
but... they all support only 1bit data storage while many bits
address...(what's use can this be..) hmm but then I read that this
distributed rams can be combined to form longer data bits storage (some sort
of combination).
but then again, I don't how to go about doing it. Tried to find more about
it and got this 2 methods:
1) Use the "HDL Coding Techniques" in your XST User Guide.
2) Specifying constraints (see your Constraints Guide), or by
"instantiating" library parts that force the logic to be created a certain
way (see your Libraries Guide). For example, the RAM_STYLE constraint lets
your force either Block RAM or Distributed RAM.
Woah that RAM_STYLE constraint looks simple to use :) Tried that and not
even a single change happened. :(
So I thought of looking at 1) and found out that in my code there was this
constraint already being used
<<attribute   ram_style      of mem_array : signal is "block">>;
Thus, I changed it to <<attribute   ram_style      of mem_array : signal is
"pipe_distributed">>
However, it seems to continue forever during my synthesis.

Currently, I'm using 57 out of 36 BRAMS (after lowering the H taps)
57 - 36 = 21 => 378000 bits of ram...





Re: Spartan 3E Not enough block ram.
Quoted text here. Click to load it

I'll repeat myself: read some DRAM's specs, you can start with those on
your board - most DRAM manufacturers do a reasonably thorough job at
describing how DRAMs work.

Quoted text here. Click to load it

The memory access policy will not change the "kind of device", all BRAMs
are the same but not all FPGAs implement all access policies. Devices prior
to the Virtex4 do not support read-before-read while write-before-read is
supported by all Xilinx devices I know of.

Quoted text here. Click to load it

Synthesis and implementation logs are your friends, remember to inspect
them thoroughly and you will run a lower risk of getting chastised for
posting newbie questions... I can usually find answers to about 90% of my
would-be questions in there.

Quoted text here. Click to load it

Any RAM uses exactly one BRAM and on modern Xilinx devices, all BRAMs are
16kbits so any sub-16kbits RAM you infer will consume one 16kbits BRAM even
if you use only 256bits.

This part of the report only tells you the port widths of the different
BRAMs used by your design, now you need to hunt down each of these
instances and see how many addresses each has to determine how many kbits
each actually uses.

If you are having a hard time determining which BRAMs are used where, you
should look at the "Macro Statistics" instead: the Macro Stats reports
memories with their actual inference parameters instead of the final report
report's raw BRAM usage.

Macro Statistics
# Block RAMs                                           : 17
  256x16-bit dual-port block RAM                        : 1
  256x72-bit dual-port block RAM                        : 4
  512x32-bit dual-port block RAM                        : 8
  512x64-bit dual-port block RAM                        : 4

Macro statistics allow me to easily find out the size of all the memories
present in my design so I can cross-check with the final report to
determine whether or not the synthesis tools have mapped everything as
expected... and here, I do not remember what the 256x16 RAM is for so I'll
have to investigate where it came from next time I work on the project I
pasted this from.

Quoted text here. Click to load it

I am not psychic... without knowing exactly how the BRAMs are being used, I
cannot tell if packing is applicable to your case. If your coefficient
table BRAMs use the same inference template and are duplicated to provide
multiple coefficients from a same table but are never written to
(effectively used as ROMs), you can create a dual read port template and
remove half of those BRAMs. If your coefficients need to be programmable,
this trick is still applicable but you will have to implement the trickier
"true dual-port BRAM" template and manage writes somehow. At this point, I
think this probably is your best avenue.

Quoted text here. Click to load it

If RAMs are left alone (no pragmas, no attributes, no force options in tool
settings), synthesis tools will automatically map memories to available
resources: they will try to map all large-ish memories to BRAMs until all
BRAMs are used and then start using distributed (LUT) memory for smaller
memories.

Quoted text here. Click to load it

It takes forever because Map/PAR is unable to complete routing when all
memory gets forced into LUT memory. Remove (comment) this attribute
altogether to let synthesis tools decide which memories should be dumped in
BRAMs and which ones should use distributed memory. Alternatively, you
could add a generic port to the template to specify block or distributed on
a per-instance basis.

Note: large distributed memories will become slow unless you add multiple
pipelining registers on their output so you should be careful when using these.

Quoted text here. Click to load it

10k free slices * 2 LUTs per slice * 16bits per LUT = 320kbits of available
distributed RAM = no fit, assuming all your BRAMs are fully used. The fact
that synthesis did not flat out say that there was no fit means some of
your BRAMs are definitely not completely used.

Re: Spartan 3E Not enough block ram.

Quoted text here. Click to load it
Yup I saw this
# RAMs                                                 : 24
 16x64-bit dual-port distributed RAM                   : 6
 1920x12-bit dual-port block RAM                       : 9
 1920x12-bit registered dual-port distributed RAM      : 3
 4096x36-bit dual-port block RAM                       : 1
 4096x9-bit dual-port block RAM                        : 2
 8x64-bit dual-port distributed RAM                    : 3
Since then I have been playing around with my codes, and I can identify
which instances in the codes are using which kind of RAM.


Quoted text here. Click to load it
in
on
these.
Currently I have tried to change some of the instances to use distributed
ram, but forcing the constraint RAM_STYLE to be pipe_distributed
And, going down the list of instances that uses the block ram, when I change
it for for my 6 horizontal and 3 vertical coeffcient instances, viola, it
immediately dropped down to 39 out of 36 block rams!
Hmm, strange, it dropped so much.
Anyway, next i try to work on some instances to change to use distributed
rams. However, I have to be careful to have a balance of not overshooting
the LUTs i have together with the block rams as well.
Then I worked with some line buffer modules under my top modules and well
after synthesis, everything was well under the resources limit. However, the
problem came i tried to implement it. The user constraint file that belong
trial synthesis had a timing constraint and my design timing was twice over
this constraint.
Then i used the timing analyzer and cross probe the problem and could see
the path looked to be quite long.
So now, i tried to work on this problem by using some of the optimization
options in the ISE
Under the map properties, I selected map option level as high. The runtime
took really long, in the end, i got this message.

The router has detected a very high timing score (5245937) for this design.
It is extremely unlikely
   the router will be able to meet your timing requirements. To prevent
excessive run time the router will change
   strategy. The router will now work to completely route this design but
not to improve timing. This behavior will
   allow you to use the Static Timing Report and FPGA Editor to isolate the
paths with timing problems. The cause of
   this behavior is either overly difficult constraints, or issues with the
implementation or synthesis of logic in the
   critical timing path. If you would prefer the router continue trying to
meet timing and you are willing to accept a
   long run time set the option "-xe c" to override the present behavior.

I thinking of just trying to meet the timing. but when do or can I set the
option "-xe c". I dont see any dos command line for me anywhere...



Re: Spartan 3E Not enough block ram.
Quoted text here. Click to load it

This is not strange: block-RAMs have 36bits-wide ports at most. Since you
have some very small x64 memories that were previously forced into BRAMs,
they ended up costing two BRAMs each. With three 8x64 and six 16x64 RAMs,
this is 18 BRAMs recovered right there.

Quoted text here. Click to load it

Distributed RAM is slow unless you give it many output register stages to
redistribute: each LUT can provide 16bits and these are patched together
with muxes to provide larger memories. Your address signals will also have
huge fanout which further contributes to the slowness. Since your 1920x12
distributed RAM probably only absorbed one register, the very long paths
you are seeing is from address bits down to some part of the way through
the output muxes down to the absorbed FFs and then from those FFs through
the remaining address muxes to the destination FFs.

Quoted text here. Click to load it

Do not bother with increasing PAR effort, this will do you no good. You
need to either put that 1920x12 RAM in BRAMs or add register stages that
synthesis will redistribute within the distributed memory to improve your
timing score. Start by adding two register levels to your 1920x12
distributed memory's output and your score will most likely drop from over
5M to possibly under 200k. Add extra registers until your timings are met
or improvements stall. After this, you will need to realign your processing
pipeline to account for the delays on this large distributed memory.

BTW, what was your LUT and slice-FF usage with that last attempt?

Re: Spartan 3E Not enough block ram.
Quoted text here. Click to load it
Ok that's cool. I guess this is indeed an invaluable point to take in the
future for efficient usage of my chip next time.

Quoted text here. Click to load it
Yeh it indeed had high fan out, hundred over!

Quoted text here. Click to load it
processing
Yah I put that in BRAMs and use aother instance (vertical sequential table)
for distributed ram.
Hmm, so I can add register levels? How do I go about that?
I tried selecting the register duplication both in the synthesis and
implement design options. But to no avail.
So do u mean using FPGA editor?

Talking about FPGA editor, I was wondering about moving the problematic
source CLBs closer to the destination so as to cut down the delay? But there
are quite a few implications due to the other wires connection to those
CLBs.

Quoted text here. Click to load it
Ah forgot to save...
Well I did another one with instances of the Vertical sequential table and
the horizontal and vertical cofficients table using distributed ram
(Before this change, I remembered the block ram usage was 33 out of 36 for
linebuffer instance using distributed ram and th 4 input LUTs was 84%)
 Number of Slices:                    6718  out of  14752    45%
 Number of Slice Flip Flops:          9007  out of  29504    30%
 Number of 4 input LUTs:             13229  out of  29504    44%
    Number used as logic:             7010
    Number used as Shift registers:    459
    Number used as RAMs:              5760
 Number of IOs:                        322
 Number of bonded IOBs:                316  out of    376    84%
 Number of BRAMs:                       36  out of     36   100%
 Number of MULT18X18SIOs:               36  out of     36   100%
 Number of GCLKs:                        1  out of     24     4%

I figured this is vertical sequential table would be the better choice to
tackle the problem after forcing it to use distributed ram.



Re: Spartan 3E Not enough block ram.
Quoted text here. Click to load it

Adding output registers is simple...

if rising_edge(clk) then
   memout_d1 <= memout;
   memout_d2 <= memout_d1;
   memout_d3 <= memout_d2;
   ...
end if;

You could add a generic port to your memory template to automatically
generate these delays to keep your upper-level HDL clean.

Quoted text here. Click to load it

1920/16 = 120 LUT-Ms to mux each bit, this means 7 2:1 mux layers. LUTs can
do 2:1, slices do 2:1 and CLBs do 2:1 so you need to go through two full
CLBs and one LUT to do this 120:1 mux. I think you will be fine if you add
three output registers in the RAM's output path. Due to the high fan-out on
the address bits, an extra register there should also help. With all this,
you will get data four cycles later after the request.

Quoted text here. Click to load it

After you implement the extra registers for the distributed RAMs, your FF
usage should increase by about 2000. With resources currently under 50%, it
should considerably improve your PAR results without any other fancy
footwork. If parts of your design are mostly self-contained, you could
floorplan them to reduce the amount of time PAR will spend making guesses
about the optimal layout.

Re: Spartan 3E Not enough block ram.
Quoted text here. Click to load it
Hmm...
I dun suppose you mean by this way? :
architecture rtl of dp2_bram is
type     mem_array_type    is array (0 to (mem_size - 1)) of
std_logic_vector((data_width - 1) downto 0);
signal   mem_array         :  mem_array_type;
attribute   ram_style      :  string;
attribute   ram_style      of mem_array : signal is "pipe_distributed";
signal       dout2 :  std_logic_vector((data_width - 1) downto 0);
signal  dout3 :  std_logic_vector((data_width - 1) downto 0);
signal   din2 :  std_logic_vector((data_width - 1) downto 0);
signal  din3 : std_logic_vector((data_width - 1) downto 0);

begin
   process (wr_clk)
   begin
      if (wr_clk'event and wr_clk = '1') then
         if (ce = '1') then
            if (wr_en = '1') then
     din2 <= din;
     din3 <= din2;
               mem_array(conv_integer('0' & wr_addr)) <= din3;
            end if;
         end if;
      end if;
   end process;

   process (rd_clk)
   begin
      if (rd_clk'event and rd_clk = '1') then
         if (ce = '1') then
            if (rd_en = '1') then
               dout3 <= mem_array(conv_integer('0' & rd_addr));
     dout2 <= dout3;
     dout <= dout2;
            end if;
         end if;
      end if;
   end process;

end rtl;

Well anyway after doing this... my implement design repeated itself for four
times, taking 2 hours of my time.
And then still having about the same timing constraint.

Oh yah, I went to use the FPGA editor. I found that this time, luckily, the
3 timing constraints all came from one CLB, so I can easily just focus on
this CLb.  I thought of shifting up the CLB closer to its sources/
destinations and thereby shortening the route. But then... the timing delays
increased and subsequent shifting did not even change the timing constraint
by a bit not matter where i shift above the original position.
Another engineer told me that it could have help as the route could have
buffers which would have delay the signals so I had hope that it would work
but alas it didn't.

Now I wondering if my adding of the register levels is wrong. Well, I should
hope that it is wrong huh, would mean still got chance...



Re: Spartan 3E Not enough block ram.
Quoted text here. Click to load it

If you delay the write data, you should also delay the write address
otherwise you will have problems. Registers on the write would be mostly
there to reduce the fanout, you probably do not need more than one extra
input register here. (for data, address and enable since all three need to
be equally delayed)

So, your write process (using your coding style) would resemble this:

process (wr_clk)
begin
   if (wr_clk'event and wr_clk = '1') then
     -- first determine if something needs to be written on the next cycle
     -- register duplication will be applied to these if necessary
     if (ce = '1') then
       wr_en1   <= wr_en;
       wr_addr1 <= wr_addr;
       din1     <= din;
     else
       wr_en1   <= '0';
     end if;

     -- then do the actual write
     if (wr_en1 = '1') then
       mem_array(conv_integer('0' & wr_addr1)) <= din1;
     end if;
   end if;
end process;


Quoted text here. Click to load it

You probably do not want to put the delays within your enable block... and
like the write, you probably want one register level on the address. With
both tweaks, the process should look like this:

process (rd_clk)
begin
   if (rd_clk'event and rd_clk = '1') then
     if (ce = '1') then
       rd_en1   <= rd_en;
       rd_addr1 <= rd_addr;
     else
       rd_en1   <= '0';
     end if;

     if (rd_en1 = '1') then
       dout2 <= mem_array(conv_integer('0' & rd_addr1));
     end if;

     dout1 <= dout2;
     dout  <= dout1;
   end if;
end process;

Quoted text here. Click to load it

Your slow paths probably were your clock/read/write-enables since you did
not decouple the enable signal from the rest of the logic. The modified
read/write processes in this message should fix this.

Quoted text here. Click to load it

When you have signals with large fan-outs and you do pipelining, you need
to decouple your enable signals to keep fan-outs on enables in check. If
you look at the two processes, you can see that I did this by combining all
  incoming enables to generate a single-signal enable for the following
pipeline stage.

Quoted text here. Click to load it

Since you are new to FPGAs, it is normal that you are not (yet) familiar
with the fundamentals of working around common design issues... but most of
these you should be able to deduce by reading your static timing analysis
and thinking about the simplest ways to fix the problems it reveals.

Re: Spartan 3E Not enough block ram.
Quoted text here. Click to load it
Hmm why in this case the
      dout1 <= dout2;
      dout  <= dout1;
is being placed outside the enable block then? Because I do not want any
latency when retrieving data?
I hope this does not add on any timing delay, if not I could be keeping on
adding register level in vain.

Quoted text here. Click to load it

Hmm yah, well I really still have alot to learn in the future. Well, really
happy now at least I have learnt something about adding register levels to
solve timing issue problems.

And really thanks alot for the code, it was spot on. I guess I could never
figured the part about the address and enable needing to be delayed myself.
Hmm come to think of it, it looks kind of stupid to do reassignment of
signals in the same block.
From there on, I found out that I need to select register balancing => Yes
in synthesis option.
After doing so, the timing code really dropped by more significant amounts.

Oh btw I have then tried to add on more register levels.
Have tried some structures, and I deduce this coding should be logical.
Wonder if there is anything wrong. The lowest I have brought the timing down
is a slack of about 1ns

architecture rtl of dp2_bram is
type     mem_array_type    is array (0 to (mem_size - 1)) of
std_logic_vector((data_width - 1) downto 0);
signal   mem_array         :  mem_array_type;
attribute   ram_style      :  string;
attribute   ram_style      of mem_array : signal is "pipe_distributed";
signal wr_en1 : std_logic;
signal wr_addr1 : std_logic_vector((LOG2_BASE(mem_size) - 1) downto 0);
signal din1 : std_logic_vector((data_width - 1) downto 0);

signal wr_en2 : std_logic;
signal wr_addr2 : std_logic_vector((LOG2_BASE(mem_size) - 1) downto 0);
signal din2 : std_logic_vector((data_width - 1) downto 0);

signal wr_en3 : std_logic;
signal wr_addr3 : std_logic_vector((LOG2_BASE(mem_size) - 1) downto 0);
signal din3 : std_logic_vector((data_width - 1) downto 0);

signal rd_en1 : std_logic;
signal rd_addr1 : std_logic_vector((LOG2_BASE(mem_size) - 1) downto 0);

signal rd_en2 : std_logic;
signal rd_addr2 : std_logic_vector((LOG2_BASE(mem_size) - 1) downto 0);

signal rd_en3 : std_logic;
signal rd_addr3 : std_logic_vector((LOG2_BASE(mem_size) - 1) downto 0);

signal dout2 : std_logic_vector((data_width - 1) downto 0);
signal dout1 : std_logic_vector((data_width - 1) downto 0);
signal dout3 : std_logic_vector((data_width - 1) downto 0);

signal ce2 : std_logic;
signal ce3 : std_logic;

begin

   process (wr_clk)
   begin
      if (wr_clk'event and wr_clk = '1') then
         if (ce = '1') then
    ce2 <= '1';
    wr_en1 <= wr_en;
    wr_addr1 <= wr_addr;
    din1 <= din;
   else
    ce2 <= '0';
   end if;

   if (ce2 = '1') then
      ce3 <= '1';
    wr_en2 <= wr_en1;
    wr_addr2 <= wr_addr1;
    din2 <= din1;
   else
    ce3 <= '0';
   end if;

   if (ce3 = '1') then
      wr_en3 <= wr_en2;
    wr_addr3 <= wr_addr2;
    din3 <= din2;
   else
    wr_en3 <= '0';
   end if;

         if (wr_en3 = '1') then
               mem_array(conv_integer('0' & wr_addr3)) <= din3;
         end if;
      end if;
   end process;

   process (rd_clk)
   begin
      if (rd_clk'event and rd_clk = '1') then
         if (ce = '1') then
    ce2 <= '1';
    rd_en1 <= rd_en;
    rd_addr1 <= rd_addr;
   else
    ce2 <= '0';
   end if;

   if (ce2 = '1') then
    ce3 <= '1';
    rd_en2 <= rd_en1;
    rd_addr2 <= rd_addr1;
   else
    ce3 <= '0';
   end if;

   if (ce3 = '1') then
    rd_en3 <= rd_en2;
    rd_addr3 <= rd_addr2;
   else
    rd_en3 <= '0';
   end if;

         if (rd_en3 = '1') then
               dout3 <= mem_array(conv_integer('0' & rd_addr3));
         end if;
   dout2 <= dout3;
   dout1 <= dout2;
   dout <= dout1;
      end if;
   end process;

end rtl;

sorry a bit long to be posted though



Re: Spartan 3E Not enough block ram.
Quoted text here. Click to load it

Each added register will add a latency cycle, what you need to do is make
sure that all the data comes together with matched latencies. I put the
extra dout registers outside any ifs because the extra control logic for
bringing the clock/read-enable further down would not have any net effect.

Quoted text here. Click to load it
 > Hmm come to think of it, it looks kind of stupid to do reassignment of
 > signals in the same block.

If you continue to work with FPGAs and relatively high-speed designs, you
will find that it is often a fundamental necessity.

Quoted text here. Click to load it

Balancing and the other related options are necessary to let XST move
registers around when you want to use "automatic pipelining". Unless some
of these options are enabled, extra registers get synthesized as plain
extra registers that do nothing more than delay data. Since each extra
register optimization option gives XST more freedom in redistributing FFs,
synthesis will be slower.

Quoted text here. Click to load it

There are limits to how many registers XST is able to move around when
using automatic pipelining and it appears to vary from two to four
depending on constructs and tool versions.

Re: Spartan 3E Not enough block ram.

Quoted text here. Click to load it

Yeh Well thanks alot in all for your help.
If not for your help, I wouldn't have make so much progress and most of all,
know more about some FPGA.
Hmm, anyway, currently for my project, I guess pretty stuck already and am
not able to lower the timing delay any further. Maybe
because probably no one knows exactly whether it is possible to port the
design from a Virtex to a Spartan. Maybe it can, maybe it just cannot be
done.

Hmm more ways I could progress further on about this would be (maybe) to
find out more on the sequential tables and coefficients tables and whether I
could do something about the wrapper and find out more about this wrapper.
Or I could use the DDR SDRAM (shudders...)

Lastly, another problem would be the IO ports and how to actually implement
this scaler in practical sense.

Anyway appreciate your help so far. Many thanks!







Re: Spartan 3E Not enough block ram.
Quoted text here. Click to load it

Since your only problem here appears to be coming slightly short on BRAMs
for a direct re-implementation, going one step up in FPGA size would solve
your problem.

Quoted text here. Click to load it

If there are large duplicated constant tables stored in a BRAM that get
initialized by software, you could make them into a dual-port ROM by
putting the constants in the BRAM's INIT. Actually, the write functionality
could be preserved too as long as the writes are made synchronous to either
read clock.

Quoted text here. Click to load it

You're welcome.

Site Timeline