Spartan 3E Not enough block ram.

I trying to port a design (video scaler) from Virtex 4 to Spartan 3E. Currently having trouble with not enough block rams.

My reference (top) design uses only 15 Block Rams, but after wrapping a "wrapper" module around my top design. The block Rams shoots up to 60. Thus, I decided to go and look into this wrapper module. It instantiates alot of this dp_bram module. So, I went to look into this dp_bram module and found the following codes. ( I took out relevant portions of it for easier understanding)

(Just for note: This wrapper module is the wrapper around the video scaler for trial synthesis. It includes two instances of the scaler, as well as a simple bus interface for the control register inputs.)

--Information in Entity portion-- data_width : integer := 8; mem_size : integer := 1920 wr_addr : in std_logic_vector((LOG2_BASE(mem_size) - 1) downto 0); rd_addr : in std_logic_vector((LOG2_BASE(mem_size) - 1) downto 0); din : in std_logic_vector((data_width - 1) downto 0);

--Architecture portion-- process (wr_clk) begin if (wr_clk'event and wr_clk = '1') then if (ce = '1') then if (wr_en = '1') then mem_array(conv_integer('0' & wr_addr))

Reply to
Ken Soon
Loading thread data ...

As I think has been answered on here before, DRAM has a complex sequence of commands that must be issued to it, there is no way to "just know how to use it". An interface will probably take in the order of 500-1000 slices in your part. If this is an eval board, I'd be surprised if the stuff the board came with didnt have SOME sort of a dram interface in the examples.... but I don't know.

Yes, the construct you pointed out is most likely what is inferring the block rams. To confirm that, look at the "language templates" section of ISE. It will show you exactlly the constructs that infer different pieces of hardware.

It sounds like the only way to do what you want is to use the external ram. Which means finding or designing a DRAM interface. You will also, however, need to redesign the hardware AROUND the memory, since it will not operate as fast or as smoothly as the block rams. Most likely, the DRAM interface will be buffered by a FIFO to the remainder of the FPGA. You will have worse latency and bandwidth as compared to block ram.

Sorry for the bad news... but it doesnt sound like this is a cut and paste. You're gonna have to teach yourself a little bit about hardware and vhdl and fpgas and then you're gonna have to do some actual design - go figure.

Reply to
Paul

1920x8bits = 15.3kbits, this is less than one BRAM... I am guessing "data_width" and "mem_size" are generics and the actual parameters on the instance are larger than that or there are multiple instances of it. If those are the actual parameters and there is only one instance, this code fails to explain the 44 extra BRAMs. Even if it was x8bytes, this would still be only 8 BRAMs instead of 45. It seems like your posting is lacking some critical details that make it impossible for us to make educated guesses. Also, having a wrapper around your 'top' design for a synthesis implementation is suspicious.

The code you posted is a BRAM inference wrapper for a dual port RAM with independently clocked read and write ports, the first real questions are: how many times is this generic wrapper used, what are the instance parameters in each case and what are they for?

Since a very decent scaler can be achieved with five lines worth of video data that would require 15 BRAMs for inputs and three more for output buffering, chances are that your scaler's wrapper needs a diet unless it does other fancy things you may not be aware of.

Because DRAMs have refresh cycles, row activation, row precharge and numerous other quirks designers have to take care of before initiating any actual data transfers... I told you so last week. Even if Xilinx had hardware memory controllers, you would still have to work with the variable latency and possible read re-ordering.

Reply to
Daniel S.

variable

Well, the reference design data file did say that the purpose of the wrapper is to minimize the number of external interconnect required by the scaler so that it can fit into a realistic target device. My wrapper has instantiates

2 sequential lookup tables and 6 horizontal coefficient tables and 3 vertical coefficient tables.

Hmm, I guess I will be looking at other designs which have interface with the DDR SDRAM controller and from there, try to understand the interface and I hope to be able to do the same for my video scaler design.

Thanks alot for checking for me that the code is a BRAM inference wrapper for a dual port RAM. with independently clocked read and write ports.

Yeh and sadly the evaluation board doesn't come with a dram interface instruction guides or somewhat. IT only comes with the pin numbers and a few brief description, alas.

Reply to
Ken Soon

So your scaler uses an FIR filter for scaling with 3x6 grid sampling and the coefficient sets for each available scaling factor are stored in BRAMs... yup, this can cost a handful of BRAMs - and you will have a hard time dumping these or your video line buffers in DRAMs.

You really need to look into exactly what is consuming how many BRAMs how and why as I suggested in my previous message. Coefficient tables and video line buffers will be difficult to shift into DRAM: you will need some BRAMs to buffer data to/from the DRAMs and are not going to be any better off if you end up with as many FIFO BRAMs as you originally needed plain BRAMs for the initial design... actually, you will be worse off given the extra glue-logic.

If coefficient tables are eating up those ~40 BRAMs, you may be in serious trouble since DRAMs cannot be programmed by a bitfile - you will need some method of initializing the DRAM. Analyze the design carefully to see how the BRAMs are used, there may be a few reduction tricks that can be applied to spare a few.

Even if you find a suitable DRAM controller to paste into your design overnight, reworking your scaler to work with it will require some significant effort.

The coding style is obvious to anyone who has spent any significant amount of time getting 'fancy' BRAM inferences to work. The independent read and write port BRAM, independently clocked or not, is the easiest to get right to the point of being nearly impossible to mess up.

If you want to interface DRAMs, I suggest you start by looking up the datasheets for your board's DRAMs. At the very least, it will give you an idea of how kludgey a DRAM interface can be. The road to a fast and stable DRAM interface is full of bumpy quirks.

Reply to
Daniel S.

Yeh true, well my supervisor was asking me to start from at least an existing design and start learning at least how a DRAM controller works. But well no matter, still can't begin how to go about even "pasting" it. It really seems complicated with all the addresses, data and then the control signals.

Heh, oops. guess I really haven't spend significant amount of time on FPGA yet. Well, sometimes, I really wonder about how designers know which feature and how many of that feature will be used and why the codes they write will exactly used that feature in the FPGA chip. Like, if I use array for storing some data, block rams will be used? and if I use * , mulitpliers will be used?. Sorry if this is just too inane to answer. Just that for synthesis, everything seems so automatic. Well, I do know about the language templates but it looks really different.

Hmm, I guess it looks good to perform some reduction tricks that you mentioned. I had a question on hand though. Let's say I have the following instantiated modules under my wrapper module as shown: pvs_wrapper - h_seqlut_inst - dpbram - rtl (dp_bram.vhd) - v_seqlut_inst - dpbram - rtl (dp_bram.vhd) - h_coeff_0_0_inst - dpbram - rtl (dp_bram.vhd) - h_coeff_0_1_inst - dpbram - rtl (dp_bram.vhd) - h_coeff_1_0_inst - dpbram - rtl (dp_bram.vhd) - h_coeff_1_1_inst - dpbram - rtl (dp_bram.vhd) - h_coeff_2_0_inst - dpbram - rtl (dp_bram.vhd) - h_coeff_2_1_inst - dpbram - rtl (dp_bram.vhd) - v_coeff_0_inst - dpbram - rtl (dp_bram.vhd) - v_coeff_1_inst - dpbram - rtl (dp_bram.vhd) - v_coeff_2_inst - dpbram - rtl (dp_bram.vhd) - pvs_top_structural (pvs_top.vhd) - ... (erm you can ignore the pvs_top and onwards as those will be for the original design modules)

Now each of the dp_bram.vhd has the following codes: mem_array(conv_integer('0' & wr_addr)) and why as I suggested in my previous message. Coefficient tables and video

BRAMs

for

Hmm can this be worse off? Well definitely FIFO Brams will be needed but it certainly be lesser than not using the DRAM right? If not, what purpose would the DRAM serve? (which is to store data)

Reply to
Ken Soon

The DRAM control/data/etc. signals is one thing, making the DRAM work exactly the way you want it to is quite another - you need to familiarize yourself with DRAMs' internals... learn what row activation and precharge do, why these are necessary, how they can affect your design and how you can work around these delays by doing pipelined burst transfers. There are a bunch of other quirks that can be exploited or have to be avoided, the ones enumerated are simply the more fundamental ones IMO... and do not forget those auto-refresh cycles.

These things are automatic only to the extent where the HDL coder follows some limitations. For SRAMs/ROMs/registers to get mapped onto BRAMs, the synthesis tools must be able to reduce the access/data logic down to something supported by the hardware. For a BRAM, this means the logic must be reducible to no more than two read+write+address+clock sets. Depending on the target device, there may be additional restrictions such as read-write policies - write first, read before write or no change.

Multipliers also have their share of quirks, particularly if you want to do pipelining.

Look at your synthesis reports pay attention to each BRAM's inference data and port mappings. Look for memories that are under 8kbits and are not using both read and write ports - these may be mergeable if they use the same clocks.

1) FIR filters are often symmetric: the nth tap (n=0..N) has the same coefficient as the (N-n)th one... it is unlikely that this optimization has not already been done if applicable to your filter but double-checking is cheap. 2) If your coefficient tables (ROMs?) use under half a BRAM and only one port, you should be able to merge two tables into one BRAM by using both ports for reading: map one address to "'0' & addrA" and the other to "'1' & addrB". 3) Examine the tables to find redundancies and equivalences, it may be possible to multiplex accesses to the coefficient tables.

How much data do you need to put in the DRAMs? 60 BRAMs x 2KB each = 120KB max., assuming you intended to put everything on the DRAMs. The question you really need to ask yourself is: can you afford the glue-logic?

Try ripping the memory controller off any project with DRAM controller you may have handy, do an unconstrained implementation run (use a slow clock like 50MHz and the memory controller's top as your synthesis project's top) and see how resource-hungry the memory controller you have is.

Reply to
Daniel S.

Yeh kind of learnt little about the refreshing of the DRAM in school and it was difficult. Hmm, i seriously need some ultra pure basic on how to use the DRAM, any such books or websites or watever?

Meaning the structure of the HDL codes is written such that the synthesis will infer what kind of devices to use. (Amazing..write first or write later also have effect on the device being used...) oh the XST guide offered alot of help on this.

Wow, the synthesis report is so cool. It tells me alot of information, like which modules contains their respective warnings, the devices inferred from the each modules (like adders, subtractors,etc) and analysis of different values for the data types. Thumbs up!

Anyway got this information on the BRAMs

57 rams RAMB16_S2_S2 : 2 RAMB16_S36_S36 : 18 RAMB16_S4_S4 : 25 RAMB16_S9_S9 : 12 All are 16kbits Ram with different port widths for A and B (as indicated by 16_Sx_Sy)

they wish to reduce BRAMs?

has not already been done if applicable to your filter but double-checking is cheap.

The nth tap as the same coefficient as the (N-n)th one. Hmm I have only a H tap and a V tap. I don't see a series of taps being used though. Just for info sake, I have already tried to reduce H tap to 4 and keeping V tap as the default 4 to fit the multipliers and also reduce a few BRAMS being used. But I can't find anywhere about the coefficient of the tap.

& addrB".

Nah, no chance. from my final synthesis report, all of them are using dual port rams. Anyway the codes were already written for dual port BRAM so there shouldn't be any reason that my coefficient tables will be using under half a BRAM, ya?

hmm....erm....I'm not sure to go about this method. Anyway, regarding the tables, I guess the tables of coefficients have not yet been used for input in the design (thus I felt kinda puzzled by your 1st and 3rd methods). Right now, I'm just synthesizing designs which some of them will process the tables later on. So what the tables have inside are the issue here. All the codes in the each of those sequential lookup and coefficient tables are using the same dp_bram, then the wrapper instantiating them will porting different values in them.

top)

Oh yah well, I'm not sure if how much data is needed but based how many BRAMs is used and each Brams is a 16kbit. 57 x 16k = 912k. But anyway, the board I will be using has four DDR SRAM Memory Chips (512Mbit). And since it could provide so much storage, surely my chip should be able to handle or afford the glue-logic.

Hmm oh yah I have used 4570 out of 14752 slices. Now I read that I can use the distributed rams from the remaining slices and I thinking it is not enough to reduce the BRAMs till there is no overloading but it certainly reduce to a extent. I went to check out the language templates for using distributed rams, but... they all support only 1bit data storage while many bits address...(what's use can this be..) hmm but then I read that this distributed rams can be combined to form longer data bits storage (some sort of combination). but then again, I don't how to go about doing it. Tried to find more about it and got this 2 methods:

1) Use the "HDL Coding Techniques" in your XST User Guide. 2) Specifying constraints (see your Constraints Guide), or by "instantiating" library parts that force the logic to be created a certain way (see your Libraries Guide). For example, the RAM_STYLE constraint lets your force either Block RAM or Distributed RAM. Woah that RAM_STYLE constraint looks simple to use :) Tried that and not even a single change happened. :( So I thought of looking at 1) and found out that in my code there was this constraint already being used ; Thus, I changed it to However, it seems to continue forever during my synthesis.

Currently, I'm using 57 out of 36 BRAMS (after lowering the H taps)

57 - 36 = 21 => 378000 bits of ram...
Reply to
Ken Soon

I'll repeat myself: read some DRAM's specs, you can start with those on your board - most DRAM manufacturers do a reasonably thorough job at describing how DRAMs work.

The memory access policy will not change the "kind of device", all BRAMs are the same but not all FPGAs implement all access policies. Devices prior to the Virtex4 do not support read-before-read while write-before-read is supported by all Xilinx devices I know of.

Synthesis and implementation logs are your friends, remember to inspect them thoroughly and you will run a lower risk of getting chastised for posting newbie questions... I can usually find answers to about 90% of my would-be questions in there.

Any RAM uses exactly one BRAM and on modern Xilinx devices, all BRAMs are

16kbits so any sub-16kbits RAM you infer will consume one 16kbits BRAM even if you use only 256bits.

This part of the report only tells you the port widths of the different BRAMs used by your design, now you need to hunt down each of these instances and see how many addresses each has to determine how many kbits each actually uses.

If you are having a hard time determining which BRAMs are used where, you should look at the "Macro Statistics" instead: the Macro Stats reports memories with their actual inference parameters instead of the final report report's raw BRAM usage.

Macro Statistics # Block RAMs : 17 256x16-bit dual-port block RAM : 1 256x72-bit dual-port block RAM : 4 512x32-bit dual-port block RAM : 8 512x64-bit dual-port block RAM : 4

Macro statistics allow me to easily find out the size of all the memories present in my design so I can cross-check with the final report to determine whether or not the synthesis tools have mapped everything as expected... and here, I do not remember what the 256x16 RAM is for so I'll have to investigate where it came from next time I work on the project I pasted this from.

I am not psychic... without knowing exactly how the BRAMs are being used, I cannot tell if packing is applicable to your case. If your coefficient table BRAMs use the same inference template and are duplicated to provide multiple coefficients from a same table but are never written to (effectively used as ROMs), you can create a dual read port template and remove half of those BRAMs. If your coefficients need to be programmable, this trick is still applicable but you will have to implement the trickier "true dual-port BRAM" template and manage writes somehow. At this point, I think this probably is your best avenue.

If RAMs are left alone (no pragmas, no attributes, no force options in tool settings), synthesis tools will automatically map memories to available resources: they will try to map all large-ish memories to BRAMs until all BRAMs are used and then start using distributed (LUT) memory for smaller memories.

It takes forever because Map/PAR is unable to complete routing when all memory gets forced into LUT memory. Remove (comment) this attribute altogether to let synthesis tools decide which memories should be dumped in BRAMs and which ones should use distributed memory. Alternatively, you could add a generic port to the template to specify block or distributed on a per-instance basis.

Note: large distributed memories will become slow unless you add multiple pipelining registers on their output so you should be careful when using these.

10k free slices * 2 LUTs per slice * 16bits per LUT = 320kbits of available distributed RAM = no fit, assuming all your BRAMs are fully used. The fact that synthesis did not flat out say that there was no fit means some of your BRAMs are definitely not completely used.
Reply to
Daniel S.

Yup I saw this # RAMs : 24 16x64-bit dual-port distributed RAM : 6 1920x12-bit dual-port block RAM : 9 1920x12-bit registered dual-port distributed RAM : 3 4096x36-bit dual-port block RAM : 1 4096x9-bit dual-port block RAM : 2 8x64-bit dual-port distributed RAM : 3 Since then I have been playing around with my codes, and I can identify which instances in the codes are using which kind of RAM.

in

on

these.

Currently I have tried to change some of the instances to use distributed ram, but forcing the constraint RAM_STYLE to be pipe_distributed And, going down the list of instances that uses the block ram, when I change it for for my 6 horizontal and 3 vertical coeffcient instances, viola, it immediately dropped down to 39 out of 36 block rams! Hmm, strange, it dropped so much. Anyway, next i try to work on some instances to change to use distributed rams. However, I have to be careful to have a balance of not overshooting the LUTs i have together with the block rams as well. Then I worked with some line buffer modules under my top modules and well after synthesis, everything was well under the resources limit. However, the problem came i tried to implement it. The user constraint file that belong trial synthesis had a timing constraint and my design timing was twice over this constraint. Then i used the timing analyzer and cross probe the problem and could see the path looked to be quite long. So now, i tried to work on this problem by using some of the optimization options in the ISE Under the map properties, I selected map option level as high. The runtime took really long, in the end, i got this message.

The router has detected a very high timing score (5245937) for this design. It is extremely unlikely the router will be able to meet your timing requirements. To prevent excessive run time the router will change strategy. The router will now work to completely route this design but not to improve timing. This behavior will allow you to use the Static Timing Report and FPGA Editor to isolate the paths with timing problems. The cause of this behavior is either overly difficult constraints, or issues with the implementation or synthesis of logic in the critical timing path. If you would prefer the router continue trying to meet timing and you are willing to accept a long run time set the option "-xe c" to override the present behavior.

I thinking of just trying to meet the timing. but when do or can I set the option "-xe c". I dont see any dos command line for me anywhere...

Reply to
Ken Soon

This is not strange: block-RAMs have 36bits-wide ports at most. Since you have some very small x64 memories that were previously forced into BRAMs, they ended up costing two BRAMs each. With three 8x64 and six 16x64 RAMs, this is 18 BRAMs recovered right there.

Distributed RAM is slow unless you give it many output register stages to redistribute: each LUT can provide 16bits and these are patched together with muxes to provide larger memories. Your address signals will also have huge fanout which further contributes to the slowness. Since your 1920x12 distributed RAM probably only absorbed one register, the very long paths you are seeing is from address bits down to some part of the way through the output muxes down to the absorbed FFs and then from those FFs through the remaining address muxes to the destination FFs.

Do not bother with increasing PAR effort, this will do you no good. You need to either put that 1920x12 RAM in BRAMs or add register stages that synthesis will redistribute within the distributed memory to improve your timing score. Start by adding two register levels to your 1920x12 distributed memory's output and your score will most likely drop from over

5M to possibly under 200k. Add extra registers until your timings are met or improvements stall. After this, you will need to realign your processing pipeline to account for the delays on this large distributed memory.

BTW, what was your LUT and slice-FF usage with that last attempt?

Reply to
Daniel S.

Ok that's cool. I guess this is indeed an invaluable point to take in the future for efficient usage of my chip next time.

Yeh it indeed had high fan out, hundred over!

processing

Yah I put that in BRAMs and use aother instance (vertical sequential table) for distributed ram. Hmm, so I can add register levels? How do I go about that? I tried selecting the register duplication both in the synthesis and implement design options. But to no avail. So do u mean using FPGA editor?

Talking about FPGA editor, I was wondering about moving the problematic source CLBs closer to the destination so as to cut down the delay? But there are quite a few implications due to the other wires connection to those CLBs.

Ah forgot to save... Well I did another one with instances of the Vertical sequential table and the horizontal and vertical cofficients table using distributed ram (Before this change, I remembered the block ram usage was 33 out of 36 for linebuffer instance using distributed ram and th 4 input LUTs was 84%) Number of Slices: 6718 out of 14752 45% Number of Slice Flip Flops: 9007 out of 29504 30% Number of 4 input LUTs: 13229 out of 29504 44% Number used as logic: 7010 Number used as Shift registers: 459 Number used as RAMs: 5760 Number of IOs: 322 Number of bonded IOBs: 316 out of 376 84% Number of BRAMs: 36 out of 36 100% Number of MULT18X18SIOs: 36 out of 36 100% Number of GCLKs: 1 out of 24 4%

I figured this is vertical sequential table would be the better choice to tackle the problem after forcing it to use distributed ram.

Reply to
Ken Soon

Adding output registers is simple...

if rising_edge(clk) then memout_d1 are quite a few implications due to the other wires connection to those

1920/16 = 120 LUT-Ms to mux each bit, this means 7 2:1 mux layers. LUTs can do 2:1, slices do 2:1 and CLBs do 2:1 so you need to go through two full CLBs and one LUT to do this 120:1 mux. I think you will be fine if you add three output registers in the RAM's output path. Due to the high fan-out on the address bits, an extra register there should also help. With all this, you will get data four cycles later after the request.

After you implement the extra registers for the distributed RAMs, your FF usage should increase by about 2000. With resources currently under 50%, it should considerably improve your PAR results without any other fancy footwork. If parts of your design are mostly self-contained, you could floorplan them to reduce the amount of time PAR will spend making guesses about the optimal layout.

Reply to
Daniel S.

Hmm... I dun suppose you mean by this way? : architecture rtl of dp2_bram is type mem_array_type is array (0 to (mem_size - 1)) of std_logic_vector((data_width - 1) downto 0); signal mem_array : mem_array_type; attribute ram_style : string; attribute ram_style of mem_array : signal is "pipe_distributed"; signal dout2 : std_logic_vector((data_width - 1) downto 0); signal dout3 : std_logic_vector((data_width - 1) downto 0); signal din2 : std_logic_vector((data_width - 1) downto 0); signal din3 : std_logic_vector((data_width - 1) downto 0);

begin process (wr_clk) begin if (wr_clk'event and wr_clk = '1') then if (ce = '1') then if (wr_en = '1') then din2

Reply to
Ken Soon

If you delay the write data, you should also delay the write address otherwise you will have problems. Registers on the write would be mostly there to reduce the fanout, you probably do not need more than one extra input register here. (for data, address and enable since all three need to be equally delayed)

So, your write process (using your coding style) would resemble this:

process (wr_clk) begin if (wr_clk'event and wr_clk = '1') then -- first determine if something needs to be written on the next cycle -- register duplication will be applied to these if necessary if (ce = '1') then wr_en1 Another engineer told me that it could have help as the route could have

When you have signals with large fan-outs and you do pipelining, you need to decouple your enable signals to keep fan-outs on enables in check. If you look at the two processes, you can see that I did this by combining all incoming enables to generate a single-signal enable for the following pipeline stage.

Since you are new to FPGAs, it is normal that you are not (yet) familiar with the fundamentals of working around common design issues... but most of these you should be able to deduce by reading your static timing analysis and thinking about the simplest ways to fix the problems it reveals.

Reply to
Daniel S.

Hmm why in this case the dout1 with the fundamentals of working around common design issues... but most

Hmm yah, well I really still have alot to learn in the future. Well, really happy now at least I have learnt something about adding register levels to solve timing issue problems.

And really thanks alot for the code, it was spot on. I guess I could never figured the part about the address and enable needing to be delayed myself. Hmm come to think of it, it looks kind of stupid to do reassignment of signals in the same block. From there on, I found out that I need to select register balancing => Yes in synthesis option. After doing so, the timing code really dropped by more significant amounts.

Oh btw I have then tried to add on more register levels. Have tried some structures, and I deduce this coding should be logical. Wonder if there is anything wrong. The lowest I have brought the timing down is a slack of about 1ns

architecture rtl of dp2_bram is type mem_array_type is array (0 to (mem_size - 1)) of std_logic_vector((data_width - 1) downto 0); signal mem_array : mem_array_type; attribute ram_style : string; attribute ram_style of mem_array : signal is "pipe_distributed"; signal wr_en1 : std_logic; signal wr_addr1 : std_logic_vector((LOG2_BASE(mem_size) - 1) downto 0); signal din1 : std_logic_vector((data_width - 1) downto 0);

signal wr_en2 : std_logic; signal wr_addr2 : std_logic_vector((LOG2_BASE(mem_size) - 1) downto 0); signal din2 : std_logic_vector((data_width - 1) downto 0);

signal wr_en3 : std_logic; signal wr_addr3 : std_logic_vector((LOG2_BASE(mem_size) - 1) downto 0); signal din3 : std_logic_vector((data_width - 1) downto 0);

signal rd_en1 : std_logic; signal rd_addr1 : std_logic_vector((LOG2_BASE(mem_size) - 1) downto 0);

signal rd_en2 : std_logic; signal rd_addr2 : std_logic_vector((LOG2_BASE(mem_size) - 1) downto 0);

signal rd_en3 : std_logic; signal rd_addr3 : std_logic_vector((LOG2_BASE(mem_size) - 1) downto 0);

signal dout2 : std_logic_vector((data_width - 1) downto 0); signal dout1 : std_logic_vector((data_width - 1) downto 0); signal dout3 : std_logic_vector((data_width - 1) downto 0);

signal ce2 : std_logic; signal ce3 : std_logic;

begin

process (wr_clk) begin if (wr_clk'event and wr_clk = '1') then if (ce = '1') then ce2

Reply to
Ken Soon

Each added register will add a latency cycle, what you need to do is make sure that all the data comes together with matched latencies. I put the extra dout registers outside any ifs because the extra control logic for bringing the clock/read-enable further down would not have any net effect.

If you continue to work with FPGAs and relatively high-speed designs, you will find that it is often a fundamental necessity.

Balancing and the other related options are necessary to let XST move registers around when you want to use "automatic pipelining". Unless some of these options are enabled, extra registers get synthesized as plain extra registers that do nothing more than delay data. Since each extra register optimization option gives XST more freedom in redistributing FFs, synthesis will be slower.

There are limits to how many registers XST is able to move around when using automatic pipelining and it appears to vary from two to four depending on constructs and tool versions.

Reply to
Daniel S.

Yeh Well thanks alot in all for your help. If not for your help, I wouldn't have make so much progress and most of all, know more about some FPGA. Hmm, anyway, currently for my project, I guess pretty stuck already and am not able to lower the timing delay any further. Maybe because probably no one knows exactly whether it is possible to port the design from a Virtex to a Spartan. Maybe it can, maybe it just cannot be done.

Hmm more ways I could progress further on about this would be (maybe) to find out more on the sequential tables and coefficients tables and whether I could do something about the wrapper and find out more about this wrapper. Or I could use the DDR SDRAM (shudders...)

Lastly, another problem would be the IO ports and how to actually implement this scaler in practical sense.

Anyway appreciate your help so far. Many thanks!

Reply to
Ken Soon

Since your only problem here appears to be coming slightly short on BRAMs for a direct re-implementation, going one step up in FPGA size would solve your problem.

If there are large duplicated constant tables stored in a BRAM that get initialized by software, you could make them into a dual-port ROM by putting the constants in the BRAM's INIT. Actually, the write functionality could be preserved too as long as the writes are made synchronous to either read clock.

You're welcome.

Reply to
Daniel S.

ElectronDepot website is not affiliated with any of the manufacturers or service providers discussed here. All logos and trade names are the property of their respective owners.