substracting a whole array of values at once

hi, i need a whole matrix to be substracted form another in one clock cycle. and i need to store all the values of these matrices in either distributed or block RAM. i thought this is not possible with block RAM, as these have didicated ports that are fixed in width. so tried to use distributed RAM with the following verilog code.

parameter nColumns = 100; parameter nRows = 200;

parameter RAM_WIDTH = 8;

//matrix 1 data reg [RAM_WIDTH-1:0] data_1[nColumns*nRows-1:0];

//matrix 2 data reg [RAM_WIDTH-1:0] data_2[nColumns*nRows-1:0];

reg [7:0] ix = 0; reg [7:0] iy = 0;

reg[10:0] diff = 0;

always @ (start) // start will trigger the calculation begin for(ix = 0; ix < 100; ix = ix + 1) begin for(iy = 0; iy < 100; iy = iy + 1) begin diff = diff +data_1[iy] - data_2[ix]; end end end

ise webpack completed stege one systhesis of this code after around 10 minutes and did not complete the whole systhesis even after around 3 hrs.

any assistance is very much appreciated.

thank you.

Reply to
CMOS
Loading thread data ...

I'm answering this assuming that it is school homework.

You try simulating this code first. It just might not do what you want it to do. For example, try the case of data_1 is all x7F and data_2 is all x81. Calculate the expected value of diff, and see what the simulation produces.

You might want to try a smaller example. Try to get a 2 X 2 matrix correct first.

You might want to look at the synthesis templates. RAMs need to be coded in fairly specific ways to be recognized. Code that is too different from the template may not produce the expected result. To produce distributed RAM:

parameter RAM_WIDTH = ; parameter RAM_ADDR_BITS = ;

reg [RAM_WIDTH-1:0] [(2**RAM_ADDR_BITS)-1:0];

wire [RAM_WIDTH-1:0] ;

[RAM_ADDR_BITS-1:0] ; [RAM_WIDTH-1:0] ;

always @(posedge ) if () []

Reply to
Phil Hays

hi,

thanks for the reply. actually its not a school homework. im doing a project in which i need to deduct an image from another. this has to be done several times to get the output. to reduce time for the calculation, i need to do the substraction in one clock cycle. in order to do that i need concurrent access to all pixels of both images. im not sure how this could be acheived with available block or distributed RAM.

thank you.

Reply to
CMOS

The only way to subtract all values at once (!) is to use registers, not distributed memory or BlockRAM. You will also need one subtractor for each pair of image pixels.

Redefine what you consider necessary.

Do you have access to all those pixels at once the clock cycle after the subtract?

If a new image is overwriting the first image in the clock after the subtract is supposed to happen, read the previous value and take the subtraction the same time the next value is being written. Cycle through all the values you need to subtract *as you read them* rather than requiring the subtraction happen all at once.

You can do a bazillion subtracts of width n if you have 3n bazillion registers and n(+1?) bazillion LUTs.

If you HAVE to have memory, you do not HAVE to have single-cycle subtracts.

Think about the hardware.

- John_H

CMOS wrote:

Reply to
John_H

Hi CMOS, here's the solution you want:

Assume the two input matrices to be stored in two separate RAMs (e.G. Blockrams). Dual Ported, if necessary.

Connect a subtractor circuit to the DataOut of these RAMs. The Output of that circuit is identical to the output of your Result-RAM

Now the result is virtually existent. If anyone has a doubt, just make a readout of the Result-RAM. The result will be there. :-)

Have a nice synthesis Eilert

Reply to
backhus

Are you saying that the above mechanism will compute a difference of two arrays in one clock cycle? If so then, I am not following... maybe you can explain a bit more.

Reply to
fpgabuilder

Reply to
Peter Alfke

Sounds like he's trying to do MAD for motion compensation.

Reply to
Pete Fraser

Reply to
Marlboro

I think he meant it '1 refresh cycle' instead of '1 clock cycle'

Reply to
Marlboro

That kind of difference is too huge not to properly disclose. If that is the case, the orignal poster REALLY needs to clarify the expectations. Bottom line: the needs - as stated - are not realistic. The true needs must be explained to entertain solutions.

It takes a huge number of cycles to accumulate pixels and a huge number to read the pixels back to make decisions. Asking for one "clock cycle" between these lengthy events is unrealistic. A specific number of clock cycles for a refresh cycle is more realistic but must be specified to bound the problem.

Reply to
John_H

Hi fpgabuilder, the above circuit is even faster, it needs no extra Clock cycle.

The point is, that CMOS didn't mention how the data is written to the RAM and how the result will be read out. Show me the RAM that can be written or read in a bulk, that is "all data at once". There is no such device. So what my circuit does is that it presents a Result-Databus over which the subtraction result of RAM1 and RAM2 can be read out. So the results are virtually existent in that "Result-RAM" instantly after writing the last word of data into the Input-Rams. All you need to do is read them out.

If one uses DP-RAMS as mentioned the results can even be read out while the Data is written to the input rams.

Even more... If you can manage your Dataflow in a way, that the first results are read out one clock cycle after the two input values have been written you can use time multiplexing mechanisms to reduce the ram size. In fact the RAMs collapse into single registers and are capable of storing arrays of any size, just one value at a time. But it works.

I know, it sounds kind of tricky, but a lot of designers are using circuits like this.

Have a nice synthesis Eilert

Reply to
backhus

thank you for all replies. i tried to post a reply earlier but did't work.

the substraction is just one operation. actually what i need is to do a convolution of two images. i did't think about doing it as the data arives, as it seems very complicated. the project im working on is not a commrcial one. i need to prove a concept. so i dont mind using the whole FPGA (Xilinx spartan 3, 400K version), just to implement this, as long as i get the result of the convolution very fast.

thank you.

Reply to
CMOS

I don't think you mean that. I think you mean the difference of two images.

I promise you, it is MUCH easier than trying to do it on two complete images. You need the old image in memory, and you probably need a second store for the difference image. For each pixel, you perform

difference_image[row][col] := new_pixel - old_image[row][col]; old_image[row][col] := new_pixel;

before going on to increment the column address. Note that this requires two memory areas, and one read and two writes for each pixel. You'll also need to budget for the read accesses that some CPU or whatever must make to the result.

As others have pointed out, it's hard to get it any faster than this: the difference image is constructed in real-time whilst the second of your two fields is being acquired.

Do you really mean CONVOLUTION? That's a spatial filtering operation, in the usual meaning. Taking the difference of two successive images is a temporal filtering operation - I suppose it's a convolution in the time domain with the kernel (+1, -1) .....

--
Jonathan Bromley, Consultant

DOULOS - Developing Design Know-how
 Click to see the full signature
Reply to
Jonathan Bromley

yes i mean it. its convolution. im doin it to calculate the motion vector by comparing two consecutive images.

Reply to
CMOS

OK, when you said "the subtraction is just one operation" you meant that it is one of many operations that need to be performed. Sorry I misunderstood.

I reckon the subtraction is trivial by comparison with the feature-detecting convolution operations, on various scales, that you probably need to do. Or the DCT. Or the correlation. Why is the subtraction such a big deal?

--
Jonathan Bromley, Consultant

DOULOS - Developing Design Know-how
 Click to see the full signature
Reply to
Jonathan Bromley

Much bigger deal.

Good luck. You'll proably be faster doing a 2D-FFT on both images, multiplying them, then doing the inverse transfom.

One clock? Hah!

Reply to
Pete Fraser

Seems like a natural way to do it. I do it all the time. Actually, I would not even use the internal RAM unless I needed a spatial operation as it seems is the original poster's problem.

I wasn't sure if you were talking about applying your solution to all the array elements simultaneously.

Best, Sanjay

Reply to
fpgabuilder

fpgabuilder schrieb:

Hi Sanjay, I see you got my point. :-)

In fact my posting was a little joke to pull CMOS back to reality by presenting a simple but cunny solution using big words like "virtual" etc. :-)

Even his convolution can be done that way. Just needs an extra accumulating adder, no big deal.

I really haa to smile when I read things like:

parameter nColumns = 100; parameter nRows = 200; parameter RAM_WIDTH = 8;

and later:

"i did't think about doing it as the data arives, as it seems very complicated. the project im working on is not a commrcial one. i need to prove a concept."

Well.. a simple calculation shows that he needs 100*200*8*2 FF to store the Input data. That's 320000 FFs! In the Xilinx Architecture that's like 160000 Slices or 80000 CLBs. Even a Virtex 5 LX330 can't handle that.

And even if he reduces his array to a size that fits into his real existing chip, the resulting combinatorical circuit to duo the calculation fully parallel will be so complex that the resulting delay will be somewhere in the millisecond range. Great proof of concept. Brainless nonsense that is.

Reply to
backhus

im not trying to proove a concept regarding SIMULTANIOUS SUBSTRACTION using FPGA's.

its bit more complicated and, this substraction thing (and total verilog code) is just a tool to prove it. thats why i dont bother about the best, most cost efficient way of doint it, as long as i can get results fast. and the code i mentioned is just an example. the actual matrix is maximum = 100 x 10, which allows me to do this in a real chip.

and i dont mind waiting for 100 milliseconds for the calculation, as the current calculation takes 3 seconds to complete. (this is using direct convolution, without doing FFT).

thank you.

Reply to
CMOS

ElectronDepot website is not affiliated with any of the manufacturers or service providers discussed here. All logos and trade names are the property of their respective owners.