video buffering scheme, nonsequential access (no spatial locality)

I am doing some embedded video processing, where I store an incoming frame of video, then based on some calculations in another part of the system, I warp that buffered frame of video. Now when the frame goes into the buffer (an off-FPGA SDRAM chip), it is simply written in one pixel at a time in row major ordering.

The problem with this is that I will not be accessing it in this way. I may want to do some arbitrary image rotation. This means the first pixel I want to access is not the first one I put in the buffer, It might actually be the last one in the buffer. If I am doing full page reads, or even burst reads, I will get a bunch of pixels that I will not need to determine the output pixel value. If i just do single reads, this waists a bunch of clock cycles setting up the SDRAM, telling it which row to activate and which column to read from. After the read is done, you then have to issue the precharge command to close the row. There is a high degree of inefficiency to this. It takes 5, maybe 10 clock cycles just to retrieve one pixel value.

Does anyone know a good way to organize a frame buffer to be more friendly (and more optimal) to nonsequential access (like the kind we might need if we wanted to warp the input image via some linear/nonlinear transformation)?

Reply to
wallge
Loading thread data ...

Well, there won't be a schema that fits every possible transform ... (if there was that would mean the SDRAM would be as flexible as SRAM ...)

Can't you narrow a little the type of access you want to do ?

Reply to
Sylvain Munaut

I've done something similar in the past. In my project I was doing small-angle rotation, so I knew ahead of time the maximum line-to-line skew of pixels that became vertical in the output image, and it was small (like 1). When I started the project, however I had the idea that the best way to accomplish the general case of rotation is to make a cache memory in the FPGA. The parts I was using at the time (XCV50's) were a bit small to implement a decent cache, but I would think newer parts could do this quite handily.

Also important are using the minimum burst size in the SDRAM to reduce you cache-line access time.

HTH, Gabor

Reply to
Gabor

Hello,

It all depends on your needs, of course, but block-style ordering can help a bit to relieve the problem by breaking the 1D-orientedness of raster-scan sort.

For instance, you can pack pixel by 16, which will represent a 4x4 square in your image. When retrieving the data, you get data from both dimentions, which will have much better spatial locality than a line of 16 pixels. This may help you quite a bit.

Peano-style or quadtree-style walking of the image could also be investigated, but my memories from it is that it's quite a bit more complicated...

JB

Reply to
jbnote

If you are doing truly arbitrary warping, then is it not right that you can never get an optimal organisation for all warps?

Could you do some kind of caching scheme where you read an entire DRAM row in at a time, and "hope it comes in handy" later?

Failing that, can you use SSRAM for your frame buffer?

Or, can you parallelise your task so that it operates on (eg) 4 wildly different areas of input data at a time, which means you can use the banking mechanism of the DRAMs to hide the latency?

Those are my initial thoughts (whilst waiting for a very loooooong simulation to run :-)

Cheers, Martin

--
martin.j.thompson@trw.com 
TRW Conekt - Consultancy in Engineering, Knowledge and Technology
http://www.conekt.net/electronics.html
Reply to
Martin Thompson

Sounds like a RAM. If it didn't fit in fpga block ram I would use an external device.

-- Mike Treseler

Reply to
Mike Treseler

Another option (depending on your mapping) would be to do it in two passes. There's a transpose in the middle, so it would probably be best to do it in small sections to an on-chip transpose buffer, before writing it out to the intermediate store.

Have you thought about what order of filtering you need?

Check out Digital Image Warping by Wolberg, or one of Alvy Ray Smith's scan line ordering papers.

Reply to
Pete Fraser

I should have been more specific in my question.

I have to use a small (64 Mbit) mobile sdram. I can't choose to use a different storage element in the system (other than *some* FPGA buffering, though not full frame).

I have heard some discussion of the way in which graphic accelerator boards do memory transactions, storing pixels in blocks of neighbor pixels (instead of being organized row major). In other words the spatial locality in the SDRAM buffer might look like:

Image pixels: N2 N3 N4 N1 P N5 N8 N7 N6

Memory organization: ADDR DATA

0x0000 P 0x0001 N1 0x0002 N2 0x0003 N3 0x0004 N4 0x0005 N5 0x0006 N6 0x0007 N7 0x0008 N8

Where P is the central pixel of interest, and the N's are its neighbors. We organize the pixels in the SDRAM buffer not by rows, but by regions of interest. This way if we are doing some kind of Image warp and we want to get more bang for the buck in terms of read latency, we are more likely to reuse pixels in the neighborhood of the currently accessed pixel than if we were arranged in a row or column major ordering (consider the case were we wanted to rotate an image by 47.2 degrees from input to output).

Has anyone seen something like this or know of any resources online with regard to memory buffer organization schemes for graphics or image processing?

Reply to
wallge

Have you thought about what order of filtering you'll need to use?

Reply to
Pete Fraser

messagenews: snipped-for-privacy@l53g2000cwa.googlegroups.com...

you'll

Reply to
wallge

Yes you are.

Yes it is.

And that's a filtering operation. So the maximum kernel size is 4 x 4, though you might use 2 x 2. The kernel size could have a substantail bearing on the traffic to/from on-chip RAM.

I'm still not sure of your limitations on off-chip RAM. You have a buffer on the input or output (or both?) Do you have enough bandwidth to have an intermediate buffer for a two-pass operation?

Reply to
Pete Fraser

Can you write out the FIR filter coeffs for a bilinear interpolation "filter kernel"? How about a bicubic interpolator filter kernel what are its filter coeffs?

arguing semantics was not the purpose of my post.

I will probably wind up doing bilinear interpolation or "filtering". Which means I need 4 pixels of the input frame to determine

1 pixel of output warped frame.

By the way what is the Freq response of the bilinear interpolation "filter"?

messagenews: snipped-for-privacy@q2g2000cwa.googlegroups.com...

Reply to
wallge

I'm happy to, but we're getting away from FPGA stuff, so let's do that off line. Let me know how many phases you need, and the coefficient format you'd like. I usually use a minor 4x4 variation on cubic, but it's all set up in Mathematica, so I could do cubic also.

So you don't really need coefficient tables for this. You can just use the fractional phase directly.

It depends on the position of output relative to input pixel, but for a central output pixel the frequency response would be Cosusoidal.

Getting back to FPGA stuff though, what are your off-chip RAM bandwidth limitations, and could you consider a two-pass approach?

Reply to
Pete Fraser

A fairly simple technique, reasonably well-known among video system designers, is to use what is sometimes called tiling the image into the ((DDR)S)DRAM columns. E.g., assume a 1Kx1K image with vertical and horizontal address bits (V9..V0) and (H9..H0), and also DRAM with row and column address bits (R9..R0) and (C9..C0). Do _not_ use the straight mapping of: (V9..V0) (R9..R0) and (H9..H0) (C9..C0) Instead, map the H/V LSBs into the DRAM column address, and the H/V MSBs to the DRAM row address: (V4..V0,H4..H0) (C9..C0) and (V9..V5,H9..H5) (R9..R0) When warping, the image sample addresses are pipelined out to DRAM, with time designed into the pipeline to examine the addresses for DRAM row boundary crossings, and only stall when it's necessary to re-RAS. Stalls only occur when the sampling area overlaps the edge of a tile, instead of with every 2x2 or 3x3 fetch. (You posed your question so well, I'll bet this already occurred to you)

Caching can also be used to bypass the external RAM access pipeline when the required pixels are already in the FPGA. There are lots of different caching techniques, I haven't looked at that in a while.

Block processing is a kind of variant of caching, reading a tile from external DRAM into BRAM, warping from that BRAM into another BRAM, then sending the results back out, but border calculations get messy for completely arbitrary warps.

HTH Just John

Reply to
JustJohn

I am not sure what you mean by two pass approach. The max (theoretical) bandwidth I have available to/from the SDRAM is about

16 bits * 100 Mhz = 1.6 Gbit/sec

This is not an achievable estimate of course, even if I only did full page reads and writes, since there is overhead associated with each. I also have to refresh periodically.

My pixel bit width could be brought down to 8 bits. That way I could store 2 pixels per address if need be.

messagenews: snipped-for-privacy@v45g2000cwv.googlegroups.com...

stuff,

for this.

Reply to
wallge

You may be missing an important feature of SDRAM. You don't need to use full-page reads or writes to keep data streaming at 100% of the available bandwidth (if you don't change direction) or very nearly 100% (if you switch from read to write infrequently). This is due to the ability to set up another block operation on one bank while another bank is transferring data. When I use SDRAM for relatively random operations like this I like to think of the minimum data unit as one minimal burst (two words in a single-data-rate SDRAM) to each of the four banks. Any number of these data units can be strung one after another with no break in the data flow. Then if you wanted to internally buffer a square section of the image in internal blockRAM the width of the minimum block (allowing 100% data rate) would only be

16 8-bit pixels or 8 16-bit pixels in your case. If the area can cover the required computational core (4 x 4?) for several pixels at a time, you can reduce overall bandwidth. This was the point of suggesting an internal cache memory.

HTH, Gabor

Reply to
Gabor

Hi Gabor,

I've missed this too, what happens at the end of the burst? Do you have another "cas" otherwise the burst will stop (unless it's full page) Anyway, if it works like that then we can save bunch of cycles ras to cas and data transfer can be seamless

I don't know it's possible to activate another bank while one bank is read/written, please clarify me on this, must be my big miss :)

Thanks,

Reply to
Marlboro

Don't forget that for video apps, you often don't need to refresh, as you are reading and writing the SDRAM rows in a regular fashion which means you can guarantee that each gets touched often enough.

Indeed for some video applications, like output framebuffers, all you need to do is ensure that you read the row out for display soon enough after the write, which is often easy to achieve.

Cheers, Martin

--
martin.j.thompson@trw.com 
TRW Conekt - Consultancy in Engineering, Knowledge and Technology
http://www.conekt.net/electronics.html
Reply to
Martin Thompson

Gabor,

Are you saying that I don't need to activate/precharge the bank when switching to another? I am kind of unclear on this. When do activate and precharge commands need to be issued? I thought when switching to a new row or bank you had to precharge (close) the previously active one, then activate the new row/bank before actually reading from or writing to it. Where am I going wrong here?

Also to the notion that I don't need to refresh since I am doing video buffering: I am actually buffering multiple frames of video and then reading out several frames later. In other words, there may be a significant fraction of a second (say 1/8~1/4 sec) of delay between writing data into a particular page of memory and actually reading it back out. Is this too much time to expect my pixel data to still be valid without refreshing?

SDRAM. You don't need to

Reply to
wallge

Each bank is closed or open on a given row. As long as your accesses go to an already open row, you needn't precharge. Beware though you must close the bank after tRAS MAX ~= 70 us.

AFAICT, ppl here have been suggesting data layout schemes what will increase your likelyhood of hitting an open row.

Yes that too much. tREFI (Average periodic refresh interval) is ~ 7 us.

Tommy

Reply to
Tommy Thorn

ElectronDepot website is not affiliated with any of the manufacturers or service providers discussed here. All logos and trade names are the property of their respective owners.