Faster image rotation

| I need to rotate a picture clockwise 90 degrees. | | Conceptually, my picture is represented by a matrix | (W columns, H rows) of pixels, with each pixel | represented by 3 octets. | | pixel original_picture[W][H]; | pixel rotated_picture[H][W]; | | At the implementation level, I'm dealing with 1D integer arrays. | | int src[W*H*3]; | int dest[W*H*3]; | | Conceptually, the rotation operation comes down to | | rotated_picture[j][H-1-i] = original_picture[i][j] | | My actual code is | | for (i = 0; i < H; ++i) | for (j = 0; j < W; ++j) | memcpy(dest+(H*j+H-1-i)*3, src+(W*i+j)*3, 3);

This is not going to be fast because:-

  1. you are calling a subroutine 1,913,616 times to move 9 or 48 bits each time.
  2. You are multiplying by 3 twice for each move.
  3. You are using integer arrays.

  1. Best to use ordinary assignment.

  1. Eliminate *3 by changing to adds.
  2. Use byte arrays (but I'm not clear about what you say. You say that you have octets. Now 3 octets will be 9 bits, not 3 integers equal to 48 bits.)

This is how it would look in PL/I:-

declare (src(m,n), dest(n,m)) char (3); do i = 1 to m; dest(*, m-i+1) = src(i,*); end;

If the color information really is octets, then the declaration is changed to bit (9) aligned.

Reply to
robin
Loading thread data ...

This is basically a matrix transpose (with the difference that one goes down instead of up). There are cache friendly matrix transpose algorithms that should also speed this one up.

Many C compilers implement memcpy() inline. I suppose you believe that PL/I does a subroutine call for the MOD function?

Most can also figure that one out.

-- glen

Reply to
glen herrmannsfeldt

When I did my first digital picture frame project, I had a standardized "put_image" routine that took parameters for width, height, destination pixel step and destination row step. The p-code was something like:

for (i=3D0;i

Reply to
larwe

Oops, make that dest +=3D, not src +=3D

Reply to
larwe

Robin, will you /please/ stop blithering on about things you don't understand?! Buy a Data Processing dictionary, for God's sake!

The rest of this is addressed to the original poster: I don't understand why you're using int variables for octets; they should be char. For the rest, I'd do the following:

typedef struct __Pixel { unsigned char red, green, blue; } Pixel;

Pixel src[W][H]; Pixel dest[H][W];

for (int i = 0; i < W; ++i) for (int j = 0; j < H; ++i) { dest[j][i].red = src[i][j].red; dest[j][i].green = src[i][j].green; dest[j][i].blue = src[i][j].blue; }

I'd also try:

for (int i = 0; i < W; ++i) for (int j = 0; j < H; ++i) memcpy(dest[j][i], src[i][j], sizeof (Pixel));

and see which one is faster; it will depend on the individual compiler. (Make sure you test at the same optimization level you plan to use.)

--
John W Kennedy
"There are those who argue that everything breaks even in this old dump 
 Click to see the full signature
Reply to
John W Kennedy

...

And perhaps dest[j][i] = src[i][j];

But in practice W,H might only be known at runtime, making the code rather different. Depending on the exact format of the image data, there might also be padding bytes (nothing to do with C's struct padding), for example at the end of each row.

--
Bartc
Reply to
bartc

formatting link

You're right. It was a typo.

Are you sure the above is a description of a rotation? :-) Clockwise or counter-clockwise?

The target system is CPU: 266 MHz, dual issue, 5-stage integer pipeline, SH-4 RAM: Two 64-MB, 200-MHz, DDR1 SDRAM modules (on separate memory buses)

After much testing, it dawned on me that the system's memory allocator returns non-cached memory. (I found no way to request large contiguous buffers in cached memory.) All cache-specific optimizations thus became irrelevant.

On this system, a load from non-cached memory has a latency of ~45 cycles, thus the only optimization that made sense was to load 32-bit words instead of octets. I configured libjpeg to output 32-bit pixels instead of 24-bit pixels.

Then I got away with trivial code:

void rotate_right(uint32 *A, uint32 *B, int W, int H) { int i, j; for (i = 0; i < H; ++i) for (j = 0; j < W; ++j) B[H*j + H-1-i] = A[W*i + j]; /* B[j][H-1-i] = A[i][j] */ }

void rotate_left(uint32 *A, uint32 *B, int W, int H) { int i, j; for (i = 0; i < H; ++i) for (j = 0; j < W; ++j) B[H*(W-1-j) + i] = A[W*i + j]; /* B[W-1-j][i] = A[i][j] */ }

gcc-4.2.4 -O2 was smart enough to strength-reduce the index computation for both arrays.

00000000 : 0: 86 2f mov.l r8,@-r15 2: 15 47 cmp/pl r7 4: 96 2f mov.l r9,@-r15 6: 63 68 mov r6,r8 8: 15 8f bf.s 36 a: 73 61 mov r7,r1 c: 08 47 shll2 r7 e: 83 69 mov r8,r9 10: fc 77 add #-4,r7 12: 13 66 mov r1,r6 14: 7c 35 add r7,r5 16: 08 49 shll2 r9 18: 04 77 add #4,r7 1a: 15 48 cmp/pl r8 1c: 07 8b bf 2e 1e: 43 60 mov r4,r0 20: 53 63 mov r5,r3 22: 83 62 mov r8,r2 24: 06 61 mov.l @r0+,r1 26: 10 42 dt r2 28: 12 23 mov.l r1,@r3 2a: fb 8f bf.s 24 2c: 7c 33 add r7,r3 2e: 10 46 dt r6 30: fc 75 add #-4,r5 32: f2 8f bf.s 1a 34: 9c 34 add r9,r4 36: f6 69 mov.l @r15+,r9 38: 0b 00 rts 3a: f6 68 mov.l @r15+,r8 3c: 09 00 nop 3e: 09 00 nop

The loop kernel is

24: 06 61 mov.l @r0+,r1 26: 10 42 dt r2 28: 12 23 mov.l r1,@r3 2a: fb 8f bf.s 24 2c: 7c 33 add r7,r3

Thanks to all for your suggestions (especially Terje).

Regards.

Reply to
Noob

[Not referring to this specific code, but just following up.]

Why can't modern CPU's optimize the heck out of the relatively simple code that a compiler might produce for a block copy? They have all of the information they need - the addresses, the length, the alignments, the position relative to page boundaries, cache lines, write buffers, etc.

Compilers often look at large chunks of code to figure out what they are doing (e.g. Sun's "heroic optimizations" of a few years ago). CPUs have transistors to burn now, why can't they look for patterns that can be executed faster? Detect block copies, and turn them into streaming fetches and stores, limited only by memory speeds. Don't cache the data, don't purge any existing nonconflicting write buffers, etc. Is the latency of detecting the situation too large?

Lots of code does a lot of copying - there could be a real benefit.

--
Experience should guide us, not rule us.

Chris Gray
Reply to
Chris Gray

ElectronDepot website is not affiliated with any of the manufacturers or service providers discussed here. All logos and trade names are the property of their respective owners.