Faster image rotation

Noob · 2010-04-07T10:17:12+00:00

[ NB: CROSS-POSTED to ][ Followup-To: set to ]Hello,I imagine the following problem has been efficiently solvedover a million times in the past.I need to rotate a picture clockwise 90 degrees.Conceptually, my picture is represented by a matrix(W columns, H rows) of pixels, with each pixelrepresented by 3 octets.pixel original_picture[W][H];pixel rotated_picture[H][W];At the implementation level, I'm dealing with 1D integer arrays.int src[W*H*3];int dest[W*H*3];Conceptually, the rotation operation comes down to rotated_picture[j][H-1-i] = original_picture[i][j]My actual code is for (i = 0; i < H; ++i) for (j = 0; j < W; ++j) memcpy(dest+(H*j+H-1-i)*3, src+(W*i+j)*3, 3);Consider as an example,W = 1644H = 1164size = 5.74 MBOn my target platform, the rotation takes 1.65 seconds.I'm trying to get this down under half a second.I'm using the GNU tool chain. For some weird reason, wecompile everything -O0. The first thing I'll try is crankgcc's optimization level.I'm hoping gcc can perform some strength reduction, as theindex calculation seems to be taking a non-negligible fractionof the total run-time.Changing the loop to for (i = 0; i < H; ++i) for (j = 0; j < W; ++j) { memcpy(dest+(H*j+H-i-1)*3, src, 3); src += 3; }brings a 5% improvement.I thought changing the memcpy call to 3 explicit assignmentsmight help, but it actually slowed things down.Perhaps I need to perform some tiling magic... I don't thinkgcc performs automatic tiling?Comments and insight would be greatly appreciated.Regards.

R

robin 16 years ago

This is not going to be fast because:-

you are calling a subroutine 1,913,616 times to move 9 or 48 bits each time.
You are multiplying by 3 twice for each move.
You are using integer arrays.

Best to use ordinary assignment.

Eliminate *3 by changing to adds.
Use byte arrays (but I'm not clear about what you say. You say that you have octets. Now 3 octets will be 9 bits, not 3 integers equal to 48 bits.)

This is how it would look in PL/I:-

declare (src(m,n), dest(n,m)) char (3); do i = 1 to m; dest(*, m-i+1) = src(i,*); end;

If the color information really is octets, then the declaration is changed to bit (9) aligned.

Vote

G

glen herrmannsfeldt 16 years ago

This is basically a matrix transpose (with the difference that one goes down instead of up). There are cache friendly matrix transpose algorithms that should also speed this one up.

Many C compilers implement memcpy() inline. I suppose you believe that PL/I does a subroutine call for the MOD function?

Most can also figure that one out.

-- glen

Vote

L

larwe 16 years ago

When I did my first digital picture frame project, I had a standardized "put_image" routine that took parameters for width, height, destination pixel step and destination row step. The p-code was something like:

for (i=3D0;i

Vote

L

larwe 16 years ago

Oops, make that dest +=3D, not src +=3D

Vote

J

John W Kennedy 16 years ago

Robin, will you /please/ stop blithering on about things you don't understand?! Buy a Data Processing dictionary, for God's sake!

The rest of this is addressed to the original poster: I don't understand why you're using int variables for octets; they should be char. For the rest, I'd do the following:

typedef struct __Pixel { unsigned char red, green, blue; } Pixel;

Pixel src[W][H]; Pixel dest[H][W];

for (int i = 0; i < W; ++i) for (int j = 0; j < H; ++i) { dest[j][i].red = src[i][j].red; dest[j][i].green = src[i][j].green; dest[j][i].blue = src[i][j].blue; }

I'd also try:

for (int i = 0; i < W; ++i) for (int j = 0; j < H; ++i) memcpy(dest[j][i], src[i][j], sizeof (Pixel));

and see which one is faster; it will depend on the individual compiler. (Make sure you test at the same optimization level you plan to use.)

John W Kennedy "There are those who argue that everything breaks even in this old dump of a world of ours. I suppose these ginks who argue that way hold that because the rich man gets ice in the summer and the poor man gets it in the winter things are breaking even for both. Maybe so, but I'll swear I can't see it that way." -- The last words of Bat Masterson

Vote

B

bartc 16 years ago

...

And perhaps dest[j][i] = src[i][j];

But in practice W,H might only be known at runtime, making the code rather different. Depending on the exact format of the image data, there might also be padding bytes (nothing to do with C's struct padding), for example at the end of each row.

Bartc

Vote

N

Noob 16 years ago

formatting link

You're right. It was a typo.

Are you sure the above is a description of a rotation? :-) Clockwise or counter-clockwise?

The target system is CPU: 266 MHz, dual issue, 5-stage integer pipeline, SH-4 RAM: Two 64-MB, 200-MHz, DDR1 SDRAM modules (on separate memory buses)

After much testing, it dawned on me that the system's memory allocator returns non-cached memory. (I found no way to request large contiguous buffers in cached memory.) All cache-specific optimizations thus became irrelevant.

On this system, a load from non-cached memory has a latency of ~45 cycles, thus the only optimization that made sense was to load 32-bit words instead of octets. I configured libjpeg to output 32-bit pixels instead of 24-bit pixels.

Then I got away with trivial code:

void rotate_right(uint32 *A, uint32 *B, int W, int H) { int i, j; for (i = 0; i < H; ++i) for (j = 0; j < W; ++j) B[H*j + H-1-i] = A[W*i + j]; /* B[j][H-1-i] = A[i][j] */ }

void rotate_left(uint32 *A, uint32 *B, int W, int H) { int i, j; for (i = 0; i < H; ++i) for (j = 0; j < W; ++j) B[H*(W-1-j) + i] = A[W*i + j]; /* B[W-1-j][i] = A[i][j] */ }

gcc-4.2.4 -O2 was smart enough to strength-reduce the index computation for both arrays.

00000000 : 0: 86 2f mov.l r8,@-r15 2: 15 47 cmp/pl r7 4: 96 2f mov.l r9,@-r15 6: 63 68 mov r6,r8 8: 15 8f bf.s 36 a: 73 61 mov r7,r1 c: 08 47 shll2 r7 e: 83 69 mov r8,r9 10: fc 77 add #-4,r7 12: 13 66 mov r1,r6 14: 7c 35 add r7,r5 16: 08 49 shll2 r9 18: 04 77 add #4,r7 1a: 15 48 cmp/pl r8 1c: 07 8b bf 2e 1e: 43 60 mov r4,r0 20: 53 63 mov r5,r3 22: 83 62 mov r8,r2 24: 06 61 mov.l @r0+,r1 26: 10 42 dt r2 28: 12 23 mov.l r1,@r3 2a: fb 8f bf.s 24 2c: 7c 33 add r7,r3 2e: 10 46 dt r6 30: fc 75 add #-4,r5 32: f2 8f bf.s 1a 34: 9c 34 add r9,r4 36: f6 69 mov.l @r15+,r9 38: 0b 00 rts 3a: f6 68 mov.l @r15+,r8 3c: 09 00 nop 3e: 09 00 nop

The loop kernel is

24: 06 61 mov.l @r0+,r1 26: 10 42 dt r2 28: 12 23 mov.l r1,@r3 2a: fb 8f bf.s 24 2c: 7c 33 add r7,r3

Thanks to all for your suggestions (especially Terje).

Regards.

Vote

C

Chris Gray 16 years ago

[Not referring to this specific code, but just following up.]

Why can't modern CPU's optimize the heck out of the relatively simple code that a compiler might produce for a block copy? They have all of the information they need - the addresses, the length, the alignments, the position relative to page boundaries, cache lines, write buffers, etc.

Compilers often look at large chunks of code to figure out what they are doing (e.g. Sun's "heroic optimizations" of a few years ago). CPUs have transistors to burn now, why can't they look for patterns that can be executed faster? Detect block copies, and turn them into streaming fetches and stores, limited only by memory speeds. Don't cache the data, don't purge any existing nonconflicting write buffers, etc. Is the latency of detecting the situation too large?

Lots of code does a lot of copying - there could be a real benefit.

Experience should guide us, not rule us. Chris Gray

Vote

Faster image rotation

Join the Discussion

Didn't find your answer?