Faster image rotation

[ NB: CROSS-POSTED to comp.arch, comp.arch.embedded, comp.programming ] [ Followup-To: set to comp.programming ]

Hello,

I imagine the following problem has been efficiently solved over a million times in the past.

I need to rotate a picture clockwise 90 degrees.

Conceptually, my picture is represented by a matrix (W columns, H rows) of pixels, with each pixel represented by 3 octets.

pixel original_picture[W][H]; pixel rotated_picture[H][W];

At the implementation level, I'm dealing with 1D integer arrays.

int src[W*H*3]; int dest[W*H*3];

Conceptually, the rotation operation comes down to

rotated_picture[j][H-1-i] = original_picture[i][j]

My actual code is

for (i = 0; i < H; ++i) for (j = 0; j < W; ++j) memcpy(dest+(H*j+H-1-i)*3, src+(W*i+j)*3, 3);

Consider as an example, W = 1644 H = 1164 size = 5.74 MB

On my target platform, the rotation takes 1.65 seconds.

I'm trying to get this down under half a second.

I'm using the GNU tool chain. For some weird reason, we compile everything -O0. The first thing I'll try is crank gcc's optimization level.

I'm hoping gcc can perform some strength reduction, as the index calculation seems to be taking a non-negligible fraction of the total run-time.

Changing the loop to

for (i = 0; i < H; ++i) for (j = 0; j < W; ++j) { memcpy(dest+(H*j+H-i-1)*3, src, 3); src += 3; }

brings a 5% improvement.

I thought changing the memcpy call to 3 explicit assignments might help, but it actually slowed things down.

Perhaps I need to perform some tiling magic... I don't think gcc performs automatic tiling?

Comments and insight would be greatly appreciated.

Regards.

Reply to
Noob
Loading thread data ...

I made a mistake. I meant to write uint8_t src[W*H*3]; uint8_t dest[W*H*3];

You're right. I meant uint8_t not int.

I'm not sure I understand what you mean.

I was trying to describe the algorithm at a slightly higher level.

pixel original_picture[W][H]; pixel rotated_picture[H][W]; for (i = 0; i < H; ++i) for (j = 0; j < W; ++j) rotated_picture[j][H-1-i] = original_picture[i][j]

There is perhaps a better (clearer) way to express the algorithm?

In what sense?

Methods? Do you mean the functions?

Strength reduction is typically the kind of task where a compiler should excel.

formatting link

for (i = 0; i < H; ++i) for (j = 0; j < W; ++j) { unsigned char *C = B+(H*j+H-i-1)*3; C[0] = A[0]; C[1] = A[1]; C[2] = A[2]; A += 3; }

was slower than

for (i = 0; i < H; ++i) for (j = 0; j < W; ++j) { memcpy(B+(H*j+H-i-1)*3, A, 3); A += 3; }

on my platform.

Thanks for the tips.

Regards.

Reply to
Noob

I haven't looked at it in detail but that immediately caught my eye. Surely:

memcpy(dest+(H*j+H-1-i)*3, src+(W*i+j)*3, 3 * sizeof(int));

--
Andrew Smallshaw
andrews@sdf.lonestar.org
Reply to
Andrew Smallshaw

from that group ]

Makes more snse

....

Too many calculations done every iteration to create an offset which should be the same difference between each stage on every iteration of the loop.

No the method/algorithm used, adding functions on every iteration of the loop adds overhead for something like this.

I think it would have trouble changing the loop style see later.

Won't stand much chance of changing something written in an usuual way. There is not much it could reduce there usefully.

Still a complex calculation EVERY loop iteration, for any particular sizes being used 'H * j' is equivalent to adding H at the end of each loop iteration, whilst every time this inner loop is started 'H - i - 1' is the same for the duration of the inner loop.

Only do actual parts of calculations that really change inside a loop use variables to store partial results that don't change.

Also I assume B is the fixed start point of the array, whilst C is the running pointer.

You can rewrite '(H*j+H-i-1)*3' as

3*(H*j) + 3*(H-i-1)

The first half is equivalent to adding 3 * H every loop iteration and the second half is fixed for duration of inner loop.

Consider this which needs testing is off the top of my head

// B = base of o/p array unsigned long Hstride, Hoffset; unsigned char *C;

C = B; Hstride = (3 * H) - 3; // -3 to offset the three ++ in inner loop Hoffset = 0; // Hoffset coding could be improved for (i = 0; i < H; ++i, C += Hoffset ) { Hoffset = (H-i-1)*3; for (j = 0; j < W; ++j, C += Hstride ) { *C++ = *A++; *C++ = *A++; *C++ = *A++; } }

Everytime you use array[ index ] the complier creates an arithmetic pointer manipulation to calculate the actual pointer to the desired location.

--
Paul Carpenter          | paul@pcserviceselectronics.co.uk
    PC Services
 Timing Diagram Font
  GNU H8 - compiler & Renesas H8/H8S/H8 Tiny
 For those web sites you hate
Reply to
Paul Carpenter

I suspect some compilers may do this 'under the hood', but you might want to consider adding a temporary variable to cut down on repeated multiplies:

for (i = 0; i < H; ++i) { W_times_i = W * i; for (j = 0; j < W; ++j) { memcpy(dest+(H*j+H-1-i)*3, src+(W_times_i+j)*3, 3); } }

This should eliminate W - 1 multiplies.

Given that there is a certain amount of overhead involved in calling memcpy (especially for such a small number of bytes), you might want to simply replace the call with three explicit assignments:

for (i = 0; i < H; ++i) { W_times_i = W * i; for (j = 0; j < W; ++j) { dest_indx = (H*j+H-1-i)*3; src_indx = (W_times_i+j)*3; dest[dest_indx++] = src[src_indx++]; dest[dest_indx++] = src[src_indx++]; dest[dest_indx] = src[src_indx]; } }

You might also try i++ and j++ in your for loops - if your processor's hardware supports post-increment better than pre-increment, your compiler may be aware of that fact.

hope that helps

Joe Power

Reply to
Joseph Power

from that group ]

Is that really true? It seems like the sort of thing that -O3 ought to be able to take care of.

--
Rob Gaddi, Highland Technology
Email address is currently out of order
Reply to
Rob Gaddi

It depends. Compilers can optimize simple index arithmetics into manipulation with pointers. However they are goofing with increased level of loop nesting and more complicated index expressions. For that matter, pointer arithmetics is usually faster.

Vladimir Vassilevsky DSP and Mixed Signal Design Consultant

formatting link

Reply to
Vladimir Vassilevsky

from that group ]

Look back at the OP's code and the pointers are incremented between iterations of the loop then used as arrays, how is it going to optimise a moving target other than making a copy and adding sizeof bytes to the temporary pointer.

--
Paul Carpenter          | paul@pcserviceselectronics.co.uk
    PC Services
 Timing Diagram Font
  GNU H8 - compiler & Renesas H8/H8S/H8 Tiny
 For those web sites you hate
Reply to
Paul Carpenter

Are you sure you will always want to rotate by some multiple of 90 deg?

Are you sure you will *always* want to rotate CW by exactly 90 deg?

Your original data definition was cleaner.

Why are you taking the memcpy function invocation hit here? You're just moving "3 bytes (octets)".

Create two pointers:

pixel *input, *output;

Initialize the input pointer to "some corner" (top left is convenient). Initialize the output pointer to the corner that is 90 degrees CW from this point.

Move the input pointer across a row (or down a column -- depending on which corner you chose as a starting point) and move the output pointer down a column (or across a row, etc.).

Then, just copy from *input to *output.

Of course, using a three byte type makes this cumbersome. I assume you don't want to pack those three bytes into a long int (time/space efficiency concerns). So, you'll have to break it down to a set of three "byte operations".

It might help if you told us which processor you are targeting.

Reply to
D Yuniskis

It depends on the compiler. Most C90 and later compilers do a pretty good job optimizing array indexing and are not so easily confused by high dimensionality or sub-array blocking. In many cases you will actually confuse them and reduce performance if you try to do your own pointer arithmetic.

That said, old versions of GCC (certainly anything prior to 3.x) were notoriously bad at optimizing even 3-D array indexing. If your chip requires using one of these old compilers, then you certainly should do your own pointer arithmetic.

George

Reply to
George Neuner

Noob skrev:

Modern Hardware like the AT91SAM9M10 will do this in a H/W accelerator.

Best Regards Ulf Samuelsson

Reply to
Ulf Samuelsson

original

.------------------->X | | | | | | \|/ Y

rotated 90 degrees clockwise

y
Reply to
Skybuck Flying

I was specifically asking how to do it in software.

I find this spamvertisement very inappropriate.

Does your employer condone this activity?

Reply to
Noob

Well, the fact that some devices feel the need to do this in hardware is relevent. It means that either the processor does have the grunt to do the job well itself or that there is significant performance to be had from methods that are not easy possible in software (e.g. dedicated parallel hardware). If the second then it hints that some cleverness in software will be needed to do well.

Anyway, there is nothing about USENET that requires replies to you questions to be for you only. If the reply is related, and the can reasonably expect there to be readers for which the reply is relevent, then I don't have a big problem. Some of this is about motivation. If the motivation of the response was to shoehorn there product or services into the discussion rather than to be genuinely helpful then that I surely don't like. Now, Ulf has a history in this group and I feel that history has been strongly positive though somewhat focused on Atmel products (since that's what he knows best).

Peter

Reply to
Peter Dickerson

I'm displaying JPEG photographs. Some users want to rotate the pictures (landscape to portrait). AFAICS, all I need is 90° either way.

Probably. And I also added a bug, because I meant uint8_t src[W*H*3]; uint8_t dest[W*H*3];

Right.

I've done this, and it is still too slow for my taste :-(

It brings the run-time from 1.6 seconds down to 1.1 seconds (for my 1644 x 1164 example).

I didn't see how to ask libjpeg to output 32-bit pixel values instead of 24-bit RGB values.

The data sheet states

SH-4 32-bit super-scalar RISC CPU o 266 MHz, 2-way set associative 16-Kbyte ICache, 32-Kbyte DCache, MMU o 5-stage pipeline, delayed branch support o floating point unit, matrix operation support o debug port, interrupt controller

Regards.

Reply to
Noob

As a matter of fact, I was told that the blitter on this platform (STx7109)

*does* support image rotation in hardware, but that the capability had not been exposed by the API. Strange, wouldn't you say?

I do understand that.

I agree that it might be appropriate to mention that some platforms provide hardware support for some operations. What I found rather spammy was the explicit mention of one of his employer's platform.

I will concede that I may have over-reacted ;-)

Regards.

Reply to
Noob

Seriously, consider doing this in assembler. I have done some fax image processing in the past and the difference between a HLL implementation and an assemler implementation of the run length decoding of a PCX image was astonishing.

Converting it from 24 bit to 32 bit first, might even give you a performance gain, especially since your image dimensions seem to be a multiple of 4.

Meindert

Reply to
Meindert Sprang

Now that you've mentioned libjpeg: it supports lossless transformations, including 90° rotations. See transupp.h.

--
Gemaakt met Opera's revolutionaire e-mailprogramma:  
http://www.opera.com/mail/
(remove the obvious prefix to reply by mail)
Reply to
Boudewijn Dijkstra

Pictures are not unconditionally rotated. They are rotated when the user asks for it, which is AFTER the picture has been decoded, which might take 2-3 seconds.

It would take longer to rotate the JPEG representation + decode new picture than to rotate the bitmap of the decoded picture

Reply to
Noob

The most important number is cache line size, which you missed. If your image is 1,024 lines tall, that will completely thrash the cache, resulting in 3 bytes copied per cache line load/spill.

If you copy 16x16 tiles you can get a 10x speedup.

CopyTile16(source, dest, x, y, width, height)...

You can also try 8x8 and 4x4, smaller loops can be faster due to all the args fitting in memory, and the loop getting unrolled. but the difference will be ~25% which is hardly worth the time to code.

If you do the following you can get a 2x speedup, it looks like more code, but will generate less, and the results will be pipelined correctly.

{ unsigned char *C = B+(H*j+H-i-1)*3; temp0 = A[0]; temp1 = A[1]; temp2 = A[2]; C[0] = temp0; C[1] = temp1; C[2] = temp2; A += 3; }

Do not use *C++ = *A++;

The is no need to flame the guy selling hardware, even though this is a bandwidth limited problem that will not be helped by more hardware.

Brett

Reply to
Brett Davis

ElectronDepot website is not affiliated with any of the manufacturers or service providers discussed here. All logos and trade names are the property of their respective owners.