Faster image rotation

- N
- Noob
  
  Contact options for registered users
posted
14 years ago

Wed, Apr 7, 2010 10:17 AM

[ NB: CROSS-POSTED to comp.arch, comp.arch.embedded, comp.programming ] [ Followup-To: set to comp.programming ]

Hello,

I imagine the following problem has been efficiently solved over a million times in the past.

I need to rotate a picture clockwise 90 degrees.

Conceptually, my picture is represented by a matrix (W columns, H rows) of pixels, with each pixel represented by 3 octets.

pixel original_picture[W][H]; pixel rotated_picture[H][W];

At the implementation level, I'm dealing with 1D integer arrays.

int src[W*H*3]; int dest[W*H*3];

Conceptually, the rotation operation comes down to

rotated_picture[j][H-1-i] = original_picture[i][j]

My actual code is

for (i = 0; i < H; ++i) for (j = 0; j < W; ++j) memcpy(dest+(H*j+H-1-i)*3, src+(W*i+j)*3, 3);

Consider as an example, W = 1644 H = 1164 size = 5.74 MB

On my target platform, the rotation takes 1.65 seconds.

I'm trying to get this down under half a second.

I'm using the GNU tool chain. For some weird reason, we compile everything -O0. The first thing I'll try is crank gcc's optimization level.

I'm hoping gcc can perform some strength reduction, as the index calculation seems to be taking a non-negligible fraction of the total run-time.

Changing the loop to

for (i = 0; i < H; ++i) for (j = 0; j < W; ++j) { memcpy(dest+(H*j+H-i-1)*3, src, 3); src += 3; }

brings a 5% improvement.

I thought changing the memcpy call to 3 explicit assignments might help, but it actually slowed things down.

Perhaps I need to perform some tiling magic... I don't think gcc performs automatic tiling?

Comments and insight would be greatly appreciated.

Regards.

- N
- Noob
  
  Contact options for registered users
Vote on answer
posted
14 years ago

Wed, Apr 7, 2010 12:50 PM

I made a mistake. I meant to write uint8_t src[W*H*3]; uint8_t dest[W*H*3];

You're right. I meant uint8_t not int.

I'm not sure I understand what you mean.

I was trying to describe the algorithm at a slightly higher level.

pixel original_picture[W][H]; pixel rotated_picture[H][W]; for (i = 0; i < H; ++i) for (j = 0; j < W; ++j) rotated_picture[j][H-1-i] = original_picture[i][j]

There is perhaps a better (clearer) way to express the algorithm?

In what sense?

Methods? Do you mean the functions?

Strength reduction is typically the kind of task where a compiler should excel.

formatting link

for (i = 0; i < H; ++i) for (j = 0; j < W; ++j) { unsigned char *C = B+(H*j+H-i-1)*3; C[0] = A[0]; C[1] = A[1]; C[2] = A[2]; A += 3; }

was slower than

for (i = 0; i < H; ++i) for (j = 0; j < W; ++j) { memcpy(B+(H*j+H-i-1)*3, A, 3); A += 3; }

on my platform.

Thanks for the tips.

Regards.

- A
- Andrew Smallshaw
  
  Contact options for registered users
Vote on answer
posted
14 years ago

Wed, Apr 7, 2010 2:40 PM

I haven't looked at it in detail but that immediately caught my eye. Surely:

memcpy(dest+(H*j+H-1-i)*3, src+(W*i+j)*3, 3 * sizeof(int));

--
Andrew Smallshaw
andrews@sdf.lonestar.org

- P
- Paul Carpenter
  
  Contact options for registered users
Vote on answer
posted
14 years ago

Wed, Apr 7, 2010 3:35 PM

from that group ]

Makes more snse

....

Too many calculations done every iteration to create an offset which should be the same difference between each stage on every iteration of the loop.

No the method/algorithm used, adding functions on every iteration of the loop adds overhead for something like this.

I think it would have trouble changing the loop style see later.

Won't stand much chance of changing something written in an usuual way. There is not much it could reduce there usefully.

Still a complex calculation EVERY loop iteration, for any particular sizes being used 'H * j' is equivalent to adding H at the end of each loop iteration, whilst every time this inner loop is started 'H - i - 1' is the same for the duration of the inner loop.

Only do actual parts of calculations that really change inside a loop use variables to store partial results that don't change.

Also I assume B is the fixed start point of the array, whilst C is the running pointer.

You can rewrite '(H*j+H-i-1)*3' as

3*(H*j) + 3*(H-i-1)

The first half is equivalent to adding 3 * H every loop iteration and the second half is fixed for duration of inner loop.

Consider this which needs testing is off the top of my head

// B = base of o/p array unsigned long Hstride, Hoffset; unsigned char *C;

C = B; Hstride = (3 * H) - 3; // -3 to offset the three ++ in inner loop Hoffset = 0; // Hoffset coding could be improved for (i = 0; i < H; ++i, C += Hoffset ) { Hoffset = (H-i-1)*3; for (j = 0; j < W; ++j, C += Hstride ) { *C++ = *A++; *C++ = *A++; *C++ = *A++; } }

Everytime you use array[ index ] the complier creates an arithmetic pointer manipulation to calculate the actual pointer to the desired location.

--
Paul Carpenter          | paul@pcserviceselectronics.co.uk
    PC Services
 Timing Diagram Font
  GNU H8 - compiler & Renesas H8/H8S/H8 Tiny
 For those web sites you hate

- J
- Joseph Power
  
  Contact options for registered users
Vote on answer
posted
14 years ago

Wed, Apr 7, 2010 4:07 PM

I suspect some compilers may do this 'under the hood', but you might want to consider adding a temporary variable to cut down on repeated multiplies:

for (i = 0; i < H; ++i) { W_times_i = W * i; for (j = 0; j < W; ++j) { memcpy(dest+(H*j+H-1-i)*3, src+(W_times_i+j)*3, 3); } }

This should eliminate W - 1 multiplies.

Given that there is a certain amount of overhead involved in calling memcpy (especially for such a small number of bytes), you might want to simply replace the call with three explicit assignments:

for (i = 0; i < H; ++i) { W_times_i = W * i; for (j = 0; j < W; ++j) { dest_indx = (H*j+H-1-i)*3; src_indx = (W_times_i+j)*3; dest[dest_indx++] = src[src_indx++]; dest[dest_indx++] = src[src_indx++]; dest[dest_indx] = src[src_indx]; } }

You might also try i++ and j++ in your for loops - if your processor's hardware supports post-increment better than pre-increment, your compiler may be aware of that fact.

hope that helps

Joe Power

- R
- Rob Gaddi
  
  Contact options for registered users
Vote on answer
posted
14 years ago

Wed, Apr 7, 2010 4:23 PM

from that group ]

Is that really true? It seems like the sort of thing that -O3 ought to be able to take care of.

--
Rob Gaddi, Highland Technology
Email address is currently out of order

- V
- Vladimir Vassilevsky
  
  Contact options for registered users
Vote on answer
posted
14 years ago

Wed, Apr 7, 2010 5:05 PM

It depends. Compilers can optimize simple index arithmetics into manipulation with pointers. However they are goofing with increased level of loop nesting and more complicated index expressions. For that matter, pointer arithmetics is usually faster.

Vladimir Vassilevsky DSP and Mixed Signal Design Consultant

formatting link

- P
- Paul Carpenter
  
  Contact options for registered users
Vote on answer
posted
14 years ago

Wed, Apr 7, 2010 5:10 PM

from that group ]

Look back at the OP's code and the pointers are incremented between iterations of the loop then used as arrays, how is it going to optimise a moving target other than making a copy and adding sizeof bytes to the temporary pointer.

--
Paul Carpenter          | paul@pcserviceselectronics.co.uk
    PC Services
 Timing Diagram Font
  GNU H8 - compiler & Renesas H8/H8S/H8 Tiny
 For those web sites you hate

- D
- D Yuniskis
  
  Contact options for registered users
Vote on answer
posted
14 years ago

Wed, Apr 7, 2010 9:14 PM

Are you sure you will always want to rotate by some multiple of 90 deg?

Are you sure you will *always* want to rotate CW by exactly 90 deg?

Your original data definition was cleaner.

Why are you taking the memcpy function invocation hit here? You're just moving "3 bytes (octets)".

Create two pointers:

pixel *input, *output;

Initialize the input pointer to "some corner" (top left is convenient). Initialize the output pointer to the corner that is 90 degrees CW from this point.

Move the input pointer across a row (or down a column -- depending on which corner you chose as a starting point) and move the output pointer down a column (or across a row, etc.).

Then, just copy from *input to *output.

Of course, using a three byte type makes this cumbersome. I assume you don't want to pack those three bytes into a long int (time/space efficiency concerns). So, you'll have to break it down to a set of three "byte operations".

It might help if you told us which processor you are targeting.

- G
- George Neuner
  
  Contact options for registered users
Vote on answer
posted
14 years ago

Wed, Apr 7, 2010 11:35 PM

It depends on the compiler. Most C90 and later compilers do a pretty good job optimizing array indexing and are not so easily confused by high dimensionality or sub-array blocking. In many cases you will actually confuse them and reduce performance if you try to do your own pointer arithmetic.

That said, old versions of GCC (certainly anything prior to 3.x) were notoriously bad at optimizing even 3-D array indexing. If your chip requires using one of these old compilers, then you certainly should do your own pointer arithmetic.

George

- U
- Ulf Samuelsson
  
  Contact options for registered users
Vote on answer
posted
14 years ago

Thu, Apr 8, 2010 12:56 AM

Noob skrev:

Modern Hardware like the AT91SAM9M10 will do this in a H/W accelerator.

Best Regards Ulf Samuelsson

- S
- Skybuck Flying
  
  Contact options for registered users
Vote on answer
posted
14 years ago

Thu, Apr 8, 2010 5:42 AM

original

.------------------->X | | | | | | \|/ Y

rotated 90 degrees clockwise

y

- N
- Noob
  
  Contact options for registered users
Vote on answer
posted
14 years ago

Thu, Apr 8, 2010 8:56 AM

I was specifically asking how to do it in software.

I find this spamvertisement very inappropriate.

Does your employer condone this activity?

- P
- Peter Dickerson
  
  Contact options for registered users
Vote on answer
posted
14 years ago

Thu, Apr 8, 2010 9:30 AM

Well, the fact that some devices feel the need to do this in hardware is relevent. It means that either the processor does have the grunt to do the job well itself or that there is significant performance to be had from methods that are not easy possible in software (e.g. dedicated parallel hardware). If the second then it hints that some cleverness in software will be needed to do well.

Anyway, there is nothing about USENET that requires replies to you questions to be for you only. If the reply is related, and the can reasonably expect there to be readers for which the reply is relevent, then I don't have a big problem. Some of this is about motivation. If the motivation of the response was to shoehorn there product or services into the discussion rather than to be genuinely helpful then that I surely don't like. Now, Ulf has a history in this group and I feel that history has been strongly positive though somewhat focused on Atmel products (since that's what he knows best).

Peter

- N
- Noob
  
  Contact options for registered users
Vote on answer
posted
14 years ago

Thu, Apr 8, 2010 9:43 AM

I'm displaying JPEG photographs. Some users want to rotate the pictures (landscape to portrait). AFAICS, all I need is 90° either way.

Probably. And I also added a bug, because I meant uint8_t src[W*H*3]; uint8_t dest[W*H*3];

Right.

I've done this, and it is still too slow for my taste :-(

It brings the run-time from 1.6 seconds down to 1.1 seconds (for my 1644 x 1164 example).

I didn't see how to ask libjpeg to output 32-bit pixel values instead of 24-bit RGB values.

The data sheet states

SH-4 32-bit super-scalar RISC CPU o 266 MHz, 2-way set associative 16-Kbyte ICache, 32-Kbyte DCache, MMU o 5-stage pipeline, delayed branch support o floating point unit, matrix operation support o debug port, interrupt controller

Regards.

- N
- Noob
  
  Contact options for registered users
Vote on answer
posted
14 years ago

Thu, Apr 8, 2010 10:12 AM

As a matter of fact, I was told that the blitter on this platform (STx7109)

*does* support image rotation in hardware, but that the capability had not been exposed by the API. Strange, wouldn't you say?

I do understand that.

I agree that it might be appropriate to mention that some platforms provide hardware support for some operations. What I found rather spammy was the explicit mention of one of his employer's platform.

I will concede that I may have over-reacted ;-)

Regards.

- M
- Meindert Sprang
  
  Contact options for registered users
Vote on answer
posted
14 years ago

Thu, Apr 8, 2010 10:51 AM

Seriously, consider doing this in assembler. I have done some fax image processing in the past and the difference between a HLL implementation and an assemler implementation of the run length decoding of a PCX image was astonishing.

Converting it from 24 bit to 32 bit first, might even give you a performance gain, especially since your image dimensions seem to be a multiple of 4.

Meindert

- B
- Boudewijn Dijkstra
  
  Contact options for registered users
Vote on answer
posted
14 years ago

Thu, Apr 8, 2010 12:53 PM

Now that you've mentioned libjpeg: it supports lossless transformations, including 90° rotations. See transupp.h.

--
Gemaakt met Opera's revolutionaire e-mailprogramma:  
http://www.opera.com/mail/
(remove the obvious prefix to reply by mail)

- N
- Noob
  
  Contact options for registered users
Vote on answer
posted
14 years ago

Thu, Apr 8, 2010 1:40 PM

Pictures are not unconditionally rotated. They are rotated when the user asks for it, which is AFTER the picture has been decoded, which might take 2-3 seconds.

It would take longer to rotate the JPEG representation + decode new picture than to rotate the bitmap of the decoded picture

- B
- Brett Davis
  
  Contact options for registered users
Vote on answer
posted
14 years ago

Sun, Apr 11, 2010 12:30 AM

The most important number is cache line size, which you missed. If your image is 1,024 lines tall, that will completely thrash the cache, resulting in 3 bytes copied per cache line load/spill.

If you copy 16x16 tiles you can get a 10x speedup.

CopyTile16(source, dest, x, y, width, height)...

You can also try 8x8 and 4x4, smaller loops can be faster due to all the args fitting in memory, and the loop getting unrolled. but the difference will be ~25% which is hardly worth the time to code.

If you do the following you can get a 2x speedup, it looks like more code, but will generate less, and the results will be pipelined correctly.

{ unsigned char *C = B+(H*j+H-i-1)*3; temp0 = A[0]; temp1 = A[1]; temp2 = A[2]; C[0] = temp0; C[1] = temp1; C[2] = temp2; A += 3; }

Do not use *C++ = *A++;

The is no need to flame the guy selling hardware, even though this is a bandwidth limited problem that will not be helped by more hardware.

Brett