Hello,
I imagine the following problem has been efficiently solved over a million times in the past.
I need to rotate a picture clockwise 90 degrees.
Conceptually, my picture is represented by a matrix (W columns, H rows) of pixels, with each pixel represented by 3 octets.
pixel original_picture[W][H]; pixel rotated_picture[H][W];
At the implementation level, I'm dealing with 1D integer arrays.
int src[W*H*3]; int dest[W*H*3];
Conceptually, the rotation operation comes down to
rotated_picture[j][H-1-i] = original_picture[i][j]
My actual code is
for (i = 0; i < H; ++i) for (j = 0; j < W; ++j) memcpy(dest+(H*j+H-1-i)*3, src+(W*i+j)*3, 3);
Consider as an example, W = 1644 H = 1164 size = 5.74 MB
On my target platform, the rotation takes 1.65 seconds.
I'm trying to get this down under half a second.
I'm using the GNU tool chain. For some weird reason, we compile everything -O0. The first thing I'll try is crank gcc's optimization level.
I'm hoping gcc can perform some strength reduction, as the index calculation seems to be taking a non-negligible fraction of the total run-time.
Changing the loop to
for (i = 0; i < H; ++i) for (j = 0; j < W; ++j) { memcpy(dest+(H*j+H-i-1)*3, src, 3); src += 3; }
brings a 5% improvement.
I thought changing the memcpy call to 3 explicit assignments might help, but it actually slowed things down.
Perhaps I need to perform some tiling magic... I don't think gcc performs automatic tiling?
Comments and insight would be greatly appreciated.
Regards.