Suggestions on optimizing driver code ?

Hi, We're implementing a video sensor driver for an OmniVision

6620(connected via AverLogic AL422b fifo memory) that runs a 2.6.17 kernel on an Intel XScale PXA255(200MHz). We're stuck at a data retrieval speed of approximately 440 microseconds/row. Can any one give a pointer on how we can speed up the frame grabbing process. We found out that 280 microseconds/row of that time is being taken by the incrementation of the read clock, so advice on reduction of that time would be great. The code to grab a row is below.

Thanks in advance, Yaman.

for (i=0; ibitarray[i][0] = ((tmp2 & 0x80000) >> 19) | ((tmp1 & 0x400) >> 9) | ((tmp2 & 0x20000) >> 15) | ((tmp1 & 0x4000000) >> 23) | ((tmp2 & 0x20) >> 1) | ((tmp2 & 0x4) > 24) | ((tmp1 & 0x10000000) >> 21); // skip a pixel GPCR1 = 0X4000; GPSR1 = 0X4000;

GPCR1 = 0X4000; GPSR1 = 0X4000; // read fifo data from GPIO registers tmp2 = GPLR2; tmp1 = GPLR1; tmp0 = GPLR0; ov6620_dev->bitarray[i][0] = ((tmp2 & 0x80000) >> 19) | ((tmp1 & 0x400) >> 9) | ((tmp2 & 0x20000) >> 15) | ((tmp1 & 0x4000000) >> 23) | ((tmp2 & 0x20) >> 1) | ((tmp2 & 0x4) > 24) | ((tmp1 & 0x10000000) >> 21); // skip another pixel GPCR1 = 0X4000; GPSR1 = 0X4000; }

Reply to
yamanc
Loading thread data ...

...

try optimizing the invariants aout of your loops

Reply to
A. Melinte

I'm _guessing_ those are volatile variables mapped to HW registers so that those lines are not invariants that can be removed from the loop.

I could be wrong...

--
Grant Edwards                   grante             Yow!  I've been WRITING
                                  at               to SOPHIA LOREN every 45
                               visi.com            MINUTES since JANUARY 1ST!!
Reply to
Grant Edwards

Reply to
yamanc

First let's look at what bits are used in which variables:

tmp0 uses 0x40000000 tmp1 uses 0x14000400 tmp2 uses 0x000a0024

// let's assign byte pointers to the variables so we can pick out individual // bytes as needed (I'm assuming big-endian - correct me if I'm wrong?)

unsigned char *t0 = (unsigned char *)&tmp0; unsigned char *t1 = (unsigned char *)&tmp1; unsigned char *t2 = (unsigned char *)&tmp2;

// now let's build lookup tables for the various bit patterns (this should // done at initialization time)

unsigned char lut0[65]; unsigned char lut1[21]; unsigned char lut2[5]; unsigned char lut3[10]; unsigned char lut4[37];

// initialize lut0 to deal with t0[0]

lut0[0x00] = 0x00; lut0[0x40] = 0x40;

// initialize lut1 to deal with t1[0]

lut1[0x00] = 0x00; lut1[0x04] = 0x08; lut1[0x10] = 0x80; lut1[0x14] = 0x88;

// initialize lut2 to deal with t1[2]

lut2[0x00] = 0x00; lut2[0x04] = 0x02;

// initialize lut3 to deal with t2[1]

lut3[0x00] = 0x00; lut3[0x02] = 0x04; lut3[0x08] = 0x01; lut3[0x0a] = 0x05;

// initialize lut4 to deal with t2[3]

lut4[0x00] = 0x00; lut4[0x04] = 0x10; lut4[0x20] = 0x20; lut4[0x24] = 0x30;

Now the code in the loop only has to do this:

ov6620_dev0->bitarray[i][0] = lut0[t0[0]&0x40] | lut1[t1[0]&0x14] | lut2[t1[2]&0x04] | lut3[t2[1]&0x0a] | lut4[t2[3]&0x24];

Of course, lut0 isn't really necessary since the index matches the data, so we can optimize that out:

ov6620_dev0->bitarray[i][0] = t0[0] | lut1[t1[0]&0x14] | lut2[t1[2]&0x04] | lut3[t2[1]&0x0a] | lut4[t2[3]&0x24];

That's gotta be quicker than doing all that shifting, no?!?

(please double check the values, but I think you get the idea??)

Patrick ========= For LAN/WAN Protocol Analysis, check out PacketView Pro! ========= Patrick Klos Email: snipped-for-privacy@klos.com Klos Technologies, Inc. Web:

formatting link
============================================================================

Reply to
Patrick Klos

Assuming any-endian is always wrong.

Possibly. Shifting by mutliples of 8 usually generates exact same code as indexing off of a byte pointer.

In my experience, tricks like that rarely provide noticable benefit. Modern compilers are pretty good.

--
Grant Edwards                   grante             Yow!  I OWN six pink
                                  at               HIPPOS!!
                               visi.com
Reply to
Grant Edwards

One question to ask is what is it that makes it take so long to increment the FIFO read clock? The device claims it'll do 50 MHz.

Is your hardware design cast in stone? Of course the hardware guys will always tell you it is, but if the bit assignments of GPLR2, GPLR1, and GPLR0 can be juggled to match what you want to end up with in bitarray after all that ugly masking and shifting, you'll get some cycles back. You're only pulling out a byte - can that byte's bits be grouped into one contiguous bitfield within one GPLRx register? Now you're down to one read, one shift(at most), and a type cast.

Is it possible to use a FIFO that doesn't require silly babysitting? A FIFO read should auto advance. Can the FIFO be memory mapped and hardware added to auto advance the read pointer?

Back to the software: Have the compiler dump the assembly and see how many instructions are needed to index into ov6620->bitarray[i][0]. You might consider a pointer that references bitarray[0][0] initially and is incremented within the loop.

Good luck, Dave

Reply to
Dave Littell

Grant,

I know you do a lot of posting here, and I'm sure it's appreciated. I see no need to shoot down my assumption. I was simply stating that so anyone looking at the code would understand which bytes of the 32-bit variable I was extracting. If the OP doesn't have his processor in big-endian mode, he'll know to take that into account when trying the code. At least I took the time to come up with an answer and state my assumptions.

Sometimes. And only in some compilers. And only with optimzations enabled. One can't always rely on the compiler to optimize our code - sometimes it's best to optimize it ourselves so WE KNOW what is going on.

Also, not all the shifting is done in multiples of 8 bits.

Well, I guess your experience does not match my experience. It happens...

Now who's making assumptions?? Let's let the OP try it and find out what difference he sees??

========= For LAN/WAN Protocol Analysis, check out PacketView Pro! ========= Patrick Klos Email: snipped-for-privacy@klos.com Klos Technologies, Inc. Web:

formatting link
============================================================================

Reply to
Patrick Klos

My understanding is that the PXA255 is very slow accessing I/O regs - a friend measured ~400 ns per access (using 400MHz part). So this sort of bit fiddling port bits will be slow. You want all your input bits on the same port to start... It looks like 2 us/byte min.

Peter

Reply to
Peter Dickerson

Greetings,

The lookup table method didn't really introduce a significant change, though thanks for the advice. In the new design we'll pack all the video output to the same register so that things will be faster, and probably go on without using the FIFO hardware. Dave, you have any suggestions for a better FIFO just in case we might decide to use a better one?

Cheers, Yaman

Reply to
yamanc

I'm sorry if I gave the impression that I thought the AverLogic was bad. I have no experience with that particular device. If you guys like it well enough, I'd definitely suggest that it be memory mapped with some control logic that will auto-tickle the read clock when the processor runs the read cycle. This approach will very likely yield much better performance that the GPIO-based mechanism.

You may have other hardware constraints, but staying with the GPIO-driven approach will never be as fast as you might like. Ideally, you'd be able to use a DMA-based design, but that would require hardware assets you may not have available.

Good luck, Dave

Reply to
Dave Littell

Most optimising compilers should be able to do a more than adequate job

on the code below. The array lookup code suggested might run just a fraction slower but you will be better sticking with the bitwise code.

One suggestion might be to look at the address generation code for ov6220->... but it will probably also be reasonable and an alternative scheme may not prove that much faster either.

On balance the slow times you are reporting will almost certainly be due to the I/Os. If this is impacting your overall elapsed times maybe you can set the device into a more lossy frame grab?

dmctek.googlepages.com

-----------------------

snipped-for-privacy@gmail.com wrote:

Reply to
dmctek.googlepages.com

ElectronDepot website is not affiliated with any of the manufacturers or service providers discussed here. All logos and trade names are the property of their respective owners.