Suggestions on optimizing driver code ?

Question

Hi,We're implementing a video sensor driver for an OmniVision6620(connected via AverLogic AL422b fifo memory) that runs a 2.6.17kernel on an Intel XScale PXA255(200MHz). We're stuck at a dataretrieval speed of approximately 440 microseconds/row. Can any onegive a pointer on how we can speed up the frame grabbing process. Wefound out that 280 microseconds/row of that time is being taken by theincrementation of the read clock, so advice on reduction of that timewould be great. The code to grab a row is below.Thanks in advance,Yaman. for (i=0; ibitarray[i][0] = ((tmp2 & 0x80000) >> 19)     | ((tmp1 & 0x400) >> 9)     | ((tmp2 & 0x20000) >> 15)     | ((tmp1 & 0x4000000) >> 23)     | ((tmp2 & 0x20) >> 1)     | ((tmp2 & 0x4) > 24)     | ((tmp1 & 0x10000000) >> 21);   // skip a pixel   GPCR1 = 0X4000;   GPSR1 = 0X4000;   GPCR1 = 0X4000;   GPSR1 = 0X4000;   // read fifo data from GPIO registers   tmp2 = GPLR2;   tmp1 = GPLR1;   tmp0 = GPLR0;   ov6620_dev->bitarray[i][0] = ((tmp2 & 0x80000) >> 19)     | ((tmp1 & 0x400) >> 9)     | ((tmp2 & 0x20000) >> 15)     | ((tmp1 & 0x4000000) >> 23)     | ((tmp2 & 0x20) >> 1)     | ((tmp2 & 0x4) > 24)     | ((tmp1 & 0x10000000) >> 21);   // skip another pixel   GPCR1 = 0X4000;   GPSR1 = 0X4000; }

A. Melinte · Accepted Answer

...try optimizing the invariants aout of your loops

Grant Edwards · Answer

I'm _guessing_ those are volatile variables mapped to HWregisters so that those lines are not invariants that can beremoved from the loop.I could be wrong...-- Grant Edwards                   grante             Yow!  I've been WRITING                                  at               to SOPHIA LOREN every 45                                           MINUTES since JANUARY 1ST!!

yamanc · Answer

You're definately right on this one. We have to reset the read clock tomove the read po

Patrick Klos · Answer

First let's look at what bits are used in which variables:    tmp0 uses 0x40000000    tmp1 uses 0x14000400    tmp2 uses 0x000a0024// let's assign byte pointers to the variables so we can pick out individual// bytes as needed (I'm assuming big-endian - correct me if I'm wrong?)    unsigned char *t0 = (unsigned char *)&tmp0;    unsigned char *t1 = (unsigned char *)&tmp1;    unsigned char *t2 = (unsigned char *)&tmp2;// now let's build lookup tables for the various bit patterns (this should// done at initialization time)    unsigned char lut0[65];    unsigned char lut1[21];    unsigned char lut2[5];    unsigned char lut3[10];    unsigned char lut4[37];// initialize lut0 to deal with t0[0]    lut0[0x00] = 0x00;    lut0[0x40] = 0x40;// initialize lut1 to deal with t1[0]    lut1[0x00] = 0x00;    lut1[0x04] = 0x08;    lut1[0x10] = 0x80;    lut1[0x14] = 0x88;// initialize lut2 to deal with t1[2]    lut2[0x00] = 0x00;    lut2[0x04] = 0x02;// initialize lut3 to deal with t2[1]    lut3[0x00] =...

Grant Edwards · Answer

Assuming any-endian is always wrong.Possibly. Shifting by mutliples of 8 usually generates exactsame code as indexing off of a byte pointer.In my experience, tricks like that rarely provide noticablebenefit.  Modern compilers are pretty good.-- Grant Edwards                   grante             Yow!  I OWN six pink                                  at               HIPPOS!!

Dave Littell · Answer

One question to ask is what is it that makes it take so long to increment the FIFO read clock? The device claims it'll do 50 MHz.

Is your hardware design cast in stone? Of course the hardware guys will always tell you it is, but if the bit assignments of GPLR2, GPLR1, and GPLR0 can be juggled to match what you want to end up with in bitarray after all that ugly masking and shifting, you'll get some cycles back. You're only pulling out a byte - can that byte's bits be grouped into one contiguous bitfield within one GPLRx register? Now you're down to one read, one shift(at most), and a type cast.

Is it possible to use a FIFO that doesn't require silly babysitting? A FIFO read should auto advance. Can the FIFO be memory mapped and hardware added to auto advance the read pointer?

Back to the software: Have the compiler dump the assembly and see how many instructions are needed to index into ov6620->bitarray[i][0]. You might consider a pointer that references bitarray[0][0] initially and is incremented within the loop.

Good luck, Dave

Patrick Klos · Answer

Grant,I know you do a lot of posting here, and I'm sure it's appreciated.  I seeno need to shoot down my assumption.  I was simply stating that so anyonelooking at the code would understand which bytes of the 32-bit variable Iwas extracting.  If the OP doesn't have his processor in big-endian mode,he'll know to take that into account when trying the code.  At least I took the time to come up with an answer and state my assumptions.Sometimes.  And only in some compilers.  And only with optimzationsenabled.  One can't always rely on the compiler to optimize our code -sometimes it's best to optimize it ourselves so WE KNOW what is goingon.Also, not all the shifting is done in multiples of 8 bits.Well, I guess your experience does not match my experience.  It happens...Now who's making assumptions??  Let's let the OP try it and find out what difference he sees??========= For LAN/WAN Protocol Analysis, check out PacketView Pro! =========    Patrick Klos                           Email:    ...

Peter Dickerson · Answer

My understanding is that the PXA255 is very slow accessing I/O regs - a friend measured ~400 ns per access (using 400MHz part). So this sort of bit fiddling port bits will be slow. You want all your input bits on the same port to start... It looks like 2 us/byte min.

Peter

yamanc · Answer

Greetings,The lookup table method didn't really introduce a significant change,though thanks for the advice. In the new design we'll pack all thevideo output to the same register so that things will be faster, andprobably go on without using the FIFO hardware. Dave, you have anysuggestions for a better FIFO just in case we might decide to use abetter one?Cheers,Yaman

Dave Littell · Answer

I'm sorry if I gave the impression that I thought the AverLogic was bad. I have no experience with that particular device. If you guys like it well enough, I'd definitely suggest that it be memory mapped with some control logic that will auto-tickle the read clock when the processor runs the read cycle. This approach will very likely yield much better performance that the GPIO-based mechanism.

You may have other hardware constraints, but staying with the GPIO-driven approach will never be as fast as you might like. Ideally, you'd be able to use a DMA-based design, but that would require hardware assets you may not have available.

Good luck, Dave

dmctek.googlepages.com · Answer

Most optimising compilers should be able to do a more than adequate job

on the code below. The array lookup code suggested might run just a fraction slower but you will be better sticking with the bitwise code.

One suggestion might be to look at the address generation code for ov6220->... but it will probably also be reasonable and an alternative scheme may not prove that much faster either.

On balance the slow times you are reporting will almost certainly be due to the I/Os. If this is impacting your overall elapsed times maybe you can set the device into a more lossy frame grab?

dmctek.googlepages.com

-----------------------

snipped-for-privacy@gmail.com wrote:

Suggestions on optimizing driver code ?

Join the Discussion

Didn't find your answer?