Suggestions on optimizing driver code ?

Do you have a question? Post it now! No Registration Necessary

Translate This Thread From English to

Threaded View
Hi,
We're implementing a video sensor driver for an OmniVision
6620(connected via AverLogic AL422b fifo memory) that runs a 2.6.17
kernel on an Intel XScale PXA255(200MHz). We're stuck at a data
retrieval speed of approximately 440 microseconds/row. Can any one
give a pointer on how we can speed up the frame grabbing process. We
found out that 280 microseconds/row of that time is being taken by the
incrementation of the read clock, so advice on reduction of that time
would be great. The code to grab a row is below.

Thanks in advance,
Yaman.

 for (i=0; i<176; i++) {
   // increment read clock
   GPCR1 = 0X4000;
   GPSR1 = 0X4000;
   // read fifo data from GPIO registers
   tmp2 = GPLR2;
   tmp1 = GPLR1;
   tmp0 = GPLR0;
   ov6620_dev->bitarray[i][0] = ((tmp2 & 0x80000) >> 19)
     | ((tmp1 & 0x400) >> 9)
     | ((tmp2 & 0x20000) >> 15)
     | ((tmp1 & 0x4000000) >> 23)
     | ((tmp2 & 0x20) >> 1)
     | ((tmp2 & 0x4) << 3)
     | ((tmp0 & 0x40000000) >> 24)
     | ((tmp1 & 0x10000000) >> 21);
   // skip a pixel
   GPCR1 = 0X4000;
   GPSR1 = 0X4000;

   GPCR1 = 0X4000;
   GPSR1 = 0X4000;
   // read fifo data from GPIO registers
   tmp2 = GPLR2;
   tmp1 = GPLR1;
   tmp0 = GPLR0;
   ov6620_dev->bitarray[i][0] = ((tmp2 & 0x80000) >> 19)
     | ((tmp1 & 0x400) >> 9)
     | ((tmp2 & 0x20000) >> 15)
     | ((tmp1 & 0x4000000) >> 23)
     | ((tmp2 & 0x20) >> 1)
     | ((tmp2 & 0x4) << 3)
     | ((tmp0 & 0x40000000) >> 24)
     | ((tmp1 & 0x10000000) >> 21);
   // skip another pixel
   GPCR1 = 0X4000;
   GPSR1 = 0X4000;
 }


Re: Suggestions on optimizing driver code ?
Quoted text here. Click to load it
...

try optimizing the invariants aout of your loops




Re: Suggestions on optimizing driver code ?
Quoted text here. Click to load it

I'm _guessing_ those are volatile variables mapped to HW
registers so that those lines are not invariants that can be
removed from the loop.

I could be wrong...

--
Grant Edwards                   grante             Yow!  I've been WRITING
                                  at               to SOPHIA LOREN every 45
We've slightly trimmed the long signature. Click to see the full one.
Re: Suggestions on optimizing driver code ?
You're definately right on this one. We have to reset the read clock to
move the read pointer in the fifo.
Quoted text here. Click to load it


Re: Suggestions on optimizing driver code ?
Quoted text here. Click to load it

First let's look at what bits are used in which variables:

    tmp0 uses 0x40000000
    tmp1 uses 0x14000400
    tmp2 uses 0x000a0024

// let's assign byte pointers to the variables so we can pick out individual
// bytes as needed (I'm assuming big-endian - correct me if I'm wrong?)

    unsigned char *t0 = (unsigned char *)&tmp0;
    unsigned char *t1 = (unsigned char *)&tmp1;
    unsigned char *t2 = (unsigned char *)&tmp2;

// now let's build lookup tables for the various bit patterns (this should
// done at initialization time)

    unsigned char lut0[65];
    unsigned char lut1[21];
    unsigned char lut2[5];
    unsigned char lut3[10];
    unsigned char lut4[37];

// initialize lut0 to deal with t0[0]

    lut0[0x00] = 0x00;
    lut0[0x40] = 0x40;

// initialize lut1 to deal with t1[0]

    lut1[0x00] = 0x00;
    lut1[0x04] = 0x08;
    lut1[0x10] = 0x80;
    lut1[0x14] = 0x88;

// initialize lut2 to deal with t1[2]

    lut2[0x00] = 0x00;
    lut2[0x04] = 0x02;

// initialize lut3 to deal with t2[1]

    lut3[0x00] = 0x00;
    lut3[0x02] = 0x04;
    lut3[0x08] = 0x01;
    lut3[0x0a] = 0x05;

// initialize lut4 to deal with t2[3]

    lut4[0x00] = 0x00;
    lut4[0x04] = 0x10;
    lut4[0x20] = 0x20;
    lut4[0x24] = 0x30;

Now the code in the loop only has to do this:

    ov6620_dev0->bitarray[i][0] =
        lut0[t0[0]&0x40] |
                lut1[t1[0]&0x14] |
                lut2[t1[2]&0x04] |
                lut3[t2[1]&0x0a] |
                lut4[t2[3]&0x24];

Of course, lut0 isn't really necessary since the index matches the data,
so we can optimize that out:

    ov6620_dev0->bitarray[i][0] =
        t0[0] |
                lut1[t1[0]&0x14] |
                lut2[t1[2]&0x04] |
                lut3[t2[1]&0x0a] |
                lut4[t2[3]&0x24];

That's gotta be quicker than doing all that shifting, no?!?

(please double check the values, but I think you get the idea??)

Patrick
========= For LAN/WAN Protocol Analysis, check out PacketView Pro! =========
    Patrick Klos                           Email: snipped-for-privacy@klos.com
    Klos Technologies, Inc.                Web:   http://www.klos.com /
============================================================================

Re: Suggestions on optimizing driver code ?

Quoted text here. Click to load it

Assuming any-endian is always wrong.

Quoted text here. Click to load it

Possibly. Shifting by mutliples of 8 usually generates exact
same code as indexing off of a byte pointer.

In my experience, tricks like that rarely provide noticable
benefit.  Modern compilers are pretty good.

--
Grant Edwards                   grante             Yow!  I OWN six pink
                                  at               HIPPOS!!
We've slightly trimmed the long signature. Click to see the full one.
Re: Suggestions on optimizing driver code ?
Quoted text here. Click to load it

Grant,

I know you do a lot of posting here, and I'm sure it's appreciated.  I see
no need to shoot down my assumption.  I was simply stating that so anyone
looking at the code would understand which bytes of the 32-bit variable I
was extracting.  If the OP doesn't have his processor in big-endian mode,
he'll know to take that into account when trying the code.  At least I took
the time to come up with an answer and state my assumptions.

Quoted text here. Click to load it

Sometimes.  And only in some compilers.  And only with optimzations
enabled.  One can't always rely on the compiler to optimize our code -
sometimes it's best to optimize it ourselves so WE KNOW what is going
on.

Also, not all the shifting is done in multiples of 8 bits.

Quoted text here. Click to load it

Well, I guess your experience does not match my experience.  It happens...

Quoted text here. Click to load it

Now who's making assumptions??  Let's let the OP try it and find out what
difference he sees??

========= For LAN/WAN Protocol Analysis, check out PacketView Pro! =========
    Patrick Klos                           Email: snipped-for-privacy@klos.com
    Klos Technologies, Inc.                Web:   http://www.klos.com /
============================================================================

Re: Suggestions on optimizing driver code ?
Quoted text here. Click to load it

My understanding is that the PXA255 is very slow accessing I/O regs - a
friend measured ~400 ns per access (using 400MHz part). So this sort of bit
fiddling port bits will be slow. You want all your input bits on the same
port to start... It looks like 2 us/byte min.

Peter



Re: Suggestions on optimizing driver code ?
Quoted text here. Click to load it

One question to ask is what is it that makes it take so long to
increment the FIFO read clock?  The device claims it'll do 50 MHz.

Is your hardware design cast in stone?  Of course the hardware guys will
always tell you it is, but if the bit assignments of GPLR2, GPLR1, and
GPLR0 can be juggled to match what you want to end up with in bitarray
after all that ugly masking and shifting, you'll get some cycles back.
You're only pulling out a byte - can that byte's bits be grouped into
one contiguous bitfield within one GPLRx register?  Now you're down to
one read, one shift(at most), and a type cast.

Is it possible to use a FIFO that doesn't require silly babysitting?  A
FIFO read should auto advance.  Can the FIFO be memory mapped and
hardware added to auto advance the read pointer?

Back to the software: Have the compiler dump the assembly and see how
many instructions are needed to index into ov6620->bitarray[i][0].  You
might consider a pointer that references bitarray[0][0] initially and is
incremented within the loop.


Good luck,
Dave

Quoted text here. Click to load it

Re: Suggestions on optimizing driver code ?
Greetings,

The lookup table method didn't really introduce a significant change,
though thanks for the advice. In the new design we'll pack all the
video output to the same register so that things will be faster, and
probably go on without using the FIFO hardware. Dave, you have any
suggestions for a better FIFO just in case we might decide to use a
better one?

Cheers,
Yaman


Re: Suggestions on optimizing driver code ?
Quoted text here. Click to load it

I'm sorry if I gave the impression that I thought the AverLogic was bad.
I have no experience with that particular device.  If you guys like it
well enough, I'd definitely suggest that it be memory mapped with some
control logic that will auto-tickle the read clock when the processor
runs the read cycle.  This approach will very likely yield much better
performance that the GPIO-based mechanism.

You may have other hardware constraints, but staying with the
GPIO-driven approach will never be as fast as you might like.  Ideally,
you'd be able to use a DMA-based design, but that would require hardware
assets you may not have available.


Good luck,
Dave

Re: Suggestions on optimizing driver code ?


Most optimising compilers should be able to do a more than adequate job

on the code below. The array lookup code suggested might run just a
fraction slower but you will be better sticking with the bitwise code.

One suggestion might be to look at the address generation code for
ov6220->... but it will probably also be reasonable and an alternative
scheme may not prove that much faster either.

On balance the slow times you are reporting will almost certainly be
due to
the I/Os. If this is impacting your overall elapsed times maybe you can
set the device into a more lossy frame grab?


dmctek.googlepages.com

-----------------------


snipped-for-privacy@gmail.com wrote:

Quoted text here. Click to load it


Site Timeline