pulse counter using LPC1768 proving to very challenging

A decent warning for those not fluent in assembly. Not for those of us who know it cold.

C code is almost _never_ as good as hand-written assembly code and unless you are doing math expressions which is NOT a good idea in an interrupt routine not much different than writing in c. Same hardware needs dealing with and the compilers usually have "constraints" that an assembly programmer does not have, at all.

We've been down this c vs assembly path a million times here. Some points are good, but I hate broad brush stuff. Look up the discussion we had a few years ago on a GCD algorithm. To this day, not even the best x86 compilers can come close even when ALL of the c-constraints must be fully observed by the hand assembly coder. Compilers cannot do topology inversion and they handle status bits somewhat poorly. There are many other issues that may relate to interrupts, as well, where there is no syntax in c for certain semantics.

How all this applies in the ARM case I'll leave to folks better informed than me. But I really don't like it when I see "you are unlikely to be able to write good assembly" and "should be very close to ... optimal assembly." Most particularly, when discussing interrupt routines.

I will leave it there.

Jon

Reply to
Jon Kirwan
Loading thread data ...

e

You have not said what is driving this unusual spec, nor the repeat- rate, and SW alone may not be enough to reject all noise types. ie Two 1.8us pulses close together, could pass.

So an external filter, either Schmitt + RC, or a simple state-engine in a SPLD/CPLD or HC163 counter may be needed.

You need to minimise the SW & interrupt calls, by helping in HW, eg capturing a value on each edge, but only INT on trailing edge, then check the delta-time.

Some of the new NXPs have a capture-clears-timer feature, which would be very useful on this type of problem.

Reply to
Jim Granville

There are still some inefficient compilers around, but they are mainly for the smaller processors that are hard to work with. On something like the Cortex, it's easy to generate reasonable code for short C functions. The big differences are for things like automatic use of vector or DSP functions, smarter loop unrolling, interprocedural optimisations, etc. - but they should not make a difference in a case like this.

Yes, there is always someone that thinks the UART receive interrupt routine is the best place to interpret incoming telegrams, act on them, and build up a reply...

Reply to
David Brown

Thanks for your valuable inputs. We tried toggling an IO pin inside while(1) and see that it only generates pulses of 150ns width. So is there something wrong with the clock configuration?

We use the LPCXpresso compiler. I'll try to post the code here for the clock init a little later.

--------------------------------------- Posted through

formatting link

Reply to
navman

It /sounds/ likely that there is something wrong, but I am not sure what you should expect here. Certainly for some ARM devices IO pin access is surprisingly slow. I don't know this chip, so I'll let others give more definite answers.

LPCXpresso uses gcc, which will produce solid and efficient code. But that depends on the compiler flags - if optimisation is turned off, you will get very big and slow object code.

Reply to
David Brown

Take a look at the generated assembly to see how many actual instructions are executed to implement the toggle feature. You should be able to works back from there to the effective instruction cycle time.

--
Rich Webb     Norfolk, VA
Reply to
Rich Webb

The LPC series use a special fast gpio interface, which is actually pretty good compared to older APB based GPIO interfaces.

I don't have a LPC17xx, but I just tried it on a LPC2478 which has a similar FGPIO interface (but an ARM7 core instead of Cortex-M3), doing:

while( 1 ) { FIO0SET = BITMASK; FIO0CLR = BITMASK; }

This results in pulses of 2 cycles high, and 5 cycles low.

In assembly, this loop is implemented as 2 stores and a branch.

Toggling the same pin with:

while( 1 ) { FIO0PIN ^= BITMASK; }

results in 9 cycles high, 9 cycles low for a load, exor, store, and branch. All of this using gcc -O2.

15 cycles for the pulse width seems a bit high in comparison.
Reply to
Arlet Ottens

navman skrev 2011-06-07 15:16:

I know how I would implement this on an Atmel AT32UC3C.

You use the pulse input as a gate to the clock of a counter (CNT0). CNT0 will count up, while the pulse is active. When the pulse ends, the counter should be reset.

A compare register is used to determine if the signal is > 2 us. If CNT0 matches the compare register, an "event" is triggered. The event is used to clock another counter CNT1.

Best Regards Ulf Samuelsson

Reply to
Ulf Samuelsson

e

ARM chips have surprisingly slow I/O. It's better that it was with the earlier devices like the LPC2106, but it's still not very good.

Leon

Reply to
Leon

That was particularly true where setting or clearing a bit required a read-modify-write sequence. Many of the Cortex M3 systems I'm working with now have separate bit set and bit clear registers which reduces the instruction count.

I just looked at a code sequence that toggles a bit to clock data from a FIFO to an LCD display. It is a partially-unrolled loop with a sequence of 16 bit bit clear and bit set instructions.

In C it is a sequence of:

GPIOB->BRR = FIFO_RD_BIT | LCD_WR_BIT; GPIOB->BSRR = FIFO_RD_BIT | LCD_WR_BIT;

GPIOB->BRR = FIFO_RD_BIT | LCD_WR_BIT; GPIOB->BSRR = FIFO_RD_BIT | LCD_WR_BIT;

. . .

The Thumb code generated is

STR R3, [R0] STR R3, [R0, #4]

STR R3, [R0] STR R3, [R0, #4]

R0 is loaded with the port base address and R3 is loaded with the bit pattern before the start of the loop. Each instruction is a single 16-bit word.

I don't think I'm going to beat that with any assembly-language optimizations. ;-) The code was generated with the IAR compiler and with optimizations level set to high.

When running on an STM32F103 with the main clock set to 64mHz, the bit toggles at 16mHz. This is consistent with the fact that the local peripheral bus for the general purpose IO bits is running at

1/2 the main clock rate, since that bus is rated for a maximum clock rate of 36MHz. Updating the whole QVGA display with 2 bytes/pixel (in RGB(565) format) takes about 14mSec.

It certainly helps that the engineer who designed the board put all the FIFO and LCD clocking bits on pins from the same peripheral port. If they were on different ports, it would take a separate instruction for each clock bit---doubling the number of instructions.

The Cortex M3 also implements bit banding, where each bit in a peripheral register or RAM word is assigned a memory location of its own. That means a bit test operation can be reduced to: bitstatus = UART_RCV_StatusBitBand; // returns either 0x01 or 0x00

instead of

bitstatus = UART_Status_Register & RCV_Status_Mask; // returns either the mask bit or zero

I haven't yet had to optimize an interrupt handler to the degree that would benefit from this capability, but it could cut some instructions from a handler that required you to figure out which of a number of possible bits caused an interrupt. Writing the code to use the bit banding would require that you pre-calculate the proper bit band address for each port bit that you want to test.

Mark Borgerson

Reply to
Mark Borgerson
[much elided]

This is where having someone comfortable in software and hardware really pays off. A pure hardware dweeb would connect things "wherever" without concern for how they are going to be used/accessed. E.g., "make the schematic look pretty" (signal names in numerical order -- even if REVERSE numerical order would have made more sense!) or "make the layout easy".

I recall the sound output (CVSD) on some old arcade hardware required the processor to shift the data byte (in an accumulator) and write *one* bit of it into the CVSD, generate a clock, lather, rinse, repeat. As a result, the quality of the speech generated on those machines was piss poor -- with the processor spending 100.0% of its time doing this!

A bit more forethought on the part of the hardware designer would have made the software easier *and* more capable. I.e., write a routine to move data and see how clumsy it REALLY is!

Reply to
D Yuniskis

Just to update my findings, the LPC1768 **is** running at 100MHz as confirmed by the CLKOUT pin. I'll have to check the disassembly and see where the problem is. But I'm still very much surprised to see 150ns pulses to toggle a pin. An 8 bit AVR on 16MHz clock can do about the same and maybe faster!

--------------------------------------- Posted through

formatting link

Reply to
navman

There's something fishy in the interrupt code.

I'm running 200 kHz data capture (10 bit SPI A/D) and a 12800 bits/s software UART simultaneously on a 50 MHz Cortex-M3 (TI/Stellaris LM3S818). There is still plenty of processor time left for other chores.

--

Tauno Voipio
Reply to
Tauno Voipio

This definitely sounds like a code generation issue. At 10nSec per instruction, I would expect you to be able to toggle a bit in 20nSec. The GPIO ports go directly to the CPU, according to the block diagram in the LPC1768 data sheet, so I wouldn't expect and AHB slowdown.

160nSec is about what I would expect to get into an interrupt service routine or a function call that saved a few registers.

Mark Borgerson

Reply to
Mark Borgerson

For a datapoint, I dragged out an old devboard & tossed a few lines of test code into a scratch project.

Running at 96 MHz on a "Blueboard LPC1768-H" devboard with Rowley's CrossWorks set for "Flash Release" mode and -O1 optimizations, the output pin toggles at just over 20 nsec for an entire period (one uppie, one downie) (nothing fancy; just a series of FIO0SETs and CLRs wrapped in a (shudder) goto). Unoptimized ("Flash Debug" mode) the period is about 94 nsec overall.

So it *can* do it. The issue is probably just finding that one line in the user's manual or that one register bit that has been overlooked. Bloody processors are so damned *literal* sometimes...

--
Rich Webb     Norfolk, VA
Reply to
Rich Webb

es

Some older LPCs had "slow" GPIO and "fast" GPIO methods. Might this be the case for your 1768?

Reply to
cassiope

Not quite. The interrupt latency of the Cortex-M3 is only 12 cycles in normal circumstances, or 120 ns when running at 100 MHz, using fast memory.

The fact that the ISR takes 6-8 us in the OP's case, doesn't mean there's a problem with the interrupt mechanism itself.

Reply to
Arlet Ottens

This shows the limits of interrupt processing. The GA144 with asynchronous waits for input change (up-going and down-going) should be able to switch between the two in a matter of nS. (Burning one processor for the input and a couple to do the processing, such as reading out a timer when signalled by the input processor.)

Groetjes Albert

--

--
Albert van der Horst, UTRECHT,THE NETHERLANDS
Economic growth -- being exponential -- ultimately falters.
albert@spe&ar&c.xs4all.nl &=n http://home.hccnet.nl/a.w.m.van.der.horst
Reply to
Albert van der Horst

With decent handling of interrupt levels and UART at the lowest priority, this may well be the cleanest design ...

Groetjes Albert

--

--
Albert van der Horst, UTRECHT,THE NETHERLANDS
Economic growth -- being exponential -- ultimately falters.
albert@spe&ar&c.xs4all.nl &=n http://home.hccnet.nl/a.w.m.van.der.horst
Reply to
Albert van der Horst

e

I see in

formatting link
.pdf that the LPC18xx series, has a Timer State engine, which could be ideal for this type of pre-qualify.

Of course, the LPC18xx seems to only come in big packages.. so it might not be a easy shift to make. Depends how important this count+Filter is ?

Reply to
malcolm

ElectronDepot website is not affiliated with any of the manufacturers or service providers discussed here. All logos and trade names are the property of their respective owners.