Blue Chip Technology + MagnumX?

- D
- David Hearn
  
  Contact options for registered users
posted
17 years ago

Mon, Aug 14, 2006 9:19 AM

Has anyone got any experience of Blue Chip Technology

formatting link

and especially their MagnumX single board computers?

formatting link

We've been investigating the ColdFire MCF5475EVB for some quite simple 5 channel GPIO capture - but at quite high data rates. We've got the

266MHz ColdFire processor to do what we want but it just isn't quite fast enough - it's missing some of the state transitions on the channels. We're sampling at around 2 to 2.2MHz and we need more like 3.3MHz.

We were wondering whether a faster (2x?) processor would be more likely to do the job. The 733MHz MagnumX board sounds like it might do this, and it has 16 channels of GPIO (we only need 5 inputs at present, unlikely to increase significantly). I'm aware it's moving from the ColdFire/M68K processor to an x86 processor, but unless there's a major difference in per cycle processing, that's not a problem.

Our app is quite simple - loop until a counter expires, each iteration store 1 byte pin state register and 32 bit counter value. We currently can sample up to 60MB of data.

So - any experience of the MagnumX would be good - especially with respect to the need for an OS (currently no OS, so no overhead on the ColdFire - just simple ELF application). DOS would not be suitable as we'd need to (easily) be able to access 60MB+ of RAM.

Any thoughts?

Thanks

David

- J
- John Devereux
  
  Contact options for registered users
Vote on answer
posted
17 years ago

Mon, Aug 14, 2006 11:11 AM

Can you not use DMA instead? And/Or perhaps an external FIFO.

It sounds like this application depends more on the GPIO rate rather than the CPU core clock speed.

--

John Devereux

- D
- David Hearn
  
  Contact options for registered users
Vote on answer
posted
17 years ago

Mon, Aug 14, 2006 12:03 PM

Personally, I have no idea. I've never used DMA before (I'm traditionally a desktop app guy!) - so I'm not sure where I would start, or what I'd need to do. Essentially, we just want to move a 8 bit GPIO register containing pin status and a 32 bit register containing a counter value into RAM as fast as possible - at around 3.3MHz (sample every 300ns).

Yes, I think this also affects the speed of capture - however, we found the following approx from memory benchmarks:

Sample 8 bit GPIO register, copy into RAM: 2.2MHz Sample 32 bit slice timer register, copy into RAM: 2.15MHz Sample 8 bit GPIO + 32 bit slice timer register, copy into RAM: 1.8MHz Sample nothing (and store nothing) but otherwise same code: 4MHz

This to me suggests that the bottleneck is either memory or reading the registers.

We've also benchmarked how fast we can toggle the GPIO when configured for output (rather than input) and I think it was 50ns level state (so every 50ns the pin output toggled). The fastest we need to sample at is around 300ns. So unless input capture (via polling) is 6 times slower than output, I can't see that currently the GPIO speed is a problem.

Of course, when considering another architecture, we need to ensure that the GPIO speed is considered.

Thanks

David

- P
- Paul Carpenter
  
  Contact options for registered users
Vote on answer
posted
17 years ago

Mon, Aug 14, 2006 2:39 PM

I am wondering what the hell you are using the processor for!

If you want timed capture then a proper design has hardware to

*guarantee* the timing intervals of capture (some processor even host may set a register to control timing and control). Feed the data to a deep FIFO and DMA/Bus Master the results to memory on the host processor.

You might even get the whole process of capture into a large CPLD or FPGA with guaranteed timing, if anything much faster rates would be possible.

So the timing is software timing and system activity dependant.

Using software where dedciated hardware should be used is the problem. Dedicated hardware that is NOT a processor.

How are you guaranteeing the timing of capture? Especially that each sample is the same time apart?

If the memory is some form of DRAM some referesh circuitry may kick in every now and then messing up the timing. Let alone any other device interupting on the board as SBC has lots of peripherals.

Work out what your system needs to do and what needs to be done in what time and work out how to guarantee that before thinking about what processor to use.

Sounds like this has been designed the wrong way round. I.E. see how to fit a processor to the problem, before understanding the problem.

--
Paul Carpenter          | paul@pcserviceselectronics.co.uk
    PC Services
              GNU H8 & mailing list info
             For those web sites you hate

- J
- John Devereux
  
  Contact options for registered users
Vote on answer
posted
17 years ago

Mon, Aug 14, 2006 9:14 PM

formatting link

Paul Carpenter beat me to to most of the points I would make. I would just wonder whether you actually need the counter - if the samples are being aquired reliably then perhaps the sample number can give you the time?

Anyway, DMA uses special hardware facilities to move data around, rather than doing it in software. Look it up in the datasheet for your coldfire chip. You should be able use the output of a timer to trigger a DMA transfer from a PIO port to memory, at a programmed rate.

Have you turned on the cache?

--

John Devereux

- D
- David Hearn
  
  Contact options for registered users
Vote on answer
posted
17 years ago

Tue, Aug 15, 2006 8:59 AM

formatting link

Yes, if the samples are being taken at guaranteed periodic intervals then use of the counter is not needed.

Originally to save RAM (we 'only' have 64MB on board), we only logged the pin state if it had changed, but this added a little processing into the loop (thus making it slower), and also required the use of the counter to get timing values as the time for each iteration would change depending on whether the pin state had changed. When the pin state was the same the loop was shorter, when it had changed it was longer.

To try and speed things up (ie. reduce memory accesses and conditionals), we started just sampling the pin state, but not filtering out any non-changes - therefore making the code path in each iteration of the loop identical. This, in theory, should give the same sample rate each time, once the 'fudge' factor (time taken for each loop iteration) is calculated.

I believe this worked well - however the sample rate was still only fractionally faster than 2MHz - not close to the (approx) 3.3MHz we'd like.

I'll look into that. The issue I found yesterday whilst looking at some generic DMA stuff is that the amount of code to initiate a DMA transfer is significant, much more than what is currently occurring in the loop. Whilst the memory transfer might be faster, the actual processing to initiate the transfer will be greater.

I guess this is what Paul meant by using a deep FIFO (I'm still learning!). This can be used to buffer up the samples such that the DMA transfer can be done on a larger block, rather than the 1 (or 5) bytes I'm currently processing each time.

Am I correct in saying that using DMA to transfer 1 or 5 bytes (if including a counter) at a time is likely to have more overhead than just doing a normal array indexed write?

We originally tried using a timer to generate interrupts and to sample at a constant rate, however, we found that the sampling speed was far slower than using a simple while() loop - too slow for our needs. To further complicate matters, turning full optimisation on (using gcc for M68k) caused interrupts to break. Using full optimisation made a noticeable difference in speed when doing the simple while() loop.

Unfortunately I have little control over modifications of the hardware (ie. adding FIFOs etc). Whilst I can advise what might be a better off the shelf choice, custom changes to boards is unlikely to happen at this stage (proof of concept and initial development).

Branch and instruction caches are turned on, data cache is turned off as I've so far been unsuccessful in getting the CACR (Cache Access Control Register) to mask off the registers from the cache. Turning it on means we get the same values back from the counter and pin status registers (as expected if they're being cached).

Turning on the branch and instruction cache made a huge difference. However, due to to the fact we're writing up to 60MB to RAM (without reading back) over a few seconds, I can't see how adding the data cache will significantly improve the performance. Most, if not all the local array indexes etc, which are written/read the most are in registers already it seems.

I'm certainly no true embedded engineer - I'm traditionally a PC software engineer who hadn't had to worry about low level code before. In the (small) company, I probably have the most 'embedded' experience through doing some set top box software development (using provided SDKs not requiring low level hardware interfacing). This has been my first true embedded development project, and had a steep learning curve but has been fun. My knowledge of architectures and embedded hardware is very limited. The Coldfire board was selected by someone else (who's moved on now!) and if I can find a more suitable arhitecture/board then that would be good. I'm not in a position to design any hardware really

- just select something that's already out there.

I have no experience of selecting, programming or interfacing with FPGAs, but I have had it suggested before that something like this might be better suited to this task. The problem is, I'm not a hardware design guy.

I've even had a suggestion along the lines of "just use a memory controller chip and use a clock to transfer the pin status into the RAM directly" (or something like that). Sounds a great idea - but that's not something I can do with my current skills.

I appreciate all your help - but really, at the moment, we're limited to using an off-the-shelf processor board - it's just knowing which is suitable. We're 2/3 of the speed we need at present - so not a huge amount to make up - it feels as if we should be able to do it using our current 'design' - just wondering whether other hardware would have the speed increase we're looking for.

Thanks

David

- J
- John Devereux
  
  Contact options for registered users
Vote on answer
posted
17 years ago

Tue, Aug 15, 2006 10:42 AM

You only initiate the transfer once. The idea is that once you have initiated it, the DMA *hardware* continuously transfers data from the port into RAM. All by itself. (I don't know the details of *your* coldfire DMA though). There may be limitations on maximum DMA transfer length, and the coldfire core may hog the memory occasionally and cause missed samples.

I think the reason we keep going on about the hardware is that a relatively simple, low speed FIFO/microcontroller solution will in fact likely outperform your 733MHz speed demon in terms of reliably aquiring samples. But I don't know if you can get anything off the shelf.

--

John Devereux

- P
- Paul Carpenter
  
  Contact options for registered users
Vote on answer
posted
17 years ago

Wed, Aug 16, 2006 12:15 AM

The software counter, only *guarantees* that a counter variable counts it can not under all circumstances guarantee anything else.

Does the time factor between samples get logged as well as the changes? If not the data starts to become meaningless.

What is the maximum frequency of change of ANY digital input compared to desired sampling frequency?

If you can guarantee that NO other actions could EVER occur on your board, causing the micro or its support hardware to put in waits and chnage your timing.

The average sample rate, over a period of time, we have no idea what accuracy of capture timing you are trying to achieve.

At the rates of capture you are trying to achieve I would NEVER do it by software polling.

If you want to capture many megabytes then most DMA controllers are capable of dealing with 32K and larger transfers fully under hardware timed control (if DMA is timer triggered).

I suggest you do some reading up on DMA on ANY processor and the concepts will become clear.

The large FIFO means that you know the time interval between EACH sample point because external hardware times the writes to the FIFO. The reads from the FIFO are subject to system software, hardware latencies and on most DMA controllers the time interval to setup the controller to point to the next memory block to use. The deeper the FIFO allows for longer latencies (accumulative) and longer delays (next memory block setup time). Also cover for sloppy code and other system events.

Comparing transfering a few bytes from memory to memory, or a SINGLE transfer from I/O to/from memory, then YES.

Comparing controlled timing to LARGE amounts of data to/from I/O at timed intervals, timer control (internal or external) and DMA wins hands down over longer periods.

Judicious use of multiple DMA buffers, would also allow your software to be 'compressing' the data from ONE buffer to change events to save, AT THE SAME TIME as _hardware_ is filling up another buffer, The amount of buffers and their size depends on the

System in use Length of data run Time to process one buffer (MAX time) by software. Capture rate (MIN time to fill a buffer) Other system activity Other things I have forgotten.

Probably too much code in the interupt routine and interupt latencies. The timer should be DIRECTLY strobing the data capture.

Sounds more likely to be improper use of things like volatile and other software design problems.

Well if you don't sort out the data acquisition properly BEFORE deciding on processor, it is akin to saying

"I have this wonderful MP3 player which needs an Edison Wax Cylinder interface via USB without designing the hardware to interface it."

Proof of concepts have always in my opinion have to show what is and what is NOT needed. If you approach the problem with the wrong tool then you will need to add something to kludge the lot together.

All you are proving at the moment is that we can throw money at buying bits and any bits, that if we keep buying bits we will force the problem to be solved.

Work out what needs to be done how and under what constraints first, then find the necessary hardware and software. You _APPEAR_ to have the hardware and software which you are trying to force into meeting unknown constraints.

.......

Well you need to find some form of CLOCKED digital input to some form of DMA transfer to some processor. They exist but mainly for PCs and the like, not many of them. Long time since I looked for such things. They are also normally relatively expensive items.

It seems to me that the system designer or whoever is in charge of the overall system design needs to review the 'proof of concept' strategy.

Someone where you work must be able to do it.

Other hardware added to your board or even much lesser processor cards will easily do this FUNCTION. Other hardware will do something your hardware will not do, guarantee the sample to sample timing ensuring that it stays that way is moving the data as fast as possible into memory to keep up with sampling regime.

You appear not to have done any capture timing tests to see how regular your digital capture is from sample to sample and whether it meets any part of your spec.

All you have at the moment shown to me is that you have a bunch of data acquired at AVERAGE clock rates.

Having spent many years dealing with all sorts of data acquisition and processing upto and including high speed Video data and down to slow heat measurements (and slower ones). This seems to me to be data acquisition of data with no reference to make the data mean anything useful.

I suggest that you get someone involved who can see the WHOLE system issue and is ABLE to if necessary design the correct hardware to support the functions required.

Much as I could be able to do such a thing I have several projects currently on the go from network infrastructure to ASIC testing. I am sure there are plenty of people out there or in your organisation that could do this.

--
Paul Carpenter          | paul@pcserviceselectronics.co.uk
    PC Services
              GNU H8 & mailing list info
             For those web sites you hate

- D
- David Hearn
  
  Contact options for registered users
Vote on answer
posted
17 years ago

Wed, Aug 16, 2006 10:09 AM

formatting link

The counter used decrements once per (133MHz) clock tick. Difference between counter values can then be used to determine period between samples.

Yes, when we did filtering of samples during the capture, the counter (ie. the time factor) was used to determine the period between samples. When we removed the filtering, and had a consistent code path for each sample iteration, the need for storing the interval was gone.

There are 5 digital inputs. Currently 4 of them are sampled perfectly fine and provide the data we're after. The issue is with 1 input which can run around 850kHz if data is present. It's not a stable line (ie. not a 850kHz clock) but has frequent phase changes - but the max is around 850kHz.

At present, the 2.xMHz we're sampling at isn't quite fast enough to get all the transitions. On the other 4 lines the speed is fine.

So far we've not seen any signs that this is happening, or at least, if it's happening, it's not affecting the quality of our (4 line) samples.

Using the same code in an interrupt handler (store byte in array, increment index) was taking longer than polling.

Will do.

I wasn't aware (until John pointed out) that DMA was a setup and forget system - and the overhead is in the setup, not the subsequent transfers. I can see now that for longer periods, DMA should be better.

See above about code in interrupt handler and performance issues.

If -O3 is enabled, then when entering an interrupt handler and then clearing/acknowledging the interrupt, when exiting the same interrupt got fired again - implying that the clearing of the interrupt never occurred. Turning it down to -O1 (and no other changes) again made it work. Register .h file definitions are from Freescale/LogicPD provided SDK/header files.

There are two options we are considering to solve the problem.

1.) Sample faster (from 2.xMHz to 3.3MHz) 2.) Get the designer of the board generating the digital inputs to detect the phase changes and have that as the 5th digital input, rather than the 850kHz data we're getting. This would mean our current implementation will be suitable.

That would be me then.

Team of 5, I'm the only one with *any* hardware experience, and that was over 6 years ago during my degree during which I biased towards the software side over the (mostly analogue and RF) hardware.

I don't disagree with this statement.

We have done this, and confirmed that 4 of the 5 digital inputs are being sampled with sufficient quality to consistently provide the data we're looking for.

Which is fine for our purpose (except the average rate is too slow for one of the inputs, which we may be able to solve with a change elsewhere).

Looking at your requirements, our design engineers have suggested that you look at some form of Bus Mastering Data Acquisition card to get the speed you are looking for. Even then, with the 33Mhz PCI bus, and a PCI Read taking about 10 clock cycles per register, you are right on the limits of what the PCI bus can cope with.

I'm sorry but we cannot even think of any Data acquisition card that may be suitable"

I suspect a typo in the 5Hz max - but I believe the meaning is still the same.

Looks like a change to the input board may be the best option (if it's possible) as then we'd met our requirements and got the necessary data sampled to the resolution and quality we require.

Thanks again

D

- P
- Paul Carpenter
  
  Contact options for registered users
Vote on answer
posted
17 years ago

Wed, Aug 16, 2006 1:38 PM

...

If you read a counter (assuming it is an atomic operation no changes in counter whilst reading bytes/words etc), then depending on how synchronised the software is to the [assumed] hardware counter you are at best +/- 1 count from the time the counter 'expired'. So sampling has jitter on it.

If the counter 'expires' by counting down and stopping then any number of effective counts could have occured not decrementing the counter, so sampling has even worse jitter on it.

...

To what accuracy of timing do you have to measure these changes? I.E. 10ms, 1ms, 100us, 10us, 1us, 100ns? These effect how good the sampling method is.

That can be down to architecture and how the port pins are read as well as frequency of input changes on all the lines.

You have a single board system with 64MB of DRAM, Dynamic RAM has to be refreshed and depending on the type this is done autmatically by hardware in the system. This refresh regime may depending on processor delay all software, delay dynamic memory accesses or extend timing at times you are not expecting.

For example it could delay all software just as your timer 'expires' so many effective counts occur before the software resumes with jitter on sampling point. Alternatively it may well delay a write to DRAM, which extends the loop time.

Also I find it difficult to believe that on a board with ethernet, USB and serial that you are not using your application under an operating system which will have its own events happening. If not now it will do when you want to do something with the acquired data.

As the processor is probably a 32bit processor I would consider taking 4 samples and creating a single 32bit value in data cache to be written to external memory as one 32bit operation. Instructions and data operating in cache will be less susceptable to outside influences and writing to memory will have less issues.

Because of interupt latencies and overheads in processing interupts. Interupts are not meant for this type of operating frequency.

.......

Interupt latencies and overheads probably due to stack saving of ALL registers and restoring amongst other issues, which could be OS as well.

Which suggests optimising a bit clear and bit set of the same bit away, which still could be that some register or variable used to create a copy of the register was not declared as volatile.

Depending on what accuracy you want to sample the changes to.

.......

.....

I suspect they mean 5kHz. But it could be 5Hz depending on the system and software being used to drive it under what they consider normal conditions.

Consider getting something that can pack the data into 16/32bits then you may be able to get the speed up easier. Especially if it is a bus mastering device.

--
Paul Carpenter          | paul@pcserviceselectronics.co.uk
    PC Services
              GNU H8 & mailing list info
             For those web sites you hate