c question

J

John Larkin 12 years ago

I'm not much of a c programmer, so maybe someone can give me some advice.

We'll be running Linux on a 600 MHz ARM processor in a Xilinx ZYNQ chip, on a microZED board. We'll have some big waveform files on an SD card, FAT32.

A simple file read loop reads a gigabyte file into RAM (just throwing away the data) at about 14 Mbytes/sec with a class 4 card, 18 with a class 10 card. I assume those reads are DMA. Interestingly, wikipedia rates class 4 at 4 Mbytes/sec and class 10 at 10. But we're doing reads, and maybe they are talking about writes.

Once it's in ram, a program loop will copy the data into the FPGA, specifically into FIFOs that load DACs to play the waveform.

So, if one wrote a the obvious c loop to dump the buffer into the FIFO, I assume that there would be some loop overhead to slow stuff down. Of maybe there's an ARM opcode that would do it fast, with no loop overhead. If it were a for loop, it could be unfolded, or semi-unfolded, which would be a loop wrapped around some reasonable number of inline moves, 256 maybe.

So, are compilers generally smart enough to optimize this? Or do they need help, second-guessing, to speed things up?

I'm thinking that a 600 MHz ARM, moving 32-bit data into an FPGA, would be fairly fast. But there is some fancy bus structure in the way, and we'll probably be crossing a clock domain.

John Larkin Highland Technology Inc www.highlandtechnology.com jlarkin at highlandtechnology dot com Precision electronic instrumentation

Vote

P

Phil Hobbs 12 years ago

The main wisdom AFAICT is not to code byte-oriented stuff the way K&R suggests, but use a library routine like memcpy(). The library routines are usually much more closely tied to the hardware, and use whichever width transfer is most efficient.

The other bit of wisdom is to pay close attention to memory alignment. Misaligned access is dramatically slower on most modern architectures.

Cheers

Phil Hobbs

Dr Philip C D Hobbs Principal Consultant ElectroOptical Innovations LLC Optics, Electro-optics, Photonics, Analog Electronics 160 North State Road #203 Briarcliff Manor NY 10510 hobbs at electrooptical dot net http://electrooptical.net

Vote

T

Tauno Voipio 12 years ago

You're right about write speed. A flash (which is in the core of a card) is more complicated to write than read.

In the ARM 32 bit instruction set, there is little to do to speed up a basic loop translated by a decent compiler:

ldr r0,sourceptr @ point to data source array ldr r1,fifoptr @ point to FIFO I/O port ldr r2,wordcnt @ get count of 32 bit words

loop: ldr r3,[r0],#4 @ r3 = *r0++ subs r2,r2,#1 @ r2-- str r3,[r1] @ *r1 = r3 bne loop @ loop if r2 not zero yet

My guess is that the FIFO is addressed via a single port.

The loop counting overhead can be decreased by unrolling the loop, and some of the program fetch overhead can be reduced by loading several registers at once with a ldmia instruction, and dumping them one by one to the output port (which is a kind of unroll).

You'll lose more because of the other things run by the operating system. Some delay will also be caused by the necessary memory management (paging) hardware needed by Linux.

Tauno Voipio

Vote

H

Hal Murray 12 years ago

e

ing

ly

I assume the real question is how fast can we feed the DACs.

ume

n

p,

The code will be in the cache. The data will be coming from RAM. In this context, RAM is horribly slow.

lp,

I suggest some experiments. Write some code and feed it to the disassembler. Check out memcpy in your library.

Write some test code that copies the data to memory (which will be cached). Time it, both overall for a big file and also per page or chunk within a big file.

You can also write some FPGA code to measure the longest gap between writes.

You are doing writes, so crossing clock domains with a FIFO should be easy.

-------

There is another branch to this problem which is to reserve as much RAM as you can get away with, load it up, then feed it to the DACs.

You can make the FPGA do DMA.

These are my opinions. I hate spam.

Vote

D

David Brown 12 years ago

The simplest method is usually something like memcpy(), if you are copying from one area to another. Compilers and libraries conspire to give close to optimal instructions for such functions (assuming you enable optimisations on the compiler, of course).

That won't work if you need to copy from a buffer into a single address. You will have to write that code yourself.

Make sure your target address is accessed as a volatile - otherwise the writes will be mostly eliminated by the compiler. And make sure all your addresses are properly aligned, and your transfers use the best size (probably 32 bits here).

Don't try to inline moves blindly - the factors affecting speed are complicated, and "optimising" without measuring is pretty much guaranteed to make things worse. Either accept the compiler's version with optimisations enabled, or be prepared to put in a lot of effort in measurements and examining code. It is not uncommon that a bit of loop unrolling will help, but it is rare that large unrolls (such as the 256 you mentioned) will help - the cost of instruction cache misses and wasted bandwidth reading the instructions is normally much higher than the cost of a simple, easily predicted branch. But since there is no hard and fast rule and it varies between processors, you either trust the compiler or measure things yourself.

Remember also with optimisation that you never need "as fast as possible" - you need "as fast as necessary". You are not looking for "the best", you are looking for "good enough".

If you find you are not getting the performance you want from software, also consider DMA. It is more complicated, with details varying on whether you have full control of the hardware or if you need to go via an OS, but it will often give better performance (again, no guarantees).

Vote

H

Hal Murray 12 years ago

He's writing to a FIFO in a FPGA, so he can setup the address decoding to ignore enough bottom bits to let memcpy work.

I'd certainly use memcpy (to RAM) for early sanity-check timings.

Good point.

These are my opinions. I hate spam.

Vote

T

Tim Wescott 12 years ago

That works if the FPGA space is memory mapped. If the architecture is "pound one memory location with sequential data" then you'll need to do some work on it yourself (but, finding the source code for memcpy is probably a good place to start!).

Tim Wescott Wescott Design Services http://www.wescottdesign.com

Vote

J

John Larkin 12 years ago

I'm thinking eight 16-bit DACs. That could be mapped as four longwords, and ARM can do 32-bit moves. The target is eight FIFOs, but it could be multiply mapped (repeat the apparent FIFOs a bunch of times) if that would make memcpy happier.

Past experience, mostly x86, suggests that theory doesn't help much, and the thing to do is a lot of experiments. Unscientific.

John Larkin Highland Technology Inc www.highlandtechnology.com jlarkin at highlandtechnology dot com Precision electronic instrumentation

Vote

J

jeroen Belleman 12 years ago

That rings a bell. Does anyone remember Duff's device? You'll hate it, but maybe it can do something for you here.

Jeroen Belleman

Vote

M

mroberds 12 years ago

If you're using fread(), sometimes you can get a little improvement by playing around with the size of the buffer with setvbuf(). You may also want to look into mmap().

I'm not sure.

It's write performance. Per

formatting link

, "Speed Class and UHS Speed Class symbols indicate minimum writing performance to ensure smooth writing of streaming content such as video shooting."

Well, there will always be an increment/decrement, a test, and a branch for each iteration of the loop - that's the cost of doing business. On ARM there's a decent chance of the loop variable ending up in a register, so the increment/decrement will probably be as fast as it is possible to be.

Some people prefer loops like for(i=10;i>0;i--) instead of for(i=1;i If it were a for loop, it could be unfolded, or semi-unfolded, which

If you're using gcc, see the -funroll-loops and funroll-all-loops options. There are some related options on how it treats the body of unrolled loops, listed right after those two options in the man page.

Kind of. Since they don't know what you're trying to do, they have to follow some heuristics based on things they can see, like the iteration count. gcc provides a *lot* of command-line switches to affect these kinds of heuristics.

See above. gcc (and most other Linux/Unix C compilers) have sort of a "master" optimization switch, -O, that usually runs from -O1 to -O3 or so. Each step turns on more and/or different individual optimizations. For gcc, you can see what optimizers get turned on at each step by saying things like

gcc -Q -O1 --help=optimizers gcc -Q -O3 --help=optimizers

Other compilers may have a similar mechanism, or the list of what happens at each -O level might be in the manual somewhere. Note that

-O3 (or whatever the top level is) doesn't necessarily turn *everything* on; for the x86_64 gcc I have handy, neither -funroll-loops nor

-funroll-all-loops get turned on by -O1, -O2, or -O3. You have to specify these options yourself on the command line (or in the Makefile or whatever) if you want them.

If you have somebody around who is familiar with ARM assembly, you might like the -S switch to gcc, which makes it output the assembly language the compiler is going to use, instead of compiling all the way to an executable. The -fverbose-asm switch may also be useful to get a little more information in the assembly output. Tip: If you've never done this before, then do it the first time without any optimizations turned on, so you can get some practice matching up the assembly code to the C code. Once you turn on optimizations, some things will move around or disappear, which makes the assembly a little harder to follow.

General advice: first try with different compiler options and see what improvement you get. Make sure to note down the compiler options for each run. If that doesn't get you anywhere, then try rearranging the code yourself.

Matt Roberds

Vote

H

Hul Tytus 12 years ago

John - one approach is to write it in c with two loops, an outer and an inner loop. Have the inner loop pass a word count that would be convient for a routine written in assembly. If the c code isn't fast enough on the first try, then write an assembly routine to replace the inner loop.

Hul

John Lark> I'm not much of a c programmer, so maybe someone can give me some advice.

Vote

P

Phil Hobbs 12 years ago

IIRC Duff's thing was to use a bunch of named labels in a partly-unrolled loop, and jump in at the right place for the number of elements you were moving. For a big move, it won't make as much difference.

Cheers

Phil Hobbs

Dr Philip C D Hobbs Principal Consultant ElectroOptical Innovations LLC Optics, Electro-optics, Photonics, Analog Electronics 160 North State Road #203 Briarcliff Manor NY 10510 hobbs at electrooptical dot net http://electrooptical.net

Vote

D

David Brown 12 years ago

That's correct.

But "Duff's Device" is a thing of the past - if your compiler can't do a better job than manually creating a Duff's device, then get a better compiler. There was a time when such hand-optimisation made sense (and it was kind of fun at the time), but we are well past that.

Vote

D

David Brown 12 years ago

Looking over the answers so far, it seems the elephant in the room has been mostly ignored.

You do /not/ want to pass this data as fast as possible - that will simply overflow your FIFOs. You need some sort of mechanism to stall the transfer, or to communicate FIFO space information back to the software. Then all you need is to be able to write out the data about

1% faster than the data gets taken out of the FIFOs.

Vote

J

Jon Elson 12 years ago

Yup, there are two ways to do this. One is to check the FIFO EVERY iteration of the loop to make sure it is not full. This will probably slow the loop down to half speed.

Two is to check if the FIFO has, say, at least 1 K of space left, then transmit 1K as fast as you can before checking status again.

And, of course, another question is do you really need to process every byte? Why not make the FIFO accept entire 1K blocks of data from memory? Read a whole block from the storage device, send entire block to the FIFO. When you are talking about tens of MB/second, this may be necessary.

Jon

Vote

J

John Larkin 12 years ago

Well, we'll have a FIFO fill count value available. When there's some chunk of room available in a FIFO, the software will top it off from the file buffer in RAM. There will be a lot of messy double buffering and file pointers and like that.

The FIFO output (DAC load) rate will be programmable, and will typically be moderate, 25 or 50 KHz maybe, but I'd like to be able to go as fast as possible.

We call this project The Wayback Machine.

formatting link

I put that pic in the proposal, along with some photos of whiteboard cartoons. Beats the hell out of PowerPoint.

John Larkin Highland Technology Inc www.highlandtechnology.com jlarkin at highlandtechnology dot com Precision electronic instrumentation

Vote

J

John Larkin 12 years ago

Cute, but we can always transfer biggish blocks of waveform data, like in 256 sample chunks maybe, big fixed-sized unrolled loops. The FPGA has lots of ram, so we can have big FIFOs.

The impression that I'm getting is that we should try a bunch of different loop styles and compiler settings and see what works best. So, why do they call it Computer Science?

John Larkin Highland Technology Inc www.highlandtechnology.com jlarkin at highlandtechnology dot com Precision electronic instrumentation

Vote

H

Hal Murray 12 years ago

Yes, but sometimes a quick experiment sheds a lot of light.

I just pulled out an old hack to measure memory bandwidth and ran it on a Raspberry Pi. memcpy gets 350 megabytes/sec reading and another 350 writing. That's with huge blocks so only the code is in the cache.

If your file system is only good for 10 megabytes, that's going to be the limiting factor. (unless you really botch the interface to the FIFO)

These are my opinions. I hate spam.

Vote

M

mroberds 12 years ago

Try the compiler settings first; they're easier.

As one of my professors liked to say, "Any field that has to put 'science' in its name isn't really a science, or at least it isn't one yet."

Part of the problem here is that you have a lot of moving parts.

The stock Linux kernel has bunches of compile-time scheduler options, and some more you can tweak at runtime. Whoever built the kernel for your board chose a certain set of them, and may have taken it upon themselves to tweak the scheduler code further.

Somebody probably wrote a kernel-space driver to interface to the FPGA FIFO. The way this driver was written can affect the speed a lot.

Your compiler, and possibly the C library it uses, will have an effect on the C side of things (reading the file, memcpy() to your FIFO). These days, the standard libraries are usually pretty good, but they can and do vary.

This is sort of out of scope of the C programming, but you probably used a tool to go from some design spec to the actual arrangement of gates in the particular FPGA chip you are using. That tool almost certainly has options for tweaking gate count/latency/throughput, some of which will affect the fastest way to get data into and out of the FPGA.

The net result is that probably nobody has exactly the same stack of software that you do, configured in the same way. So it is hard for somebody on the other end of a newsgroup to tell you the exact optimal settings for everything, which is why you are getting the "try it different ways and see" answer.

There are some things that are fairly obviously bad ideas, like using read(2) to get one byte at a time from the SD card and then using write(2) to shove one byte at a time at the FPGA, so you are getting discussions about memcpy() and so on.

Another piece of the puzzle is that in these latter days, with lots of CPU cycles and RAM and bandwidth, often the first random thing you code up is fast enough anyway. If it isn't, the next popular option is that nothing you can code up approaches 10% of the speed that you need, which means that you need to buy a faster CPU or otherwise make major design changes. Finally, you have the cases where sometimes you get 105% of the speed you need and sometimes you get 95%, so you start tweaking around the edges to make it work.

The preceding paragraph doesn't apply if you have an 8051 with three and a half bytes of RAM, and you want it to run for five years on a CR2016 - in that case you start up in "tweak the hell out of everything" mode. For "bigger" embedded systems it's quite often true, though.

Matt Roberds

Vote

U

upsidedown 12 years ago

A memcpy() _could_, but does not have to use some nice tricks based on the actual hardware implementation. Much depends on if cache is available, the size of the cache line, the number of simultaneous cache lines available, the number and size of (free) registers available.

For instance loading the whole cache line into (long/double) registers and then storing the registers into the target in sequence. With multiple cache lines available, perform a dummy read to one byte in each cache line and then perform a bulk copy of each cache line as above.

This requires detailed understanding of the processor and some experimentation is needed.

It would be a good idea to use cache line/virtual memory page alignment.

Vote

c question

Join the Discussion

Didn't find your answer?