c question

R

Robert Baer 12 years ago

"C" and i do not get along. But it seems silly and time wasting to do TWO R/W processes: #1: A simple file read loop reads a gigabyte file into RAM; then #2: a program loop will copy the data into the FPGA. Transfer directly from the file to the FPGA; might even be more than twice as fast.

Vote

R

Robert Baer 12 years ago

Amen! Bye oriented takes N*10^9 ops very s l o w. Open the FPGA as the buffer and do one read; fast as all hell. If the architecture allows a 10^9 block natively, BAM! But even if it allows a measly 10^6 block, N*10^3 ops is a heck of a lot faster than N*10^9 ops.

Vote

B

Bill Sloman 12 years ago

ce.

p, on a microZED board. We'll have some big waveform files on an SD card, F AT32.

away the data) at about 14 Mbytes/sec with a class 4 card, 18 with a class

10 card. I assume those reads are DMA. Interestingly, wikipedia rates class 4 at 4 Mby tes/sec and class 10 at 10. But we're doing reads, and maybe they are talki ng about writes.

cifically into FIFOs that load DACs to play the waveform.

, I assume that there would be some loop overhead to slow stuff down. Of ma ybe there's an ARM opcode that would do it fast, with no loop overhead. If it were a for loop, it could be unfolded, or semi-unfolded, [...]

256 sample chunks maybe, big fixed-sized unrolled loops. The FPGA has lots of ram, so we can have big FIFOs.

t loop styles and compiler settings and see what works best. So, why do the y call it Computer Science?

The impression that you are getting is that you and your minions should tak e the scientific approach of running experiments and seeing how they come o ut.

Somebody who'd been interested enough to take in some Computer Science migh t not need to run quite so many experiments.

Bill Sloman, Sydney

Vote

H

habib.bouaziz 12 years ago

on a

y the

. I

talking

ically

assume

's an

loop,

und

As far as i understand, you want to DMA a chunck of disk buffer to the FPGA (memory or registers ?) right ? So it's not so much a c lang question but more a Linux driver, a starting p oint is to be able to mmap() the FPGA memory (in /dev/mem).

If you are able to mmap() FPGA memory into user space, chances that DMA xfe rs are implemented as is. Not sure though.

Habib.

d help,

e

Vote

K

Klaus Kragelund 12 years ago

Perhaps I am missing something, but why don't you load the file data directly into the DAC via DMA?

Even a simple device like the Cortex series supports fancy data transfers, with the DMA being controlled by the DAC sample rate. You only need to setup the DMA channel, no code involved otherwise

You can even do multiple DMAs, so if you need to process the data, a DSP engine can be inserted into the DMA path, still no "active" code during processing and CPU horsepower left un-affected

Cheers

Klaus

Cheers

Klaus

Vote

D

David Brown 12 years ago

Experiments are essential to being "scientific". Having some realistic theory is useful, but you have to /measure/ performance to know whether you are fast enough, and what your bottlenecks really are.

The balance between theory and experiments is a bit different with software than you are used to with hardware - it is a small matter to test a dozen different variants of software to see which is faster, while hardware engineers are usually reluctant to make a dozen variations of their boards and test each one.

Vote

D

David Brown 12 years ago

I think of software development as a craft - it is part science, and part art. The same applies to electronics development. There is a lot of science behind it, but the science and the theories don't cover all aspects of the design.

Vote

D

David Brown 12 years ago

Yes, that would be the way to handle it (unless it is possible to hook up a FIFO ready signal to a DMA enable signal, but that would require hardware interaction between the FPGA FIFO and the processor's DMA).

1 K blocks is probably a good size to start with - not too big for the target's FIFOs, but big enough that the check for FIFO ready has little overhead.

Vote

D

David Brown 12 years ago

You should aim to add some instrumentation and statistics to your software - how often do you check the FIFOs, how many entries do you queue each time, how long does it take to fill these buffers, etc. Measurement is key.

Vote

D

David Brown 12 years ago

The datarate reading from a file can be highly unpredictable - you might get 14 MB/s average, but sometimes "read the next chunk" might be delayed for 10s of milliseconds or more. Once the data is in ram, it is easier to get a consistent readout speed. (Of course, this depends on the requirements of the applications, and the consequences of underflow.)

Assuming that the average output speed is lower than the average file read speed (otherwise buffering most or all of the data is the only option), then it might make sense to have a couple of large buffers (say, 1 MB) and feed out from one buffer while reading the next chunk into the other buffer.

Vote

M

Martin Brown 12 years ago

Unrolling loops no longer works on CPUs with speculative execution of most recently traversed path and can even slow things down by ensuring that the execution microcode will overflow any internal buffers.

It is far from guaranteed on a given platform which way of incrementing an address will be fastest! You basically have to test the alternatives.

The fastest loop code is usually the one that will fit in a single cache line (which means the start is optimally on a 16 byte boundary). If there are streaming SIMD instructions using them is faster.

Experiments are the basis of all modern scientific understanding!

You need to look first at the loop code that the compiler is generating and understand how the target CPU executes it. These days for short loops the code can end up either fitting in cache or not and/or have critical choice of branch destination. Good compilers are far more likely to get this stuff right than C hackers trying to micromanage things. As someone else has already said making everything a 4 byte aligned integer move and using memcpy as a basis is a good way start.

If speed at all costs is required then there is no substitute for enabling whatever diagnostics the chip offers and looking to see which sorts of pipeline stalls or cache stalls are the main bottleneck. And then experiment explicitly with minimalist assembler loops. You can get some significantly different results on loops with different CPU models on the Intel family which is why they provide a target CPU flag.

Regards, Martin Brown

Vote

M

Martin Brown 12 years ago

Experiments are the foundation of modern science.

If you knew what you were doing you would already have looked at the disassembly of your loop as generated by the optimising compiler and posted it here for comment. These days the best profile guided optimisers are sufficiently good (provided you tell it the right target CPU) that attempts by an amateur to "optimise" the code will more than likely pessimise it and result in worse performance.

The relevant footnote for ARM CPUs is on their branch prediction capabilities depending on the exact model chosen eg.

formatting link

You have to decide if you really need the fastest possible performance which can take a while to figure out (and be quite complex to take account of different cache configurations and quirks on various platforms) or the *engineering* solution of being fast enough for the job in hand. The 80:20 rule applies here but more like as a 90:10.

Regards, Martin Brown

Vote

H

haiticare2011 12 years ago

John, I think one issue with SW development is psychological - getting lost in the trees and not seeing the forest. If you look at how Phil designs optics, for example, I believe he has enough empirical data and formulae (and hunches) to calculate noise and photon budgets, as he puts it.

In your case, I don't see anything about actual speeds you need. So most of the comments are speculating in a vacuum. What is your speed budget?

In c compilers, I used to work with one that had inline assembler code capability. But I am concerned that you don't know, even an order of magnitude, what kind of transfer speed you need. Just an observation, possibly wrong. jb

Vote

J

John Larkin 12 years ago

We don't need to test a dozen PCB designs, because we have the tools, including established science, to accurately predict the performance of designs. We're engineers, not scientists. We can predict the speed of analog circuits and logic designs pretty closely. Unfortunately, we can't predict the execution speed of software loops to within a factor of 4 or so.

We could when we programmed in assembly on most architectures. I had one assembler that showed the execution time of each instruction on the assembly listing.

I wouldn't mind running a test to determine mem-copy runtime, but there are so many variants, code and compiler switches, to be tested.

John Larkin Highland Technology Inc www.highlandtechnology.com jlarkin at highlandtechnology dot com Precision electronic instrumentation

Vote

J

John Larkin 12 years ago

Well, science is supposed to lead toward quantitatively accurate prediction of how things behave. F = M * A, reusable things like that.

As engineers, we want to be able to use established science to build things whose behavior is expected. Building a bridge should not be an experiment.

Hardly predictive!

John Larkin Highland Technology Inc www.highlandtechnology.com jlarkin at highlandtechnology dot com Precision electronic instrumentation

Vote

J

John Larkin 12 years ago

But I'm an engineer, not a scientist.

This one is not my loop, it's being done by my embedded-system programmer and my FPGA guy, who between them have all the (complex) tool chains to compile c code and FPGA designs into something that can be booted and run on the ZYNC.

What my job is, is to write specifications for product performance (ie, make promises to potential customers and persuade tham to buy stuff) and to goad the software and FPGA guys into making estimates of realtime performance that I can trust. I can predict the electronic system speeds perfectly, in that I can invent specs that we are essentially guaranteed to meet. The software transfer rates are unknown, and can be optimized only by lots of mostly blind, and tedious, experiment.

These days the best profile guided

There's no hard spec; the better the specs I offer, the more we can sell. So I need to push things a bit. I know we can transfer 25K samples/sec to 8 dacs, but can we push 500K? And run ethernet concurrently?

Dedicating one ARM core to the data transfer program would immensely improve predictability, but aparently that's not easy, either.

John Larkin Highland Technology Inc www.highlandtechnology.com jlarkin at highlandtechnology dot com Precision electronic instrumentation

Vote

J

John Larkin 12 years ago

What I like to do, when I can, is bring things out to oscilloscopes. When there's an FPGA available, as in this case, that's fairly easy. We can, say, look at FIFO writes and see what happens when we do some ethernet stuff off to the side.

One of the problems with x86 is that it's increasingly a closed system, without even a classic memory bus or paralel ports. You get DRAM, USB, and PCIe ports, nothing realtime. The new generation of ARM+FPGA SoCs, from Xilinx and Altera, are much easier to snoop in real time.

John Larkin Highland Technology Inc www.highlandtechnology.com jlarkin at highlandtechnology dot com Precision electronic instrumentation

Vote

J

John Larkin 12 years ago

Ditto.

That might work. We'd have to make the FIFO inputs look like big chunks of memory space, which could be done. But they can't be any bigger than the FPGA ram that's available building the FIFOs. An intermediate DRAM buffer could be much bigger, which would allow bigger file transfers and less file system overhead.

More unpredictable complex things to try!

John Larkin Highland Technology Inc www.highlandtechnology.com jlarkin at highlandtechnology dot com Precision electronic instrumentation

Vote

J

John Larkin 12 years ago

We've already mmap'd FPGA space, on another project. It only took two guys (and some help from Avnet) about two weeks to get a c program to write a register in FPGA. At least that part is over.

Depends on transfer sizes. We could do big file reads into DRAM but much smaller reads directly into the FPGA, because the FPGA doesn't have a gigabyte of ram available.

John Larkin Highland Technology Inc www.highlandtechnology.com jlarkin at highlandtechnology dot com Precision electronic instrumentation

Vote

J

John Larkin 12 years ago

For one custimer, in one case, he has waveform files that were sampled at 25 KHz, I think 4 channels. But I want to define a product to service other customers and apps, and I need to come up with specs to sell. 1 MHz x 8 channels would be easy on the hardware side, but predicting the software performance is messy. I could launch a month of experiments, lots of c stuff and FPGA builds, and it looks like I'll have to.

As fast as is reliably deliverable!

John Larkin Highland Technology Inc www.highlandtechnology.com jlarkin at highlandtechnology dot com Precision electronic instrumentation

Vote

c question

Join the Discussion

Didn't find your answer?