c question

- H
- haiticare2011
  
  Contact options for registered users
Vote on answer
posted
9 years ago

Mon, May 12, 2014 4:24 PM

Larkin -

OK, but what are the boundary conditions, eg, what exactly does the c code have to do? Store "Nyquist" data points? And - you are doing this offline from memory - is this RT later? I would like to see all the operations described, because I'm not sure the algorithm. Sounds like a simple memory transfer and that's it? The speeds you mention sound easily within c speeds - but again don't know all the data manipulations you need. And - don't forget - you can always throw hardware at it to get some extra speed. Your immediate need is a very slow system to satisfy a customer? Why not just code for them and see what the c code will do? (after some bandwidth (speed) budget figuring. It's important to get out of experiment mode and into engineering mode. j (I always give advice on areas I'm guilty of. :)

- L
- ldfishel
  
  Contact options for registered users
Vote on answer
posted
9 years ago

Mon, May 12, 2014 4:26 PM

be

ssible.gy.com jlarkin at highlandtechnology dot com

Sorry, I got tired of reading the thread, but this was what I was looking f or. Forgive me if this question has already been asked and answered.

Your SD card is 100 times faster than this and your processor is easily 10 times faster still. Why is speed even a question unless your processor will actually be doing something else 99% of the time? (Adjust by some small fa ctor if you're talking kHz SAMPLES with each sample being multiple bytes.)

Whatever method simplifies your code should be more than sufficient, unless , as someone suggested (maybe you) you're actually outputting directly to s ound hardware and HAVE to go SLOW enough for it to keep up...

I haven't had to worry about going FAST enough for sound hardware in 25 yea rs...but I could easily be missing something... That was on a machine that weighed 10 pounds, and yours is probably the size of a credit card or small er, but it's 300 times faster...

- J
- John Larkin
  
  Contact options for registered users
Vote on answer
posted
9 years ago

Mon, May 12, 2014 4:34 PM

The main case will be that we have a big, maybe gigabytes, waveform file on an SD card, preformatted whatever way is best to stuff into our DACs. We'll use the Linux fole system to read the file in DRAM, c program space, and the ARM code will poke that data into the DAC FIFOs in the FPGA. Repeat until we run out of file some hours later.

We'll also have some ADCs for waveform acquisition, a separate diccussion.

We can pre-manipulate the waveform files (ie, make the customer do that.)

Sure, that's what the FIFOs and fill count registers are for.

I want to write a spec for a product that we can sell. So, how fast can it play waveforms, as a guarenteed specification?

My point exactly!

--

John Larkin                  Highland Technology Inc 
www.highlandtechnology.com   jlarkin at highlandtechnology dot com    

Precision electronic instrumentation

- T
- Tim Wescott
  
  Contact options for registered users
Vote on answer
posted
9 years ago

Mon, May 12, 2014 4:53 PM

Test until you find something that works, then stop testing and start developing product.

--

Tim Wescott 
Wescott Design Services 
http://www.wescottdesign.com

- J
- John Larkin
  
  Contact options for registered users
Vote on answer
posted
9 years ago

Mon, May 12, 2014 5:12 PM

At 50 Ksamples/sec, 8 channels, 16 bits, that's 800 Kbytes/sec, so the SD card margin is more like 10:1 than 100.

But The faster it is, the more potential customers I have. So I want to make it as fast as a reasonable effort allows.

Slow is no problem. We can manage keeping the FIFOs loaded at low speeds.

It's not a sound card. It will be an FPGA and some 16-bit DACS and optput amps. It needs to be precise and DC coupled and have instrumentation-type drive.

--

John Larkin         Highland Technology, Inc 

jlarkin att highlandtechnology dott com 
http://www.highlandtechnology.com

- H
- Habib Bouaziz-Viallet
  
  Contact options for registered users
Vote on answer
posted
9 years ago

Mon, May 12, 2014 5:52 PM

Hello,

Today i spend an hour reading specs Xilinx Zynq Soc and it's really a "beast killer" !

Just reading this

formatting link

Linux kernel implement DMA mechanism with this SoC i.e. it gives user API for the PL330 DMA engine. Don't know if it helps your project.

BR, Habib.

- J
- John Larkin
  
  Contact options for registered users
Vote on answer
posted
9 years ago

Mon, May 12, 2014 7:27 PM

The ZYNQ, and the corresponding Altera part, are fabulous. You get two ARMs and a lot of FPGA on the same chip, which means you don't waste

60+ balls and a lot of nanoseconds connecting a uP chip to an FPGA chip.

And the port dilemma is fixed. Parallel ports and memory busses are rare among serious processors, and the FPGA side of the SoC chips bring them back.

There was a rumor of an x86 SoC, an Intel core and an FPGA on one chip. Not that I'd want it.

ZYNQ is apparently optimized for DMA, from the FPGA fabric into CPU memory space. That would be fast, but also a lot of engineering. We might be pushed to do that some day, if a customer needs it. I'm thinking that, for now, a dual-core 600 MHz can move tens of Mbytes/second in software.

--

John Larkin         Highland Technology, Inc 

jlarkin att highlandtechnology dott com 
http://www.highlandtechnology.com

- L
- Lasse Langwadt Christensen
  
  Contact options for registered users
Vote on answer
posted
9 years ago

Mon, May 12, 2014 10:19 PM

Den mandag den 12. maj 2014 21.27.09 UTC+2 skrev John Larkin:

:

ch

the test I've seen say something like 27~90Mbytes/sec from DDR to a PL block ram using memcpy, depending on burst size

For real high performance you'd want the PL to access the memory directly using something like the datamover or dma controller is probably easiest

I am working on a zynq design and soon I'll have to figure out how to do read/write of ddr from the PL

-Lasse

- J
- John Larkin
  
  Contact options for registered users
Vote on answer
posted
9 years ago

Tue, May 13, 2014 12:32 AM

A real number! That's promising! Especially if the SD card read turns out to be DMA. We can shuffle a bunch of buffers and keep everybody busy.

If you have trouble getting the ARM to talk to registers in the FPGA, we've worked through that. A guy a Avnet helped a lot. We used a consultant for our microZED/ZYNQ FPGA design, and he's available, too.

--

John Larkin         Highland Technology, Inc 

jlarkin att highlandtechnology dott com 
http://www.highlandtechnology.com

- L
- Lasse Langwadt Christensen
  
  Contact options for registered users
Vote on answer
posted
9 years ago

Tue, May 13, 2014 12:42 AM

Den tirsdag den 13. maj 2014 02.32.06 UTC+2 skrev John Larkin:

much

a

ux

or

y

With a bit of help from google I got the arm talking to the FPGA pretty quick, getting all the tools up and running and building linux took a bit more time. You really need a linux box, hadn't used that in a while eit her

I just got the license for the bus-functional-model, so now I'll have to figure out how to make the FPGA access DDR via the master ports

-Lasse

- J
- John Devereux
  
  Contact options for registered users
Vote on answer
posted
9 years ago

Tue, May 13, 2014 6:30 AM

Hundreds of MB/sec I would say, unless limited by the source and destination of course. I would just write a simple loop and not worry about it until you get some actual timing measurements. Unless you have wierd hardware you will already be aligning the I/O devices on 32 bit boundaries. If the SD card is being read as bytes you might want to ensure the data of interest ends up 32-bit aligned too, e.g. an array of uint32_t. Or even 16-byte aligned as someone suggested. There are compiler statements for that, eg

__attribute__((aligned(16))) uint32_t data[1024];

But objects are automatically aligned to their data sizes anyway.

Really memory-to-memory and cpu core stuff is generally very fast, the bottleneck is usually the I/O device as data is translated onto peripheral buses and wait states inserted etc.

--

John Devereux

- M
- Martin Brown
  
  Contact options for registered users
Vote on answer
posted
9 years ago

Tue, May 13, 2014 7:13 AM

[snip]

Actually parts of building a novel bridge design are in part an experiment and sometimes even with top architects it still goes wrong. The London Millenium footbridge suffered a spectacular pedestrian footfall lockstep resonance causing the bridge to sway alarmingly.

formatting link

The reason it is no longer easy to predict is because there are so many go faster stripes in the *processor* that can interact. Cache sizes for the execution engine and avoiding pipeline stalls are critical.

Most of the clever optimising compilers do all this for you, but for a very simple load save loop hand optimised code can be quicker and certain magic lengths quicker still. But why don't you try a few of the more obvious loop strategies to see for yourself?

I think the right way to do this would be at device driver level and have the FIFO signal an interrupt when it has still enough bytes in the queue to ensure that no matter what else is going on the CPU will always have time to service it and put more data into the FIFO.

--
Regards, 
Martin Brown

- M
- Martin Brown
  
  Contact options for registered users
Vote on answer
posted
9 years ago

Tue, May 13, 2014 7:26 AM

You futz with the numbers that isn't the same thing.

OK. So once again the problem is with your coding monkeys. Pay peanuts?

Anyone with even half a clue can start from the output of a decent optimising compiler and substitute the various obvious loop strategies to see which is the optimum. Free hint: On ARM the branch predictor heuristic is that a backwards conditional branch is always taken.

It is fairly pointless trying different C coding since most compilers will generate pretty much the same optimised code when given a smallish functionally equivalent loop. Older compilers might not do.

That depends more on how carefully you design the FIFO. The obvious way to do it would be to have a FIFO length such that it can signal when it still has enough samples in so that the worst case realtime response of the OS cannot result in a data underflow to the DACs. Look at how they do it on DVD writers and try to avoid falling foul of the patents.

You should be doing the data transfer to the FIFO as a device driver so that the FIFO can interrupt with high priority when it needs more data. The data fill should be so fast that it can be safely done in an ISR.

How long is your FIFO?

--
Regards, 
Martin Brown

- H
- habib.bouaziz
  
  Contact options for registered users
Vote on answer
posted
9 years ago

Tue, May 13, 2014 7:47 AM

:

ch

Yes indeed ! Xilinx Zinq is a great job, very clever for high bandwidth des igns (e.g Video on IP)

ARM business model is not shared by Intel and i doubt it will in the future .

It's always a good strategic view to pass many hardware/software tricks and traps before promote such a thing (The Zynq SoC) for clients projects. You mention tens of Mbytes/s with softwre ... it may be sufficient for nume rous applications.

BR, Habib.

- U
- upsidedown
  
  Contact options for registered users
Vote on answer
posted
9 years ago

Tue, May 13, 2014 9:38 AM

Not to mention some novel airport terminal buildings (Paris ?) which simply collapsed due to the wind load. Imagine a suitcase "forgotten" close to the critical structure going caboom :-)

Absolutely.

In the past, there were quite distinct processors families, these days you get quite different performance for the same processor in different pin count, cache organization etc.

The FPGA FIFO definitively needs a flow control mechanism to stop reading the flash, while the DAC is not capable of consuming more data.

- H
- haiticare2011
  
  Contact options for registered users
Vote on answer
posted
9 years ago

Tue, May 13, 2014 1:49 PM

A non-sequitor:

I've sometimes had the idea, from looking at your circuit board images, that you should blow them up and sell them to the tourists in Sausalito. :) They are very eye-catching. I can suggest titles like: "Which Way Out of Here?" "Maze of My Mind" "Organized Thinking" etc. You could even invent a therapy for space-cases, called "Total Linear Make-Over." Those of you who have lived in the Bay Area will appreciate the realism of my comments. Post that pic again sometime, will ya?

- D
- David Brown
  
  Contact options for registered users
Vote on answer
posted
9 years ago

Tue, May 13, 2014 2:47 PM

My point is that if you could build and test a dozen PCB designs as quickly and easily as you can write and test a dozen software loops, you would not need to buy and use the rather expensive PCB simulation tools. You use the tools precisely because you cannot test and measure the real thing in a practical way.

With software, it is usually simple to try it and see. Of course, it is also possible to get simulators for processors - some are very complicated and accurate, just like with PCB simulators, and some are very expensive, just like with PCB simulators.

That's true for simpler microcontrollers, where you know exactly how long each instruction takes, and exactly how long it takes to access memory. As the devices get more complicated, and you get pipelining, stalls, memory wait states, caches, etc., the prediction process gets harder and harder until it becomes impractical.

You probably only have to test a few, then you'll quickly get an idea of where you are going.

- D
- David Brown
  
  Contact options for registered users
Vote on answer
posted
9 years ago

Tue, May 13, 2014 2:49 PM

Start developing the product? This is software we are talking about - if the test works, ship it! You can always send an update by email after the system is deployed. :-)

- J
- John Larkin
  
  Contact options for registered users
Vote on answer
posted
9 years ago

Tue, May 13, 2014 3:05 PM

Which pic?

Yeah, they don't call PCB designs "artwork" without reason. One of our hallways at work has big plots of PCB layouts on the wall, framed.

Schematics can be art too, but fewer people appreciate that.

--

John Larkin                  Highland Technology Inc 
www.highlandtechnology.com   jlarkin at highlandtechnology dot com    

Precision electronic instrumentation

- J
- John Larkin
  
  Contact options for registered users
Vote on answer
posted
9 years ago

Tue, May 13, 2014 3:18 PM

The only tools that I use to sim circuits and PCBs are free, like LT Spice and Appcad. Maybe half of the boards that I design get no simulation at all. We do use a moderately expensive ($5K) schamatic+layout package, but it does no simulation.

I could use old-fashioned math to do linear sims, and nomographs and such for trace impedances and things, but the computer stuff is faster.

There's not much point in simulating an ARM program when you can run it on an ARM. But there's very little way to predict software performance other than to run it yourself.

Like with Linux, if I run a tight loop to toggle a port bit, I can measure the rate with a scope, and more importantly, I can see when the OS and drivers steal the CPU from me, how often and for how long. That sort of thing is hard to find online, pretty much undocumented. There might be some infrequent garbage collection or some system thing that I miss that might really clobber my loop. And if I want to optimize that for a multicore machine, good luck. Programmers tend to not think in microseconds, and RTOS's have their own problems.

I do need to worry about the code suspension thing, so we can keep the FIFOs full. Ideally, we'll run the critical loop exclusively on one ARM core, if we can figure that out.

I'll post some numbers if we discover anything interesting.

--

John Larkin                  Highland Technology Inc 
www.highlandtechnology.com   jlarkin at highlandtechnology dot com    

Precision electronic instrumentation