Hidden latencies and delays for a running program?

H

haiticare2011 12 years ago

Hi all,

I've been a SW developer, but one question I've never addressed is: What OS latencies and CPU delays are there in a compiled, running program? Is there any simple way to minimize them?

I am thinking of a simple c code program that reads data off a pci card and then writes it to memory like a PCIe SSD drive. I understand there will be various hardware latencies and delays in the data input.

But what if the assembler program is executing? Does the OS "butt in" and context switch/ multi-task during execution of a continuous compiled program? If so, how does one shut that off?

I've read about this somewhere, but never paid attention to it.

Thanks in advance jb

Vote

P

Paul Rubin 12 years ago

Lots. At the cpu level alone: variable instruction timing, cache misses, pipeline stalls, etc. At OS level: swapping and page faults, contention for machine resources by other tasks, etc.

If you have absolute deadlines ("hard real time") then it's complicated and there's books written about it.

Some OS's offer real time scheduling which basically means you can give an absolute priority to your real time task, so no other tasks can run until the priority task has released the cpu.

Vote

T

Tauno Voipio 12 years ago

Yes, it does, and you should not attempt to prevent it, as you may make the whole system totally unresponsive.

There is little difference between a compiled C program and an assembly program performing the same algorithm.

The write to the SSD drive is far from simple, if you have a file system on the card. Also, the SSD may have an internal controller which needs time slots for its own purposes. Examples are SD (camera) cards and USB sticks.

Tauno Voipio

Vote

D

Don Y 12 years ago

That, of course, depends on the choice of processor ("CPU delays") and the choice/characteristics of the OS you are using (if any).

CPU's often include instruction pipelines, I/D caches, and (instruction) scheduling algorithms that can cause what you *think* is happening (i.e., by examining the assembly language code that is actually executing) to differ from what is *actually* happening (i.e., by examining the CPU's *state*, dynamically).

Add a second (or fourth) core and things get even messier!

OS's range from *nothing* (e.g., running your code in a big loop) to those with virtual memory subsystems, and dynamic scheduling algorithms, preemption, resource reservations, deadline handlers, etc.

Of course, if it's *your* hardware (and OS choice), you can opt to bypass all of those mechanisms by *carefully* designing your "system" to run at the highest hardware priority available. In essence, claiming the CPU for your exclusive use.

Again, that depends on the choice of processor and the actual code that gets executed (recall, what you *write* can be rewritten by an aggressive compiler so you need to look at what the actual instruction stream will be). You can, of course, mix and match your tools to the tasks best suited. E.g., if there are timing constraints and relationships that must be observed in accessing the PCI card, code that in ASM. If the OS already knows how to *talk* to the SSD (assuming you are using a supported file system and not just writing to the raw device), then just pass the results of the ASM routine to a higher level routine that allows the OS to do the actual write.

Of course, you have to be sure your *average* throughput meets the needs of the data source. Often, that means an elastic store, somewhere, so your ASM routine can *always* be invoked to get the next batch of data even if the OS hasn't caught up with the *last* batch of data. Make this store easily resizable and then measure to see just how much gets consumed (max) in your worst case scenario.

[Hint, if you are using a COTS OS, you probably will never be able to get *published* data to allow you to make these computations a priori. And, if the OS will support a variety of unconstrained *other* applications, all bets are off -- unless you can constrain them to suit your requirements!]

Again, depends on the OS and how you've installed your "program". E.g., if you have ensured that your code always runs at highest privilege, then the OS waits for *you* (which could bodge other applications that are expecting the OS to "be fair").

If, OTOH, you are just a userland application, then your code could "pause" for INDEFINITE periods of time: milliseconds to

*days* (exaggeration).

All the "writing in ASM" buys you is the ability to see what the sequence of opcodes available to the CPU will be. Writing in a HLL hides that detail from you (though you can often tell your compiler to show it to you) *and* limits your ability to make arbitrary changes to that sequence (because the compiler has liberties to alter what you've told it -- in "compatible ways").

Much effort goes into system designs to *free* people from having to think about these sorts of details. But, when you are dealing with hardware, there are often other constraints that force you to work around/through those abstractions.

Typically (i.e., even in a custom OS/MTOS/RTOS) a high(er) priority task deals with events that have timeliness constraints. E.g., fetching packets off a network interface (if you "miss" one, it either is lost forever *or* you have to request/wait for its retransmission -- a loss of efficiency... especially if you are likely to miss *that* one, too!).

The data acquired (or *delivered* -- when pumping a data sink), is then buffered and a lower priority (though this might still be a relatively high priority... based on the overall needs of the system) task removes data from that buffer and "consumes" it.

Note that this *adds* latency to the overall task. And, allows that latency to exhibit a greater degree of variability (based on how much of the elastic store gets consumed -- or not -- over the course of execution). So, if you expect a close temporal relationship between "input" and "output", you have to address this with other mechanisms (e.g., if you wanted something to happen AS SOON AS -- or, some predictable, constant time thereafter -- an input event was detected, the variability in this approach is directly reflected in that "output")

Of course, if it can't be consumed as fast as it is sourced, then your system is too slow for the task you've set for it!

"Why not just do the output in the same high priority task as the input?"

What if the SSD (in your case) is not *ready* for more input at the

*moment* your new input comes along? Perhaps the SSD is doing internal housekeeping? Do you twiddle your thumbs in that HIGH PRIORITY task *waiting* for it to be ready? How long can you twiddle before your *next* input comes along AND GETS *MISSED*?

OS's (particularly full-fledged RTOS's) can provide varying degrees of support to remove some of the details of this task management. E.g., it may provide support for shared circular buffers. Or, allow buffers to be dynamically m-mapped to recipient tasks (to eliminate bcopy()'s). Signaling between the producer and consumer can be

*part* of the OS (instead of forcing you to spin-wait on a flag). Deadline handlers can be created (by you) that the OS can then invoke *if* the associated task fails to meet its agreed upon deadline (e.g., what happens if you *can't* get back to look at the PCI card before the next data arrives? or, if you can't pull the data out of the buffer before the buffer *fills*/overflows? Do you *break*? Or, do you gracefully recover?)

Best piece of advice: figure out how *not* to have timing constraints on your task. And, if unavoidable, figure out best to handle their violation: "hard" constraints can be handled easiest -- you simply stop working on them once you're "late"! ("Sorry, the ship has already sailed!"). "Soft" requires far more thought and effort -- it assumes there is still *value* to achieving the goal -- albeit *late*. ("But, if you charter a speedboat, you could probably catch up to that ship and arrange to board her AT SEA -- or in the next port. Yeah, that's a more expensive proposition but that's what happens when you miss your deadline!").

Any more *specific* answer requires far more specifics about your execution environment (processor, hardware involved, choice of OS, etc.)

HTH,

--don

Vote

P

Paul Rubin 12 years ago

Oh I remember now, you had the other post about some kind of data logging application. As others said, it sounds like you don't really have a strict latency bound as long as you don't use data, given enough ram to buffer stuff while i/o is blocked, with high enough probability that the failure possibilities are dominated by the reliability of the hardware.

Anyway my guess is that the main source of delays may be the SSD itself. Those have unpredictable delays as they sometimes have to reorganize the data internally, which on some units can take a VERY long time on rare occasions. If you use an "enterprise" SSD, the vendors try harder to control those delays, including by overprovisioning the device so that the reorganization can happen using the extra capacity in the background. For that reason the enterprise SSD's cost more.

Vote

R

rickman 12 years ago

I worked on a real time PC in which we had installed a board. It ran NT with a real time extension. First pass of my board had a bug which hung the bus transfer and the *entire* machine hung. Wow! The only way out was a hardware reset.

JB seems to have a lot to learn about real time systems. The part I don't quite get is why the PC side has to be real time. If he uses a separate MCU board to capture the ADC data (the important real time part of the problem) it can then send the data to a PC, not in "real time", just with a through put that exceeds the data rate. Adequate buffering on the MCU card will assure no loss of data. Then the PC can store the data on any media it wishes. Sounds simple enough to me but I don't get why he continues to flog this horse.

Rick

Vote

T

Tauno Voipio 12 years ago

Maybe the PHB has orederd him to make the PC a real-time capturing system. Anyway, he'll have a stiff climb up the learning steps.

-TV

Vote

R

rickman 12 years ago

PHB? Do you mean powers that be? He has been asking about embedded, but seems to think he has to put the entire system on the embedded device. I don't want to give the guy grief, but it sounds like he is not familiar enough with embedded design to even know if his task can use it effectively or not. He seems to reject a lot of suggestions before he understands them. I'm also very unclear on what data rate he really needs from the front end to the storage.

Rick

Vote

T

Tauno Voipio 12 years ago

Sorry - Pointy-Haired Boss, from Dilbert.

-TV

Vote

D

David Brown 12 years ago

The OP is very unclear about the data rate he needs (he alternates over several orders of magnitude), and has no idea at all about the sample size. The worrying thing is that he does not seem to consider this a problem, and does not realise that this project needs a lot of thought and planning, then a lot of research and prototyping, before he can start looking at implementation and development.

He also has virtually no idea about the technologies for implementing the system. He has some fixed pre-conceived ideas that he won't change no matter what people tell him - he believes USB latency will cause trouble, he believes SSD is the greatest invention since sliced bread, he believes assembly programming will be more "real time" than C programming.

The guy may be a good SW developer for all I know, but he is clearly far out of his depth with this project. I don't know if this is his own fault, or that of a PHB, but he desperately needs help here (of a kind that we cannot give him) before he wastes lots of time and money.

Vote

M

MK 12 years ago

From this and your other posts I think you are trying to make a data acquisition system which will store up to 10Mbyte/s on a PC hard drive. You've got three ways (at least) to get the data into the PC: USB, Ethernet and PCI. USB and Ethernet are relatively easy and work with any kind of PC and won't need fancy driver level code - so will probably work with any OS. Ethernet is the most simple from the PC software point of view.

10Mbyte/s is wire speed maxed out for 100Mb Ethernet so you'll struggle If you try to use a typical micro's on chip MAC. You can get ARM based micros with high speed USB. If I were doing this (and I have , many times) I'd use an FPGA to control the ADC , buffer the data and drive Ethernet via an off chip Gigabit PHY. You will need to buffer the data from the ADC and unless you are very clever with the host computer you'll need a decent sized buffer for the data. How big depends on so many variables that it's very risky to guess

- you'll need to check but I would start with enough to store 500mS worth of data (5M bytes in your case so use a 32Mbyte or so SDRAM).

In order to control the Ethernet interface you'll need to be quite confident with VHDL or Verilog or use a soft micro on the FPGA and get into a different kind of mess.

If all your experience is with software you might do better with the micro with built in high speed USB but you'll need one which supports external SDRAM at the same time and your data throughput will be challenging.

PCI has all the problems of USB and Ethernet interfaces and a lot of additional ones as well - don't go that way unless there is a really good reason for it.

Unless you need a lot of these I suggest you just buy something, and of course if you want a good design done you could always email me :-)

Michael Kellett

Vote

H

haiticare2011 12 years ago

Rick If this is as trivial as you say, then there would be more examples of how to do it that work. But there aren't. There is little consensus on how to achieve good data throughput. Solutions range all over the place, and few work. For example, there is "Starter Ware," a low overhead OS for ARM from Ti. But if you read the forums, much of the documentation is incorrect and unworkable.

Now, you recommend a "mcu board." Now we're getting somewhere. Do you have any actual examples of this working? Which mcu? How was the bus to the PC configured? Since you say "I have a lot to learn, teach me your concrete system example."

JB

Vote

H

haiticare2011 12 years ago

Thanks for the compliments. :) I'm convinced the rank-and-file developers out there don't have their ducks in a row on this one, either. Judging by the BBB developers attempts, it's still the Wild West. :)

Vote

U

upsidedown 12 years ago

If that is all you need, what do you need an OS for ?

Just use an ISR (Interrupt Service Routine) for reading your input card (such as an ADC) and an other ISR for writing the data to SSD drive (write complete interrupt).

The main program then consists of initializing those two interrupt service routines and a program body, consisting of an eternal loop, consisting of a (low power) wait for interrupt instruction.

Vote

R

rickman 12 years ago

No, I have not built your system for you already. In the other thread I have given you lots of material for you to work with. On the other hand you have not given us a set of requirements to work from. When I get the requirements I will consider if I want to take on the job. :)

Rick

Vote

D

Don Y 12 years ago

Um, are you requiring an "example" to be of the form: _Application Note 1234: Using the C Language Under to Copy Data from a PCI Card in a PC(?) to an SSD in the Same PC without any Constraints on Timeliness using Free Tools" If that's the case, I can save you a lot of time...

Which *specifically* "don't work"? And, do they not work because of ommissions on YOUR part? If not, please identify *why* they "don't work". E.g., the solution I provided *does* work as I have used it on dozens of projects. If you can't see how use on a non-PC applies, then I can site my 9-track tape driver that runs on a PC... not a PCI card (ISA) and not an SSD (IDE) but if you cant work "in the abstract", you'll never work in the *specific*!

Then "Starter Ware" requires more of you than you are able to provide. Fine. Pick something else.

You probably can't use Limbo, either -- due to your unspecified timing constraints, hardware interface, file formats, filesystem choice, etc.

Jaluna would intimidate you with its build environment.

RTEMS might not provide the (unspecified) user interface you need.

QNX costs money.

You probably can't write on bare iron...

etc.

Hey, maybe the Linux folks can entertain your queries! I'm sure there's a newsgroup/forum for that!

In all seriousness, until *you* know (meaning "can put in unambiguous quantifiable terms") what your complete set of criteria are, you're just going to be squeezing balloons -- always chasing, never achieving.

Good luck!

Vote

H

haiticare2011 12 years ago

Actually, the failure of the ARM community to achieve any serious IO is embarassingly apparent and does not require any bureaucratic structure to see it. The GPIO data rate was coaxed into the mHz range, but with great difficulty. It is natively in the low kHz range. General material is offered, which evaporates under scrutiny...

Vote

S

Simon Clubley 12 years ago

800/1600 BPI or one of those new-fangled 6250 BPI tape drives ? :-)

(PS: note above smiley)

StarterWare is not a OS; it's a support library for bare metal programming.

Are you confusing the GPIO achievable rates under Linux with those achievable in a bare metal environment ?

IIRC, those enhanced Linux speeds involve writing the GPIO lines directly via memory mapped I/O rather than through a driver call for each I/O manipulation.

If it's the memory mapped option under Linux you are talking about, then I don't see that as "difficult".

I would like to say that while I am a programmer as part of my day job, my embedded work is purely a hobby. However, the questions others here have asked you are among the questions I would already have asked myself before posting here.

Doing embedded work requires a certain mindset and the ability to pull together data from various sources. The questions you have been asked are good questions and are designed to make you think about the problem and what hardware/timing constraints are required to solve the problem.

Doing research (and knowing how to do that research) is a part of any serious embedded project and it's not something you can avoid.

Simon.

PS: As a good natured comment, I wonder if I should start applying for embedded jobs. :-) Sometimes, I think that as a hobbyist I seem to know more about this world than those paid to do it for a living. :-)

PPS: The above PS doesn't apply to the OP. I get the feeling he's being forced into something by his boss that he's not really comfortable doing and has not been trained on. I hope he's begun to understand more about the issues involved as a result of the various feedback here and can educate his boss about the issues involved.

Simon Clubley, clubley@remove_me.eisner.decus.org-Earth.UFP Microsoft: Bringing you 1980s technology to a 21st century world

Vote

D

Don Y 12 years ago

My transport is 800/1600/3200 (oddball). The I/F card for which I wrote the driver is little more than a few latches and level translators; so, it's effectively "bit-banging" the interface (and, given it's ISA, the bus speed/cycle time makes it a real "challenge" to keep the interface satiated.

["data" transfers can benefit from DMA but most other controller actions -- tape positioning, etc. -- need a lot of hand-holding]

Then, presumably, it is pretty "thin"? Should be relatively easy to see what it *is* doing and figure out what it *should* be doing?

What's the "ARM community". SA's could *easily* toggle I/O's at MHz rates. The FIRQ would even allow you to do it *outside* a "tight loop" (e.g., pseudo DMA -- but *without* DMA hardware!)

You're looking at something else. E.g., you can run a PC's *parallel* port (which has the ISA bus between it and the CPU -- low bandwidth) for PLIP and achieve data rates in excess of 75KB/s (which means you're toggling pins at ~200KHz)

Agreed.

And, are they from user-land *through* an intermediary?

This is actually true of *any* engineering endeavor.

When I first looked at the NRL ruleset for text-to-phoneme conversion, I was tickled to find several "free" implementations of the original algorithm. This wasn't surprising -- the algorithm was well documented and the rules published.

What *was* surprising was that virtually every (C) implementation was technically flawed! Their authors had failed to understand how SNOBOL -- the language in which the original implementation was crafted -- applied operators. Instead, they adopted more "modern" rules and, silently, altered the algorithms performance.

They *thought* they knew what the algorithm was doing without actually *understanding* the published description. And, given the complexity of the ruleset -- and a lack of a set of test cases -- I suspect if they got *anything* that "sounded" like natural speech out of the algorithm, they ASSUMED it was working!

And, don't get me started on the flaws in the available Klatt synthesizer implementations!

Do the research, *understand* what it means, *then* tackle the problem at hand!

IME, folks who do embedded work are either hardware guys who started writing code to prove their hardware works -- and then got "drafted" into *doing* the code (I know of a Fortune 500 company that had a *technician* writing the code for a large embedded project "because he tinkered with software at home"; the PHB was a self-confident BASIC programmer so he was *sure* he understood these issues... ) but, without a formal software education, don't really understand how to *design* the software (---> buggy code);

Or, they are software folks who know squat about hardware and, as a result, ill-equipped to understand what *can* (and does) go wrong and, therefore, write buggy code.

I'm not sure he "gets it". E.g., even a naive exposure to a CPU/MCU datasheet should make it *painfully* clear that "kilohertz" toggle rates suggests "something else" is going on ("Gee, what?").

It seems like he is expecting the equivalent of "finding a qsort() algorithm, published" -- that, instead, addresses *his* particular problem. And, seems unable/unwilling to see that moving bytes off a magnetic tape head and into memory (which, can obviously, then be moved onto disk -- by just specifying the disk device as the target) is "the same problem" he is facing.

Or, pulling bytes in/out of a UART, NIC, etc.

There really are very *few* "problems"... just lots of applications that map *onto* that problem set! ;-) (i.e., it is the apps that makes engineering interesting -- not the *problems*!)

Off for my pro bono work...

Vote

S

Simon Clubley 12 years ago

With the TI datasheet in your hand, it's very easy to see what is going on.

This is the same example code we were talking about recently which TI had placed under export control and which I later found on GitHub (_after_ finding out the MMU answers the hard way. :-))

I'm not 100% sure because I don't use Linux to directly manipulate GPIO lines; if using Linux, I tend to use a dedicated frontend MCU to get the realtime guarantees.

However, AIUI under Linux you use mmap to map in the GPIO registers and then manipulate them directly.

Major oops here. That _should_ say "...more about this world than

*some* *of* those paid to do it for a living." I'm NOT trying to claim I know more about this stuff than the professional c.a.e regulars around here. :-)

I came to the embedded world as a software person, but I also design and build my own circuits (although they are veroboard based :-)) so I have developed some understanding of the hardware side of things.

I'm much stronger on the digital side of things than the analogue/analog side of things however.

Simon.

Simon Clubley, clubley@remove_me.eisner.decus.org-Earth.UFP Microsoft: Bringing you 1980s technology to a 21st century world

Vote

Hidden latencies and delays for a running program?

Join the Discussion

Didn't find your answer?