Task level processing vs. Interrupt level processing

- N
- NewToFPGA
  
  Contact options for registered users
posted
16 years ago

Sun, Jan 20, 2008 2:52 AM

Hi,

I have just started working on programming on low level in the embedded systems. I understand some basic stuff and trying to get my self familiarize with some low level programming aspects. Below are some of the questions I have:

How expensive are task switchings in a 400 MHz processor (MPC82xx based processor)?

How much time this processor takes to run one assembly instrustion?

Can I implement a polling interms of micro seconds? If I have the polling implemented in the task (or thread) level code what are the common problem that I would face? If not at the task level, is there anything I can do in the hardware configuration that I can request a timed interrupt?

How do I configure some interrupts so that an FPGA can raise it when there some data for the software to read?

Thanks, Eswar.

- T
- Tim Wescott
  
  Contact options for registered users
Vote on answer
posted
16 years ago

Sun, Jan 20, 2008 8:23 AM

As expensive as your RTOS makes them. This should be part of the RTOS documentation.

That depends on your processor, and should be in the processor documentation, or is at least something that you can benchmark. Generally the execution time will vary with instruction, and for processors with pipelines the execution time will depend on the instructions that precede and follow the instruction in question, which makes it very difficult to predict how long it will take to execute.

That depends on your environment. If you have a 400MHz processor, probably -- but if you're polling once every 1us you'll find that you'll use a lot of clock ticks just for the polling.

That depends on your processor, and should be in its documentation. Does it have hardware timers? Can the timers throw interrupts?

You read the processor documentation, and maybe some applications notes, and you figure it out.

I'm not trying to be snide here -- every processor has the World's Most Clever way of turning on interrupts, and every processor designer thinks that all the rest are idiots -- so techniques vary.

Usually you have to set (or clear) a global interrupt mask, and set (or clear) an interrupt mask for the specific interrupt you want to enable. You'll have to specify where the ISR is to the processor, unless your processor vectors to fixed locations. On many microcontrollers, each pin can do approximately one bazzilion different things, so you also have to configure the pin correctly as an interrupt input.

Finally, you have to spend a week or two struggling with the one important part that got left out of the manual, or is in the manual for some seemingly unrelated part of the processor. Usually this involves flipping the default value of one frigging little bit in an obscure register someplace, but sometimes it requires completely rewriting all your code.

--
Tim Wescott
Control systems and communications consulting
http://www.wescottdesign.com

Need to learn how to apply control theory in your embedded system?
"Applied Control Theory for Embedded Systems" by Tim Wescott
Elsevier/Newnes, http://www.wescottdesign.com/actfes/actfes.html

- D
- Didi
  
  Contact options for registered users
Vote on answer
posted
16 years ago

Sun, Jan 20, 2008 12:34 PM

It probably varies a lot between operting systems. For DPS on an

824x this would be somewhere between 1 and 5 uS, depending on whether the task which exits uses the FPU (so all 32 64-bit registers are saved) and whether the task which is given control to uses the FPU (so all the 32 FPU regs have to be restored), and some other, less influencing factors. Another factor would be memory speed - whether the 824x has a 64 or a 32 bit data path; I have only had a 64-bit memory path system here. IRQ latency is a whole lot better, of course - IRQ stays masked just for a few cycles while putting the CPU in a recoverable state.

While being really low latency and tiny footprint DPS is not what you would typically associate with an RTOS, it is a fullblown OS by any standards.

Notice tht the above is written by the author and owner of DPS.

Dimiter

------------------------------------------------------ Dimiter Popoff Transgalactic Instruments

formatting link

------------------------------------------------------

formatting link

- V
- Vladimir Vassilevsky
  
  Contact options for registered users
Vote on answer
posted
16 years ago

Sun, Jan 20, 2008 2:33 PM

This also depends on the particular hardware and the execution instant. If the memory for the context storage is cached or not, if page fault happens or not, if the cache or SDRAM bank hit or miss - all of that creates a lot of variation.

Vladimir Vassilevsky DSP and Mixed Signal Consultant

formatting link

- V
- Vladimir Vassilevsky
  
  Contact options for registered users
Vote on answer
posted
16 years ago

Sun, Jan 20, 2008 3:19 PM

For our proprietary RTOS and Blackfin CPU, the interrupt latency is under

200ns. The task switching time depends on many factors (number of tasks, semaphores, messages etc.) and it is generally at the order of microseconds. It could be done better then that; however the goal was the portability and convenience rather then performance.

I am curious to know what does it mean "a tiny footprint full blown OS by any standards". Our RTOS takes about 20K core only; any practical configuration is likely to take over 40K. Still this is a small RTOS composed as the library with the support for the very minimal set of basic services.

Vladimir Vassilevsky DSP and Mixed Signal Consultant

formatting link

- D
- Didi
  
  Contact options for registered users
Vote on answer
posted
16 years ago

Sun, Jan 20, 2008 6:54 PM

Things are in the same ballpark range for the PPC running DPS, obviously depending on clock frequency and perhaps memory speed. At 400 MHz it should be perhaps half that or so.

It means an OS with multiple tasks, multiple windows, many hundreds of system calls to utilize these, filesystem of course, tcp/ip stack, a (pretty unique) inherent mechanism for object maintenance and many things I cannot think of now, probably. All the above takes less than 1M on a PPC; think 1/3 that on a CPU32 (I have stopped developing the CPU32 version a while ago, though). It has more than enough so if one needs to write an application one does not have to do much if anything but the application. The minimum you can boot with - while having scheduler and filesystem and about half of all calls - is something like 100K on the PPC, and < 30K on a CPU32.

A more or less representative view - running some applications on top of DPS, perhaps another few hundred K - is at

formatting link

. It is an old screenshot (>5 years), but will give you an idea.

I hope this year I will get around to make a less platform dependent package available. I would be doing this at a much higher priority if there were any PPC based documented hardware in the PS3/XBOX price range, which is not the case.

Dimiter

------------------------------------------------------ Dimiter Popoff Transgalactic Instruments

formatting link

------------------------------------------------------

formatting link

s.

d

to

- N
- NewToFPGA
  
  Contact options for registered users
Vote on answer
posted
16 years ago

Sun, Jan 20, 2008 11:59 PM

If I dont look at the performance point of view what is the maximum number of ticks that I can have in a 400 Mhz processor? is it

400,000,000 ticks per second (or 2.5 nano seconds per tick)?

If I have a periodic task which wakes up periodically every 25 micro seconds what is the overhead for this timer itself? How do we find it out?

Any general good reference book or online documentation that that talks about the processors in general?

- D
- Didi
  
  Contact options for registered users
Vote on answer
posted
16 years ago

Mon, Jan 21, 2008 12:13 AM

This is explained in some detail in the 603e core databook. On the Freescale website in the 824x section, you will find it or the G2 core (which is pretty much the same with some enhancements on some implementations). They specify "up to 3 instructions per clock cycle", which means the core can do in one cycle an integer instruction, an FPU instruction and fold a prefetched branch in the same cycle. Obviously the branch cannot be fetched in the same cycle since the data path to the cache is only

64 bits. Thinking 1 clock per instruction is pretty safe as long as you run off cache; you need to calculte external delays separately yourself depending on your hard- and software.

Dimiter

------------------------------------------------------ Dimiter Popoff Transgalactic Instruments

formatting link

------------------------------------------------------

formatting link

- V
- Vladimir Vassilevsky
  
  Contact options for registered users
Vote on answer
posted
16 years ago

Mon, Jan 21, 2008 12:19 AM

That's very impressive. Although you have an interesting notion of tinyness :)

What is your paradigm for the following problem: passing an object from one task to another?

Let's say the first task is preparing a block of data. The second task is sleeping. When the block is ready, it has to be passed to the second task, and the task has to be awaken. Who owns the memory occupied by the data block? If the memory is dynamic, who allocates and releases it? If the memory is static, how does the first task know when the second task does not need the object any more? Do you support the object transfer mechanism at OS level or is it left to the application?

Vladimir Vassilevsky DSP and Mixed Signal Design Consultant

formatting link

- N
- NewToFPGA
  
  Contact options for registered users
Vote on answer
posted
16 years ago

Mon, Jan 21, 2008 1:13 AM

Thanks for directing me to this manual. There is a lot of info in this. I am going to read it in the next couple of days...

How many instructions are there in C code like "int i =3D 100; int j =3D i;". Again any reference to look at these details will also be appreciated.

- D
- Didi
  
  Contact options for registered users
Vote on answer
posted
16 years ago

Mon, Jan 21, 2008 1:21 PM

In DPS memory is allocated dynamically. At the lowest level, a task can either allocate pieces in a registered manner (so if the tasak gets killed "by force" the pieces will be deallocated) or in a non-registered way where the allocated piece will stay allocated. Then tasks have the option to put in their history record (the same record which contains the registrations for allocated pieces) one (or more) addresses in their program section which will be called upon kill by the system, along with some parameters passed via that same record. And then one has the option to allocate a registered piece of memory to a third party, i e. task a allocates it but it gets registered on behalf of task b. There is a variety of intertask communication facilities, starting with the common data section groups of tasks share, through the inter-task signalling mechanism, to the (highest level) object specific ways. The latter also offer higher level facilities for memory allocate/deallocate which turned out to be very convenient. Oh well, I guess my notion of tiny can only get more interesting if I go on :-). But I meant "tiny" in an apples-to-apples way of comparison, say, a running OS with filesystem and about 300 calls in a

100k PPC program code is tiny... Now if you turn the VM on, with page tables and all - which I normally do - things get a lot less tiny, and if you add the other 300+ calls for the graphics, window maintenance etc. it can still be called tiny if compared apples-to-apples.

Dimiter

------------------------------------------------------ Dimiter Popoff Transgalactic Instruments

formatting link

------------------------------------------------------

formatting link

y

- D
- Didi
  
  Contact options for registered users
Vote on answer
posted
16 years ago

Mon, Jan 21, 2008 4:36 PM

You should be able to talk the compiler into generating an assembly output list and look at it. Different compilers would likely produce different sequences.

But for understanding things at the level you want to, C (or Basic or Pascal or whatever HLL) is not the right place to look at. You need to understand how things work in machine code, then you can choose a higher level language to use in order to hide the machine level from you. Right now you are trying to understand the machine level, though, and hiding it from yourself does not seem to be a good idea :-). You could read for a while the PPC programming environment (or sort of title) book, you can locate it on the Freescale site as well. It is bulky, but pretty straight forward and easy to understand, should make a useful reading, I suppose.

Dimiter

------------------------------------------------------ Dimiter Popoff Transgalactic Instruments

formatting link

------------------------------------------------------

formatting link

- N
- Niklas Holsti
  
  Contact options for registered users
Vote on answer
posted
16 years ago

Mon, Jan 21, 2008 5:47 PM

As others have pointed out, the execution time of a single instruction is not constant, but depends on the context (pipeline state, cache state, maybe other processor state). So you have to consider entire code snippets, for example functions or even whole threads.

The aiT static execution-time analysis tool from AbsInt

formatting link

can compute bounds on the worst-case execution time for PPC code, for some PPC models (which models are supported I don't exactly know). It takes into account pipeline and cache effects using a very detailed hardware model. It covers all execution paths by static analysis and abstract interpretation. But it's not cheap.

HTH

--
Niklas Holsti
Tidorum Ltd
niklas holsti tidorum fi
       .      @       .

- C
- CBFalconer
  
  Contact options for registered users
Vote on answer
posted
16 years ago

Mon, Jan 21, 2008 10:25 PM

Actually, for the query case, almost all machines will produce at most:

mvi regno, 100; Move the immediate value to reg no stoi regno, baserg; Store that via the address in baserg inc baserg, sz; By sz, i.e. the size of an int.

and something else has set up baserg, etc. The int j = i will be almost the same, except that it will start by replacing "mvi regno,

100" with:

movm regno, value; Load content mem loc'n 'value' to regno

and the details of how those assembly instructions are constructed, manipulated, etc. will vary from machine to machine. But the idea is quite consistent.

Assembly language is different from higher level in that the instruction perform known actions, and the assembly language writer has to combine those actions to get the desired effect. In the higher level language, he just writes the effect, and other software (the compiler, usually) selects the assembly sequence.

--
 [mail]: Chuck F (cbfalconer at maineline dot net) 
 [page]: 
            Try the download section.