C++ threads versus PThreads for embedded Linux on ARM micro

G

graeme.prentice 8 years ago

We're starting an embedded Linux C++ project with an ARM micro and using GCC V7. Can anyone suggest pros and cons of using C++ Threads versus PThreads (Posix threads).

Vote

D

David Brown 8 years ago

C++ threads are always a wrapper around an underlying library. So if you are using C++ on Linux, the C++ threads /are/ pthreads. These are the points I can think of for preferring C++ threads:

You have a nice class/template library with C++ threads, instead of a C function interface.
You have have RAII classes for locks and other synchronisation objects.

You have consistency with other C++ thread systems.
Your compiler may understand that your code is threaded.

- You need at least C++11 (but that has huge advantages anyway, compared to older C++).

- It is marginally more fiddly if you need the underlying thread details for features not supported by the C++ thread library.

Vote

G

graeme.prentice 8 years ago

That's great, thanks.

Vote

P

Paul Rubin 8 years ago

Also, ask yourself if you really need threads in the first place. Depending on what you're doing, you may be better off with multiple processes. That gets rid of a lot of lock and race hazards, and if the processes can communicate through sockets, that improves scalability by making it easier for you to distribute your program across multiple machines if you run out of cpu cores on your original machine.

Vote

G

graeme.prentice 8 years ago

Thanks for the suggestion. The micro is an ARM9 LPC3250 SOM (we're forced to use this at the moment) which I believe is single core (it's hard to fin d out for some reason) but it could easily change in future. Based on a pr evious project, race conditions and deadlocks are a major headache so I'm h oping the core data will be written to by one thread only, maybe with lock- free queues. The CPU data cache is 32KB and it's probably "write through". We would have to do some performance tests to see if multi-processes and sockets is viable.

Vote

D

David Brown 8 years ago

There are certainly tasks that are better handled as multiple processes rather than multiple threads. (But note that it is not an either/or choice - often the best solution uses both.)

No, it does not - it merely changes them. If your separate threads of execution need to synchronise, communicate, or agree about shared resources, then there is no theoretical difference about the types of hazards, races, or other such problems if you use multiple threads or multiple processes. The details change, and the types of synchronisation objects use can change, but they do no not go away. Some may be handled by the OS rather than the application, however - for example, a pipe between processes will let you communicate without worry about locks for the underlying shared data structure, at the cost of being a lot less efficient than shared memory in threads.

Multiple processes have higher resource costs, and they make it a lot harder to use tools such as "-fsanitize=thread" to find problems. On the other hand, they make it easier to break the problem down into separate tasks that are handled independently and tested independently. That helps if you have different developers - or even different programming languages.

True.

This can also be useful during development when you might have some of the bits running on your target system, and other bits running on your host computer (perhaps under a debugger).

That doesn't matter for the choice of threads, processes or both.

Multiple processes are slower than multiple threads, and sockets are much slower than in-process queues. But the sockets are more flexible. You might find you want an abstraction that can use either method as a backend, and change during different stages of development.

Vote

T

Tauno Voipio 7 years ago

For number of cores on your system, if you have Linux running on the target, have a look at /proc/cpuinfo:

tauno@pi2:~ $ cat /proc/cpuinfo processor : 0 model name : ARMv7 Processor rev 5 (v7l) BogoMIPS : 38.40 Features : half thumb fastmult vfp edsp neon vfpv3 tls vfpv4 idiva idivt vfpd32 lpae evtstrm CPU implementer : 0x41 CPU architecture: 7 CPU variant : 0x0 CPU part : 0xc07 CPU revision : 5

processor : 1 model name : ARMv7 Processor rev 5 (v7l) BogoMIPS : 38.40 Features : half thumb fastmult vfp edsp neon vfpv3 tls vfpv4 idiva idivt vfpd32 lpae evtstrm CPU implementer : 0x41 CPU architecture: 7 CPU variant : 0x0 CPU part : 0xc07 CPU revision : 5

processor : 2 model name : ARMv7 Processor rev 5 (v7l) BogoMIPS : 38.40 Features : half thumb fastmult vfp edsp neon vfpv3 tls vfpv4 idiva idivt vfpd32 lpae evtstrm CPU implementer : 0x41 CPU architecture: 7 CPU variant : 0x0 CPU part : 0xc07 CPU revision : 5

processor : 3 model name : ARMv7 Processor rev 5 (v7l) BogoMIPS : 38.40 Features : half thumb fastmult vfp edsp neon vfpv3 tls vfpv4 idiva idivt vfpd32 lpae evtstrm CPU implementer : 0x41 CPU architecture: 7 CPU variant : 0x0 CPU part : 0xc07 CPU revision : 5

Hardware : BCM2835 Revision : a01041 Serial : 0000000064d34ba1

The above is from a Raspberry Pi 2.

Vote

L

Les Cargill 7 years ago

I'd suggest thinking about a design for how you'd measure which works in context for you. If I run pthreads on a big Linux machine it'll be different from running them in a VM. Similarly, it'll be different on a RasPi 3 sized ARM computer.

Les Cargill

Vote

R

Rob Gaddi 7 years ago

3250 is single-core. Happens to be a part we use a lot around here, although we always go bare-metal rather than running Linux. Cache is programmable through the page-table as to whether it's write-through or not.

The reason you're having trouble determining much of this information is that NXP bought large chunks of that chip wholesale from ARM without anyone there actually understanding it. So the NXP documentation is spotty and occasionally wrong (let me tell you of our I2C-based woes).

There's a document available directly from ARM, ARM DDI 0198E, that is specifically the ARM926EJ-S Technical Reference Manual. Getting into the details on the 3250 is nearly impossible without it.

Rob Gaddi, Highland Technology -- www.highlandtechnology.com Email address domain is currently out of order. See above to fix.

Vote

M

mac 7 years ago

You can also communicate among processes through shared memory (e.g. mmap).

To look at a other way, processes require explicit sharing, threads share implicitly.

On an embedded system, the heavier cost of process switching may be important.

Vote

S

StateMachineCOM 7 years ago

Traditional threads, whichever way you package them (as C++ threads, p-thre ads or any other thread library), typically correspond to the "shared-state concurrency and blocking" approach. This approach is known to be problemat ic, and many experts in concurrent programming recommend to drastically lim it both sharing and blocking according to the following three best practice s:

Keep data isolated and bound to threads. Threads should hide (encapsulat e) their private data and other resources, and not share them with the rest of the system.

Communicate among threads asynchronously via messages (event objects). U sing asynchronous events keeps the threads running truly independently, wit hout any further blocking on each other.
Threads should spend their lifetime responding to incoming events, so th eir mainline should consist of an event-loop that handles events one at a t ime (to completion), thus avoiding any concurrency hazards within a thread itself.

The set of these best practices are collectively known as the Active Object design pattern (a.k.a. Actor). While this pattern can be applied manually on top of a traditional threads, a better way is to use an Active Object fr amework.

The main difference is that when you use "naked" threads, you write the mai n body of the application (such as the thread routines for all your tasks) and you call various thread-library services (e.g., a semaphore or a time d elay). When you use a framework, you reuse the overall architecture and wri te the code that it calls. This leads to inversion of control, which allows the framework to automatically enforce the best practices of concurrent pr ogramming. In contrast, a "naked" threads let you do anything and offer no help or automation for the best practices.

Vote

G

graeme.prentice 7 years ago

Thanks. What is an "event object"? What is the best way to pass data asyn chronously using a queue, on Linux? I've read that lock-free data structur es are easy to get wrong and best avoided and that the C++ thread library d oesn't have any lock free data structures - mainly because there's too many variations to have a generalized data structure.

Can we use the "libcds" library and be confident that it will work correctl y?

formatting link

reads or any other thread library), typically correspond to the "shared-sta te concurrency and blocking" approach. This approach is known to be problem atic, and many experts in concurrent programming recommend to drastically l imit both sharing and blocking according to the following three best practi ces: [snip]

Vote

P

Paul Rubin 7 years ago

Just a message that you pass from one thread to another.

I'd probably just use std::deque with a lock.

You might look at seastar-project.org for some inspiration.

Vote

S

StateMachineCOM 7 years ago

ly using a queue, on Linux?

I can tell you how this is done in the QP/C++ framework, which I've designe d and refined for almost two decades now. But before I can get to the techn ical, I need to make full disclosure that QP is a dual-licensed (open-sourc e/commercial) product of my company (see

formatting link

so I do have a commercial interest in promoting it.

So, now going back to your question, "event objects" are messages that thre ads send to each other via event queues. But a naive implementation of copy ing messages to and from the queues is expensive and hurts real-time perfor mance. So, in the QP framework, the events are allocated from fixed-size po ols and only pointers to events are kept in the event queues. The framework maintains the copy-by-value semantics as much as possible, while event obj ects are really shared under the hood. The framework also automatically rec ycles events that are processed.

Specifically to the POSIX port of QP/C++, which has been available for over 15 years now, each active object runs in its own p-thread. These threads a re organized as an event-loop (according to the best practice I listed in m y previous thread), so they block only in one place--when the event queue i s empty. The queue uses internally a p-thread mutex and a condition variabl e to implement blocking on an event queue and signaling the queue. But the application programmer does not need to know any of it, because the main po int is that the framework does the heavy lifting of thread-safe asynchronou s event exchange. The application threads (active objects) only process the events one at a time (to completion), but they don't need to worry about a ny low-level mechanisms like mutexs or condition variables.

The design also allows you to avoid sharing of anything (except events) amo ng the threads, which is another best-practice of concurrent programming. T his means that you don't need to use any synchronization objects. In this s ense, the RAII benefits of synchronization mechanisms in the C++ threads do n't matter.

There is of course much more to active object framework like QP/C++ to capt ure here. For example, the framework supports Hierarchical State Machines t o implement the internal behavior of active objects. There is also a free m odeling tool (QM), with which you can design your HSMs graphically and gene rate production-code automatically. But all of this requires a paradigm shi ft from the traditional sequential-programming with blocking to event-drive n programming without blocking or sharing. To learn more, you might read ab out the key concepts here:

formatting link

Miro Samek state-machine.com

Vote

D

David Brown 7 years ago

Encapsulation is always a good principle, but don't take it too far. If two parts of the system need to share data of significant size, then you want shared data, not "messages" or other synchronisation mechanisms. (You use the messages or other synchronisation to communicate metadata - such as who owns the real data space at any given time - but not the data itself.)

Blocking is fine with threads. If you have a single core cpu - or more threads than cores - then blocking is often more efficient than attempting to continue. After all, if thread A is asking thread B to do something (via a message, actor call, or whatever) then thread B can't get started in doing the work A wants until A has taken a break. It's cheaper to have a voluntary break (yield, or blocking call) than to wait for a scheduling change.

So use blocking calls whenever they fit naturally in the progression of the code - and non-blocking calls whenever /that/ is the more natural fit. Don't make the mistake of thinking that one is inherently "better" or necessarily more efficient - /measure/ the /real/ effects if efficiency is vital.

Actor designs can certainly have their advantages - equally certainly, they are not the best design for all uses. Whenever someone says "this is the best way to do it", it's unlikely to be that simple - and whenever they say so without knowing exact details of the problem at hand, they are almost certainly wrong.

Vote

D

David Brown 7 years ago

Lock-free data structures can range from very simple to very difficult, and there can be huge differences depending on the details of the structure. A single-writer, single-reader fixed size queue is /easy/ - it's just two atomic counters for "head" and "tail" and an array, with a little care about memory ordering. For single core embedded processors, it's usually sufficient to just use "volatile" - for bigger systems, C++11 or C11 atomics handle the details.

On the other hand, a queue that can have variable size, and more than one reader or writer, quickly gets really complicated to handle lock-free, and often it is much simpler, safer and cheaper to use a lock. On the third hand, if you have multiple cores you might want lock-free again for scalability.

There is no simple answer here, and much depends on the details of exactly what you are wanting. As long as you ask general questions, you'll only get general answers.

Vote

S

StateMachineCOM 7 years ago

@David Brown: Absolutely, if you stick to the traditional sequential progra mming paradigm with shared-state concurrency and blocking threads, the thre e best practices I listed in my previous post can all be questioned, relaxe d, and ultimately dismissed.

That's because they represent a different, event-driven ("reactive") progra mming paradigm. The distinction is important, because the two programming p aradigms do NOT mix well, certainly not inside the same thread. So it is im portant to always realize which paradigm you are using in which thread, to avoid confusion and mixing the two.

To back up this point, I'd like to recommend the article "Managing Concurre ncy in Complex Embedded Systems" by David Cummings

formatting link

). The author pre sents general guiding principles of structuring threads, which he found par ticularly useful and which he applied in the NASA Mars rovers and other mis sion-critical systems. The paper starts with the description of the general thread structure, which can be immediately recognized as the event-loop. T he bulk of the paper then focuses on discussing several scenarios in which designers might be tempted to apply thread BLOCKING, followed by explanatio ns why blocking is always a BAD idea. Again, I repeat, that this conclusion applies to the "event-driven" thread structure, which the author started w ith.

Vote

D

David Brown 7 years ago

I haven't read the link yet (I will do so), but I do agree that blocking is a very bad idea in an event-driven thread.

Vote

P

pozz 7 years ago

Il 01/08/2018 00:16, StateMachineCOM ha scritto:

Thanks for this reference... it is a very *very* instructive material.

Vote

U

upsidedown 7 years ago

It seems Cummings has reinvented the wheel :-).

Those principles were used already in the 1970's to implement real time systems under RSX-11 on PDP-11. Later on these principles were also used on real time systems under RMX-80 for 8080 and similar kernels.

Vote

C++ threads versus PThreads for embedded Linux on ARM micro

Join the Discussion

Didn't find your answer?