I'm looking for information (good articles, books, websites) concentrating on SW architecture themes related to Embedded platforms.
Primarily SW architecture discussion seem almost solely devoted to the PC world and object orientated languages. I do find some of these topics also apply to the embedded world however I'm missing topics on issues like the list below.
:Multi-processor communication :Multi-processor System Partitioning :Protocol Handling :POST (Power On Self Test) :Distributing System events (i.e. power up) :ISO Network Model :Debug/Trace Strategies :Error Handling :How to benchmark sufficiently :Event-Driven Systems : ...
Which books do you use when tackling problems such as those mentioned? Which issues should be added to the list?
You missed the most important one: - Designing demonstrably bug-free software and systems
The desktop world has a level of quality *vastly* inferior to that demanded of embedded work, where "crash" == "broken", and maybe "lawsuits" and "closure of company". The mindset is quite, quite different. (And where it's not, it should be.)
This is not just a question of a chapter in a book. It really is a mindset, an attitude. It's the difference between "engineering" and "messing about with an erector set".
To add just a little: robustness is always the key attribute. *Design* for zero errors; debugging is something that you do when you fail, so aim not to. The hack/debug approach must be seen for what it is: amateurish tinkering.
Hmmm. If I add any more, I'd be in danger of writing a book. (Or possibly a sermon.)
Now there's a thought (the book, not the sermon).... ;)
Steve interesting but I think we should accept that with the tools/languages we use 'bug-free' isn't going to happen, except on trivial projects.
I agree totally here
I don't accept that. We should build mechanisms into our systems in order to 'see' our systems in work. If you or someone in your team does happen to 'fail' then do you have the right tools/procedures for fixing that error? Logging tools, printing debug lines to stdout etc all have their usefulness and should be designed in from the ground up.
Of course you can always over design a system. The more non-functional modules you include, the more complex the system the more chances you have to fail. Like everything it's a balancing act.
I would be interested in techniques people use to debug their systems. Printing to a serial port is OK if you have a spare port but it also can be CPU intensive if the developers over do it with the debug lines.
Did you mean multi-processors in the same box or distributed across a network? There is a world of difference in managing the two. For "in the same box" multi-processing the subject is minimally covered in "Hardware/Software Design of Digital Systems" by REH Bywater (Prentice Hall ISBN 0-13-383950-8). Whilst it is an old book it still has some very valid points on the topic. For "across a network" multi-processor system then you should be looking for texts on "Distributed processing".
This is in Jack Ganssle's book "The Art of Designing Embedded Systems" (Newnes ISBN 0-7506-9869-1). He has quite a lot to say on the topic.
These have had articles in magazines such as Embedded Systems Engineering and Embedded Systems Programming. I recently ditched a number of quite old copies of these so you may now have to do an index search on the respective websites.
This has been covered here in this newsgroup quite adequately in quite a long thread on the topic about three years ago. Use Google to search for the thread. The search string should be obvious.
Another book that will help you out is "Front Panel" by Niall D. Murphy (R&D Books ISBN 0-87930-528-2). This is more about the relationship between users and the equipment but by understanding these issues you will end up with better, more usable, systems.
The main things is read as widely as you can manage and learn to use Google well. You will find most of what you need on-line.
Paul E. Bennett ....................
"printing debug lines to stdout" suggests you're thinking about larger systems, e.g. Linux appliances such as network equipment. The majority of embedded systems work is what I would call "deeply embedded" control systems where there is no stdout, or screen or even Ethernet. An LED or two to indicate state is a luxury!
Not really. It's all about system test. Deeply embedded systems often have a quite finite set of inputs and responses and can often be exhaustively tested, including their various failure scenarios.
Typically an event log in a serial EEPROM is sufficient to log anomolies.
"in the same box" multi-processing. I'll check out the book you mention. This is a real hard topic, to partition the system in such a way that each feature/function/task is allocated to the correct processor. Especially difficult when one processor must use the resources of the
I do regularly read embedded.com but don't feel that they always go into sufficient detail.
I have heard of this book previously but dismissed it as a UI book. I'm wrong?
Which of course I do when I know what I'm looking for. However it is difficult to stumble across new ideas/concepts in this manner.
Lets agree to disagree on that one. Scanning your reply again I see that we are discussing different systems.
I don't want to get into a debate about what is an embedded system and I'll accept your 'deeply embedded' term but using a couple of LEDs is a debugging technique if a simple one.
We have some products like you describe above and other products with a custom OS, no MMI etc. Really what I meant by stdout is a serial port or a couple of spare I/O pins for bit banging purposes. However our systems have multiple tasks communicating with various external hardware components as opposed to a small one task control system.
So I think my comment above remains valid but for these 'types' of embedded systems and not your 'deeply embedded' systems.
Arh. System test! Yes we should test and I take your point for 'deeply embedded' systems but why not take a more pragmatic view? Why not build mechanisms in the system esp. for the types of systems I describe above? Or is everyone happy with the amount of testing done on their products?
Paul, Like above I think we are discussing different systems here. Your comments are interesting and made me think but please consider that not all embedded systems are single task control systems. I perhaps should have made that more clear in my original post.
You obviously have never worked on aerospace/automotive/medical systems then.
"Sorry the brake on your car did not work, this is a feature that is updated in Service Pack 25" ...
Not all small systems are single task, some small systems have OS in them (network appliances etc..). Having just delivered a small ASIC tester with multiple 'tasks' running under a simple scheduler.
"Deeply embedded" could be a calculator, DVD, microwave, mobile phone, which does not have a satndard OS interface to see what teh system is doing. It could be many other things from airbag control to aviation control systems. Basically running a set of tasks and NOTHING else.
For the main markets above, system test MUST be done, as any other method requires testing against some form of model, and often the model is wrong or not compliant with reality in some aspect(s).
Bitbanging could be system intensive serial I/O under interupts should not be. That is why interupts are there. I often have a whole control ability by serial I/O to call lower functions for test and debug as well as system monitoring, most customer like it for logging and the like.
Not all embedded systems are monster multi-task systems or one task systems, many lie in between and have many 'tasks', that may not be noticeable to a OS accustomed person.
Paul Carpenter | email@example.com
No. Absolutely, categorically, emphatically not. The "complex software must be buggy" myth is one I absolutely, categorically, emphatically refuse to accept.
I deliver products, of any level of complexity, when they're bug-free. Not before.
"Trivial projects" only? No. That's an example of the Wrong Attitude. Sorry, but you failed ;).
The thing I find weird is that designing reliable systems (with "thou shalt not crash" in mind) takes less time, mainly because it cuts out the whole asymptotic "reducing bugs to nearly but not quite zero - until we run out of time" phase. This seems to be a well-kept secret. I find this weird. All branches of engineering I've been involved in have been driven by economics, except this one. (I think I know the reason, and it ain't pretty.)
What you're describing is debugging, which has its use during development (mostly against typos). But again - a properly designed system won't need debugging (other than the typos). Period. "Proper design", however, is rare - and rather beautiful.
Achieving this is not always easy, but *should* be what we all aspire to. Acceptance of failure is far too common in s/w design, and simply unacceptable in embedded work.
Bollocks. To be blunt.
A complex system is a collection of simple parts, or or complex parts consisting of simple elements. It's the interactions between these parts that confuse us. Which is dumb. We should be designing in terms of interactions and simple things. Complexity doesn't exist - unless you don't understand the problem, or the solution offered.
As for over-designing - no, again. Equating "it shall not crash" with "overdesigning" is precisely the kind of attitude I'm up against on a regular basis, and one I frankly despair of. Paraphrasing what I said earlier, acceptance of failure is a malignancy.
I mostly use a single LED. Otherwise, I endeavour to ensure that the effects of a simple failure are catastrophic - which means I won't miss it. Seriously - I try hard to ensure things don't get overlooked.
OTOH, I rarely use an OS (an OS has its place, but only when one can qualify the reliability of, vs the services offered by, the OS). I have a distrust of most (not all) 3rd-party code for reasons I've probably already made clear ;). Also because I believe in KISS.
Just to clarify my viewpoint: my idea of a simple project is probably the one I've just completed: a Eurocard controlling a backplaneful of other CPU-laden modules, using a simple-ish serial protocol, but adding TCP/IP socket-based control features on top of the original RS232 overall control-based protocol. There were several classical tasks: backplane, several layers of TCP/IP and socket management, two async serial tasks, error logging, non-volatile memory management, front panel, etc, etc, all of which ran concurrently as independent tasks. No OS - just a simple roundrobin and some discipline.
My idea of a complex project was perhaps one I did a few years back - a multi-processor system, with upto 32 control panels (each with at least one CPU), all controlling a rack controlling a crosspoint audio system (and yet more CPUs) - effectively an audio routing system. A fandango of asynchronous events. It was actually a talkback (intercom) system used by TV broadcast crews during live broadcasts, and was used by 4 international crews (including the BBC) during the '98 World Cup. During the competition itself, I was on call to all to the crews. And it was all live TV. I slept well at nights, knowing it would just work. And again: still no OS. Just a bunch of inviolable rules. (And France won, which was nice.)
Just two examples that spring to mind. Plenty more where they came from. Trivial projects, eh?
I really do despair. Complexity just needs managing (i.e. decomposing) properly. That's it. An OS is one tool amongst many. It's all down to good engineering (in the same sense that I was taught good soldering many years ago), and choosing appropriate tools - *and* the mindset I spoke of earlier. A mindset that DOES NOT ACCEPT failure.
I'll have to write that sermon^h^h^h^h^h^h^hbook. I do feel very passionately about this. I've come across *so* many people who figure complex software is hard - yet complex hardware is routine. What's the difference? (Hint: synchronous design.... a pre-emptive multitasker makes this *harder* due to the asynchrony of task switching...)
You could declare a code bug-free if it exactly matches the specification. However, this specification must specify all possible inputs and all expected outputs. This is simple for any combinatoric logic, but as soon as the specification contains internal states and time dependent behaviour, things get tricky.
But the main question is, is the specification correct ? It is very nice, if the software exactly implements a faulty specification, but the end product is still useless for the original requirement.
What about the Ariane 5 first failure ?
The flight control software was originally created for Ariane 4 which had limited flight dynamics input signal ranges. The software was written so that with any expected input values, the internal calculations could not overflow.
The same software was used on Ariane 5 with different flight dynamics and larger input value ranges, without requalifying the software, the first serious problem. During the first Ariane 5 flight, the software received larger input values than originally expected on Ariane 4, which caused an internal overflow, which was not handled correctly and the rocket went out of control and had to be destroyed.
Was that original flight software bug-free ? If it was known that when written, no overflow could occur with the specified inputs, why was the overflow exception still enabled ? If you have an exception enabled and do not provide an exception handler ? Was the original Ariane 4 software bug-free ?
Of course the main blame is for not requalifying the software for the different input values of Ariane 5, but is it the only thing to blame?
Not all products destined for the markets outlined above are safety critical. Repeating the same old "What if Microsoft designed Automotive SW" gag is wearing a bit thin. Especially looking at the work they are currently involved in with Fiat. Also even if I haven't worked in those areas, are my comments less valid?
I work in projects with something like 40 SW Engineers. These Engineers range from experienced people with 20 years experience to completely knew colleagues based off-shore. Multi-site development is a real challenge and the biggest obstacle is like always communication. We receive requirements from the customers and other departments in our organization very late. All these things mean that the SW development process is not always 100% followed. If it was then we would not deliver product. I'm all for discussing how things 'should be done' and also see that SW as a discipline has a long way to go. Unfortunately the real world sometimes gets in the way.
I accept that these 'deeply embedded' systems can have multiple tasks!
But you don't need a standard OS interface to see what the system is doing. That was really the point of the OP. I was (still I'm) interested in collecting new ideas/techniques in order to debug a system. However I've had to defend the concept of debugging a system. Seemingly I'm the only one who has this need.
Yes System test is done and I of course accept for safety critical applications (i.e. DO-178) this is done very well but in other areas this is not done well.
I make an example (i.e. bit-banging) of a system without all the resources that an "OS accustomed person" would might expect and you suggest "That is why interrupts are there. ". What if our 'deeply embedded' system does not have the interrupts required?
Agreed. I was wrong in a previous post to state 'trivial' and 'single task'. I've been brought up on it, fair enough but I'm also not just an "OS accustomed person."
Super. Can you describe in a few words the complexity of these systems? How many Engineers are in the project team, when in the development process you receive final complete requirements, is the time frame of the project realistic? Do customers not mind you finding all the bugs before delivering?
Yea OK. Trivial wasn't the right wording.
I think we need to be careful what is the environment these projects are being completed under. There are many external factors which conspire to frustrate a development team but this sounds like something which you don't suffer from.
No one should set out with the mind-set that they will deliver a buggy product, of course not. However the products which i work on can not be
100% adequately tested due to external events (user or external components) and differing state machines in the product. What if one of your products was returned because it was suspected as being faulty. How would you debug the device in this case? What hooks do you build in from the ground up in order to help you investigate? Lets be clear here, I'm not suggesting for a second that you don't have any bugs. You've made it clear that you don't deliver bugs, however you still have this device you have to investigate.
Hang on a second. I was not suggesting that one must over-design in order to have a 'it shall not crash' type of product. I only mentioned that in order to offer more and more debug/trace features more SW is required and thus the complexity might/will go up. I forget which book it is but it states that you should always be careful in the 2nd system you design in order not to over-design to compensate for the problems which arose in the 1st system.
Of course and I agree but a single LED does not scale up well to systems with a task scheduler and time-dependent protocols. More sophisticated tools are required.
This sounds to be excusing poor project management and requirements capture. If the software development process can't manage late changes then the process itself is flawed. The processes are not just there to allow marketing to boast about ISO9000 and CMM levels, they are there to prevent mistakes being made and to identify those that do before they get into the field. This applies to all software designs whether embedded or not. However, it does seem that more care is taken over embedded systems.
I think if the real world gets in the way of writing and testing good software then pretty quickly the real world will stop buying the defective products it generates. The last minute panic to accommodate requirement changes and management whims is exactly the stage in the project where most errors are introduced and, for that reason, is the place where the most rigid process discipline should be exercised. If you are prepared to let the quality control slip at this stage then you must balance it against the risk of the product failing in the field and accept that the bug probably won't be in well documented code because that is the first thing to go when the procedure is relaxed.
Multi-site development adds complications to all the development stages but with sensible use of design tools and competent management then it needn't be a headache. I've worked with global teams where we all checked in and out code from a central store and e-mails were always answered promptly. The requirements had been well thought out at the beginning and late changes were incorporated at the system documentation level before being flowed down to the implementors. The biggest problem was time difference but once you knew that any e-mail would be answered in under 8 hours then it wasn't a major bind. Emergencies could be dealt with by phone but happen rarely if proper planning is done.
It was NOT a Microsoft jibe, but a comment on how those industries work compared to a lot of stuff that is shipped. A lot of the products may not be safety critical, but a lot of the parts are tested to lower levels of the same standards. It is amazing though how much of it is regareded as various levels of safety critcial, example being dashboard displays or central locking not being top level safety critical, but they are a lower level as they stop the driver being aware of problems (even their speed) or stop them exiting the vehicle in an emergency.
Some aspects of airline entertainment systems are NOT safety critical but the system test and other procedures have to be done to ensure they do not impact on other systems that are safety critical.
If you design expecting bugs (typos or design), then you will get them. If you design from the ground up to ensure things cannot happen and problems are trapped to give predictable outcomes, life becomes easier.
More time spent at the front end, saves time at the back end.
Sound like a badly organised structure, and lack of good requirements and specification to start from. Yes changes occur, but if the process is then "throw anything in" you will get "anything out", MAKING more debugging. More than likely the problesm are internal (to your organistaion) sitting on changes to be passed down, I have too often seen it in other companies. Even changes need design process to be sure it will go in correctly and its impact on all parts of the system.
Remember the most basic computer saying GIGO !
Most of which explains why a lot of 'gadgets' I see have bad software design. Any device or embedded project that locks up or needs a restart to sort a problem has been badly designed and from today's experience at one customer site even Wireless Access Points are in that category.
Hmm, it is in other areas, just done one yesterday, and will have another next week for an ASIC tester. There will be other system tests as the spec is changed (as EXPECETED) for a new ASIC. Most problems are down to original models and other deviations.
I have done system tests on Medical equipment that was NOT safety critical.
Sorry it is a very small number of devices that are used in embedded that have litlle or no interupts, even for timers to control bit banging. It is even rarer for Serial I/O not to have interupts. In only one of my many designs were all the timers on a controller used up, but one was used for a scheduler, which if nothing else gives you a clock for low speed bit banging of data, for that matter bit banging state information at start/end of tasks gives you a lot of information for profiling on a scope/analyser.
If ALL the system resources are in use then the design is MOST likely to be on edge of problems already that adding any debugging facilities will cause problems in themselves.
Paul Carpenter | firstname.lastname@example.org
Not really, no. I accept that this is a common view, but it's not one I agree with.
As I've said, I consider large projects to be a collection of small projects. The way those fit together is of course rather important ;). But - given correct design, there is no reason that a large project should be any more buggy than the sum of the small projects (not the exponential law that's usually quoted). And those small projects should be bug-free. If putting them together results in bugs, there's a problem with the design.
Also as I've said (many times), the key skill is managing complexity. We can do this well in hardware and mechanical design; why not in software?
I guess this is the main point at which you and I diverge in opinion.
The original statement by Usenet Groups was that "'bug-free' isn't going to happen, except on trivial projects". I would suggest that any system that can have all of its inputs can be driven to all possible combinations while internal states are in all of their possible conditions fits under "trivial".
Yes process and design are important, but there is no process that will completely eliminate the possibility that the programmers make a mistake and don't think of some condition and write the code to handle it (again, aside from the trivial). So we're back to relying on testing to find those conditions or prove they don't exist. And now how to you know that the test conditions you've thought of didn't miss something that might actually occur in the real world in 20 years? No one is perfect. And following a process, however good, isn't going to stop people from making mistakes.
We're a little late. Didn't IBM mathematically prove that you can't prove that a program is bug-free back in the 1950s?