books for embedded software development

- A
- Alessandro Basili
  
  Contact options for registered users
posted
12 years ago

Mon, Dec 12, 2011 8:01 PM

Hi everyone,

I just started to (re)design the software for an embedded application on a very old DSP (ADSP21020 32bit floating point) and I was trying to look for some good books on embedded software development since I wanted to start it right from the beginning rather than chase it later on.

The application is a Star Tracker which is using a CCD aiming at stars while the DSP processes and compresses the images for transmission to ground.

Thanks,

Al

--
A: Because it fouls the order in which people normally read text.
Q: Why is top-posting such a bad thing?
A: Top-posting.
Q: What is the most annoying thing on usenet and in e-mail?

- D
- Dave Nadler
  
  Contact options for registered users
Vote on answer
posted
12 years ago

Mon, Dec 12, 2011 9:45 PM

"Better Embedded System Software" - Philip Koopman "Test-Driven Development for Embedded C" - Grenning

Two recent keepers. Enjoy, Best Regards, Dave

- P
- Paul E. Bennett
  
  Contact options for registered users
Vote on answer
posted
12 years ago

Tue, Dec 13, 2011 7:29 PM

Phil's book is an excellent suggestion. Really good for any development situation.

I would also add into the pile: The Art of Designing Embedded Systems Jack Ganssle ISBN 0-7506-9869-1

Handbook of Walkthroughs, Inspections and Technical Reviews (Evaluating Programs, Projects, and Products. Daniel P. Freedman and Gerald M Weinberg ISBN 0-932633-19-6

I included the latter because the review aspect can become a project saviour if performed properly throughout the development.

--
********************************************************************
Paul E. Bennett...............
Forth based HIDECS Consultancy
Mob: +44 (0)7811-639972
Tel: +44 (0)1235-510979
Going Forth Safely ..... EBA. www.electric-boat-association.org.uk..
********************************************************************

- D
- Don Y
  
  Contact options for registered users
Vote on answer
posted
12 years ago

Tue, Dec 13, 2011 10:05 PM

Perhaps the most important and effective thing you can do is to enumeate the reasons *why* you are redesigning the device (assuming at east one iteration exists already and has been deployed, etc.). E.g., component availability, feature creep, etc.

Then, add to that list any "issues" that have been uncovered but not yet (satisfactorily) addressed in the conceptual design of the device. E.g., any "mysterious/anomalous" behaviors that may have been noticed (and possibly "resolved themselves" *or* were handled by resetting the device).

IMO, you'll get more *practical* "return for this time invested than rethinking an implementation methodology (which you might do wrong)

Is the device space-based? E.g., can you make significant improvements in power consumption (or other resource limitations)?

- A
- Alessandro Basili
  
  Contact options for registered users
Vote on answer
posted
12 years ago

Wed, Dec 14, 2011 4:47 PM

the current software is:

not fulfilling the specification; the software is supposed to provide compressed data for stellar fields and it simply does not.

"unstable"; after few hours of operation in non-compressed mode the software hangs and a hardware reset is needed.
structureless; after a code review it is clear that debugging it would be more costly rather then redesigning it from scratch.

full of logical flaws; synchronization problems are the most common mistakes, but interrupt service routines are excessively and needlessly long, essentially preventing any time analysis.
overly complicated in the commanding interface; the software tries to handle a command queue with no control over the queue itself (queue reset, queue status).

lacking control over the CCD; there's a register implemented in the hardware which gives control over the CCD functionalities, but has been ignored in the current implementation.
reinventing the wheel with basic functions; the C-runtime library provided by AD is completely ignored and a long list of functions have been implemented apparently for no reasons.

not utilizing the available bandwidth; there's a serial port through which the DSP can write its data to an output buffer, but the bandwidth available is reduced to essentially 256/1920 due to the handshake protocol implemented (where on the other side no one is really performing any hand shake). This limit poses a very hard constraint on the science data, to the point where the information is not enough to reconstruct pointing direction with the accuracy needed.
not maintained anymore; the previous software developer left and a new team took over. We tried to recover from the ashes all that we could from the previous work, but apparently a huge management flaw lost control over the schedule and deliverables that we are now facing a touch choice: where to start from?

not designed for testing; there are a lot of functions that are not observable and there's no logging mechanism in the code for tracing either. It's what I usually call "plug and pray" system.

I think there are other items that would call for a redesign, as lack of documentation, lack of revision control system, lack of test campaigns, lack of tools to work with the software, ...

I agree that redesign is not the solution, if you want is just a mean to get it working. We normally follow an "iterative and incremental " model, in order not to invest too much time in the "wrong design" and leave room for adjustment along the way. Unfortunately the previous team didn't have any guideline to follow and spent 99% in development and %1 in "hope it works". To their credit they didn't have any support from any external test team and they inevitably fell short of feedback as well.

It is space-based and it is currently flying. Actually the goal here is to *have* a design, which seems to me the current software is lacking of completely. Well I can imagine it's hard to explain how did we get to this point and I also believe it would be interesting to analyze what happened to got us to this point in order not to repeat the same mistakes, but unfortunately the situation is what it is and something has to be done if we ever want to correlate our photon reconstruction capabilities with pointing information.

Actually I tried to get the chance to have some good reference to foster discussion in the team as well. I believe that following some simple and well described approach can enable us to bend it later on to our particular needs (which might not be clear at this point).

- P
- Paul E. Bennett
  
  Contact options for registered users
Vote on answer
posted
12 years ago

Wed, Dec 14, 2011 6:51 PM

Whichever route Alessandro takes (the start from scratch route or the improve current design route as Don suggested) the books will be good reminders of the proper conduct of the effort involved.

Start with a good and thorough review of the specification document. Take it apart and test every statement of requirement it contains. Vagueness or assumed facets should be transformed into clear, concise and testable requirements specifications. Do not start any design until this review is complete and you have a fully tested specification document. Up to 80% of the bugs for a system can be eliminated at this point by robust reviewing. The point about writing requirements specifications is that they have to be clear, concise and testable statements of what is required in the system by way of functionality, interfaces, performance and maintainability. The specification document should be free of assumptions but if assumptions are necessary these should be very clearly identified as such and the basis for the assumptions clearly described. However, it would be better to eliminate them altogether.

Phil Koopman's book and the Freidman and Weinberg book on reviews will stand you in good stead for a lot of what you have to resolve. There are a number of others I could reccommend but you have to get some work done. Make sure that your management read Phils book. In my review of this book I stated that it should be "impossibly open at all chapters at once on the developers desk". You will find much in it that will help you through your current problems.

In addition to the books, you should also spend a bit of time to visit the design of your development process. Know what documentation you will need to build as evidence that the development was performed correctly. Do not forget to build in a mechanism for management of the innevitable changes.

Finally, you spoke of scrapping the current software and starting again. More often than not this can be a good thing to do. However, be sure not to make the same mistakes the second time around.

--
********************************************************************
Paul E. Bennett...............
Forth based HIDECS Consultancy
Mob: +44 (0)7811-639972
Tel: +44 (0)1235-510979
Going Forth Safely ..... EBA. www.electric-boat-association.org.uk..
********************************************************************

- S
- Steve B
  
  Contact options for registered users
Vote on answer
posted
12 years ago

Wed, Dec 14, 2011 8:01 PM

I'm a little curious about this. Are you planning to uplink new software to the already flying hardware, or redesigning for the next attempt? If it's the latter, I'd maybe reconsider the hardware as well, though you may not have time for that.

Was any ground testing done previously? I'm just wondering if the radiation environment might be a contributing factor to your failure. I've seen/heard lots of success stories of COTS parts working just fine for the most part on cube sats and such, even without much in the way of rad tolerance measures in the design, but you probably have to do frequent full resets.

- D
- Don Y
  
  Contact options for registered users
Vote on answer
posted
12 years ago

Wed, Dec 14, 2011 8:22 PM

Excellent! You know where/why you're starting! (Unfortunately, it looks like you've inherited a real *mess* :< )

Is there a *real* specification? Or, just a general "goal"? I.e., is there anything to test against?

(speaking with zero knowledge of the application) This is suggestive of memory (management) problems, counter overflows or "deadly embrace". (I mention these simply to get a feel for what "services" your application can benefit from)

(sigh) That is probably the case. Even (just a) disciplined developer would impart *some* structure to his/her code. And, from your other comments, it seems like this was an "ad hoc" team effort so any attempt at structure was lost in the noise.

Synchronization problems can be avoided with a disciplined design. This would also help identify cases of priority inversion.

Long ISRs are usually a consequence of a lack of structure. No mechanism to systematically pass data/events between foreground and background so the "processing" finds its way into the ISR.

Logical flaws should have been easily identifiable from the specification (assuming it is serving ts proper role).

This also seems like it should have been apparent in the specification (else how does the commanding agent know what rules *it* must live by??)

Sorry, I don't understand the role it plays (in the hardware or system itself). On the surface, it seems to suggest: "How the hell could the device *function* if it has no control over its transducer(s)?"

Are you *sure* that the library is disused OUT OF IGNORANCE? (from your other comments, this seems like it is probably the case) But, be aware that sometimes vendor supplied libraries are provided as check-off items and often not suited to a particular environment (e.g., reentrancy).

Huh? I assume you mean the "other party" has no *need* for the handshaking (so it is wasted overhead)? Is the "handshaking" intended as a pacing mechanism or as an acknowledgement/verification mechanism?

Meaning you can't get data often enough to point the satellite (or your instrument therein) *at* the correct target?

Understood. This is actually a common event. Especially if for projects that aren't particularly "well designed/implemented" (its too easy/tempting to quietly "slip away" when no one is watching)

All these are "strongly desired" -- especially with the stakes as they are (I suspect it is very difficult/costly to get instruments flying!)

I tend to favor heavily front-loaded processes -- putting lots of effort into nailing down all the details in a specification which can then be followed, almost blindly. But this requires folks who are good at challenging assumptions to be able to foresee the things that can go wrong -- and fortify the specification against them. Frankly, I don't know how else to develop especially in an environment where you have no physical control over the "what if's" (what if the spacecraft is pointed the wrong way? what if communications are interrupted at this point? etc.)

In addition to serving as a map that "implementors" (coders?) can "just follow", a good spec gives you a contract that you can test against. And, for teams large enough to support a division of functions, lets the "test team" start designing test platforms in which the *desired* functionality can be verified as well as

*stressed* in ways that might not have been imagined. This can speed up deployment (if all goes well) and/or bring problems in the design (or staff!) out into the open, early enough that you can address them ("Sheesh! Bob writes crappy code! Maybe we should think of having Tom take over some of his duties?" or "Ooops! This compression algorithm takes far too long to execute. Perhaps we should aim for lower compression rates in favor of speedier results?")

Worthwhile for *management's* use -- but doesnt do enough to sort out the mess in front of you.

That can be a good approach -- depending on the dynamics of your particular team. What you want to avoid is the distraction of people focussing on "arguing with the examples/guidelines" instead of learning from them and modifying them to fit *your* needs.

[think about how "coding guidelines" end up diverting energy into arguing about silly details -- instead of recognizing the *need* for SOME SORT OF 'standard']

I like to approach designs by enumerating the "things it must do" from a functional perspective. I.e., "activities": control the transducer, capture data from the transducer, process that data, transmit that data... (an oversimplification of your device -- but I don't know enough about it to comment in detail). Note these are all verbs -- active.

Then, identify the communications between these "activities". And, the resource requirements of each.

[this is all informal, "shirt-cuff" at this point]

This gives me an idea of how finely I can partition the design. The resource requirements tell me what constraints exist in terms of how much can happen concurrently. E.g., do I have enough memory/CPU to *collect* (new) data while processing (previous) data AND transmitting (old) data? If not, what value judgement can I make to best use the resources that I *do* have to maximize the functionality of the device? (e.g., if I have lots of memory but very little CPU, I might prefer to gather as much raw data *now* -- while some observable event is happening -- and worry about processing and transmitting it

*later*)

This gives my first rough partitioning of tasks/threads/processes and "memory regions". It also shows where the data is flowing and any other communication paths (even those that are *implicit*). Synchronization needs then become obvious. And, performance bottlenecks can be evaluated with an eye towards how the design can be changed to improve that.

E.g., if an (earth-based) command station has to *review*/analyze data from the device before it can reposition/target it, then the time from collection thru processing and transmission is a critical path limiting how quickly the *overall* system (satellite plus ground control) can react. If the times associated with any of those tasks are long-ish, you can rethink those aspects of the design with an eye towards short-cutting them. So, perhaps providing a mechanism to transmit *unprocessed* data (if the processing activity was the bottleneck) or collect over an abbreviated time window (if the collection activity was the bottleneck).

Once the activities and communications are identified, I look to see what services I want from an OS -- and the resources available for it. IMO, choice of services goes a *LONG* way to imposing structure on an application! And, it acts as an implicit "language" by which the application can communicate with other developers about what its doing at any particular place.

E.g., do you need fat pipes for your communications? Are you better off passing pointers to memory regions to reduce bcopy()'s? Can you tolerate shared instances of data? Or do you *need* private instances? How finely must you resolve time? What are the longest intervals that you need be concerned with?

I would also set aside some resources for (one or more) "black boxes". These can be invaluable in post-mortems. Ideally, if you have a ground-based prototype that you can modify, consider adding additional memory to it (even if it is "secondary" memory) for these black boxes. Having *lots* of data can really help identify what is happening when things turn ugly. (much easier than trying to reproduce a particular experiment -- which might not be possible! -- so you can reinstrument it!) Litter your code with invariant assertions so you see every "can't happen" when it *does* happen! :>

Finally, testing is critical. The goal *should* be to "break" the device. Really! And then look at the conditions in which it did "break" and see how those relate to your actual deployment ("Well, the system crapped out when the 40KHz signal from the switching power supply was coupled to the NMI pin...")

Mark Borgerson (sp?) would be a good contact -- he posts here often (and, IIRC, designs data collection devices for submersible deployments... similar issues to what you face, I suspect.)

Paul Bennett (I see a post from him elsewhere in this thread) will be a good resource regarding "getting it right, on paper, *first*" -- since mistakes are probably costly for you (in terms of access to the actual device as well a lost opportunity for your "customers")

There are others who can offer good advice, as well. The actual nature of your project is what prompted me to suggest these two folks...

- W
- Walter Banks
  
  Contact options for registered users
Vote on answer
posted
12 years ago

Wed, Dec 14, 2011 8:41 PM

Alessandro

I have worked on some major software disasters and one question that should be asked early is, "What works?"

Is what works self contained enough for retention as a separate module?

Software like anything else is subject to the same rules of reliability that anything is. That conceptually is useful in breaking apart the problem.

In general that means that software components should be isolated except through well defined interfaces. Breaking large modules into several component parts will often improve overall reliability of the software significantly especially if all components are not called every time the module is called. (I have some interesting proof for this is anyone is interested)

Samek's state machine work is interesting but has some problems as applications grow into real application size. In most state machine based applications system timing starts to become a problem and various bandaid solutions start to be applied, usually trading complexity for small minor timing fixes. Eventually the bandaid solutions will start to interfere with system dependencies and over all reliability will begin to drop.

Regards,

Walter Banks Byte Craft Limited

- D
- Don Y
  
  Contact options for registered users
Vote on answer
posted
12 years ago

Wed, Dec 14, 2011 8:52 PM

[much elided as we already agree on geeral methodology]

I don't know about the "80" figure (from the sorts of bugs I have encountered in RELEASED code, I would have imagined the figure to be even HIGHER!) but I find this to be very true. Even the act of aggressively writing test cases against a specification will often turn up lots of things that weren't considered.

E.g., I go to great lengths to try to design data representations so that "nasty" values CAN'T exist. -- so that someone can't fabricate a set of inputs that I'm not prepared to handle.

It's hard to eliminate all assumptions. E.g., I invariably assume that the next instruction executed WILL be the one that is intended to be executed :>

But, people seem to find it very hard to ask themselves, "What am I *assuming*, here?" To often, assumptions are *so* fundamental that they bend into the landscape. If, instead, you approach it in an EQUIVALENT manner as "What am I RELYING ON, here?", it tends to result in a more apprehensive approach to that exercise. I.e., as if there *is* some vulnerability and you are tasked with *finding* it! (looking for "assumptions" seems to be less "threatening" -- and, perhaps, less *motivating*)

Once you identify the assumptions ("reliances"/dependencies?), things get easier. But, you still have to be incredibly honest (cynical?) in how you assess them.

If you dismiss an assumption as "safe", is it *really*? What PREVENTS those things that "can't happen" from actually happening? If you can't prevent it, do you at least try to *detect* it (as a safeguard)? If it TRULY can't happen, then you should feel SUPREMELY CONFIDENT adding this line of code to your product:

if (cant_happen) { give_away_all_assets(mine); self.shoot() }

:-/

- A
- Alessandro Basili
  
  Contact options for registered users
Vote on answer
posted
12 years ago

Wed, Dec 14, 2011 9:41 PM

[...]

We have capabilities to uplink the software, some of the core software is on PROM though, so we keep what we have there (and by the way, this was a wise choice). The hardware has a lot of items that could have been done differently, but likely hardware is difficult - read impossible - to change and it defines very precise constraints.

The hardware was tested with a "test software" running, which checked the hardware functionalities, like memory access, registers and such, but the main software was too much behind schedule to meet the test campaign. To the credits of the all team, we should say that at a higher level the management completely overlooked the schedule problems they were having and only after the launch of the experiment they realized "hey we have a star tracker, let's use it!".

Except for two items that were space rated, all the rest (~600 dsp units, ~20 microcontrollers and tons of fuse-logic fpgas) have been chosen for their tolerance to radiation after doing tests on particle accelerators. That actually means that the rate - cross section - is low enough to not adversely affect operations. We have a built-in test that performs a check over the program memory and calculates the CRC, any deviation from what we expect will be handled with a reboot of the node. We don't have yet looked at a distribution of these events, but the rate is ~1/2 per day.

- T
- Tim Wescott
  
  Contact options for registered users
Vote on answer
posted
12 years ago

Wed, Dec 14, 2011 10:04 PM

Did you leave out the semicolon on purpose, so it would fail to compile and save you from shooting yourself?

--
My liberal friends think I'm a conservative kook.
My conservative friends think I'm a liberal kook.
Why am I not happy that they have found common ground?

Tim Wescott, Communications, Control, Circuits & Software
http://www.wescottdesign.com

- D
- Don Y
  
  Contact options for registered users
Vote on answer
posted
12 years ago

Wed, Dec 14, 2011 10:21 PM

Defective keyboard (notice how many characters are "missing" in my posts, lately -- despite a conscious effort to go back and fill in any that I notice :< ). End-of-year equipment shuffling means machine I normally use for posting is being replaced (and I'm hammering on a "poor substitute" in the meantime :< )

(sigh)

- S
- Steve B
  
  Contact options for registered users
Vote on answer
posted
12 years ago

Wed, Dec 14, 2011 10:30 PM

Interesting. This is off the topic of the thread, but I think a star tracker will be quite difficult to get tuned and working after the fact. Not impossible, but having the optical and mechanical calibration and integration done right would be essential. So I bet it would make for a very interesting and challenging task.

Sounds quite good then. I guessed from the cern.ch domain on your email that whoever you're working with must have access to lots of radiation data or facilities.

- D
- Duane Mattern
  
  Contact options for registered users
Vote on answer
posted
12 years ago

Wed, Dec 14, 2011 11:56 PM

Walter, Samek's state machine book was a real eye opener for me, but if you think that it will break down with "larger" applications, can you suggest an alternative approach for handling multi-level state machines? I've used IBM's native UML code generator (iLogic Rational) and it was a bit verbose but worked well for the small application that I was testing it on. I would not want to stop using the UML hierarchical statecharts, so I need some mechanism by which to generate code from them. If I don't have an IDE that does the UML code generation for me, then I'll use Samek's approach until I find something better.

Duane Mattern Sampled Systems, LLC

- A
- Alessandro Basili
  
  Contact options for registered users
Vote on answer
posted
12 years ago

Fri, Dec 16, 2011 2:13 AM

On 12/14/2011 7:51 PM, Paul E. Bennett wrote: [...]

Those suggested are both being ordered at this point (unfortunately CERN Library does not have them - sigh!)

One of the big problem in the environment I work in is that usually the requirements look like "as fast as possible", "as precise as possible" or "as small as possible". In this kind of environment often is even hard to get agreement that specs are needed (sounds sad, but that is often the case).

I'm currently in the situation where I'm arguing against the claimed accuracy the instrument *should have* to meet the "expectations". Essentially the star tracker is needed to provide position in order to verify where the 'photon detector' is aiming to. The position accuracy is rated to few arcsec, but unfortunately the 'photon detector' has a pointing accuracy of ~0.1 deg! i.e. 3 orders of magnitude difference. That means we are trying to build a Ferrari, when a tricycle is more than enough. So far I have not brought these argument up to the decision makers, but it is odd to me that everyone I ask to seems to realize that such a precision is not needed at all.

Often the idea of "squeezing everything out of the hardware" is kind of insane, since it most probably results in an overcomplicated software for no benefit at all (ever used a bazooka to kill a fly?).

I share your point fully, assumptions too often are tagged as "normal", "usual", "common". Example: the guy who implemented the fpga to handle the serial port didn't use a FIFO - so the software has to make sure there's no overriding going on in the receiver - which "normally" one would have thought of. Unfortunately the guy was either not aware of the existence of the FIFO per se or simply didn't think it was needed because it was not specified in any requirement. Now, how is it "normal" to assume that a serial port has a FIFO? With which confidence I may rely on the decisions in the implementation phase? Assumptions are very subtle some time since "common practice" is not so common after all.

On top of it there's an herd out there which simply refuses to write down specs just to avoid having to explain reasons and motivation behind a choice (which by the way, most of the times it does not have any reason behind).

At this point part of my efforts are in documenting the concepts with big blocks diagrams. Too often I see people stuck in the details while lacking the overall picture. And I do believe that simple big blocks diagram are often very effective in passing the concepts; they provide a "place" where to point finger at and discuss, instead of lengthy discussions waving hands and mimicking functionality with a whole set of gestures that are often hard to even reproduce. I found writing concepts in the form of big block diagrams are extremely helpful and often eliminate the need to use a colloquial language, which is too often incorrect, most of the time not understood by the non-natives and embarrassingly full of synonyms which only distracts attention (a "signal" on the first sentence become an "impulse" on the second one and an "event" in the third!).

Unfortunately I'm a fervent opponent of ebooks (well indeed only of DRMs which is often how they are delivered), so I cannot even print it on single side page and distribute it around. Will proceed with a normal printed copy to be shared ;-)

As part of this process I'm trying to evaluate the usage of the g21k compiler under GNU/Linux. The previous team was too much focused on the software itself and didn't even think about creating an "environment" around it.

I'm used to write software in GNU/Linux systems and following some very simple concepts, my development cycle is usually very clear and easy to manage (to the point that is increasingly followed by people from different groups within the collaboration). I want to profit from that approach in here as well, that's why I spent quite a bit of time to make the g21k working together with the assembler and the linker under GNU/Linux. The only missing piece is the archiver (the /ar/ utility does not support the COFF format!) but I can live without it and link directly with the objects I need. I managed to compile the ADI C-runtime library from source, but to be sure that the whole chain is working I will need to build a "test" program that I know it has been proven to work and load it on the hardware. If anyone around here has any experience with that I would be more than happy to follow advices/suggestions/comments.

That is why I'm trying to write down what actually happened in terms of management and design. My plans are to leave a plan for people to follow, either with or without me on board and even though it is easier to say than done, at least I believe I understand its importance.

- A
- Alessandro Basili
  
  Contact options for registered users
Vote on answer
posted
12 years ago

Fri, Dec 16, 2011 2:43 AM

On 12/14/2011 9:52 PM, Don Y wrote: [...]

Could you make an example here?

How would you go with the assumption that the compiler of your application works? How do you check that?

That is quite an interesting point. In my previous life (five years ago!) I was building hardware for the same detector and the development cycle surely included a timing static analysis on the FPGAs, but we didn't _assume_ that was enough, rather we decided not to _rely_ on the output of that analysis and stuff the electronic in the thermal chamber and did a fully functional test. Those tests not only spotted few bad components (infant mortality!), but it gave us the grounds to believe that in all possible thermal conditions the hardware behaved the way we expected.

Those tests were part of our acceptance tests for the flight electronics and we are currently benefiting a lot from it!

what if the cant_happen variable sits in a memory which has a bit flip? Would you then protect the variable with a CRC and instead of a cant_happen variable have a cant_happen() function which retrieves the variable and calculate the crc comparing it with the stored CRC?

- A
- Alessandro Basili
  
  Contact options for registered users
Vote on answer
posted
12 years ago

Fri, Dec 16, 2011 5:49 AM

And indeed I'm working on that as well, at least to have numbers right. I believe now the expectations are way beyond the current hardware design.

The DSP program has a bootstrap code which loads a very minimal utility program that we call "loader". The loader is capable of executing few commands like "write flash", "read flash" (that we use to upload new software) and "jump to main", where "main" is our main application.

After few hours running in the main, it looks like the hardware goes back to the loader, as if there was an hard RESET that would bootstrap the loader. In the main program there is no intention to jump back to the loader, that's why this looks a bit strange.

We experienced the same behavior on ground so the magic bit flip is harder to sell!

To the developer excuse I must admit he had both no experience neither a mentor. I must say though that I believe that a good amount of self-criticism often compensate the lack of knowledge, when you realize you don't know much after all.

I am actually very sensitive to this problem. I do believe that for these kind of applications the complexity of multi-tasking or multi-threading is not necessary and a simple hierarchical state machine may get the job done, but since I have to serve the serial port in a timeliness fashion I'm not quite sure I would control the timing of the FSM.

I personally believe that interrupts should be setting flags and that's it, in this way the synchronization is totally handled at the FSM level and I shouldn't chase funny combinations of interrupts occurring in various moments (how would I test that???).

What was specified in the specs was that for every command there should be a reply, but unfortunately here also it was not clear when the reply will come, since some processes are slower than others and so on. Given the fact that the software had a "command queue" it would have been possible to ignore the reply to the first command continuing sending the others.

Along the software implementation commands were added as the developer needed one command more to accomplish what he wanted to do. The result is a bunch of commands, each of them with its own interface of parameters and no clear indication how each of them is processed.

My bad, I should have elaborated over this a little more. The CCD is actually only the sensor, while an FPGA controls the way the chip shoots the picture sample the data and then an ADC converts it to a digital picture. This FPGA is controlled via a register that is write/read accessible from the DSP. So if I want to integrate the light on a longer time I could send a command to the FPGA simply writing few bits in this register.

Now we are trying to assess how the register look like (reading the vhdl), of course no documentation is providing any detail about the register.

I understand that, but I would start at least from reading the source code of the basic functions. They might be a check-off item, but I believe they are worth using as a first approximation.

I believe it was intended as a pacing mechanism, since nobody is verifying anything on the "other party". But the format of the message didn't allow more than 256 bytes, effectively reducing the possibility to send data out up to 1920 bytes/sec.

We have a GPS onboard which is continuously sending data and may differ on the type of messages it sends according to configuration. In this case there's no pacing, but it fully utilizes the bandwidth.

We reconstruct pointing on ground, i.e. every picture comes with a timestamp and a the N brightest stars in the field of view. The bigger is N the better the algorithm on ground recognize the stellar field. Moreover the higher the frequency of sampling the higher the accuracy (less need for interpolation). But all these factors are increasing the volume of data that needs to be transferred.

Regarding testing, I had in mind to add a tracing mechanism (a sort of printf) that would fill a log with some useful information that can be dumped regularly or on request. The implementation shouldn't add too much overhead but I believe that if used with care can give great insights to the flow. As an example it could be possible to log how much time it is spent in each function.

The shuttle flight to launch our experiment costs 500M$. The cost of the experiment is evaluated 2B$.

I understand your point and I'm not denying the fact that investing in a well defined set of specs and a good design pays off later on. I also believe you have to factor in the personal background each member of the team has. It is very hard to change the way people work, after all we human beings are a fundamentally lazy animal ;-)

This is why we actually prefer to have the an iterative and incremental approach, the early testing would make us go back to redefine better the specs and adjust the aiming along the way. A waterfall model may result in problems if the specs are not so thoroughly checked and at the same time they are engraved in stone.

[...]

Gee that's another thing knocked me off. I don't blame people who have a different coding style, as long as they have one. Lower and upper case are one of people favorite leisure. It seems to me they let FATE decide what would be the case of next letter in the word.... arghhh!

Personally I believe that block diagrams fulfill the need pretty well, if some timing is needed then a waveform like drawing with a cause-effect relationship between signals may help a lot to understand the flow.

What I've always seriously doubted is a flow chart of the program. They rarely match what the program is doing (also because it would be nice to see how to include your interrupts in a flow char) and often give the impression that once you have it done the software is "automatically generated". I personally have never seen a flow chart which corresponds

1:1 to the program, maybe is just my lack of experience.

Believe it or not the memory mapping of the board was the first document I did and it is still incomplete (not yet sure about few FPGA registers!). This simple document allowed me to understand how the memory is intended to be used.

I think synchronization is really complex whenever you are down to the multi-thread business and/or have multiple interrupt servicing. Given the old technology and luckily very few support for an OS (I haven't found any), I was aiming to have a very simple, procedural design which I believe would be much easier to test and to make it meet the specs.

To backup a bit more this motivation I just finished to write an extremely simple program to toggle a flag through the timer counter interrupt. The end result is that I failed to get the period I want and moreover is clear that interrupts are lost from time to time.

Since in this last case I was kicking the dog with this flag, I actually couldn't care less if I lost an interrupt as long as the period is enough to keep the dog quiet. But I got discouraged by a post on comp.dsp which stated: "This is embedded ABC basics: don't kick a dog in the interrupts." but no motivation was given.

Now my point is, how much time should I invest to make it working rather than exploiting a totally different path? If I had an infinite time I would probably try to make this stupid interrupt work the way I expect but these details may delay a lot if not irreversibly the project.

This reinforce my personal opinion that having the design in a block diagram form would allow for these kind of "shifts" easier to see. Adding processing time to functions may give a lot more details to avoid or bypass bottlenecks.

Here I'm a strong supporter of statically allocated memory, unless is not enough. Dynamically allocated memory is one of the things most programmers end up failing and even without noticing it. Countless the amount of time I saw a malloc in the middle of the function which was not free'ed at the end.

Shared memory is also something that unless needed should be avoided IMO. What is the advantage to have a memory shared when there's enough available?

What do you mean post-mortem?

I agree and this is why I believe I would need to add a logging capability, in order to see what happened after the fact. I never thought about changing the hardware, since I've always believed that adding hardware is preventing me from building the tools and maybe the software structure. A good example is the emulator. I see people are increasingly developing with the emulator, then they have to unplug it and give the product away and now they don't have any tool to assess the state/functionality since they were always used to use the emulator and now they cannot work anymore without it.

Likely here we have some experience in "breaking" things, but jokes apart I like the idea of testing not to check if it works but to check when it *does not* work. A simple example in that regards we have a list of commands on board the main computer that cannot exceed 256 items. Since everybody knew that 256 was the limit no one ever tried to send more, up to the point when by mistake it happened and the main application had to be rebooted, due to a problem on the 256th item! Testing in order to break puts actually builds up the reliability of the software which otherwise looks fragile, so depending on everything else working fine.

- A
- Alessandro Basili
  
  Contact options for registered users
Vote on answer
posted
12 years ago

Fri, Dec 16, 2011 8:02 AM

Well, until the software is not ready to reliably take images it would be hard to do anything you mentioned. I don't quite understand what do you mean by mechanical calibration.

Most of the testing was done off site, mostly at GSI with heavy nuclei with energies range from 100 to 1000 MeV/Nucleon. At CERN there are several facilities which monitors level of radiation, mostly ionization dose, but they are of course sensitive to SEE as well. There's a great deal, now that the machine is working, to reassess the status of the electronics, given the fluxes are much higher than anticipated [reference needed...].

- D
- Don Y
  
  Contact options for registered users
Vote on answer
posted
12 years ago

Fri, Dec 16, 2011 8:52 AM

Hi Alessandro,

Using unsigned's for counts (can you have a negative number of items?). Using relative measurements instead of absolutes (e.g., "worksurface is 23 inches from reference; position of actuator is 3.2 inches from edge of worksurface" contrast with "worksurface is at 23.0, actuator is at 22.5 -- oops!")

Buy a compiler that has passed a validation suite. And hope your code doesnt stress it in some bizarre way :>

In recent years, MIN and MAX numbers seem to have disappeared from datasheets. Everything is "typ" and at "ambient". Worst case design practices seem to be a thing of the past. :< And, if you try to do *anything* "out of the ordinary", you're often "on your own"!

(IMO) You have to adopt a similar cynicism when it comes to software. "Will every client follow these rules? What happens if they don't?"

I try to formalize my contracts with assertions at the top of each function *proving* that the caller has "followed the rules" and that what I am about to do in the function is safe. E.g.,

ASSERT(count > 0) average = total / count

I should be able to remove those ASSERT()s with no change in functionality -- they should never be tripped. (This also gives developers an unambiguous explanation of what the interface *does* guarantee.)

What if the register that it gets loaded into (from UNCORRUPTED memory) gets wacked by a particle? :>

What if a "Jump on Non Zero" opcode gets corrupted into a "Jump on Zero" as it is fetched from memory? What if some other UNRELATED piece of code gets corrupted resulting in a *jump* directly to the self.shoot() code fragment?

As I qualified Paul's comments about getting rid of assumptions... there are always some assumptions that you can't get away from.

If "doing something" has disproportionate consequences, then you should try to determine the predicate condition in two different INDEPENDENT ways -- hoping that BOTH can't be wrong (bug) or corrupted at the same time. The same sort of approach can be applied to the associated hardware.