Thoughts on developing an

- G
- Geato
  
  Contact options for registered users
posted
9 years ago

Fri, Feb 20, 2015 8:18 PM

Hi,

New member here so hopefully I am in the correct group.

We have many controller boards in the field running the NXP1769 processor. Randomly, maybe a year or even 2 years down the road, the processor crashes and re-flashing the firmware brings it back to life. We know it is a transient doing this but while we are chasing that problem, a stop gap fix would be to come up with an auto re-flasher of sorts.

I am thinking of a small pcb that plugs onto the existing JTAG connector that has a firmware image stored a uSD card. Something (non-processor hopefully), perhaps a CPLD powers up the uSD and transfers the image to the NXP. The hardware watchdog will initiate the transfer.

Does this sound like it is possible to do at a high level? I am trying to minimize the reliance on additional firmware like bootloaders or standalone JTAG programmers. The latter is physically too big and pricey as I would need about a thousand of these.

Cheers....

--------------------------------------- Posted through

formatting link

- L
- langwadt
  
  Contact options for registered users
Vote on answer
posted
9 years ago

Sat, Feb 21, 2015 12:23 AM

Den fredag den 20. februar 2015 kl. 21.18.51 UTC+1 skrev Geato:

you can't get to the uart and bootpin so you can use the buildin bootloader?

many years ago I did boot and flash ARM7 via JTAG for a test system, it was basically some parallelport JTAG code from a PC app ported to an MCU

Talking JTAG isn't complicated, figuring what to tell the chip can be, but if you can figure that out any old MCU with enough flash should be able to do what you want

-Lasse

- D
- Don Y
  
  Contact options for registered users
Vote on answer
posted
9 years ago

Sat, Feb 21, 2015 1:21 AM

Have you verified that the old/existing firmware image *is* corrupted at this point? I.e., that the device won't restart "normally" after such a crash (cycle power)?

As your existing system (apparently) can't self-flash, you must be dispatching "staff" to perform this reflash? Do they do anything besides blindly reflashing the device?

Is it cheaper/easier to just *replace* defective devices (which gives you a chance to do a post-mortem on the device(s) that have failed)?

(I.e., this sounds like the "reinstall Windows" solution-to-all-problems)

So, you assume the ONLY time the watchdog kicks in is when this "crash" happens? I.e., there are NEVER cases where the watchdog kicks in, resets the processor and execution resumes CORRECTLY (without needing a reflash)? Do you track "reset" events anywhere so you can determine *if* this is the case?

Does your run-time *ever* attempt to write to that flash in normal operation?

Is power cycled often/frequently in your environment?

Is anyone tracking the frequency of these crashes in your deployed population so you can begin to identify if there is a common pattern (power-on-hours, power cycles, manufacturing date code, etc.)? I.e., can you *predict* when this event is likely to happen (or, NOT happen)? Does reflashing cause the device to be "reliable" for "another 2 years"? Or, once reflashed, do crashes occur with greater frequency?

What are the consequences to the user/application when this crash occurs?

The "right" solution is to figure out what the actual cause is. If "can't happen" *is* happening, then some assumption has been violated (which can result in *other* problems that haven't yet been visible).

What will you do if (when?) a device just sits in a tight crash-reflash loop, indefinitely? Will the user be able to determine that this is actually happening (big red light)? Will *you* be able to determine how often any particular "reflasher" has been triggered? I.e., are you sure your fix won't just *change* the problem's manifestation?

- P
- Paul E Bennett
  
  Contact options for registered users
Vote on answer
posted
9 years ago

Sat, Feb 21, 2015 8:23 AM

In addition to Don's points, you might ask yourself what happens if the problem you are experiencing happens also to your re-flashing device (as that is likely to have FLASH also). You may end up loading a corrupt image in the wrong locations.

You really need to understand the problem in much better detail. Does the unit design have vulnerabilities to electrical noise, brown-outs, RF interference, High Energy Transients, Higher Frequency Interrupts than it can deal with?

I am not sure how much protection Don puts in his circuitry but I expend quite some effort to make sure that the processors in my products are quite well protected from a whole raft of transient interference. I also have checking in place to know when I am facing problems and need to report the fact. Then, my systems are usually expected to run a couple of decades with little or no maintenance effort in high dependability applications.

So, back to Don's point. Have you done an analysis of the failures that lead to the perceived need for re-flashing? Have you traced the impetus for such failures. You might want to discuss the problem with NXP as well.

--
******************************************************************** 
Paul E. Bennett IEng MIET..... 
Forth based HIDECS Consultancy............. 
Mob: +44 (0)7811-639972 
Tel: +44 TBA (due to  re-location) 
Going Forth Safely ..... EBA. www.electric-boat-association.org.uk.. 
********************************************************************

- D
- Don Y
  
  Contact options for registered users
Vote on answer
posted
9 years ago

Sat, Feb 21, 2015 9:52 PM

Ha! I hadn't considered that! (though if the same soul designed both devices, it only stands to reason!) Rather, I was more concerned (above) with the reflasher failing to flash the (original) device due to a problem

*in* the original device. E.g., perhaps when "staff" reflash the device, they have it powered from a more stable power source, implement more robust "tests" that the flash "took", etc. A "dumb box" could easily fail to achieve any of these "differences" leading to a less reliable reflash... followed by another crash (perhaps for some *other* reason than the original problem!) and another reflash followed by...

When "can't happen" *does*, you really need to step back and figure out what's wrong with your assumptions. Have you overlooked something? Has something *changed* unexpectedly?? Do you even *know* what your assumptions *are*?

Dismissing these sorts of events as "flukes" is a sign of poor engineeering (when do you begin to consider a "fluke" a "genuine bug" to be acted upon??)

It puzzles me that ALL devices don't have BlackBoxes /de rigueur/. Even

*volatile* implementations are very feasible and invaluable (IMO) for these sorts of situations! It's not like it's an "expensive" mechanism (development, time *or* space)

The OP seems to have decided a Band-Aid is the quickest way to "solve" this problem. That seems unlikely (though we've not seen all the particulars re: his design/application/environment).

Ask oneself: what *should* I do differently to ensure the NEXT design doesn't suffer from the same problem? I suspect the "right" answer is NOT "design a reflasher in with the INITIAL design!"

And, as you've said, "what do I do when the reflasher fails?"

- T
- Tim Wescott
  
  Contact options for registered users
Vote on answer
posted
9 years ago

Sun, Feb 22, 2015 3:10 AM

Can you set protect bits on the flash, either permanently or (assuming that you have to re-program from time to time) unlockable?

It sounds like you're allowing the processor to write to program memory, which is just wrong. If you have valid flash writes (i.e., if you have program and non-volatile data in flash), consider hard-coding the flash write routines to fail if they're told to write someplace they're not supposed to.

--

Tim Wescott 
Wescott Design Services 
http://www.wescottdesign.com

- S
- Simon Clubley
  
  Contact options for registered users
Vote on answer
posted
9 years ago

Sun, Feb 22, 2015 1:26 PM

If you are concerned about that, have the build procedures which generate the image to be flashed in the first place also generate a MD5 or similar hash of the generated image at the same time.

As part of your post-flash verify pass, you can then download the image which was actually flashed and generate it's MD5. Comparing the two hashes will tell you if the image was flashed correctly (unless you manage to generate a hash collision :-)).

Simon.

--
Simon Clubley, clubley@remove_me.eisner.decus.org-Earth.UFP 
Microsoft: Bringing you 1980s technology to a 21st century world

- P
- Paul E Bennett
  
  Contact options for registered users
Vote on answer
posted
9 years ago

Sun, Feb 22, 2015 2:10 PM

As Don Y, Tim Wescott and myself have suggested, there is something that is fundamentally wrong with the installed systems. Rather than designing, building and installing thousands of re-flashers they should explore the root cause of the problem more thoroughly.

It is obvious that the Flash is being trashed somehow. Finding out what and why would be the best use of their time. If they have to change the design perhaps they can build in the protection measures to prevent such recurrences.

--
******************************************************************** 
Paul E. Bennett IEng MIET..... 
Forth based HIDECS Consultancy............. 
Mob: +44 (0)7811-639972 
Tel: +44 TBA (due to  re-location) 
Going Forth Safely ..... EBA. www.electric-boat-association.org.uk.. 
********************************************************************

- T
- Tauno Voipio
  
  Contact options for registered users
Vote on answer
posted
9 years ago

Sun, Feb 22, 2015 6:02 PM

A brownout detector reset chip could be a good investment.

--

-TV

- D
- Don Y
  
  Contact options for registered users
Vote on answer
posted
9 years ago

Sun, Feb 22, 2015 7:18 PM

I'm not sure that would give a conclusive result.

First, the OP hasn't confirmed that the image even *appears* to have been corrupted (i.e., altered). All he's said is that reflashing FIXES the "problem". I.e., he is (apparently) assuming that the flash has been corrupted -- as that is what reflashing *purports* to "fix".

There may, indeed, be something (?) that has happened to the system that his reflashing ACTIVITY/procedure is "fixing" OTHER THAN "CORRECTING" THE CONTENTS OF THE FLASH.

E.g., imagine a device that is powered *on* 24/7/365 and only has power cycled as a side-effect of the reflashing process. The contents of the flash may, in fact, be intact and it is the cycling of power that is "fixing" the ACTUAL problem.

[I am not claiming this is the case. Rather, indicating that the OP's "diagnosis" is unsubstantiated: is the firmware image ACTUALLY corrupt? *How*/where? Do all afflicted devices exhibit the same problem in the same *way*/place? etc.]

Second, how you obtain that checksum/hash -- even a literal byte-by-byte comparison -- may not reflect the operating conditions of the device in its failed state. E.g., using JTAG to pull the bytes from the device will obviously *not* occur at "opcode-fetch speed". Nor will the memory access patterns mimic those that occur in normal operation.

Etc.

The OP first needs to prove to himself that reflashing *could* be a remedy -- by indicating that the contents HAVE, in fact, been altered between the time the device was manufactured and the time the "crash" (and proposed reflash) occurred.

E.g., imagine examining the flash's contents and finding it *intact*! Yet, still noting that the reflash "fixes" the problem! This poses a different problem than finding the contents have been *altered*...

While the OP may, in fact, have done these things, I'm just asking for confirmation and an elaboration as to *how* he came to the conclusion that a reflasher "makes sense" (even as a PTF). It's sort of like someone who "debugs" code by making "arbitrary" changes and waiting to DISCOVER which of them (appears to) yield the correct results. While you *may* find a change that appears to work, unless you can PROVE that it *should* work (by understanding the real problem), you may have just CHANGED the problem...

- D
- Don Y
  
  Contact options for registered users
Vote on answer
posted
9 years ago

Sun, Feb 22, 2015 7:44 PM

As I said to Simon (upthread), I am not convinced that the flash HAS been trashed! The only "evidence" may be entirely coincidental. That's why I'd like to hear (from the OP) what he did to verify the flash's integrity (or, lack thereof).

"Reinstall Windows" is just too simplistic an approach to a problem (and, like most of those cases where "windows was reinstalled", it often doesn't prevent the problem from re-occurring! Because the PROBLEM hasn't been identified and solved).

Scientific method: construct a hypothesis; then construct an experiment (test) to validate or invalidate that hypothesis. *THEN*, come to conclusions (or, a refined hypothesis). OP seems to have just found something that APPEARS to work (? no idea how WELL!) and settled on that.

Sunday lunch: Finestkind!

- S
- Simon Clubley
  
  Contact options for registered users
Vote on answer
posted
9 years ago

Mon, Feb 23, 2015 1:13 AM

Hello Don (and Paul),

I was addressing Don's interesting and specific comment about how do you detect, in general, a faulty flash image caused by a malfunctioning reflasher ? I wasn't offering a general suggestion for the OP.

The beauty of a build time hash is that even if a faulty reflasher corrupts the in-memory image _before_ burning it, the hash will detect that but comparing the burnt image against the corrupt in-memory image will not.

However, based on the thread so far, I agree the OP has a more basic problem which is the real cause and is the one which needs solving.

Simon.

--
Simon Clubley, clubley@remove_me.eisner.decus.org-Earth.UFP 
Microsoft: Bringing you 1980s technology to a 21st century world

- J
- Jack
  
  Contact options for registered users
Vote on answer
posted
9 years ago

Mon, Feb 23, 2015 7:22 AM

and also do some check on the non-volatile data in flash in caseit becomes corrupt...

Bye Jack

--
Yoda of Borg am I! Assimilated shall you be! Futile resistance is, hmm?

- D
- Don Y
  
  Contact options for registered users
Vote on answer
posted
9 years ago

Mon, Feb 23, 2015 4:46 PM

Good point! OTOH, do you build a re-reflasher to verify the hash stored in the reflasher hasn't been corrupted? I.e., reflasher's hash gets mangled. It CORRECTLY reflashes the device in question. Then, computes the hash of that image (from/in the device) and notices that it is not in agreement with the stored hash -- so, it (erroneously) decides the reflash didn't "take" and repeats the process... :-/

(which, of course, will *still* fail -- because the *hash* is corrupt!)

I think the OP hasn't even (clearly) identified the *symptoms*, let alone the *problem*! (i.e., *is* the image intact or not? if it *is*, then why are you reflashing it??)

- S
- Simon Clubley
  
  Contact options for registered users
Vote on answer
posted
9 years ago

Tue, Feb 24, 2015 1:35 PM

:-)

I learnt a long time ago that not every problem can be solved by technical means; sometimes a technical solution becomes a management solution instead.

In this hypothetical case, the build time hash has allowed it to be established that either the image or the hash itself is getting corrupted by the reflasher. In either case, the end result is the same - the reflasher is faulty and cannot be trusted.

At this point, the reflasher should be pulled out of service and dumped on the bench of whoever created it. This person should be told "this reflasher is faulty and this hash is the proof. Fix it."

If they still can't do that then that's when you either go to their manager with your hash proof and/or put a quote for your design services on their desk. :-)

Indeed. And just to repeat this; I am not suggesting the OP go down the reflasher route. I am just thinking about how to detect/solve the specific question Don posed.

Simon.

--
Simon Clubley, clubley@remove_me.eisner.decus.org-Earth.UFP 
Microsoft: Bringing you 1980s technology to a 21st century world

- D
- Don Y
  
  Contact options for registered users
Vote on answer
posted
9 years ago

Tue, Feb 24, 2015 4:15 PM

Yes -- but notice how we're now talking about a problem with a REFLASHER! The *original* problem is hiding (unsolved) behind a (potentially) newly created one! :-/

(OP) Understand the problem first. It *may* be that the most practical solution ends up being a reflasher (ick!). E.g., Hubble's defective mirror was best solved as it was -- instead of *replacing* the entire mirror (which would have been the "ideal" solution).

But, know *why* this solution is the best instead of just throwing it up as a quick fix!

[I'm off to one of my pro bono gigs...]