Approach to Finding the Root Cause of Failures

ure that is rare, intermittent or obscure?

failure than I was when I was doing more design work. In many ways I thin k it is more challenging than design work. It takes a mindset that is dif ferent than design.

t for not being fooled by the results of your test

n a weighting factor of 1 to anything until you know you have the problem s olved

draw an opposite conclusion when you repeat a test than what you concluded after the first test.

one had they care and are smart, on the other hand if you go about chasing other peoples ideas (often conceived of to just demonstrate they are conce rned in a meeting) you will never get an a clear path to troubleshoot the p roblem in your own way.

roblematic.

the design phase, I no longer look at that as a curse, but as a blessing. It is going to come back and get you later.

is bad. As a designer you can show a days work for a days pay. In root ca use you feel like you have accomplished nothing for a long time. Frequentl y, though , these problems are the most visible problems in an organization and can make a difference between losing a customer and keeping one.

elp you find contradictions in your thinking.

I'm mostly talking about circuits I designed. But I do say the exact same things to myself. :^)

George H.

Reply to
George Herold
Loading thread data ...

I meant myself.

--
John Larkin         Highland Technology, Inc 

Science teaches us to doubt. 
 Click to see the full signature
Reply to
jlarkin

Cognitive behaviour therapy would probably suggest that you should settle for "What did I miss here?".

Calling yourself and idiot - no matter how correctly - doesn't set up a good frame of mind.

--
Bill Sloman, Sydney
Reply to
Bill Sloman

about.

less.

Wrong. Finding a one off field failures are not the same as failure anal ysis at the factory that affects an entire production run. I did it at Micr odyne, where it was often traced to out of spec components, or the OEM had changed their manufacturing process. Another cause was purchasing substitut ing unauthorized components. Like when they switched suppliers of variable inductors without asking for samples and verifying the new coils. Their exc use was "5% is always better than 10%, isn't it?" The SRF was 25 to 35% low er on the 5% parts, so we had an entire run of boards that had insufficient bandwidth. You won't find problems like that as a service tech.

Reply to
Michael Terrell

I had a smiley, but I have seen more than a few systems reliability improved by adding a bigger capacitor. There is a rule in software development that "almost all programming can be viewed as an exercise in caching". (Yes, it is an exaggeration - but there's a grain of truth in it.) Capacitors are the hardware equivalent of software caches.

Mind you, I have seen problems with too big capacitors too. I remember long ago trying to find why a card communicated find (at 9600 baud RS-232) with some computers but not others. Looking with a scope, the RS-232 signals were lovely triangle waves - someone had added 100 nF capacitors to the lines to reduce the noise...

Reply to
David Brown

David Brown wrote in news:r61klr$369$ snipped-for-privacy@dont-email.me:

I have a trusted engineer friend who once said that most failures occur at power up or power down. He always left his computers at work and his home up all the time.

Old net and system admin guys usually like keeping systems up and running at all times too. The big computer rooms of the sixties would lose thousands and hour in insurance if the room temperature rose above a preset level like

Reply to
DecadentLinuxUserNumeroUno

Yeah, I've gone overboard too. The board I'm making now has a 150 uF tant on the 12 volt line because I have no real specs on the power source and th ere are multiple boards it's used on as a daughercard for anyway. The orig

. So I used the biggest part I could find not knowing what else might be o n that power rail glitching away. Turns out 150 uF x 8 daughtercards was a bit much for the supply at power up! Fortunately the chip they used had a cap you could change to set the ramp speed and once it was dialed back it worked fine.

After that the only problem was ham fisted installers who shove the boards into the rack misaligned scraping these tall caps right off the card!

--
  Rick C. 

  ++ Get 1,000 miles of free Supercharging 
 Click to see the full signature
Reply to
Rick C

Rank the tests you have by their ability to cut down the area where the fault must lie. In my field McCabes CCI metric is quite good for that.

If the code complexity index is too high there is a very good chance that the code doesn't actually work correctly.

I disagree with this at least in part - you should make a list of things which ought to be true and a list of invariants that you expect to remain true if things are operating correctly.

Always worry if a test can pass or fail apparently at random.

Here I disagree massively. A well chosen helpful outsider can sometimes help you break a problem even if they are not all that skilled in the art. Explaining to a junior who isn't afraid to ask apparently dumb questions can sometimes allow you to see your own mistaken assumptions. With practice verbally explaining it to an empty chair can also work since it runs the problem through a different part of the brain.

Sooner you catch a fault the less it costs.

Explaining your reasoning to a relatively junior engineer (or if it is a really tough problem an engineer of the same rank or higher and who thinks differently to how you do) can be very powerful.

--
Regards, 
Martin Brown
Reply to
Martin Brown

He is right.

I keep my PC's on all the time. Even a Windows machine can run for months without a restart if treated with due care and kindness. But it's not just about risk of failure - I usually have so many projects open at a time on different workspaces (on the Linux systems) that is a big effort and waste of time to restart the thing.

Reply to
David Brown

Ham-fisted installers are always a problem! We made a number of systems that were used in farming industries, and we'd get boards back for service that were an incredible mess, with electronics fried and connectors and sockets broken. The hand-scribbled failure report would say things like "the socket was the wrong size - I had to use a hammer to get the plug in". Round plugs and square holes were no hinder to these guys.

Reply to
David Brown

that is rare, intermittent or obscure?

lure than I was when I was doing more design work. In many ways I think it is more challenging than design work. It takes a mindset that is differe nt than design.

r not being fooled by the results of your test

weighting factor of 1 to anything until you know you have the problem solve d

To clarify, I meant assigning weighting factors to the conclusions you make as you run through various tests.

an opposite conclusion when you repeat a test than what you concluded afte r the first test.

Again, to clarify, it is not that the test randomly changes the result, it is that there is some subtle missing element in the test that you missed th e forst time, and that subtlety results in an opposite result.

had they care and are smart, on the other hand if you go about chasing oth er peoples ideas (often conceived of to just demonstrate they are concerned in a meeting) you will never get an a clear path to troubleshoot the probl em in your own way.

ematic.

Most of the time we do not get to choose who helps us. There are meetings with a room full of ideas. The intentions are all good....but the road to hell is paved with good intentions.

design phase, I no longer look at that as a curse, but as a blessing. It i s going to come back and get you later.

ad. As a designer you can show a days work for a days pay. In root cause you feel like you have accomplished nothing for a long time. Frequently, t hough , these problems are the most visible problems in an organization and can make a difference between losing a customer and keeping one.

you find contradictions in your thinking.

Reply to
blocher

the toughest issue I had to find was a power up issue. It turned out that the memory part manufacturer had a bug in their handshake codes at power up and occasionally it threw a bad code which then set the DSP into a wrong c lock speed which then resulted in the NVRAM getting corrupted....the unit b ricked (although recoverable at the factory with a complete reprogram) . T here was a cryptic note in the data sheet which when we finally realized th at the cryptic note seemed to rhyme with our problem we contacted the manuf acturer. They then gave us the complete story which was that all date code s prior to a particular time were susceptible to the problem and date codes after were fixed.

I would have loved to hear the debate about how to put that note in the dat a sheet. Frankly, they knew that if they were totally candid, then the par t was not valid so they wanted to mask it, but , I guess, some engineer was screaming about how bad this was and they agreed to the cryptic note.

As another aside, this was kind of a good one for us because our customer w as mad that they had bricked units in their airplane but when we presented them the problem, it was not our fault and we had been tenacious in finding the problem. And nobody looks bad for designing the thing wrong.

Reply to
blocher

t the memory part manufacturer had a bug in their handshake codes at power up and occasionally it threw a bad code which then set the DSP into a wrong clock speed which then resulted in the NVRAM getting corrupted....the unit bricked (although recoverable at the factory with a complete reprogram) . There was a cryptic note in the data sheet which when we finally realized that the cryptic note seemed to rhyme with our problem we contacted the man ufacturer. They then gave us the complete story which was that all date co des prior to a particular time were susceptible to the problem and date cod es after were fixed.

ata sheet. Frankly, they knew that if they were totally candid, then the p art was not valid so they wanted to mask it, but , I guess, some engineer w as screaming about how bad this was and they agreed to the cryptic note.

was mad that they had bricked units in their airplane but when we presente d them the problem, it was not our fault and we had been tenacious in findi ng the problem. And nobody looks bad for designing the thing wrong.

Also, there was one obscure LED on the board that gave an indication that t he boot load had finished. Had that LED not been on the board, I do not th ink we would have ever found the problem. Normally at power up the LED tur ned on then turned off when everything finished initializing. In this case the LED stuck on, so we knew it was a power on issue. Still a real bugger to find.

Reply to
blocher

Do a binary search when you can. Keep cutting the solution space in half.

--
John Larkin         Highland Technology, Inc 

Science teaches us to doubt. 
 Click to see the full signature
Reply to
jlarkin

Let's fire all those engineers and replace them with guitar repairmen.

--
John Larkin         Highland Technology, Inc 

Science teaches us to doubt. 
 Click to see the full signature
Reply to
jlarkin

Yeah, you're on a roll. Good work.

Cheers

Phil Hobbs

--
Dr Philip C D Hobbs 
Principal Consultant 
 Click to see the full signature
Reply to
Phil Hobbs

ailure that is rare, intermittent or obscure?

use failure than I was when I was doing more design work. In many ways I t hink it is more challenging than design work. It takes a mindset that is different than design.

Root for not being fooled by the results of your test

sign a weighting factor of 1 to anything until you know you have the proble m solved

ll draw an opposite conclusion when you repeat a test than what you conclud ed after the first test.

the one had they care and are smart, on the other hand if you go about chas ing other peoples ideas (often conceived of to just demonstrate they are co ncerned in a meeting) you will never get an a clear path to troubleshoot th e problem in your own way.

e problematic.

ng the design phase, I no longer look at that as a curse, but as a blessing . It is going to come back and get you later.

rk is bad. As a designer you can show a days work for a days pay. In root cause you feel like you have accomplished nothing for a long time. Freque ntly, though , these problems are the most visible problems in an organizat ion and can make a difference between losing a customer and keeping one.

o help you find contradictions in your thinking.

!!

e

Grin.. sorry. Humor doesn't work well when not face to face.

GH

Reply to
George Herold

Negative PSRR is usually horrible in "single supply" op amps, because, duh, they expect you to use a single positive supply. ;)

Yup.

Well, children, anyway. ;)

Dunno. I first saw it in an audio amp project in a magazine, circa

1977. The LED + NPN emitter-follower voltage reference, I saw in an article of Walt Jung's at about the same time.

We should revisit that "how many two-transistor circuits are there?" thread at some point.

Cheers

Phil Hobbs

--
Dr Philip C D Hobbs 
Principal Consultant 
 Click to see the full signature
Reply to
Phil Hobbs

Yup. At IBM Watson we used to shut the whole place down over Labor Day weekend. It always took a couple of days to get the silicon fab line back up, because things like corroded connections and worn-out motors tend to fail at inrush.

Cheers

Phil Hobbs

--
Dr Philip C D Hobbs 
Principal Consultant 
 Click to see the full signature
Reply to
Phil Hobbs

Having the appropriate number of blinky LEDs is key. Sometimes when I run short of pins, I'll have the housekeeping loop output a state code from a UART. That's super helpful in keeping track of state machines and so on.

I sometimes do that too, but only when the project is under version control, which most are. (Github/Gitlab private repos are good for projects where nobody else would know what it is. Not so much for the crown jewels.)

Cheers

Phil Hobbs

--
Dr Philip C D Hobbs 
Principal Consultant 
 Click to see the full signature
Reply to
Phil Hobbs

ElectronDepot website is not affiliated with any of the manufacturers or service providers discussed here. All logos and trade names are the property of their respective owners.