Approach to Finding the Root Cause of Failures

Another topic that I hope can elicit engineering discussion:

What makes up a good skill set for finding the root cause of a failure that is rare, intermittent or obscure?

Over the past several years I have been more involved in root cause failure than I was when I was doing more design work. In many ways I think it is more challenging than design work. It takes a mindset that is different t han design.

Here is my reminder list when doing root cause studies

  1. never root for a particular outcome when performing a test. Root for no t being fooled by the results of your test

  1. Assign weighting factors to everything you believe. Never assign a weig hting factor of 1 to anything until you know you have the problem solved

  2. Expect to have to do certain tests over again and that you will draw an opposite conclusion when you repeat a test than what you concluded after th e first test.

  1. Taking guidance from "helpful" outsiders is challenging. On the one had they care and are smart, on the other hand if you go about chasing other p eoples ideas (often conceived of to just demonstrate they are concerned in a meeting) you will never get an a clear path to troubleshoot the problem i n your own way. Help is a two edged sword. It is important but can sometimes be problemat ic.

  2. As an aside - I have learned that when I "see something" during the desi gn phase, I no longer look at that as a curse, but as a blessing. It is go ing to come back and get you later.

  1. Get past the notion that having nothing to show for a days work is bad. As a designer you can show a days work for a days pay. In root cause you feel like you have accomplished nothing for a long time. Frequently, thoug h , these problems are the most visible problems in an organization and can make a difference between losing a customer and keeping one.

  2. Look for contradictions in your thinking. Use other people to help you find contradictions in your thinking.

OK - enough for now......

Reply to
blocher
Loading thread data ...

On Tuesday, March 31, 2020 at 11:34:44 AM UTC-4, snipped-for-privacy@columbus.rr.com wro te:

at is rare, intermittent or obscure?

re than I was when I was doing more design work. In many ways I think it i s more challenging than design work. It takes a mindset that is different than design.

not being fooled by the results of your test

ighting factor of 1 to anything until you know you have the problem solved

n opposite conclusion when you repeat a test than what you concluded after the first test.

ad they care and are smart, on the other hand if you go about chasing other peoples ideas (often conceived of to just demonstrate they are concerned i n a meeting) you will never get an a clear path to troubleshoot the problem in your own way.

atic.

sign phase, I no longer look at that as a curse, but as a blessing. It is going to come back and get you later.

. As a designer you can show a days work for a days pay. In root cause yo u feel like you have accomplished nothing for a long time. Frequently, tho ugh , these problems are the most visible problems in an organization and c an make a difference between losing a customer and keeping one.

u find contradictions in your thinking.

Oh yeah....If it is RF related there is >50% change it is grounding related

Reply to
blocher

On Tuesday, March 31, 2020 at 11:34:44 AM UTC-4, snipped-for-privacy@columbus.rr.com wro te:

at is rare, intermittent or obscure?

re than I was when I was doing more design work. In many ways I think it i s more challenging than design work. It takes a mindset that is different than design.

not being fooled by the results of your test

ighting factor of 1 to anything until you know you have the problem solved

n opposite conclusion when you repeat a test than what you concluded after the first test.

ad they care and are smart, on the other hand if you go about chasing other peoples ideas (often conceived of to just demonstrate they are concerned i n a meeting) you will never get an a clear path to troubleshoot the problem in your own way.

atic.

sign phase, I no longer look at that as a curse, but as a blessing. It is going to come back and get you later.

. As a designer you can show a days work for a days pay. In root cause yo u feel like you have accomplished nothing for a long time. Frequently, tho ugh , these problems are the most visible problems in an organization and c an make a difference between losing a customer and keeping one.

u find contradictions in your thinking.

Also - the FPGA guys and the SW guys will only acknowledge a problem when i t is laid out under their nose. It is never their fault :-)

Reply to
blocher

On Tuesday, March 31, 2020 at 11:34:44 AM UTC-4, snipped-for-privacy@columbus.rr.com wro te:

at is rare, intermittent or obscure?

re than I was when I was doing more design work. In many ways I think it i s more challenging than design work. It takes a mindset that is different than design.

not being fooled by the results of your test

ighting factor of 1 to anything until you know you have the problem solved

n opposite conclusion when you repeat a test than what you concluded after the first test.

ad they care and are smart, on the other hand if you go about chasing other peoples ideas (often conceived of to just demonstrate they are concerned i n a meeting) you will never get an a clear path to troubleshoot the problem in your own way.

atic.

sign phase, I no longer look at that as a curse, but as a blessing. It is going to come back and get you later.

. As a designer you can show a days work for a days pay. In root cause yo u feel like you have accomplished nothing for a long time. Frequently, tho ugh , these problems are the most visible problems in an organization and c an make a difference between losing a customer and keeping one.

u find contradictions in your thinking.

Yeah I'd call this trouble shooting. The most important thing IMHO is not to make assumptions about the cause early on. This is hard because we all look for an 'answer' first and then try and test it. So get as much data on problem as you can. Then make a list of all possible things it might be. And a list of possible tests. (Then go to sleep or do something else and maybe some other ideas will form in your brain.)

Finding intermittent problems is the worst. And it's sometimes useful trying to make it fail more often.

George H.

Reply to
George Herold

That's a good one. Don't dismiss a weird observation just because it goes away. People are emotionally primed to do that.

--

John Larkin         Highland Technology, Inc 

Science teaches us to doubt. 

  Claude Bernard
Reply to
jlarkin

rote:

that is rare, intermittent or obscure?

lure than I was when I was doing more design work. In many ways I think it is more challenging than design work. It takes a mindset that is differe nt than design.

r not being fooled by the results of your test

weighting factor of 1 to anything until you know you have the problem solve d

an opposite conclusion when you repeat a test than what you concluded afte r the first test.

had they care and are smart, on the other hand if you go about chasing oth er peoples ideas (often conceived of to just demonstrate they are concerned in a meeting) you will never get an a clear path to troubleshoot the probl em in your own way.

ematic.

design phase, I no longer look at that as a curse, but as a blessing. It i s going to come back and get you later.

ad. As a designer you can show a days work for a days pay. In root cause you feel like you have accomplished nothing for a long time. Frequently, t hough , these problems are the most visible problems in an organization and can make a difference between losing a customer and keeping one.

you find contradictions in your thinking.

I missed that one.... working hard to replicate the failure requires more e nergy than everything else, because in the end if you cannot replicate it y ou probably (exceptions to every rule) do not know for sure what it is

Reply to
blocher

rote:

that is rare, intermittent or obscure?

lure than I was when I was doing more design work. In many ways I think it is more challenging than design work. It takes a mindset that is differe nt than design.

r not being fooled by the results of your test

weighting factor of 1 to anything until you know you have the problem solve d

an opposite conclusion when you repeat a test than what you concluded afte r the first test.

had they care and are smart, on the other hand if you go about chasing oth er peoples ideas (often conceived of to just demonstrate they are concerned in a meeting) you will never get an a clear path to troubleshoot the probl em in your own way.

ematic.

design phase, I no longer look at that as a curse, but as a blessing. It i s going to come back and get you later.

ad. As a designer you can show a days work for a days pay. In root cause you feel like you have accomplished nothing for a long time. Frequently, t hough , these problems are the most visible problems in an organization and can make a difference between losing a customer and keeping one.

you find contradictions in your thinking.

This is talking about problems that are deeper than troubleshooting. Isolat ing Broken parts is troubleshooting. This is finding the hidden corner ca ses in a design that typically are not seen until hundreds of units are in the field finding those corner cases

Reply to
blocher

That's because it's usually a hardware fault - and it can be solved by using a bigger capacitor :-)

Reply to
David Brown

Very. Making it worse is as good as making it better.

Also, as you're going through it, fix every problem that you find. Surprisingly often that'll also fix the mysterious one. Long ago, I had a sensitive front end which had a horrible offset voltage problem.

It was a low cost optical head tracker for computers, and used modulated IR LEDs to illuminate your forehead, and three pairs of photodiodes positioned behind a shadow mask to get XYZ position of the bright patch. The PDs were chopped at 100 kHz, and each channel had an MC1496 to do the synchronous detection. One channel had a horrible offset voltage problem.

Everything I did seemed to make it worse. Turned out to be the 100 kHz getting in from the noisy supply via a 1-pole cap multiplier ripple (180 degrees lag from two poles) and stray capacitance to the noisy supply (90 degrees lag from the filter - 90 degrees lead from the stray capacitance). Both contributions were in phase with the LO, and just about exactly the same size. Fixing the supply ripple revealed just how bad the stray contribution was: I had a single 1-mm pad over a slightly-noisy supply pour. A BFC fixed both.

One more: problems never "just go away". Even if it's a rare EMI condition, like Joerg's example of the radar EMI in the other thread, the EMI vulnerability didn't go away when they closed the aluminum blinds.

Cheers

Phil Hobbs

--
Dr Philip C D Hobbs 
Principal Consultant 
ElectroOptical Innovations LLC / Hobbs ElectroOptics 
Optics, Electro-optics, Photonics, Analog Electronics 
Briarcliff Manor NY 10510 

http://electrooptical.net 
http://hobbs-eo.com
Reply to
Phil Hobbs

e that is rare, intermittent or obscure?

ailure than I was when I was doing more design work. In many ways I think it is more challenging than design work. It takes a mindset that is diffe rent than design.

for not being fooled by the results of your test

a weighting factor of 1 to anything until you know you have the problem sol ved

aw an opposite conclusion when you repeat a test than what you concluded af ter the first test.

ne had they care and are smart, on the other hand if you go about chasing o ther peoples ideas (often conceived of to just demonstrate they are concern ed in a meeting) you will never get an a clear path to troubleshoot the pro blem in your own way.

blematic.

e design phase, I no longer look at that as a curse, but as a blessing. It is going to come back and get you later.

bad. As a designer you can show a days work for a days pay. In root caus e you feel like you have accomplished nothing for a long time. Frequently, though , these problems are the most visible problems in an organization a nd can make a difference between losing a customer and keeping one.

p you find contradictions in your thinking.

ating Broken parts is troubleshooting. This is finding the hidden corner cases in a design that typically are not seen until hundreds of units are i n the field finding those corner cases

Hmm OK. I designate two types of problem solving.

1.) Your (prototype) gizmo is not working. I call this de-bugging. The problem could be somewhere in the gizmo, or you may have made a fundamental error in your idea. Those are the hardest types of problems. 2.) You've got several working units but this one from production has a problem not seen before. I call that trouble shooting... it's easier because you've got working units, so you know it can't be a fundamental problem. It could still be a design problem. Like you didn't spec the spread in cap ESR on the voltage regulator and the odd high or low esr cap causes your voltage regulator to oscillate.

Trouble shooting is by far where I've spent most of my 'problem solving' time.

I guess there is some intermediate case, where your prototype gizmo is (mostly) working, but there's a glitch or something not understood on the edge cases. You could call that trouble shooting or de-bugging.

I use to work for a small company, not too many units made per year. And would half joke that our customers were our beta testers.

I'm not sure having a 'simple' broken component makes things any easier. I remember ripping up this whole circuit piece by piece, to finally discove r that a toggle switch had ~1 meg ohm of resistance when open. (it drove me crazy for a few days.)

George H.

Reply to
George Herold

that is rare, intermittent or obscure?

ilure than I was when I was doing more design work. In many ways I think i t is more challenging than design work. It takes a mindset that is differ ent than design.

or not being fooled by the results of your test

weighting factor of 1 to anything until you know you have the problem solv ed

w an opposite conclusion when you repeat a test than what you concluded aft er the first test.

e had they care and are smart, on the other hand if you go about chasing ot her peoples ideas (often conceived of to just demonstrate they are concerne d in a meeting) you will never get an a clear path to troubleshoot the prob lem in your own way.

lematic.

design phase, I no longer look at that as a curse, but as a blessing. It is going to come back and get you later.

bad.. As a designer you can show a days work for a days pay. In root caus e you feel like you have accomplished nothing for a long time. Frequently, though , these problems are the most visible problems in an organization a nd can make a difference between losing a customer and keeping one.

you find contradictions in your thinking.

Oh two 'layers' of the not working 'onion' have about the same magnitude but opposite signs... that insidious.

. Well unless the sample is changing with time. I had these new Rb cells from a reputable supplier. Dang things had some signs of residual gas in them. I made some guesstimate of the amount of gas (Ramsey, "Molecular Beams"). The supplier had some test where by he could tell the gas was below some level. We went back and forth for a f ew days, and finally we agreed I'd send one back for him to test again. A week or so later I got the cell back from him.. having checked out fine the second time. When I tested it again (after the few weeks) it was fine. And the other 9 cells (that I'd kept) were also all fine now.

George H.

Reply to
George Herold

It's not about attitude, really, but about the PARTS that compose the problematic item.

You really want an analysis, a breakdown of all of the elements of the apparatus. Have you ever considered the internal mechanical construction of the batteries? Loose connections can be internal to a dry cell. Or, thermal sensitivity of wiring (because of thermocouple effects)? So, pretend you have X-ray vision, and consider all the parts, even if YOU didn't handle them except as subassemblies. Importance can attach to the plating on a washer, or a choice of glue, or an historic supply-chain shift.

It might be a contaminant in the chemistry of a 'pure' material. The tale is told of a failure of carrier lifetime at Fairchild, which was traced to the introduction of Lemon-Fresh Joy detergent.

Reply to
whit3rd

Add 3) It fails on some customers' site, but not elsewhere.

Now, is it because the customers' equipment is at fault or the spec is inadequate (whatever that might mean)?

Reply to
Tom Gardner

On Tuesday, March 31, 2020 at 11:40:42 AM UTC-4, snipped-for-privacy@columbus.rr.com wro te:

rote:

that is rare, intermittent or obscure?

lure than I was when I was doing more design work. In many ways I think it is more challenging than design work. It takes a mindset that is differe nt than design.

r not being fooled by the results of your test

weighting factor of 1 to anything until you know you have the problem solve d

an opposite conclusion when you repeat a test than what you concluded afte r the first test.

had they care and are smart, on the other hand if you go about chasing oth er peoples ideas (often conceived of to just demonstrate they are concerned in a meeting) you will never get an a clear path to troubleshoot the probl em in your own way.

ematic.

design phase, I no longer look at that as a curse, but as a blessing. It i s going to come back and get you later.

ad. As a designer you can show a days work for a days pay. In root cause you feel like you have accomplished nothing for a long time. Frequently, t hough , these problems are the most visible problems in an organization and can make a difference between losing a customer and keeping one.

you find contradictions in your thinking.

it is laid out under their nose. It is never their fault :-)

Sure, why should they waste time chasing hardware problems which they can't duplicate in their simulations? It's hard to prove a problem *isn't* in t he SW, so they expect to see proof that it *is* in the SW. It's the only r ational way to handle it.

I recall once spending days adding debug features to an FPGA and the ah-ha moment in the lab when I said, "It's almost as if it isn't being initialize d". Sure enough, that was the problem. I didn't have to spend as much tim e in the lab after that.

--

  Rick C. 

  - Get 1,000 miles of free Supercharging 
  - Tesla referral code - https://ts.la/richard11209
Reply to
Rick C

rote:

that is rare, intermittent or obscure?

lure than I was when I was doing more design work. In many ways I think it is more challenging than design work. It takes a mindset that is differe nt than design.

r not being fooled by the results of your test

weighting factor of 1 to anything until you know you have the problem solve d

an opposite conclusion when you repeat a test than what you concluded afte r the first test.

had they care and are smart, on the other hand if you go about chasing oth er peoples ideas (often conceived of to just demonstrate they are concerned in a meeting) you will never get an a clear path to troubleshoot the probl em in your own way.

ematic.

design phase, I no longer look at that as a curse, but as a blessing. It i s going to come back and get you later.

ad. As a designer you can show a days work for a days pay. In root cause you feel like you have accomplished nothing for a long time. Frequently, t hough , these problems are the most visible problems in an organization and can make a difference between losing a customer and keeping one.

you find contradictions in your thinking.

Yeah, someone told me the failure rate for telephone equipment is some huge ly large number that if you get a failure at that level is near impossible to find. This requires a whole different mindset to the design and test pr ocess. Essentially you have to prove that every part of your design works rather than testing for a failure.

Kinda like medical equipment.

--

  Rick C. 

  + Get 1,000 miles of free Supercharging 
  + Tesla referral code - https://ts.la/richard11209
Reply to
Rick C

FPGA people bench test pretty hard, so want serious explanations of why things went wrong... which they seldom do. Programmers seem to accept that there will be bugs.

--

John Larkin         Highland Technology, Inc 
picosecond timing   precision measurement  

jlarkin att highlandtechnology dott com 
http://www.highlandtechnology.com
Reply to
John Larkin

that is rare, intermittent or obscure?

ilure than I was when I was doing more design work. In many ways I think i t is more challenging than design work. It takes a mindset that is differ ent than design.

or not being fooled by the results of your test

weighting factor of 1 to anything until you know you have the problem solv ed

w an opposite conclusion when you repeat a test than what you concluded aft er the first test.

e had they care and are smart, on the other hand if you go about chasing ot her peoples ideas (often conceived of to just demonstrate they are concerne d in a meeting) you will never get an a clear path to troubleshoot the prob lem in your own way.

lematic.

design phase, I no longer look at that as a curse, but as a blessing. It is going to come back and get you later.

bad. As a designer you can show a days work for a days pay. In root cause you feel like you have accomplished nothing for a long time. Frequently, though , these problems are the most visible problems in an organization an d can make a difference between losing a customer and keeping one.

you find contradictions in your thinking.

en it is laid out under their nose. It is never their fault :-)

You laugh, I once used a telephony part that had a PSRR of 0dB which I had missed. (Who expects 0 dB?) On the customer's work bench they were gettin g noise in the audio that turned out to be from the DSP power consumption. They were using clip leads to provide power to the UUT and the on board ca pacitance wasn't enough to mitigate it. We told them to use better power c onnections and also used a larger cap.

0 dB of PSRR??? How can you even do that exactly??? CP Clare, what a piec e of work they are. The other CP Clare part had a problem that virtually m ade it unusable, but they didn't point it out in the data sheet. I wonder if they actually use engineers or if they just let high school kids design their ICs?

This was my first project as an independent engineer and I never forgot the lessons I learned on that. The other big ones were to not do your own pro curement and NEVER trust a disti delivery date.

--

  Rick C. 

  -- Get 1,000 miles of free Supercharging 
  -- Tesla referral code - https://ts.la/richard11209
Reply to
Rick C

Reply to
ABLE1

Whoops!! 2nd try!!

With all the above being typed and read I have a much simpler way to look the problem.

Just use the the "Not Method of Troubleshooting".

The Not Method goes like this.

Now I am sure someone will find fault with my method, well Ok then!! Some days the Not's just have to be adjusted.

Have a good day!!

Les

Reply to
ABLE1

Yeah, I'd still call that trouble shooting 'cause you know it works most places. Dealing with customer problems is a whole 'nother ball of wax.

1.) they are customers 2.) they might be (experimental) idiots 3.) they might have a 'real' problem.

It's a delicate dance. George H.

Reply to
George Herold

ElectronDepot website is not affiliated with any of the manufacturers or service providers discussed here. All logos and trade names are the property of their respective owners.