Approach to Finding the Root Cause of Failures

Do you have a question? Post it now! No Registration Necessary

Translate This Thread From English to

Threaded View

Another topic that I hope can elicit engineering discussion:

What makes up a good skill set for finding the root cause of a failure that
 is rare, intermittent or obscure?

Over the past several years I have been more involved in root cause failure
 than I was when I was doing more design work.  In many ways I think it is  
more challenging than design work.  It takes a mindset that is  different t
han design.

Here is my reminder list when doing root cause studies

1. never root for a particular outcome when performing a test.  Root for no
t being fooled by the results of your test

2. Assign weighting factors to everything you believe.  Never assign a weig
hting factor of 1 to anything until you know you have the problem solved

3. Expect to have to do certain tests over again and that you will draw an  
opposite conclusion when you repeat a test than what you concluded after th
e first test.  

4. Taking guidance from "helpful" outsiders is challenging.  On the one had
 they care and are smart, on the other hand if you go about chasing other p
eoples ideas (often conceived of to just demonstrate they are concerned in  
a meeting) you will never get an a clear path to troubleshoot the problem i
n your own way.
Help is a two edged sword.   It is important but can sometimes be problemat
ic.

5. As an aside - I have learned that when I "see something" during the desi
gn phase, I no longer look at that as a curse, but as a blessing.  It is go
ing to come back and get you later.

6. Get past the notion that having nothing to show for a days work is bad.  
 As a designer you can show a days work for a days pay.  In root cause you  
feel like you have accomplished nothing for a long time.  Frequently, thoug
h , these problems are the most visible problems in an organization and can
 make a difference between losing a customer and keeping one.

7. Look for contradictions in your thinking.  Use other people to help you  
find contradictions in your thinking.

OK - enough for now......

Re: Approach to Finding the Root Cause of Failures
On Tuesday, March 31, 2020 at 11:34:44 AM UTC-4, snipped-for-privacy@columbus.rr.com wro
te:
Quoted text here. Click to load it
at is rare, intermittent or obscure?
Quoted text here. Click to load it
re than I was when I was doing more design work.  In many ways I think it i
s more challenging than design work.  It takes a mindset that is  different
 than design.
Quoted text here. Click to load it
not being fooled by the results of your test
Quoted text here. Click to load it
ighting factor of 1 to anything until you know you have the problem solved
Quoted text here. Click to load it
n opposite conclusion when you repeat a test than what you concluded after  
the first test.  
Quoted text here. Click to load it
ad they care and are smart, on the other hand if you go about chasing other
 peoples ideas (often conceived of to just demonstrate they are concerned i
n a meeting) you will never get an a clear path to troubleshoot the problem
 in your own way.
Quoted text here. Click to load it
atic.
sign phase, I no longer look at that as a curse, but as a blessing.  It is  
going to come back and get you later.
Quoted text here. Click to load it
.  As a designer you can show a days work for a days pay.  In root cause yo
u feel like you have accomplished nothing for a long time.  Frequently, tho
ugh , these problems are the most visible problems in an organization and c
an make a difference between losing a customer and keeping one.
Quoted text here. Click to load it
u find contradictions in your thinking.
Quoted text here. Click to load it

Oh yeah....If it is RF related there is >50% change it is grounding related

Re: Approach to Finding the Root Cause of Failures
On Tuesday, March 31, 2020 at 11:34:44 AM UTC-4, snipped-for-privacy@columbus.rr.com wro
te:
Quoted text here. Click to load it
at is rare, intermittent or obscure?
Quoted text here. Click to load it
re than I was when I was doing more design work.  In many ways I think it i
s more challenging than design work.  It takes a mindset that is  different
 than design.
Quoted text here. Click to load it
not being fooled by the results of your test
Quoted text here. Click to load it
ighting factor of 1 to anything until you know you have the problem solved
Quoted text here. Click to load it
n opposite conclusion when you repeat a test than what you concluded after  
the first test.  
Quoted text here. Click to load it
ad they care and are smart, on the other hand if you go about chasing other
 peoples ideas (often conceived of to just demonstrate they are concerned i
n a meeting) you will never get an a clear path to troubleshoot the problem
 in your own way.
Quoted text here. Click to load it
atic.
sign phase, I no longer look at that as a curse, but as a blessing.  It is  
going to come back and get you later.
Quoted text here. Click to load it
.  As a designer you can show a days work for a days pay.  In root cause yo
u feel like you have accomplished nothing for a long time.  Frequently, tho
ugh , these problems are the most visible problems in an organization and c
an make a difference between losing a customer and keeping one.
Quoted text here. Click to load it
u find contradictions in your thinking.
Quoted text here. Click to load it

Also - the FPGA guys and the SW guys will only acknowledge a problem when i
t is laid out under their nose.  It is never their fault :-)  

Re: Approach to Finding the Root Cause of Failures
On 31/03/2020 17:40, snipped-for-privacy@columbus.rr.com wrote:
Quoted text here. Click to load it

That's because it's usually a hardware fault - and it can be solved by  
using a bigger capacitor :-)

Re: Approach to Finding the Root Cause of Failures
On Tuesday, March 31, 2020 at 12:41:36 PM UTC-4, David Brown wrote:
Quoted text here. Click to load it
 wrote:
Quoted text here. Click to load it
 that is rare, intermittent or obscure?
Quoted text here. Click to load it
ilure than I was when I was doing more design work.  In many ways I think i
t is more challenging than design work.  It takes a mindset that is  differ
ent than design.
Quoted text here. Click to load it
or not being fooled by the results of your test
Quoted text here. Click to load it
 weighting factor of 1 to anything until you know you have the problem solv
ed
Quoted text here. Click to load it
w an opposite conclusion when you repeat a test than what you concluded aft
er the first test.
Quoted text here. Click to load it
e had they care and are smart, on the other hand if you go about chasing ot
her peoples ideas (often conceived of to just demonstrate they are concerne
d in a meeting) you will never get an a clear path to troubleshoot the prob
lem in your own way.
Quoted text here. Click to load it
lematic.
 design phase, I no longer look at that as a curse, but as a blessing.  It  
is going to come back and get you later.
Quoted text here. Click to load it
bad.  As a designer you can show a days work for a days pay.  In root cause
 you feel like you have accomplished nothing for a long time.  Frequently,  
though , these problems are the most visible problems in an organization an
d can make a difference between losing a customer and keeping one.
Quoted text here. Click to load it
 you find contradictions in your thinking.
Quoted text here. Click to load it
en it is laid out under their nose.  It is never their fault :-)
Quoted text here. Click to load it

You laugh, I once used a telephony part that had a PSRR of 0dB which I had  
missed.  (Who expects 0 dB?)  On the customer's work bench they were gettin
g noise in the audio that turned out to be from the DSP power consumption.  
 They were using clip leads to provide power to the UUT and the on board ca
pacitance wasn't enough to mitigate it.  We told them to use better power c
onnections and also used a larger cap.  

0 dB of PSRR???  How can you even do that exactly???  CP Clare, what a piec
e of work they are.  The other CP Clare part had a problem that virtually m
ade it unusable, but they didn't point it out in the data sheet.  I wonder  
if they actually use engineers or if they just let high school kids design  
their ICs?  

This was my first project as an independent engineer and I never forgot the
 lessons I learned on that.  The other big ones were to not do your own pro
curement and NEVER trust a disti delivery date.  

--  

  Rick C.

  -- Get 1,000 miles of free Supercharging
We've slightly trimmed the long signature. Click to see the full one.
Re: Approach to Finding the Root Cause of Failures
On 2020-03-31 14:40, Rick C wrote:
Quoted text here. Click to load it

Are you quoting that WRT the input or the output?  PSRR and CMRR are  
normally quoted input-referred, i.e. to find out the effect you have to  
multiply by the overall gain.

There are lots of parts that can have negative-dB PSRR as referred to  
the output.

Cheers

Phil Hobbs

--  
Dr Philip C D Hobbs
Principal Consultant
We've slightly trimmed the long signature. Click to see the full one.
Re: Approach to Finding the Root Cause of Failures
On Tuesday, March 31, 2020 at 4:08:59 PM UTC-4, Phil Hobbs wrote:
Quoted text here. Click to load it
om wrote:
Quoted text here. Click to load it
re that is rare, intermittent or obscure?
Quoted text here. Click to load it
failure than I was when I was doing more design work.  In many ways I think
 it is more challenging than design work.  It takes a mindset that is  diff
erent than design.
Quoted text here. Click to load it
 for not being fooled by the results of your test
Quoted text here. Click to load it
 a weighting factor of 1 to anything until you know you have the problem so
lved
Quoted text here. Click to load it
raw an opposite conclusion when you repeat a test than what you concluded a
fter the first test.
Quoted text here. Click to load it
one had they care and are smart, on the other hand if you go about chasing  
other peoples ideas (often conceived of to just demonstrate they are concer
ned in a meeting) you will never get an a clear path to troubleshoot the pr
oblem in your own way.
Quoted text here. Click to load it
oblematic.
he design phase, I no longer look at that as a curse, but as a blessing.  I
t is going to come back and get you later.
Quoted text here. Click to load it
s bad.  As a designer you can show a days work for a days pay.  In root cau
se you feel like you have accomplished nothing for a long time.  Frequently
, though , these problems are the most visible problems in an organization  
and can make a difference between losing a customer and keeping one.
Quoted text here. Click to load it
lp you find contradictions in your thinking.
Quoted text here. Click to load it
when it is laid out under their nose.  It is never their fault :-)
Quoted text here. Click to load it
had missed.  (Who expects 0 dB?)  On the customer's work bench they were ge
tting noise in the audio that turned out to be from the DSP power consumpti
on.  They were using clip leads to provide power to the UUT and the on boar
d capacitance wasn't enough to mitigate it.  We told them to use better pow
er connections and also used a larger cap.
Quoted text here. Click to load it
piece of work they are.  The other CP Clare part had a problem that virtual
ly made it unusable, but they didn't point it out in the data sheet.  I won
der if they actually use engineers or if they just let high school kids des
ign their ICs?
Quoted text here. Click to load it
  
Quoted text here. Click to load it
At higher frequencies aren't there many opamps that cross  
0 dB PSRR. At least for one of the rails.  
(That's why God* invented the cap. multiplier.)

George H.  
*or one of his offspring.... who did do the cap mult. first?  

Quoted text here. Click to load it


Re: Approach to Finding the Root Cause of Failures
On 2020-03-31 18:50, George Herold wrote:
Quoted text here. Click to load it

Negative PSRR is usually horrible in "single supply" op amps, because,  
duh, they expect you to use a single positive supply. ;)

Quoted text here. Click to load it

Yup.


Well, children, anyway. ;)

Quoted text here. Click to load it

Dunno.  I first saw it in an audio amp project in a magazine, circa  
1977.  The LED + NPN emitter-follower voltage reference, I saw in an  
article of Walt Jung's at about the same time.

We should revisit that "how many two-transistor circuits are there?"  
thread at some point.

Cheers

Phil Hobbs

--  
Dr Philip C D Hobbs
Principal Consultant
We've slightly trimmed the long signature. Click to see the full one.
Re: Approach to Finding the Root Cause of Failures
On Wednesday, April 1, 2020 at 12:49:09 PM UTC-4, Phil Hobbs wrote:
Quoted text here. Click to load it
.com wrote:
Quoted text here. Click to load it
lure that is rare, intermittent or obscure?
Quoted text here. Click to load it
e failure than I was when I was doing more design work.  In many ways I thi
nk it is more challenging than design work.  It takes a mindset that is  di
fferent than design.
Quoted text here. Click to load it
ot for not being fooled by the results of your test
Quoted text here. Click to load it
gn a weighting factor of 1 to anything until you know you have the problem  
solved
Quoted text here. Click to load it
 draw an opposite conclusion when you repeat a test than what you concluded
 after the first test.
Quoted text here. Click to load it
e one had they care and are smart, on the other hand if you go about chasin
g other peoples ideas (often conceived of to just demonstrate they are conc
erned in a meeting) you will never get an a clear path to troubleshoot the  
problem in your own way.
Quoted text here. Click to load it
problematic.
 the design phase, I no longer look at that as a curse, but as a blessing.  
 It is going to come back and get you later.
Quoted text here. Click to load it
 is bad.  As a designer you can show a days work for a days pay.  In root c
ause you feel like you have accomplished nothing for a long time.  Frequent
ly, though , these problems are the most visible problems in an organizatio
n and can make a difference between losing a customer and keeping one.
Quoted text here. Click to load it
help you find contradictions in your thinking.
Quoted text here. Click to load it
m when it is laid out under their nose.  It is never their fault :-)
Quoted text here. Click to load it
by
I had missed.  (Who expects 0 dB?)  On the customer's work bench they were  
getting noise in the audio that turned out to be from the DSP power consump
tion.  They were using clip leads to provide power to the UUT and the on bo
ard capacitance wasn't enough to mitigate it.  We told them to use better p
ower connections and also used a larger cap.
Quoted text here. Click to load it
a piece of work they are.  The other CP Clare part had a problem that virtu
ally made it unusable, but they didn't point it out in the data sheet.  I w
onder if they actually use engineers or if they just let high school kids d
esign their ICs?
Quoted text here. Click to load it
o
Sure, but I need to understand all the one transistor circuits first.  

Just thinking out loud here, but in principle you've got three configuratio
ns
(what terminal is common) and then can think about voltage or current as th
e  
input or output parameter.. I get 12 possibilities.  
But maybe I'm over thinking it.  

George H.  
Quoted text here. Click to load it


Re: Approach to Finding the Root Cause of Failures
On Tuesday, March 31, 2020 at 4:08:59 PM UTC-4, Phil Hobbs wrote:
Quoted text here. Click to load it
om wrote:
Quoted text here. Click to load it
re that is rare, intermittent or obscure?
Quoted text here. Click to load it
failure than I was when I was doing more design work.  In many ways I think
 it is more challenging than design work.  It takes a mindset that is  diff
erent than design.
Quoted text here. Click to load it
 for not being fooled by the results of your test
Quoted text here. Click to load it
 a weighting factor of 1 to anything until you know you have the problem so
lved
Quoted text here. Click to load it
raw an opposite conclusion when you repeat a test than what you concluded a
fter the first test.
Quoted text here. Click to load it
one had they care and are smart, on the other hand if you go about chasing  
other peoples ideas (often conceived of to just demonstrate they are concer
ned in a meeting) you will never get an a clear path to troubleshoot the pr
oblem in your own way.
Quoted text here. Click to load it
oblematic.
he design phase, I no longer look at that as a curse, but as a blessing.  I
t is going to come back and get you later.
Quoted text here. Click to load it
s bad.  As a designer you can show a days work for a days pay.  In root cau
se you feel like you have accomplished nothing for a long time.  Frequently
, though , these problems are the most visible problems in an organization  
and can make a difference between losing a customer and keeping one.
Quoted text here. Click to load it
lp you find contradictions in your thinking.
Quoted text here. Click to load it
when it is laid out under their nose.  It is never their fault :-)
Quoted text here. Click to load it
had missed.  (Who expects 0 dB?)  On the customer's work bench they were ge
tting noise in the audio that turned out to be from the DSP power consumpti
on.  They were using clip leads to provide power to the UUT and the on boar
d capacitance wasn't enough to mitigate it.  We told them to use better pow
er connections and also used a larger cap.
Quoted text here. Click to load it
piece of work they are.  The other CP Clare part had a problem that virtual
ly made it unusable, but they didn't point it out in the data sheet.  I won
der if they actually use engineers or if they just let high school kids des
ign their ICs?
Quoted text here. Click to load it
  
Quoted text here. Click to load it

This was a telephone line isolation interface.  One end was connected to th
e phone line, an isolation capacitor (high frequency chopper) crossed the i
solation barrier and the other side of the chip connected to the low voltag
e CODEC circuit.  

Not sure it matters if the spec was input or output referred since the circ
uit has no gain, just isolation.  

We had some low level audio frequency noise on the power rail (10 mV comes  
to mind) which showed up in the data as an audible tone which corresponded  
to the processing loop of the DSP.  10 mV seems like an acceptable amount o
f noise in a power supply line, but I suppose normally PS noise is outside  
the audible range.  The noise wasn't loud, but present.  The fact that it c
ame and went was what make it noticeable.  

Compare to op amps where I typically see a large amount of PSRR in the audi
o range, some 50 dB and up.  The impact of 10 mV audio noise would not be m
easurable in most op amp circuits.  

--  

  Rick C.

  -+ Get 1,000 miles of free Supercharging
We've slightly trimmed the long signature. Click to see the full one.
Re: Approach to Finding the Root Cause of Failures
On 31/03/2020 20:40, Rick C wrote:
Quoted text here. Click to load it
<snip>
Quoted text here. Click to load it

I had a smiley, but I have seen more than a few systems reliability  
improved by adding a bigger capacitor.  There is a rule in software  
development that "almost all programming can be viewed as an exercise in  
caching".  (Yes, it is an exaggeration - but there's a grain of truth in  
it.)  Capacitors are the hardware equivalent of software caches.


Mind you, I have seen problems with too big capacitors too.  I remember  
long ago trying to find why a card communicated find (at 9600 baud  
RS-232) with some computers but not others.  Looking with a scope, the  
RS-232 signals were lovely triangle waves - someone had added 100 nF  
capacitors to the lines to reduce the noise...




Re: Approach to Finding the Root Cause of Failures

Quoted text here. Click to load it
  I have a trusted engineer friend who once said that most failures  
occur at power up or power down.  He always left his computers at  
work and his home up all the time.

  Old net and system admin guys usually like keeping systems up and  
running at all times too.
The big computer rooms of the sixties would lose thousands and hour  
in insurance if the room temperature rose above a preset level like  


Re: Approach to Finding the Root Cause of Failures
On 01/04/2020 11:17, snipped-for-privacy@decadence.org wrote:

Quoted text here. Click to load it

He is right.

I keep my PC's on all the time.  Even a Windows machine can run for  
months without a restart if treated with due care and kindness.  But  
it's not just about risk of failure - I usually have so many projects  
open at a time on different workspaces (on the Linux systems) that is a  
big effort and waste of time to restart the thing.


Re: Approach to Finding the Root Cause of Failures
On Wednesday, April 1, 2020 at 7:54:09 AM UTC-4, David Brown wrote:
Quoted text here. Click to load it

the toughest issue I had to find was a power up issue.  It turned out that  
the memory part manufacturer had a bug in their handshake codes at power up
 and occasionally it threw a bad code which then set the DSP into a wrong c
lock speed which then resulted in the NVRAM getting corrupted....the unit b
ricked (although recoverable at the factory with a complete reprogram) .  T
here was a cryptic note in the data sheet which when we finally realized th
at the cryptic note seemed to rhyme with our problem we contacted the manuf
acturer.  They then gave us the complete story which was that all date code
s prior to a particular time were susceptible to the problem and date codes
 after were fixed.  

I would have loved to hear the debate about how to put that note in the dat
a sheet.  Frankly, they knew that if they were totally candid, then the par
t was not valid so they wanted to mask it, but , I guess, some engineer was
 screaming about how bad this was and they agreed to the cryptic note.

As another aside, this was kind of a good one for us because our customer w
as mad that they had bricked units in their airplane but when we presented  
them the problem, it was not our fault and we had been tenacious in finding
 the problem. And nobody looks bad for designing the thing wrong.  
Quoted text here. Click to load it
  
Quoted text here. Click to load it

Re: Approach to Finding the Root Cause of Failures
On Wednesday, April 1, 2020 at 8:40:03 AM UTC-4, snipped-for-privacy@columbus.rr.com wro
te:
Quoted text here. Click to load it
t the memory part manufacturer had a bug in their handshake codes at power  
up and occasionally it threw a bad code which then set the DSP into a wrong
 clock speed which then resulted in the NVRAM getting corrupted....the unit
 bricked (although recoverable at the factory with a complete reprogram) .  
 There was a cryptic note in the data sheet which when we finally realized  
that the cryptic note seemed to rhyme with our problem we contacted the man
ufacturer.  They then gave us the complete story which was that all date co
des prior to a particular time were susceptible to the problem and date cod
es after were fixed.  
Quoted text here. Click to load it
ata sheet.  Frankly, they knew that if they were totally candid, then the p
art was not valid so they wanted to mask it, but , I guess, some engineer w
as screaming about how bad this was and they agreed to the cryptic note.
Quoted text here. Click to load it
 was mad that they had bricked units in their airplane but when we presente
d them the problem, it was not our fault and we had been tenacious in findi
ng the problem. And nobody looks bad for designing the thing wrong.  

Also, there was one obscure LED on the board that gave an indication that t
he boot load had finished.  Had that LED not been on the board, I do not th
ink we would have ever found the problem.  Normally at power up the LED tur
ned on then turned off when everything finished initializing. In this case  
the LED stuck on, so we knew it was a power on issue.  Still a real bugger  
to find.  
Quoted text here. Click to load it
  
Quoted text here. Click to load it
  
Quoted text here. Click to load it


Re: Approach to Finding the Root Cause of Failures
On 2020-04-01 08:43, snipped-for-privacy@columbus.rr.com wrote:
Quoted text here. Click to load it

Having the appropriate number of blinky LEDs is key.  Sometimes when I  
run short of pins, I'll have the housekeeping loop output a state code  
from a UART.  That's super helpful in keeping track of state machines  
and so on.

Quoted text here. Click to load it

I sometimes do that too, but only when the project is under version  
control, which most are.  (Github/Gitlab private repos are good for  
projects where nobody else would know what it is.  Not so much for the  
crown jewels.)

Cheers

Phil Hobbs


--  
Dr Philip C D Hobbs
Principal Consultant
We've slightly trimmed the long signature. Click to see the full one.
Re: Approach to Finding the Root Cause of Failures

Quoted text here. Click to load it

  I had a 'next step' 286 PC way back then.  It ad a really cool BIOS  
and an LCD display on the front of the case that showed the BIOS POST  
progress at each step.  Once it was booted up, it showed hard drive  
cylinder and sector access numbers.  Like that could tell one  
something then.  Oh My, the 32MB drive just failed and I noticed the  
track it was on when it happened.  Yeah, sure... that would have been  
useful to know.  For those drive recovery guys.  Even then what it  
reads at the moment of a crash may not coincide with where the  
platter failure was anyway.  So I saw no use for that part, though it  
was cool to see where it was hitting the drive at.

   They have LED touch panels on printers.  I figured that  
motherboard makers would have status/setup panels by now.

  Hey, there is the new standard.   Was "ATX".  Now it could be  
"MATX" for "Monitored ATX", so the case makers could make provisions  
for the panels.

Re: Approach to Finding the Root Cause of Failures
On 2020-04-01 05:17, snipped-for-privacy@decadence.org wrote:
Quoted text here. Click to load it

Quoted text here. Click to load it

Yup.  At IBM Watson we used to shut the whole place down over Labor Day  
weekend.  It always took a couple of days to get the silicon fab line  
back up, because things like corroded connections and worn-out motors  
tend to fail at inrush.

Cheers

Phil Hobbs

--  
Dr Philip C D Hobbs
Principal Consultant
We've slightly trimmed the long signature. Click to see the full one.
Re: Approach to Finding the Root Cause of Failures
On Wednesday, April 1, 2020 at 9:51:35 AM UTC-7, Phil Hobbs wrote:
Quoted text here. Click to load it



But, replacing corroded connections and worn-out motors in threes after
startup might involve less down-time than getting the fab line
shut down three times at unscheduled times.

Re: Approach to Finding the Root Cause of Failures

Quoted text here. Click to load it

  The whole fab damily?

Site Timeline