OT: Risk Management

NOTE: The magazine, Mechanical Engineering, refused to publish this.

Risk Management By Paid-up S&A subscriber Jay Lipeles

To the Editor:

I read with alarm, the cover story in your current issue: "Risk ? Informed Decision Making". The basic message that risk can be managed is wrong and extremely dangerous. Mechanical Engineering does a great disservice to its readership and the engineering community at large by promoting such nonsense.

To site a few examples: In the investigation that followed the Challenger disaster, NASA testified as to its risk assessment made prior to the event. The assessment was wrong. Seven astronauts died, and it cost the nation millions of dollars and several years to recover. To my mind, NASA never has, witness your article. Bad thinking remains. The error in NASA's assessment (and related risk management theories) is that inherent in the theory (and usually unstated, as in your article) is the assumption that the situation is ergodic. NASA's assessment was based on (among other things) test data taken at room temperature. The probability of failure of the O rings was based on that data and it was incorrect. The situation was non-ergodic. The theory was inapplicable. The analysis was wrong and disaster followed.

Long Term Capital Management (LTCM) lost $ 4.6 billion in 1998 and subsequently failed. Among the reasons for its failure was it assessed its risk with an analysis similar to the one in your article, which assumed that the situation was ergodic. It was not. When Russia defaulted on its debt, the mistake was revealed and LTCM went down.

Now we are struggling through the worst recession since the Great Depression. It was initiated by the collapse of the subprime market. It goes without saying, that there were a number of contributing causes. Among them was the bundling of mortgages into securities backed up by risk analyses that predicted the risk to be very low. The analyses assumed that the situation was ergodic. Whoops! In all of history, was there ever a more costly mistake?

Inherent in all analyses of this type is the assumption that the situation is ergodic. It almost never is. What the authors would have us believe is that a careful, thorough risk analysis can be helpful in reducing risk. Nonsense! What such an analysis does is provide a sense that risk is being addressed when it is not. It contributes to a false sense of security.

The most important quality an engineer brings to the game is his judgment. Greater reliance on risk management really implies (and demands) reduced reliance on judgment (engineering, financial or whatever). It is a very bad strategy, based as it is on an assumption known to be wrong. It is an extremely dangerous strategy. We should instead be relying more heavily on, and developing the people, who possess good judgment.

(investment newsletter Ferris comment: Thank you, Jay, for this intelligently crafted letter, which I believe is right on the money. Judgment is what investors need, not fancy math models that contribute, as you point out, "to a false sense of security." Well done.)

Reply to
Robert Baer
Loading thread data ...

s.

This reminds me of quality. A few decades ago there was a big emphasis on "quality," which was ensured by testing everything to detect failures and rejects, as if that made the final product more reliable. It doesn't.

Deming, and maybe someone before him, pointed out that quality is

*built-in* to a design--either the thing has conservative margin, or not. A 100mW resistor dissipating 99mW will simply never be as reliable as a 1/4W part.

Mil-spec parts were tested repeatedly, to death--for all the testing, mil-spec parts had higher failure rates. Why? Static damage, from handling...while testing!

"Risk management" is like testing. Just make the thing solid to begin with. That works a lot better than "risk management."

-- Cheers, James Arthur

Reply to
dagmargoodboat

Reminds me of a situation in (old) Fairchild, where a particular TO-3 part after proven to be reliable, final testing (where it is plugged into a socket) was eliminated and even forbidden - because the insertion caused micro-cracks in the glass to metal seal on the leads, thereby INCREASING the failure rate to a measurable degree. That was on a MIL part and written into the government testing specifications(!).

Reply to
Robert Baer

About 20 years ago, a fellow from England gave a talk where his central premise was that fault trees simply do not work. Fault trees will show you all the things that will not fail. Unfortunately they do not show you what will fail. He gave examples of Bopal, Nasa, Browns Ferry and many other cases, all of which had their fault trees done in thousands or millions of pages. None of which applied to the failure.

Reply to
Robert Baer

this.

That reminds me of testing for some high energy density capacitors for a particular application in a Bradley fighting vehicle. The caps had some materials related wear out mechanisms that limited life to about 400 POH. The target life of qualified parts was over 300 POH, but had a testing regime of some 200 POH. Needless to say the "fully qualified" parts had several times higher field failure rate than "cowboyed" (under 24 hours testing) parts. As far as i know they never corrected the testing regime.

Reply to
JosephKK

There's an excellent book called "Normal Accidents: Living With High Risk Technologies" by Charles Perrow. He points out that accidents in complicated, highly nonlinear, tightly coupled situations like nuke plants, refineries, and air traffic control almost always have multiple causes--e.g. silent failures in safety devices, unusual operating conditions, poorly understood fault dynamics, etc.

Part of the reason for this is that people pay very close attention to the fault trees, so that the easily foreseeable things aren't usually responsible. So the fault tree's function is precisely to work itself out of a job--as it apparently does.

The real problem is when we believe that the fault tree is exhaustive, so that hammering in all the nails means that the system can't fail.

Cheers

Phil Hobbs

--
Dr Philip C D Hobbs
Principal
ElectroOptical Innovations
55 Orchard Rd
Briarcliff Manor NY 10510
845-480-2058

email: hobbs at electrooptical dot net
http://electrooptical.net
Reply to
Phil Hobbs

[...]

Somehow I thought of Charles Perrault's "Puss in the Boots" Very risky, indeed :)

And the answer is?

VLV

Reply to
Vladimir Vassilevsky

  1. Cheers

Phil Hobbs

--
Dr Philip C D Hobbs
Principal
ElectroOptical Innovations
55 Orchard Rd
Briarcliff Manor NY 10510
845-480-2058

email: hobbs at electrooptical dot net
http://electrooptical.net
Reply to
Phil Hobbs

Read Richard Feynman's personal observations on the reliability of the Shuttle:

formatting link

--
"For a successful technology, reality must take precedence 
over public relations, for nature cannot be fooled."
                                       (Richard Feynman)
Reply to
Fred Abse

Also see Nicholas Taleb's book "The Black Swan"

formatting link

and "How Complex Systems Fail" by Richard I Cook

formatting link

and "A Brief Look at The New Look..."

formatting link

Perhaps not quite perfectly on topic, but all engaging reads.

Reply to
Ralph Barone

I did a fair amount of research investigating "software verification", and a basic problem (which applies as well to several examples others have brought up, where systems either involve software or have logic-like states) is that the nature of discrete mathematics is such that there's not really any provable way to make a program "6 times less likely to fail" in a manner analagous to building a bridge 6 times stronger than its maximum load rating.

If you believe the mtbf figures of the software verification industry, there have been several million years worth of failures in "verified" software over the past 20 years or so, and in quite a few of those, people died from it (since a usual reason to go to the great expense of verification is that failure might kill people.)

My personal conclusion was that serious consideration should be given to reviving analog computers for this type of task, but that does not seem terribly likely. The other obvious conclusion is that verification is snake oil.

--
Cats, coffee, chocolate...vices to live by
Reply to
Ecnerwal

Should have told the idiots that the best way to test a fuse is to see how much current it can stand...

Reply to
Robert Baer

Please tell use how analog computers can help..

Reply to
Robert Baer

Tee hee hee. You are the guy they designed the reset-able jobs for.

Reply to
Pieyed Piper

to=20

seem=20

Analog computers are handy toys for science fair demos of dynamic = systems, but they aren't particularly good computers in the modern, = procedural sense of "computer".

It's quite easy to make accurate circuits that add, subtract, multiply, = divide, exponentiate, logarithmize, integrate and differentiate voltages = (all the fundamental functions, from which everything else can be = derived to arbitrary accuracy, including infinite series and chaotic = systems). No, the question is, how exactly do you make an if/then/else = clause and still call it analog? What about data storage? There is = nothing in the century old paradigm of computing to accommodate that. = Only the sharpest of physicists can think in differential equations*.

*Richard Feynman was hired to help build the Connection Machine,
formatting link
and he wrote a differential equation concerning the behavior of the = router.

Tim

--=20 Deep Friar: a very philosophical monk. Website:

formatting link

Reply to
Tim Williams

But the original article is correct. You *have* to make informed engineering decisions on risk reward with imperfect information. That imperfect information is the best that you have available at the time.

There are some *very* good heuristics available when you know for example that a part has a significant (eg >1%) risk of infant mortality and an otherwise reasonable MTBF. If you do not understand this you end up changing the part too often and get a higher number of serious plant failures per year. A filament light bulb obeys this distribution as do various other electronic and mechanical components.

The Bayesian solution is that you allow a small proportion of in service failures to occur to maximise the proportion of time the plant is running at full capacity. Obviously you have to include a suitable penalty function where lives are at stake in mission critical gear.

Otherwise you get bean counter results like the US Ford Pinto and its self immolating vehicles in a rear end shunt. They did the calculation and decided it was cheaper to kill people than do the recall and fix it.

formatting link

This isn't about engineering so much as business ethics.

The fault trees are useful in making sure that there are no easy ways for a forseeable chain of events to go catastrophic. How good that is depends on the quality of the engineers that developed it or the brief they were given.

One from my own experience with atmospheric RF plasma and a safety cooling interlock to prevent it melting the sample interface. The unit protected itself, but did not notice that the drain had blocked until the water in the basement room reached waist height. It wasn't in our spec to make sure the waste cooling water was going down the drain!

Regards, Martin Brown

Reply to
Martin Brown

this.

begin

a

some

POH.

had

hours

That's cute. I did some fuse testing, pushing pulses of several times rated steady current. At two levels, the must not blow level and at the must blow level. Also fast pulses through all the cabling necessary to get into and back out of a vacuum chamber. Lots of shielded twisted pair. Stiff high current pulse generators, lots of fun.

Reply to
JosephKK

ElectronDepot website is not affiliated with any of the manufacturers or service providers discussed here. All logos and trade names are the property of their respective owners.