Embedded systems using hard disks - reliability quandary

We make a product containing, among other things, an LCD and a hard drive. We have until this moment been characterizing the product lifespan based on the LCD backlight's rated lifespan, since if you look at the speced MTBF column in our BOM, the CCFL has the smallest number.

Recently, however, I was asked to add spindown-HDD-after-idle functionality to reduce acoustic noise. Then someone said "it should also increase drive lifespan greatly" and I started to research this topic using the full 200-page drive datasheets instead of the one-page spec sheets we used previously (which is where we got the single "MTBF" number). Now, I'm thoroughly confused. I'm trying to work out some reasonable defaults to reduce drive spinning time (= reduce noisy time) without overstressing some part of the mechanism and inducing premature failure.

Looking at one particular drive - IBM DJSA-220 - the drive is rated for 5 years or 20,000 power-on hours, whichever comes first. At 24/7 power-on, that's about 2.28 years. However, the assumptions in that lifespan are: less than 333 power-on hours per month (not valid for our product, which is normally powered up 24/7), and seek/read/write operations occupy less than 20% of power-on hours (might or might not be valid for our product, depending on exactly what the user is doing). The datasheet essentially says that all bets are off if those limits are exceeded. Furthermore, the drive is rated for 300,000 normal head unloads, and 20,000 emergency unloads. Our product's power switch is an emergency unload. Me spinning down the drive in software is a normal unload.

These questions become even more interesting for some other drives we use regularly, e.g. Fujitsu MHK2120AT. They are rated for the same 5 years/20k hours (250 hours/month maximum, and 1 power-cycle per day REQUIRED, but no more than 50 spinup/down operations per day!), but they are only rated for 50,000 spindle start/stop operations total (it appears this limit is related to the associated head load/unload operations, not specifically the spindle motor). If the user has the drive auto-spinning down every 15 minutes (not unreasonable, this means less than 1.5 years before something dies. Weirdly enough, the

250 hours/month and max 50 spinups per day limits are removed if you can guarantee to keep the disk envelope at 48 Celsius or below (which we can't).

Is there anyone else who uses 2.5" IDE hard disks in an embedded system, and has developed a sane method of choosing default power management settings? I would like to be able to say "yes, guarantee it for three years", although two would do. As a secondary point, I'd like to know what to put in the product's instruction manual, since the HDD sleep time is user-configurable. Do I say "anything other than our carefully tuned default setting will reduce the lifespan of your hard disk"? Or should we just put a waiver in the warranty saying the HDD is only warranted for 12 months, and after that time only the labor is free?

Reply to
Lewin A.R.W. Edwards
Loading thread data ...

snipped-for-privacy@larwe.com (Lewin A.R.W. Edwards) wrote in news: snipped-for-privacy@posting.google.com:

[snip]

FWIW,

I think you will find that these drives are not supposed to run 24/7 and will start posting Error[UNC] errors sooner than you'd like under heavy load. I encourage you to enable the maximum AAM (hopefully supported) to both quiet the seeks and slow them down which is good for lifetime. Then enable the drives APM and set the auto-spin down feature so that your software does not need to. The drive will spin up when you access it (be sure to allow the required 31 seconds before calling an error though). I use 3.5" DeskStars.

--
- Mark ->
--
Reply to
Mark A. Odell

you could consider to have batterypowered RAM plus perhaps a Flash as short term storage and have the disk strictly as longterm storage, such that the disk is powered up once a day only for a quick transfer.

Rene

--
Ing.Buero R.Tschaggelar - http://www.ibrtses.com
& commercial newsgroups - http://www.talkto.net
Reply to
Rene Tschaggelar

Hi Mark,

Hmm. Interesting response. I guess I should go into more detail:

Yes, this makes sense given that they're intended to be laptop drives. I will download some of the 3.5" specs and compare the reliability data. But our housing would require radical redesign to accommodate a 3.5" drive ( And heat would be a problem, with those high-speed desktop drives.

However, the typical failure modes we see are:

  • Spindle bearing noise suddenly increases. This usually causes an end-user complaint before the unit has time to actually fail. This problem happens mostly with pre-Hitachi IBM drives, but it also happens with 20Gb+ Fujitsus. But the spindle noise varies widely even inside a single batch of drives; I've opened a box of 20 to test this phenomenon and found two or three loud ones alongside 17-18 imperceptibly quiet ones, all with the same manufacture date.
  • Drive won't spin up. Interestingly, I have opened a couple of these drives and found the bearing is jammed really, really hard. It's not head stiction, because I can see the drive seeking the heads across the stationary disk. If I break the jam by turning it a couple of revs by hand (with power off!), then replace the cover, the drive operates flawlessly (apart from SMART reporting "imminent failure! Danger, Will Robinson!!"). This problem only occurs on Fujitsu drives 12Gb and smaller (12, 10 and 6 are the sizes we have used; these are non-coincidentally all out of older units, too).

Do you mean the acoustic noise management feature? It's supported on all the drives we have shipped to date. I set it to maximum in this latest beta firmware version (the first version to support spindown). However I have to give the user the option of disabling that feature, because it adversely affects the product's ability to play high-bitrate video.

Oh, I don't spin the drive down manually, I use the normal set sleep time command and let the firmware handle spindown. Sorry, should have clarified that :) I allow the user to configure it to "disabled", or in 1-minute intervals from 1-20 minutes.

How long have you been fielding products based around hard drives? I'd be very interested to hear what kind of real-world reliability results you're seeing 12-36 months down. Our normal production process includes a 48-hour burn-in designed to catch as many moribund infants as possible. Recently we have had a spate of units between two weeks and six months old, suddenly dying. In descending order of frequency, the top problems are:

  • Intermittently bad RAM. The RAM is from different vendors/batches, and appears to test good at first glance, but swapping out the SODIMM is guaranteed to fix the problematic unit, and it's not dirty connectors because I've tried cleaning them.
  • Hard disk failures of the type mentioned above, particularly sudden increases in bearing noise.
  • Mainboard failures. In particular, we are observing the CS5530 chip is just not putting out a video signal. Poking around on the board, all the necessary clocks and supplies seem OK, and by inspecting over a serial port I can write/read all the chip registers, but there's just no video output on either the analog or TFT-LCD ports. Weird.

The actual numbers of failed units aren't huge, but they represent a big overall spike; for instance, out of a sample of 250 units between 12-24 months old, we have only four real failures*, all of which were failed power supplies reasonably attributable to mains glitches. But there is no common factor that would be an obvious cause for the current set of problems.

  • - Not counting units which we have determined were damaged in shipping.

--

-- Lewin A.R.W. Edwards

formatting link
Learn how to develop high-end embedded systems on a tight budget!
formatting link

Reply to
Lewin A.R.W. Edwards

Oh, no. This is for storage of multimedia data (pictures, MPEG movies, MP3 audio). Gigabytes of data, not a tiny logfile :)

--

-- Lewin A.R.W. Edwards

formatting link
Learn how to develop high-end embedded systems on a tight budget!
formatting link

Reply to
Lewin A.R.W. Edwards

There mature GB-size flash disks on the market. No noise, no seek time, do not care for spin-ups/spin-downs; operating temperature from -40 to +85 plus vibrations ... If you want reliability - you gotta pay...

Duke S.

Reply to
Duke Skylurker

Duke Skylurker wrote: : There mature GB-size flash disks on the market. : No noise, no seek time, do not care for spin-ups/spin-downs; : operating temperature from -40 to +85 plus vibrations ... : If you want reliability - you gotta pay... :

....and with limited life-cycles depending on how much you write to them....

Flash is not the answer for storing large multimedia files. It dies after a measely million or so writes (depending on the media of course).

I'm not sure how this compares with the life of an IDE drive though. I think the comparison would suggest buying IDE drives that can be easily replaced......

--buddy

--
Remove '.spaminator' and '.invalid' from email address
when replying.
Reply to
buddy.spaminator.smith

At thousands of dollars each, and not designed for constant-rewrite applications.

*mild exasperation* I think my question was very specific. I am interested in heuristics for calculating power management settings that will extend the life of an IDE hard disk. It is ridiculous to think about using flash media in our application, until 30Gb of flash storage can be obtained for under $80. I know there are people using hard disks in applications such as laptop computers, MP3 players, and TiVos, so somebody probably knows the answer to my question.

Much slower write access than hard disks, also. Depending on the media type, could be slower read access as well.

Reply to
Lewin A.R.W. Edwards

That is a million per sector and they are wear leveled. Completely rewriting a 1GB FLASH device at say 4MB/s will take 256 seconds and doing that 1 million times will take 71,000 hours which compares rather well with the 20,000 power on hours for a 2.5" hard drive.

Reply to
nospam

I can't claim to have the experience you hope to find, but some of my = observations may yet be of use to you.

We manufacture video servers, and have been shipping such beasts since =

1997. One of the first observations we made was that the drives in use in = 1997 (9GB Seagate Barracudas), if run in the open on the bench, would soon = rise to the full rated temperature given in the spec sheet (150F). We = developed our own mounting subchassis, and sandwiched the drives between = aluminum plates, with significant airflow. We then observed that the = temperature of each drive maintained at about 105F. In those units (many = still in service), we had remarkably few drive failures.

Heat is always a significant factor in failure, and drives generate it on = their own, in prodigious amounts. It would be interesting to develop some = power cycling controls, and then to graph the performance trade-offs vs. = average operating temperature. I'm sure that managing the up-time of the = drive will boost MTBF, but by how much, of course, is difficult to = ascertain.

My observations over the last 6 years have been that drives either fail = very quickly from some sort of component failure (typically less than 30 = days), or they fail from wear. Since the obvious wear point is the = bearing, and it's inaccessible, in my view, the only things you can do to = extend its life will be to a) reduce running time, and b) reduce heat = generation.

It would be great to discuss such things with a drive design engineer or = two, but the odds of gaining access to any such seem very poor.

--
Bill
Posted with XanaNews Version 1.15.7.4
Reply to
William Meyer

observations may yet be of use to you.

One of the first observations we made was that the drives in use in 1997 (9GB Seagate Barracudas), if run in the open on the bench, would soon rise to the full rated temperature given in the spec sheet (150F). We developed our own mounting subchassis, and sandwiched the drives between aluminum plates, with significant airflow. We then observed that the temperature of each drive maintained at about 105F. In those units (many still in service), we had remarkably few drive failures.

own, in prodigious amounts. It would be interesting to develop some power cycling controls, and then to graph the performance trade-offs vs. average operating temperature. I'm sure that managing the up-time of the drive will boost MTBF, but by how much, of course, is difficult to ascertain.

quickly from some sort of component failure (typically less than 30 days), or they fail from wear. Since the obvious wear point is the bearing, and it's inaccessible, in my view, the only things you can do to extend its life will be to a) reduce running time, and b) reduce heat generation.

but the odds of gaining access to any such seem very poor.

return key broken?

Reply to
TCS

Hi William,

the full rated temperature given in the spec sheet (150F).

aluminum plates, with significant airflow. We

In those units (many still in service), we had

Our drives are mounted, PCBA-down, onto a subchassis made of ~2mm steel (it's some crazy non-metric gauge, but it's roughly 2mm thick). We have bent tabs that run up the sides of the drive, and we use the side-entry screw holes, not the bottom screw holes. The "parts side" of the subchassis faces the outside world, with about 2cm of airspace then a perforated thin steel outer housing.

The thing is, we don't temperature-test/characterize every single point over the surface of the subchassis. We have a couple of temperature probe points - on top of hot ICs, and in the power supply - and we test in various environments to make sure we don't exceed rated temperatures of 60 Celsius, with a target temperature of 50 degrees.

their own, in prodigious amounts.

Unfortunately, the main thing I managed to extract from the datasheets is that you're damned if you do and damned if you don't. Powering down the drive eats into a "number of head unload cycles" lifespan. Leaving the drive running eats into a "number of hours of bearing life" lifespan. It would be interesting to develop some power cycling controls, and then to graph the performance trade-offs vs. average operating temperature. I'm sure that

two, but the odds of

Yeah.. I would settle for talking to an engineer at one of the big laptop manufacturers, though. When Toshiba decides on the warranty period for their laptops, they must have some sensible method of determining it...

--

-- Lewin A.R.W. Edwards

formatting link
Learn how to develop high-end embedded systems on a tight budget!
formatting link

Reply to
Lewin A.R.W. Edwards

Probably even more difficult than getting a shot at a drive design engineer....

I haven't looked at the head load/unload cycle figures. I'd have thought that the numbers for that activity would be more than sufficient.

We always use the side mounting holes, ensure good contact with metal, and on the 3.5" drives, we use screws in all three positions on each side, also for better thermal coupling. Not sure what I would expect from steel; we have always mounted with aluminum.

It's an interesting problem, and I'd be interested in what you find if you develop some test setups.

--
Bill
Posted with XanaNews Version 1.15.7.4
Reply to
William Meyer

ElectronDepot website is not affiliated with any of the manufacturers or service providers discussed here. All logos and trade names are the property of their respective owners.