OT: Dealing with random laptop lockups

- J
- Jeff Liebermann
  
  Contact options for registered users
Vote on answer
posted
8 years ago

Wed, Aug 5, 2015 10:01 PM

I haven't had that happen in many years. Stiction was a problem back in the days when manufacturers landed their heads on the platter instead of using a head lifter. The spindle motor didn't have the torque to break loose the stiction between two parallel surfaces. See item #44 for life with stiction:

Also gone are the days when the drives loaded all their firmware from below track zero off the platter. Even though there were 3 or more copies of the "firmware" (more correctly bootware), dimensional variations in the aluminum case was sometimes sufficient to cause a boot failure. That's because the servo tracks were NOT being used until after the bootware was loaded and therefore could not compensate for these changes.

More common are calls like "I turned off the machine, cleaned out the dust, and now it won't boot or turn on". What that really means is that the electrolytics in the power supply have cooled down and now have a higher ESR than when hot. Replacing the power supply and leaving the HD alone usually fixes that.

That's exactly the scenario I described to DecadentLinuxUserNumeroUno and the reason I don't like RAID.

Try blade or nano-ITX servers and SSD drives for lower power and less noise. At this time, they're probably too expensive for consumers but as prices drop, I think these will be the norm.

Nice. My plans are to do something like that. All storage goes on NAS boxes. All services on a dedicated ITX box running FreeBSD or Debian. Add FreeRadius to your list of services so that I don't have to deal with shared wireless pass phrases (WPA2-PSK). I have it mostly working at home.

--
Jeff Liebermann     jeffl@cruzio.com 
150 Felker St #D    http://www.LearnByDestroying.com 
Santa Cruz CA 95060 http://802.11junk.com 
Skype: JeffLiebermann     AE6KS    831-336-2558

- D
- Don Y
  
  Contact options for registered users
Vote on answer
posted
8 years ago

Thu, Aug 6, 2015 1:11 AM

Not stiction. Just "startup surge" trying to get that mass into motion and up to speed. This is especially true of the 15K drives.

But you can get the redundancy of RAID without adopting an "integrated" RAID solution.

E.g., I keep duplicate copies (or more) of all files. And, the meta data for each in a DBMS. So, I can keep 12 copies of the same file on the same volume; or on 12 different volumes; etc. And, I don't have to ensure all 12 copies are "on-line" at the same time in order to be able to retrieve *a* copy of the file.

Or, keep just *one* copy!

So, if the copy I *choose* to retrieve doesn't pass its stored checksum, I know it is corrupt and can go looking for another copy on the same volume or on a different volume -- perhaps one that isn't even *spinning*, currently! (or, even *in* a ZIP/TAR/etc. *archive* having yet another name!)

Because the DBMS tracks all the meta data, I can have code systematically walking through any/every filesystem WHENEVER IT IS MOUNTED checking/verifying all the files known to reside on that filesystem: "Hmmm... the checksum for foo.baz doesn't match!" "Oooops! Read error when trying to access biggle.boo!" "Hey! Where the hell did Taxes.txt disappear to??" By updating the DBMS to reflect "time last checked", this systematic examination can be interrupted whenever I want to spin the drives/boxes down -- then resumed when things come back online days or weeks later.

So, instead of the "nasty surprise" that awaits you when you go looking for a particular file (which may have disappeared, become unreadable, or become corrupted), you get advance notice and, hopefully, will act to ensure no further losses are incurred!

I'm spinning many TB currently. I don't care to make that sort of investment in SSD's just to cut down on noise, heat, etc. :> Instead, I'll tough it out when I *need* to access these things.

I have a bunch of Dell FX160's: They'll hold a laptop SATA internally. Passive cooling (small fan under the disk drive *for* the disk drive). Runs headless as I've got it tucked under a dresser in my bedroom. (silent!)

This *one* box has all the key network services for my intranet

*plus* serves PXE images to the other FX160's which run diskless (and headless). Other boxes have external USB drives and act like NAS's using software served up from that first machine.

The first machine is relatively complete in terms of software so I can telnet to the box and build/rebuild system images (for the other boxes) or run X clients from the box, etc. (so I can be working from an X terminal *or* an X server on a Windows machine and have the benefit of a "real console" even though there's no video connected to the actual box(es).

Eschew wireless, here. A combination of paranois and NOT wanting to have to address those security issues. There are close to 100 network drops around the house (about half of them are dedicated to specific devices) so I can sit down virtually anywhere and be a "6 ft patch cord" away from a connection.

- D
- DecadentLinuxUserNumeroUno
  
  Contact options for registered users
Vote on answer
posted
8 years ago

Thu, Aug 6, 2015 1:25 AM

On Wed, 05 Aug 2015 18:11:32 -0700, Don Y Gave us:

Bull. 15k drives are 1.5 inch platters typically and spool up faster than the more massive, 7200 rpm 3.5 inch form factor drives.

You keep guessing at reality and getting it wrong. The industry reduced platter diameters and this aspect was one of the main reasons. They spool up faster, not slower.

- J
- Jeff Liebermann
  
  Contact options for registered users
Vote on answer
posted
8 years ago

Thu, Aug 6, 2015 2:20 AM

Ummm... 15K drives use the SAS (serial attached scsi) interface in order to obtain the advertised speeds (currently 6 Gbits/sec and climbing): Most of these drives are 3.5" platters. If you search for "15K SAS

3.5 inch", you'll get plenty of hits: I did find a few 2.5" platters but no 1.5": Unfortunately, I've had no experience with these drives (yet) and know nothing about spinup time.

On my data dumpsters, I usually configure the drives to spin up in a staggered sequence with about 2 seconds delay between each drive. That's probably overkill for small arrays, but starting up 8 drives simultaneously once caused a power supply to complain which then forced me to drive 60 miles round trip at 2AM in order to kick start it.

With all due respect, you might want to double check your allegations, especially if you're not certain, before you post them.

--
Jeff Liebermann     jeffl@cruzio.com 
150 Felker St #D    http://www.LearnByDestroying.com 
Santa Cruz CA 95060 http://802.11junk.com 
Skype: JeffLiebermann     AE6KS    831-336-2558

- D
- Don Y
  
  Contact options for registered users
Vote on answer
posted
8 years ago

Thu, Aug 6, 2015 2:53 AM

I took out a current probe and watched the 12V supply on one of these drives when commanded to spin up: just about

3A peak @ 12V (though for just a fraction of a ms) holding steady at 2.5A for a significant portion of the wind up; settling in at about 1A for "idle" (no seeks).

Startup power is thus almost 40W (12V+5V) instantaneous settling to ~12-13W when idle. By contrast, the 640G laptop SATA drive in the FX160 draws a few watts, total! :-/

(we'll see if *it* lasts 10+ years like these others! :> )

I stagger the drives to start 2 every 6 seconds so it takes almost a minute to spin up a shelf. I'm pretty sure the power supply would quit if they were NOT staggered!

- D
- David Brown
  
  Contact options for registered users
Vote on answer
posted
8 years ago

Thu, Aug 6, 2015 6:22 AM

RAID 0+1, as the term is mostly used and as the Wikipedia entry says, uses 4 drives. Drives 0 and 1 are striped to make RAID0 virtual drive. Drives 2 and 3 are striped to make another RAID0 virtual drive. These two are mirrored to make a RAID1 pair. There is no 5th drive, or parity drive.

RAID 01 is rarely used, because RAID10 is almost always better (less chance of catastrophic failure, faster recovery, and better performance). It occasionally finds use when the two halves of the RAID1 pair are physically separate or have a lower-speed link.

RAID10 takes two drives and creates a RAID1 mirror of them. Then another two drives make another RAID1 mirror. These two RAID1 pairs are then striped together to make a RAID0 pair of RAID1 pairs. Again, there is no parity drive. It is commonly used as a very fast array (especially for work that uses a lot of small writes), with quick recovery. It can tolerate any single drive loss, and two out of three combinations of 2 drive losses.

As far as I can figure (but I may have interpreted you incorrectly), what you have got is two RAID1 pairs of two drives, connected along with a fifth drive as the parity drive of a RAID3 setup. It is, frankly, a silly arrangement - overly complex and poor performance for almost everything (assuming there is nothing special about the drives or the links), compared to more conventional systems.

It would give you full two-drive redundancy - but you could achieve that more simply and efficiently with a standard 4-drive RAID6.

- D
- David Brown
  
  Contact options for registered users
Vote on answer
posted
8 years ago

Thu, Aug 6, 2015 6:39 AM

You are mixing up to /completely/ different terms here. An "unrecoverable read error" means that the drive is unable to return the requested data, and says so to the drive electronics. Such URE's are common as drives get older, especially on bigger drives. They are not a problem in a good raid system - the drive reports to the failure to the raid controller (OS or hardware raid card), which reads the mirrored version or stripe with parity from the other disks, re-creates the missing data, and re-writes it to the original disk. The original disk writes the data again, possibly re-mapping the bad block if needed.

The chances of an URE occurring at the same point on two disks (even of the same type and age) are negligible. The risk comes when you have a single-redundancy system (RAID1 pair, RAID5) and a complete disk failure, or you are replacing a disk (without having an advanced system like Linux raid's hot replace). In such situations, you have no redundancy and an unrecoverable read error means missing data.

An /undetected/ read error is completely different. This means the drive has read incorrect data, but thinks the data is correct. This is incredibly rare in normal use. An unrecoverable read error occurs when there are more bit errors in the magnetic surface than the Reed-Solomon codes can correct. But to get an undetectable read error, there must be many more such errors and they must match up in a way that the RS codes pass as correct data. In reality, undetected read errors are almost invariably a sign of electronics failures or firmware bugs. They /do/ occur, but are fortunately seldom an issue.

You are right that having a mirror halves the chances of the undetectable read error reaching the host. And identical drives would share the same firmware bug (if that's the cause), but it is likely that such bugs would be triggered by unusual circumstances (such as particular results from the RS codes), and would not trigger on same block read. Hardware failures would not be correlated.

That is completely wrong. Identical drives are likely to have the same /distribution/ pattern for failures (percentage of early deaths, roughly similar main use expected lifetimes, and similar half-lives after that). But expecting two identical drives to fail at the same time is as rational as expecting two identical dice to roll the same number. You can use the information for statistics in your datacenter with thousands of drives, but not for any given raid array.