OT: Dealing with random laptop lockups

- D
- Don Y
  
  Contact options for registered users
Vote on answer
posted
8 years ago

Sun, Aug 2, 2015 8:15 PM

Dunno. I've never even read the "obvious" choices, there.

Most charities are anything *but* "altruistic"! IME, they are all thinly veiled efforts to give a certain group of people JOBS -- at the expense of volunteer labor/donations from a far greater number!

And, lots of "rationalizing" as to why the "CEO" *needs* to go to fancy restaurants for lunch (to coax donations from donors?),

*needs* an auto allowance, health care, etc. Yet, can't afford to pay more than minimum wage for the few clerical positions (*and*, NOT pay unemployment insurance on those!!)

OTOH, when the curtain gets peeled back, those folks dipping into the till find they've no place left to go -- who wants to hire an "executive" from a failed non-profit? *Another* non-profit (intent on failing??)

I wanted some guidance so I could design algorithms to tell users of *my* devices when memory integrity was "suspicious". Expecting NO ECC errors is probably not realistic. OTOH, how many are acceptable? When do you start wondering if you are getting uncorrectable errors?

In my case, as there is no payment (from friends), the criteria is "I'll fix anything that a friend/neighbor is willing to *ask* to have fixed -- knowing that it is a significant imposition on me to do so."

E.g., last laptop that went under the knife was an XPS laptop with a bad power connector (snapped off). Trivial repair -- but a lot of work tearing it down to the bare bones to be able to extract the mainboard and refit a new power connector. OTOH, it was a really nice laptop (3D display, etc.).

This turns every donation into several hours of work. *Assuming* they've remembered to bring you the power adapter, etc.

OTOH, if you can dig through a stack of "identical" laptops and quickly put those missing disk drives, optical drives, busted keyboards, etc. off to the side to concentrate on the ones that *look* like they stand a greater chance of success, you can get more results for a given investment of time. Then, start poking at the odds-n-ends as you have time to see if you can piece together N machines from N+m machines worth of components.

It's not just that aspect. Even if you had infinite space, eventually, all the "stuff" becomes distracting. How do you decided where to put your efforts? How do you *find* something that you "know" you've got squirreled away, somewhere?

E.g., I have a bunch of (identical) 10.5x5x18" boxes lining the wall in the garage. (we're talking 150-200 such boxes!) I recover them from one of the local hospitals (they are used to ship "vacutainers" to the hospital so there is an almost endless supply of them!). Each box is labeled: "pointing devices", "mice", "speech synthesizers", "video cables", "DB9 cables", "DB25 cables", "RJ45", "VHDCI", "SCSI2", "Sun SCSI", "SCSI3", "wall warts 5V", "wall warts 12V", "wall warts >12V", "bricks", "appliances", "USB", "ribbon", "SATA", "SATAIDE", "velcro", "cable ties", etc. -- plus a box of spare parts for each of the machines that I have in service.

If I can't quickly find what I am looking for with that sort of fine-grained partitioning, then, chances are, the item is too unique for me to realistically "need" to hang onto it. There's a point where *looking* for a part takes more time than just BUYING it!

I have 4 boxes of power cords -- sorted based on the sorts of connectors that they have. E.g., mickeys, figure-of-eights, 12",

36", 6ft, HP "extenders", right angle (up, down, left and right), etc. Anything that needs a power cord already has a power cord fitted. E.g., each of my scopes/DSO's, logic analyzers, freq generators, DMM's, etc. all have cords (typically right angle) *captive* to them. So, I only need "extra" cords to address changing equipment needs.

E.g., I have a bunch of 14AWG *LONG* power cords that I keep plugged into various outlet strips around the house/office. When I need to pull a piece of equipment out, I'll temporarily plug that long cord into the device instead of trying to climb under/behind a piece of furniture to plug the *correct* cord from the device into that outlet strip.

I took that approach with laserjets, originally. Treat the printer as a "cartridge" that was disposable -- when I run out of ink, I'll just recycle the entire printer!

But, found that the toner cartridges lasted a LONG time. And, in some cases, I was able to rescue new cartridges, as well. E.g., my LJ4m+ has enough toner to last to The End of Days!

- D
- DecadentLinuxUserNumeroUno
  
  Contact options for registered users
Vote on answer
posted
8 years ago

Sun, Aug 2, 2015 10:50 PM

On Sun, 2 Aug 2015 13:21:03 -0400, bitrex Gave us:

Another common occurrence with these is the CPU to heat sink interface. The paste can dry up, and if the interface was not clamped up nice and tight and coplanar, this can happen.

So it still may not be an actual hardware failure electronically speaking. It nay still be a thermal issue where it simply is not being conducted away quick/well enough.

- J
- Jeff Liebermann
  
  Contact options for registered users
Vote on answer
posted
8 years ago

Mon, Aug 3, 2015 3:57 AM

On Sun, 02 Aug 2015 13:15:47 -0700, Don Y wrote:

Yep, that's about it. Usually, when the charity is first conceived, good intentions are the driving force. However, as the glamour of helping the impoverished morphs into helping the poverty fighters, the initial intentions are lost in the noise. The road to hell is paved with the best of intentions.

When I'm confronted by the unexpected or irrational, I always ask myself "What problem are they trying to solve"? For charities, it's usually the lack of sufficiently profitable and sustainable employment for those involved. Sometimes, it's more complex, like laundering money contributed by wealthy donors for a large charitable income tax deductions.

The "Problem Background" section includes clues, test results, and references that look helpful. I can't provide a guidance, but perhaps some of the references can.

Chuckle. It's amazing how many laptop owners forget the charger, even if I remind them. During the late 1980's, where most PCB's had either a mess of jumpers or a setup program, I would ask the customer to bring in all the disks and documentation. Few did, assuming that I maintained a library of everything that was ever produced. Actually, it seemed like I did with boxes and boxes of disks and docs. It took a few years, but I finally figured out how to inspire the customer to bring everything. I mentioned that if they couldn't find the battery charger, I would be happy to supply them with a spare at my usual exorbitant markup. Amazingly, the laptop chargers began to arrive quite regularly.

That probably can be made to work. I tend to do the easy stuff first, leaving the machines with headaches last. The result is a week or two of easy sailing, followed by a week or two of living hell. These days, I do it the other way around, mostly because I have to order parts for the problem machines, which take time to arrive. Speaking of parts, I've been off for a week and can only imagine the boxes of parts that will await me on Monday.

Easy. I file things chronologically. The oldest units are at the bottom of the pile. If I need to find something, I can usually find it if I remember when it arrived. I use anti-static bags or cardboard boxes for parts that need to be kept near the work in progress. It can get rather messy, but I rarely mix up two machines, or loose the bag. My office isn't big enough for that to become a problem. It would not work if I had to deal with 300 machines at a time, but it's ok for the usual 5-10 machines in work, and 5-10 waiting for parts.

Some of my equipment has attached cords, but most of the stuff uses removable power cords. I leave the cords plugged into the power strip when I move the equipment. That means I need to either have spare cords when I move the test equipment, or fight may way through the Gordian knot of power cords to retrieve the original. I would have been doing just fine with spares for everything had I not decided to reduce the number of cords in stock, and overshot.

Laser printers are an oddity and exception to all the economic rules in computing. Users will recycle a computer at the first sign of a problem, but will hang onto their favorite printer as long as possible. I've done repairs that cost more than the value of a new printer, without any complaints from the customer. Business customers will recycle their desktops every 6 years or so, but retain their laser printers for 10 years or more.

I think most of the problem was that laser printers were originally designed to sell replacement toner carts. The printers were sold near cost and the toner carts were expensive. Kinda like inkjet printers. The problem was that refills and clone carts flooded the market with cheap carts, ruining that business plan. Also, most printers came from the copier market, which was based on long life mechanisms. So, HP and others decided to solve the problem by reducing the designed life of the printer mechanics, while simultaneously decreasing the number of pages that a toner cart would hold. The result was that commodity HP laser printers made in about the last 10 years are mostly junk. 300K pages was common with the older models, while the new printers could barely do 50K pages before wearing out. HP also excelled at leaving known failure mechanism in place. For example, the built in duplex paper jam misfeature was well known: but left in place for about 15 years in the older HP printers. The newer generations have different failure modes, mostly based on difficult to replace wear parts. The newer printers also lacked the commonly available rebuild kits of the older printers, substituting major assemblies which made repair uneconomical. Most of my business customers have experienced the problem. They ask me to buy older and more rugged printers on eBay, refurbish them with all new rubber parts, and are quite happy with the results.

I had to draw the line somewhere. The LJ4m+ is just too old and slow. Roughly, the HP printers with 4 digit numbers are the good ones. (2100, 2200, 2300, 4100, 4200, 4300, etc).

Ok... Vacation is over for me. Back to having "fun" while working.

--
Jeff Liebermann     jeffl@cruzio.com 
150 Felker St #D    http://www.LearnByDestroying.com 
Santa Cruz CA 95060 http://802.11junk.com 
Skype: JeffLiebermann     AE6KS    831-336-2558

- D
- Don Y
  
  Contact options for registered users
Vote on answer
posted
8 years ago

Mon, Aug 3, 2015 11:19 AM

I'm not even sure *that* is true! (Cynic) I see a lot of it as "Hmmm... what 'excuse' can I come up with that will convince people to give me money?"

A problem is the organizations quickly shift their focus from the original goal to one of "keeping themselves employed" (er, "paid") To that end, they make all sorts of unsubstantiated claims, etc.

But, to the folks at the cocktail parties (fund raisers), they "talk a good line". Few of those donors are sufficiently interested in seeing what's *really* going on -- so, they just write a check and feel good about themselves. Little genuine concern for the effort they are allegedly funding!

Note that in our case, by the time the easy machines are done and out of the way, another shipment of machines could have arrived. It may include more (good) machines of the same make/model. Or,

*better* machines. With limited time/space, you can opt to just ignore the "tough" units in the first batch entirely.

I used to maintain my own "work area", there -- so I could leave things "in pieces" while waiting for parts (or to finish troubleshooting on another day). Particularly useful for things like LCD monitors (which eat up a lot of orkspace!)

This was a colossal failure. As I would only "visit" once each week, invariably, someone would "clear off" my work area for some other use. Or, just pile more stuff on top. And, any tools I'd left behind would wander off...

I learned the only real way to keep work in progress was to bring it home and set it aside -- which was routinely met with grumbles and sidelong glances from SWMBO!

My approach with the laptops took this into consideration so I didn't have to leave things "half finished"; just a place to store the server and drag out some network cables to tether the laptops being processed on that day!

Yes. But, I leave them connected to the *device* in question. E.g., usually there's a way to wrap the cord around the device while stored.

There are 36 (6 strips of 6) outlets in the office fastened to the undersides of my workbenches (this keeps the cords up off the floor and makes it easier to access the outlets). Plus another 48 outlets on UPS's (8 sets of 2 groups of 3) -- which consume 8 of these 36. Together, they "permanently" feed: (8) UPS's (one for each of the "computers") (7) monitors (2) PC's (2) SPARCstations (2) 12-drive arrays (711 cases) (2) DLT's (2) 1U servers (2) 2U servers (6) half height SCSI devices (611 cases) (2) X terminals (3) laser printers (4) scanners (4) NAS's (1) print server (1) network switch (1) Unisite (1) digitizing tablet (2) inspection lamps/magnifiers (1) stereomicroscope illuminator (1) inspection camera (1) personal stereo (1) wireless headphones (2) uncommitted long power cords Some of these are wall warts -- which often "waste" an adjacent outlet (or two!)

[Of course, I never have all of this running! But, having to crawl behind/over each device to plug it in when needed is just suicide!]

If I need to pull a piece of test equipment out, I tend to use the "uncommitted long power cords" -- instead of having to crawl under the worktables to find a free outlet for the power cord that is "attached" to the device.

I've had very few problems with lasers. I had an LJII many years ago that had a "pick" problem but that was just rollers. I'd still have it if it wasn't such a power hungry beast!

The (solid ink) phasers are a bit of a pain as they waste so much "ink" on each startup. So, I make sure I have a lot to print when I fire that up. (And, the smell of "melted crayons" takes forever to dissipate!)

I watch for them at local auctions, at this "recycling" facility, etc. People want new and sexxy so older units just get dumped. It;s usually retty easy to obtain large MFC units -- but who the heck has the room for such a beast??

I use the LJ4m+ when I need to print (native) PostScript or when I need the duplexor capability. I.e., it saves me paper *and* time if I'm printing something big (like an SoC datasheet). And, with a bunch of new toner carts, it only costs paper to print!

For day to day, "low volume" use, I use the 5p or 6p. And, for color, one of the phasers.

- J
- Jasen Betts
  
  Contact options for registered users
Vote on answer
posted
8 years ago

Mon, Aug 3, 2015 11:39 AM

Yeah it's bullshit. I had an unrecoverable read error, ran a full surface test it found no errors, raised no warnings, I ran another, same result. 3 weeks later it failed again, a repeated scan found over a hundered unrecoverable sectors.

This happened 2 months before the 3 year warranty period ended,

My opinion now is that one unrecoverable read error is too many.

--
  \_(?)_

- W
- Wond
  
  Contact options for registered users
Vote on answer
posted
8 years ago

Mon, Aug 3, 2015 4:16 PM

Working with EIDE drives, Hitachi's Drive Fitness Test was the go-to diag; it would fail drives that caused mysterious hangups, that other tests would pass. Have you (or anyone) found a similarly reliable diag for SATA drives? TIA

- K
- krw
  
  Contact options for registered users
Vote on answer
posted
8 years ago

Mon, Aug 3, 2015 5:01 PM

March of Dimes is the poster child. They once had a reason for being. Now their reason for being is to be; don't let a good tear go to waste.

On the other end is the Salvation Army.

- J
- Jeff Liebermann
  
  Contact options for registered users
Vote on answer
posted
8 years ago

Mon, Aug 3, 2015 5:24 PM

Gosh... that goes way back. I didn't use it much, but I do recall it took a long time to run.

Nope. HDtune 2.55 (free) and various SMART tools are what I use. I can't afford the time to do a thorough test, so I settle for fast tools and use intuition to make the determination.

Also, I have one accidental tool that works nicely. When I do an image backup with Acronis True Image, it will slow down significantly if there are any read errors. I know how long it takes to backup a good drive. If the test drive seems slow, it's probably doing retries and on it's way out. Running a SMART test (SpeedFan) before and after a backup sometimes shows changes, which are always a bad thing.

14 Free Hard Drive Testing Programs I've tried about 5 of these with mixed results. In general, the SMART based tools are marginal or worse. I need to try some of the others (yet another project).

--
Jeff Liebermann     jeffl@cruzio.com 
150 Felker St #D    http://www.LearnByDestroying.com 
Santa Cruz CA 95060 http://802.11junk.com 
Skype: JeffLiebermann     AE6KS    831-336-2558

- D
- Don Y
  
  Contact options for registered users
Vote on answer
posted
8 years ago

Mon, Aug 3, 2015 8:17 PM

I've had no reliable means of predicting when drives would fail. Personally, I've only had one 3.5" and a pair of 2.5" (all PATA) drives fail -- the latter (I hypothesize) because they were installed in a box that ran 24/7/365 and they just couldn't take the constant abuse, spin up/down, etc.

When dealing with recycled machines, you *know* the drives are "used" and have no real idea what sort of service they saw before coming to you. So, its a crap shoot. But, very frustrating to build a machine with a drive, use that drive as the template for cloning other drives (we have a machine that does this) -- only to find the drive fails to spin up just *after* you've built a system on it!

IIRC, it is possible to get single corrupted sectors from power anomalies. So, I'm not sure I would give up at "one". OTOH, is *two* the right number? Or, 985?? :<

- J
- Jasen Betts
  
  Contact options for registered users
Vote on answer
posted
8 years ago

Tue, Aug 4, 2015 6:57 AM

SMART will tell you how many hours they've been running, the number of spin-ups, max and min temperatures etc...

I've never seen that happen. If the drive's on a UPS, or the power has been good since the file was written one is definately the right number.

--
  \_(?)_

- D
- Don Y
  
  Contact options for registered users
Vote on answer
posted
8 years ago

Tue, Aug 4, 2015 7:34 AM

Yes, but none of that is an accurate predictor of whether/when the drive will bite the shed. As I said, Google did an analysis of their disk farm and came to essentially the same conclusion: SMART isn't very helpful (I think they claimed half of their failures were "surprises" in terms of lack of any indication from SMART reporting).

Remember, it's not the power at the AC outlet that the drive is concerned with but, rather, the power at the disk *and* the controller talking to the disk.

E.g., I've found power supplies with bad caps so you know the drive isn't seeing "DC" or even the "correct" voltages continuously. It can mostly work, always work, never work or work intermittently. Unfortunately, your data doesn't get a second chance!

[Search for the google study; it's a good read -- few folks have that large a population to survey!]

- J
- Jeff Liebermann
  
  Contact options for registered users
Vote on answer
posted
8 years ago

Tue, Aug 4, 2015 10:01 PM

"Failure Trends in a Large Disk Drive Population"

--
Jeff Liebermann     jeffl@cruzio.com 
150 Felker St #D    http://www.LearnByDestroying.com 
Santa Cruz CA 95060 http://802.11junk.com 
Skype: JeffLiebermann     AE6KS    831-336-2558

- J
- Jasen Betts
  
  Contact options for registered users
Vote on answer
posted
8 years ago

Wed, Aug 5, 2015 1:24 PM

You asked for "what sort of service they saw before" I don't find that informatiin very useful myself.

google said something similar. "39 times more likely to fail" but they mean "more likely to be replaced" which is not the same thing at all.

--
  \_(?)_

- D
- DecadentLinuxUserNumeroUno
  
  Contact options for registered users
Vote on answer
posted
8 years ago

Wed, Aug 5, 2015 1:56 PM

On 5 Aug 2015 13:24:50 GMT, Jasen Betts Gave us:

Redundancy in your backups via volume mirroring, even on the same physical drive reduces this possibility to near nil.

- D
- Don Y
  
  Contact options for registered users
Vote on answer
posted
8 years ago

Wed, Aug 5, 2015 3:13 PM

It still doesn't tell you the type of service seen. A drive that spends its life thrashing sees different sorts of wear than one that just spins -- even at elevated temperature.

It isn't. Tape drives, DLP's, etc. could provide insight into the amount of "remaining life" with metrics like "tape motion hours", "lamp hours", etc. There's no real analog in the disk world (bytes transfered? seek-inches??)

39 times a small number can still be a small number. Note they also claim that half of their "failures" had no corresponding "early (SMART) warning". With that large a population, you'd expect a more definitive result! I can run with a growing defect list; I can't run with a failed spindle motor!

- J
- Jeff Liebermann
  
  Contact options for registered users
Vote on answer
posted
8 years ago

Wed, Aug 5, 2015 6:06 PM

I do. The drives that seem to last forever are the ones that had been running in servers spinning 24x7. The one's that die quickly are those that spin up/down the drive to save energy. I have one old SCO Unix 3.2v4.2 server that was running 24x7 from about 1996 until last year, when I got tired of waiting for it to die and shut it down. I can't check right now but I think it was a Conner Peripherals 1GB SCSI drive. The drive was running an email server, so it got plenty of action processing spam. I've had similar experiences with server drives.

If you want the drive to last, leave it spining and don't let the drive go through too many thermal cycles.

--
Jeff Liebermann     jeffl@cruzio.com 
150 Felker St #D    http://www.LearnByDestroying.com 
Santa Cruz CA 95060 http://802.11junk.com 
Skype: JeffLiebermann     AE6KS    831-336-2558

- J
- Jeff Liebermann
  
  Contact options for registered users
Vote on answer
posted
8 years ago

Wed, Aug 5, 2015 6:16 PM

Really? The way I see it, if you take two identical drives, mirroring the data with RAID 1 will cut the chances of undetected data failure in half. This is good. However, with double the number of drives, you'll have double the chances of having a single drive failure. If the drives are absolutely identical, chances are good that both drives will have nearly identical lifetimes, and possibly near simultaneous and identical drive failures. At best RAID 1 (mirroring) gives you a fighting chance to remirror a failed drive if you're lucky, and the mirrored drive survives long enough to complete the remirroring. I haven't missed yet, but I've come close. See my other comments on RAID a few messages upthread.

From my experiences, RAID creates as many problems as it solves.

--
Jeff Liebermann     jeffl@cruzio.com 
150 Felker St #D    http://www.LearnByDestroying.com 
Santa Cruz CA 95060 http://802.11junk.com 
Skype: JeffLiebermann     AE6KS    831-336-2558

- D
- DecadentLinuxUserNumeroUno
  
  Contact options for registered users
Vote on answer
posted
8 years ago

Wed, Aug 5, 2015 7:01 PM

On Wed, 05 Aug 2015 11:16:28 -0700, Jeff Liebermann Gave us:

Except note where I clearly stated ."on the same physical drive".

Not at all. The components last a very long time. A failed drive is usually due to some component that does NOT match the characteristics of the average lot.

I stated as much.

Not multi-drive RAID. They can handle up to two drives dropping out and still fully recover all data in nearly every instance.

Simple mirroring is not the RAID level I would generally choose.

- D
- Don Y
  
  Contact options for registered users
Vote on answer
posted
8 years ago

Wed, Aug 5, 2015 8:18 PM

Except when they *are* spun down (power outage, hardware maintenance, etc.) -- and refuse to spin back *up* again!

There are other usage patterns that can be applied. E.g., I've been running a mirrored 1.5T archive on "consumer" quality drives (e.g., Costco) for 6 or 7 years without a problem. Of course, they are only powered *up* when I need something off of them. And, I don't need to spin up a drive AND its mirror to recover what I'm interested in!

[This is one of the problems with COTS RAID solutions: both drives see the same sorts of wear, environmental conditions, traffic, etc. So, in a RAID1 or RAID5 (and variants thereof) configuration, when the first failure is *detected* (keeping in mind that many RAID arrays don't implement scrubbing), the "backup(s)" are already in a similarly dubious state. Rebuilding the array (many hours) can cause the/a remaining "good" drive to crap out -- leaving you with nothing but a pile of dead drives. Exactly when you NEED the redundancy *most*!]

I wouldn't even consider leaving my 12 drive arrays spinning "all day" -- let alone "forever"! The noise, wasted power and excess heat thrown off are just prohibitive.

OTOH, my little DNS/TFTP/NTP/FTP/font/LPR/etc. box has been running for the better part of a year with a dinky USED 640G laptop drive.

- J
- Jeff Liebermann
  
  Contact options for registered users
Vote on answer
posted
8 years ago

Wed, Aug 5, 2015 9:41 PM

You called it "volume mirroring". As I understand it, a volume is a physical drive, as opposed to a partition, which is a division on a physical drive allowing one to produce logical drive letters (Windoze) or mountable partitions (Unix, OS/X). In the distant past, I would put backup images on the same physical drive with lousy results. If there were errors on one part of the drive, there would also be errors on the backup partition, which will prevent recovery of the backup. Unfortunately, there are backup programs (i.e. Acronis True Image) that still do the same thing. It creates a "secure partition" on the main drive (or any other drive), where it saves backups. Bad idea.

I go a step further and make sure my backups are on a different machine (or NAS) box. Ideally, in another room or building. Using rsync between duplicate servers also allows a ready to run backup server which has sometimes come in handy when I need to work on the main server.

Do we live on the same planet? I'm somewhat in the repair biz. I see plenty of machines, mostly failing. While I'm sure that there are users out there that never have a problem or a failure, I don't see them. Component failures, in particular anything that moves, are very common.

Not quite. In the RAID 0+1 systems I was previously building that was not the case. There were 5 drives. Two that were striped (RAID 0) and two that were mirrored (RAID 1). The 5th drive stored the parity bit for all 4 drives which was necessary to detect errors and to help recover from single bit errors:

It could recover nicely from single drive failures of any of the 5 drives. It could also recover from two drives failing, as long as two drives remained working that contained both 4 bit stripes. However, if two drives failed on the same 4 bit stripe, it's game over.

What would you choose? At this very moment, I should be working on a bid and proposal for a new server and backup system for a Vcc (very cheap company). For storage and backup, I'm thinking about an NAS box from Synology with WD Red drives (64MB cache) in RAID 1: Backup will be a duplicate Synology NAS box, located in an adjacent building with a dedicated fiber link. I'm stuck with Windoze for clients. Got a better topology or cheaper/better hardware?

--
Jeff Liebermann     jeffl@cruzio.com 
150 Felker St #D    http://www.LearnByDestroying.com 
Santa Cruz CA 95060 http://802.11junk.com 
Skype: JeffLiebermann     AE6KS    831-336-2558