/thanks/ to a buggy control software and potentially a wrong approach in error handling, we have wiped out the factory-marked bad block information on a NAND flash device (Micron).
Now, according to the manufacturer (AN2917), if those blocks are used, they "may appear to operate normally but may cause other good blocks to fail or create additional system errors."
Since the above sentence is not more than just a "hey, we cannot guarantee a damned thing, so don't blame us if you screw up!", are there any techniques that I can use to recover?
I believe that on a critical system I would not take any risk and replace the chip, but should I care otherwise?
Why a factory-marked bad block is of any difference from any other block that goes bad after shipping? After all, if I continue to use a bad block I will sooner or later realize that is bad and eventually mark it, so not a big deal.
Any pointer/suggestion is appreciated.
A: Because it messes up the order in which people normally read text.
Q: Why is top-posting such a bad thing?
At least theoretically, some blocks could be bad because of a group error - say a decoding problem in addressing the array. So the blocks might be good, you just aren't actually consistently writing to the blocks you think you are. While I know that's been done with DRAMs, I have no idea if it applies to flash devices.
The key point here is 'consistently'. I haven't found lots of literature/references on this subject, but as I understand it the addresses for a specific 'list' of blocks may not work properly (as if there were some sort of propagation delay error in the address decoder).
But I don't understand well what you are saying because if that is the case than the manufacturer as well could not mark the block as bad (unless the addressing decoding mechanism for the spare area is /different/ from the block itself), being uncapable of ensure that when he marked the block as bad the address was decoded correctly.
Moreover if addressing is not working consistently, how would I recover bad-block markers? they data retrieved may belong to a different block address and be completely wrong.
Sure! *You* can undertake to qualify the device yourself! (good luck with that! ;-) Solution: discard the trashed device and use this as a learning experience to ensure you NEVER do it again!
Do you care about the reliability of the component as it pertains to your overall device reliability? If the answer is "no"... (then why are you even USING the device if it "doesn't have to work"?)
Think about how you *use* a block: i.e., you program, read and erase it (in various combinations). Associated with *each* action are "disturb events" in which OTHER blocks are compromised by your actions on *this* block. The manufacturer has effectively said: "Use of these blocks leads to stronger than expected disturb events". It's just easier (safer) for the manufacturer to overprovision the device and mark those blocks as "avoid these if you want to trust the component at the level stated in our specifications".
Unfortunately, the analogy of "bad blocks" on a disk doesn't hold up. There, you *will* eventually discover that a portion of the medium is defective and GROW the defect list to, effectively, recreate the PERMANENT defect list (if you managed to wipe it out).
Research "disturb events" so you see how actions on one block can alter the contents of another. Then, imagine the one block is "leakier than most" and consider how *using* it will affect that "other(s)".
This is exactly what I proposed. But this is flight hardware and 'discarding' the device is a ton of money!
reliability issues are not simply related to a 'working device'. The failure rates calculation in the case of a Flash component is rather a specialization of its own.
I'm not sure how QAs are litigating on this issue, but it all boils down to the acceptance level required for that specific mission.
OTOH you may accept the risk of having a *potentially* failed component if, for example, you do not have the time to change the component.
After all flying to a distant planet is not something you can schedule out of the target launching window. And even flying to LEO can shift easily 6 months, causing huge losses.
Yes, indeed using those block will screw up all failure rate analysis. It may end up that we do not meet the spec anymore. It may happen that a device does not meet its own spec, but here is a bit worse than that. We are knowingly using the device in a condition that goes beyond its 'abs max ratings'.
thanks for that. I was already aware about 'disturb' errors, but AFAIK those ones are all correctable, i.e. you fix the leak by 'scrubbing' with a sufficiently high rate the entire device. When you scrub you fix the errors that have leaked from 'once were bad' blocks.
This effect has to be taken into account in the specific use case. We are storing data for a relatively short time (few hours) before being refreshed/overwritten. Imagine a big fifo to compensate a continuous data stream in and an intermittent data stream out. When writing/reading/erasing a 'once were bad' block it may leak (hence disturb) other blocks, *faster* than anticipated. Indeed to a level that whenever we use those blocks we screw up the whole memory content.
At a certain point I'd say, big deal! It means I need to intentionally add wear in order to recover the bad block list. Unfortunately nobody can tell me when I'm done...damn it!
I have lots of experience with raw micron nand. Though I dont know how leve l level you are, you can regrow your bad block list by erasing the entire n and and reading the first spare area byte of the first page of each block. Micron's bad blocks are permanently tagged and will continue reading as a b ad block after an erase. Also if you attempt to program a bad block, it sho uld come back as a program failure
If you are operating at a higher level than this, say at above the interfac e (sata, sas, etc), then a good controller's firmware should rebuild it upo n secure erase
vel level you are, you can regrow your bad block list by erasing the entire nand and reading the first spare area byte of the first page of each block . Micron's bad blocks are permanently tagged and will continue reading as a bad block after an erase. Also if you attempt to program a bad block, it s hould come back as a program failure
ace (sata, sas, etc), then a good controller's firmware should rebuild it u pon secure erase
I should append that I don't actually know what Micron's BB process is. I d on't know if they recalculate the bad blocks upon erase, or if they do it w hen binning at production. From my observations, we have only seen the same bad blocks come up over again during the factory bad block list generation following a full nand erase. Grown bad blocks have to be logged separately using ECC information at runtime.
Too late to worry about that. That train has left the station.
Things being as they are right now, all previous failure rate calculations are invalidated, and attempting new ones would be futile. The only honest answer to the questions: "How do you model the expected failure rate of this element, and what's the model's result?" would currently be "We can't", and "By default, unacceptably high", in that order.
Right now you can match _no_ requirement worth mentioning.
Forget about the "may" in that statement. That's a certainty. If you can still meet it, it's not worthy of being called a specification.
*Just* the trashed flash? If it's TRULY "a ton of money", talk to the manufacturer and see if they can requalify it for you (for something *less* than "a ton")
Sure. My point was that you *do* care. Thus, want a "reliable" number.
Or, if a replacement simply doesn't exist -- or, is too costly to install. (I've worked on numerous "one off" systems where the cost of replacing the *one* system was exceedingly high)
Likewise, a *failed* mission has direct costs -- as well as indirect (loss of prestige, opportunity, etc.)
If it was *just* the block that was unreliable, then you can quickly return to the point at which those blocks are shuffled out of service (re: my previous discussion on this issue) effectively leaving you with the "good" blocks that you *should* have started with.
The problem is that bad blocks can have consequences that affect other data in the array -- in an unpredictable (without detailed knowledge of the implementation and mask) way.
To use the disk analogy:
If a particular block is truly "bad" (anomalies in the oxide layer in that physical portion of the medium), then you can learn to avoid using it to store data.
OTOH, if using that disk block (which, by itself, *might* be able to retain data perfectly!) causes some *other* block on the medium to be corrupted (or, maybe just *compromised*/"disturbed"), then how will you *know* that this has happened? Examine the *entire* medium to see if any data has changed? What if the magnetic domain hasn't been altered enough for it to be seen as having "flipped"? (i.e., for the flash, what if the charge level has changed -- been compromised -- but not enough for it to appear as a "flipped bit" that your ECC could "notice"). How do you know *which* block operation to associate with each *future* data anomaly? (i.e., when the data in that "other" block degrades to a point of being noticeable)
But, you have set that "scrub rate" based on the metrics related to a set of IN SPEC flash blocks! I.e., you assume the effect of the disturb events can be characterized for a KNOWN GOOD device (or, for a PORTION of a device that the manufacturer has told you is "well behaved" -- meets spec).
Now, suddenly, you are using parts of the device that are NOT well behaved! How do you adjust your scrub rate? Perhaps the particular failure causes *multiple* bits to be disturbed in a single block (something that the manufacturer would know would render the device unusable as the ECC would quickly become ineffective).
I.e., you are now using the device in a manner for which the manufacturer has not provided qualification data. E.g., running TTL off of 8V (I'm sure you can find *some* that won't breakdown at that level -- esp with care on output loading, etc.)
But, you don't know *how* much faster -- or even if the nature of the "leak" is the same as "normal". What happens if, at some particular voltage/temperature/rate those accesses have catastrophic consequences? I.e., wipe out big swatches of data in unpredictable ways?
[I have no idea how the ACTUAL failures would manifest. But, NEITHER DO YOU! The manufacturer is only telling you how the device will behave *if* you use it in the manner that they have prescribed!]
If it's a "ton of money", then you should be involving the folks that know FOR SURE. Not *us*! :>