[cross-post] nand flash bad blocks management

alb · 2015-01-12T09:23:24+00:00

Hi everyone, We have ~128Mbit of configuration to be stored in a Flash device and for reasons related to qualification (HiRel application) we are more inclined to the use of NAND technology instead of NOR. Unfortunately NAND flash suffers from bad blocks, which may also develop during the lifetime of the component and have to be handled. I've read something about bad block management and it looks like there are two essential strategies to cope with the issue of bad blocks: 1. skip block 2. reserved block The first one will skip a block whenever is bad and write on the first free one, updating also the logical block addressing (LBA). While the second strategy reserves a dedicated area to remap the bad blocks. In this second case the LBA shall be kept updated as well. I do not see much of a difference between the two strategies except the fact that in case 1. I need to 'search' for the first available free block, while in second case I reserved a special area for it. Am I missing any other major difference? The second question I have is about 'management'. I do not have a software stack to perform the management of these bad blocks and I'm obliged to do it with my FPGA. Does anyone here see any potential risk in doing so? Would I be better off dedicating a small footprint controller in the FPGA to handle the Flash Translation Layer with wear leveling and bad block management? Can anyone here point me to some IPcores readily available for doing this? There's a high chance I will need to implement some sort of 'scrubbing' to avoid accumulation of errors. All these 'functions' to handle the Flash seem to me very suited for software but not for hardware. Does anyone here have a different opinion? Any comment/suggestion/pointer/rant is appreciated. Cheers, Al -- A: Because it messes up the order in which people normally read text. Q: Why is top-posting such a bad thing? A: Top-posting. Q: What is the most annoying thing on usenet and in e-mail?

D

Don Y 11 years ago

That assumes you have a reliable source of entropy that is not affected by loss of power. I'd be nervous about just how "random" any such source would actually be; if it isn't, then it introduces an irrecoverable bias to the algorithm. Unexpected "patterns" in supposedly random events have a nasty tendency to manifest. E.g., power application introduces a voltage spike to some external circuit that attempts to measure thermal noise on a diode junction, etc.

Any *digital* machine without state would obviously not be "random".

You have to deal with outages with *any* write-erase cycles.

Vote

D

Dimiter_Popoff 11 years ago

I think everyone so far has been dramatically overdoing the complexity. The simplest solution would likely be most efficient, I think they use it on HDD-s for ages. Reserve some virgin space to relocate bad sectors (blocks, whatever). Then write to the non-reserved with no further consideration; verify after each write. Once a write fails, relocate the block (obviously you have to keep a backup copy of the block prior to writing). This will take care of variations between blocks, writing frequency etc. Obviously you must ensure you have enough power to complete the write, verify and potentially the relocate& write operations in the power capacitors so the thing can survive a blackout at any moment. Then on a space mission you might want to "RAID" the data across say 3 flash chips.

Dimiter

------------------------------------------------------ Dimiter Popoff, TGI

formatting link

------------------------------------------------------

formatting link

Vote

A

alb 11 years ago

Hi Don,

Don Y wrote: []

you have a point. Indeed a block that shows less ECC errors will see less recycling events, hence write-erase cycles.

I simply rephrased your statement 'Over time, their [preferred cells] performance will degrade'.

OTOH you're implying that write-erase cycles limits are not made *equal* for all cells, i.e. more robust cells will likely hold more than the quoted (conservative) limit.

If this is the case (and I'm not sure whether data exist to back the reasoning), than your approach will level performance instead of write-erase cycles and get the storage size last longer.

[]

If the only aim is to preserve write-erase cycle than I wouldn't care, the aim would be to guarantee equal number of cycles. But if the aim is different (as it sounds like) than it becomes more important to rank the usage with some metrics and level performance instead of write-erasure cycles.

In the end you'll have an uneven distribution of write-erase cycles, but an equally distributed 'probability' to show errors.

An ECC warning, as you referred to it, should be treated to avoid the possibility to be unable to recover data integrity. Now the only choice left is which block should I use first.

If the faulty block is nevertheless my top one, I may consider to rewrite in place, otherwise I'll move the data to yet another place. Rewriting in place is not so trivial since you would need a block size buffer at your disposal, and not a page size that you would need in the event of a move to another location.

If the algorithm needs to be twisted because of other constraints, the end result might be biased as well.

[]

Exactly. Whether this is a more preferrable approach I have hard time to judge.

What I'm sure about is that in HiRel applications you'll have hard time to convince the customer that your algorithm is not guaranteeing to be within spec (max number of write-erase cycles) even if the aim is to get the most out of the memory.

It would be like trying to show that your power mosfet is used beyond max. ratings but 'believe me that this is where I get the most out of it'. A respectable QA would never allow to slip that out of the house and if that happens a respectable customer would never allow that to slip in.

[]

And within your selection essentially lies your goal. Leveling 'performance' might be another strategy which might be the winning one on some applications, but I don't see how that can be levereged in a field where it will be never accepted to 'go beyond' the quoted limits.

I realize now (sorry for having been so hard headed!) that essentially your thinking might have been driven by another goal.

This goal is certainly not wrong, but if you add that you shall maintain the total number of write-erase cycles within spec (per each block), your choice does not really matter since a better performing block will last longer (number of reads before a recycle event) independently of the strategy to pick it.

A block that shows errors less likely than others *will* be used more and eventually degrade. If I pick the best performing block before others it wouldn't change its performance so why bothering in the ranking (considering that I cannot go beyond the quoted write-erase cycles).

This is typically handled by the FTL in my filesystem of choice. Pick the right one and you may even be free to modify the wear-leveling algorithm! There's a miriad of flash file systems popping out, in all sorts of flavors. If you are really interested just dig deep enough.

If you are not interested and want to live in a -not so well- fenced world, just select any closed system and pray someone has taken your worries into account. The choice, as usual, is in your hand ;-).

[]

I'm not sure I understand your point here. Why would I ever want to rewrite a block if is not giving ECC errors?

Because if the goal is to not go beyond the max write-erase count on each block it does not really matter how you choose them.

I think I'll try (in my spare time) to see what would be the footprint of my FTL implemented in software (LOC and memory) in order to have a cost/benefit ratio between the two approaches.

Performance as well should be taken into account (throughput, latency, ...).

That's an interesting approach, so altering *wear* would increase the statement 'not all cells are made equal' and magnify the impact of my algorithm. But I believe that I need to repeat the exercise with the tweaked algorithm on a different device with same starting conditions of induced *wear* otherwise results may easily be misinterpreted.

[]

Ensuring correctness of the algorithm is a separate subject, I agree. Still you can imagine that while it is *doable* to implement in both types of implementation, the efforts (development time, verification time) and the costs associated to hardware/software/tools/equipment may lead to preferring one vs. the other.

Al

Vote

A

alb 11 years ago

Hi Don,

Don Y wrote: []

search for 'random number generator fpga' and you'll find plenty of hits! I think you may achieve a flat distribution without much of a hassle.

We will most probably need to implement it as well for our dither generator, so the may function can serve multiple uses.

Are you implying a journalized file system? I certainly don't want to go that far. Metadata like bad block list (BBL) and Logical block address (LBA) shall be confined in one location only and preferrably be 'atomic'. If you find a new bad block and copy the data nothing will happen if a power cut happens since original data is still accessible. Certainly if the power cut happens while you are updating the BBL and LBA there's not much that can help you unless you guarantee enough energy storage in bypass caps to hold on until the operation is complete.

Losing the BBL is not a major issue since you can rebuild it from the device (at the expense of rescanning the whole memory), losing LBA is more of an issue since your data are scrumbled and may not be simple to recover (but again you may conceive a mechanism to recover this).

Certainly these functionalities are more appropriate to software rather than hardware.

Al

Vote

A

alb 11 years ago

Hi Dimiter,

Dimiter_Popoff wrote: []

Ideally any solution which is not the simplest to reach the goal is either too complex or too simplistic. Still I'd be more comfortable with something that is overdoing a bit than with something that would not meet the requirement.

This is the first flaw. You cannot anticipate how many bad sectors you are allowed to have. With worst case scenarios, manufacturers guarantee only the minimum amount of valid block only through the life endurance of the memory which is indeed a parameter of a block. So if you're allowed 100K write-erase cycle and you have 4K blocks, with a wear leveling algorithm you may extend the amount of cycles to nearly 4K *

100K. At this point the total amount of bad blocks quoted by the manufacturer is meaningless.

But even if you can estimate what's the total amount of bad block you are going to have, put a cap on it would be rather pessimistic. Once the mission is on orbit it can maybe last more than anticipated while your cap will limit its functionality beyond the reserved area.

So if you have a bad block in one of those reserved ones you still need to move it to yet another one. Would the replacement for it be in the same reserved area? What will happen when the reserved area is worn out?

As stated alread (IIRC in the OP) the data are mainly configuration, meaning you are reading continuously from these blocks with very little modifications. If you move every faulty block to the reserved area you'll be soon confined in that reserved area and can never get out.

[]

Power management is something to take into account anyhow.

In the past we've triplified the data stored in the same chip/component. Since metadata are stored locally in RAM (protected by scrubbing) there's no need for triplifying the hardware.

The system is redundant on its own, so a hard failure on the component will lead to switching to the redundant one.

Al

Vote

D

Don Y 11 years ago

Those are *pseudo* random. How do you "seed" it when power reappears? I.e., it will start up in the same place that it *started* up last time power was cycled -- and the time before that, etc. -- because it doesn't remember where it *was* (in its pseudo-random sequence) when it was powered *off*.

Even if you clock it at some relatively high rate (wrt "replacement block selection rate"), you still need some external source of entropy to ensure your *digital* system doesn't simply pick the same iteration of the RNG each time it gets up and running.

I.e., here's a pseudo-random (over *some* interval) string of digits:

3593450671875346641049135080357695788035424...

But, if the next time I apply power causes the same series to be regenerated (because there is no source of entropy in the system), then the first block chosen for replacement will always be the same -- '3'.

[If you clock the RNG at a high frequency, you'll move the selection of *which* digit you will examine -- but, will always move it by the same amount relative to power-up. Unless there is some external event that is asynchronous wrt your system's operation on which you can rely to introduce some sense of "randomness" to the selection of this first digit]

If accesses are NOT "atomic", then how will you verify the integrity of that data? How will you *know* which updates (of that data) were "in progress" when the power died? Everything that you expect to be "persistent" has to be preserved in some form or another.

I.e., if you have just decided that block #N must be recycled and are in the process of erasing it, you must have made a durable reference to this fact *somewhere* so that if power goes away before you have completed the erasure, you will know which block (N) was in the process of being erased (and can restart the erase operation). Similarly, when you are *done* erasing it, you need to make a durable note of this fact so that you don't re-erase it, unnecessarily.

The same applies to write cycles. Otherwise, an interrupted write can result in you repeating the write when power is restored -- only to encounter a write failure (because you had previously *almost* completed the write and are now restarting it -- effectively lengthening the write timer).

Likewise, if you are tracking block read counts (for read fatigue), each of those data must be persistent.

But you've now got a write that is unaccounted for! I.e., that block is now no longer "erased and ready to be rewritten" nor "written and ready to be used". You have to remember that you were in the process of writing it and something has prevented that write from completing. The only safe recourse is to assume it was completed for the sake of tracking "program cycles"; yet assume it was NOT completed for the sake of the validity of the data that it contains -- thus, re-erase it (and count that erasure)

Exactly. This is true of *anything* in the array that must be persistent.

How do you know (for sure) which blocks are corrupt, retired, etc.?

I think the only sort of algorithm you can hope to implement (in hardware) has to rely on very limited information about the state of the memory array. I.e., a FIFO structure for recycling blocks, an "open loop" criteria for when those blocks get recycled, etc.

You'd be well served to research the current literature in this regard. Unlike magnetic media whose wear/failure modes are reasonably well understood, there seems to be a lot of hit-or-miss activities when it comes to effective management of flash.

If you can "guarantee", a priori, that you need a fixed life from the design *and* the components will behave in a guaranteed manner (in those conditions), then you can probably come up with a solution that "fits" and forget about it.

The research I've been doing has been geared towards high usage in a consumer grade device; as cheap as possible and as *durable* as possible -- yet trying not to impose too many constraints on how it is *used* (to give the other aspects of the system as much flexibility as possible in their design)

Vote

D

Don Y 11 years ago

Exactly. What you're interested in is preserving data. If the data exhibits no errors, why "do anything" *to* it? And, if you have one portion that exhibits a higher error rate, then *it* (all else being equal) drives the performance of your store. If you can effectively *improve* it's perfomance at the expense of portions that are currently performing *better*...

It's not as simple as this argument has been, to date. I've deliberately avoided other "real" issues that add even more layers of complexity to the analysis, design, argument, etc.

E.g., imagine the existing data have settled into nice, "very reliable" blocks (i.e., no ECC events). But, you need to do a write (because some data has changed). And, by chance, your spare block list is exhausted! Which block do you choose -- from among those *reliable* blocks?

There are two ways to look at this:

- if the memory is performing correctly, why do you care about the number of write cycles that it has incurred? Do you deliberately end the product's life when the number of write cycles has been attained -- even if it *seems* to be working properly?? "Hello, Mars Rover... turn yourself OFF, now; your design life has been attained!"

- qualify the device so you *know* what sorts of ACTUAL performance you can get from it and design your algorithm(s) around that data.

I routinely encounter consumer kit that has been designed with *way* too little margin (9VDC on 10V caps, etc.). For the consumer, this sucks because the devices invariably have short lives. OTOH, for the manufacturer, it's "getting the most out of the least" (cost). They've *obviously* done the math and realized that these design choices will allow the device to survive the warranty interval (so, *they* never see the cost of repair or replacement).

There is also nothing that prevents you from addressing multiple criteria simultaneously. *Except* the complexity of such a solution! (i.e., again returning to the software vs hardware approach and consequences/opportunities of each)

Even if your circuit had provisions to automatically replace the MOSFET when/if it failed?

Again, do you automatically decommission the product when it has achieved it's *specified* life expectancy? I.e., under nominal loads, each MOSFET will last XXX POHr's. So, *at* XXX hours, we will replace the MOSFET, discarding it as "used" -- even if it has usable life remaining. When we replace the N-th of these devices (after N * XXX hours), we will have run out of MOSFETs and, by design, have to decommission the device.

Said another way: "We'll select a bigger MOSFET to handle the increased load and determine the PDF over time that it will survive." What if it

*doesn't*?

I'm not claiming this is the right solution for you. Rather, I am indicating how the strategy for managing the flash array can easily become complex. Citing a portion of the conversation several posts back:

followed, in a subsequent post, by:

I.e., your choice to implement this *in* an FPGA places some (realistic) constraints on what you can effectively do in your algorithm.

Figure out what your driving priorities are as well as how much leeway you have in achieving them. Then, think about the consequences of your implementation if something *isn't* what you expected it to be (i.e., if a device does NOT meet its stated performance).

E.g., in my application, I want the array to *look* like an ideal disk. I don't want the application to have to be aware of the implementation. Yet, I want the implementation to accommodate the needs of the application without undue wear on the array. So, the application shouldn't have to consider whether it has recently written "this value" to "this sector" -- or not. Yet, the array shouldn't naively do what it is told and, in this case, incur unnecessary wear. All while trying to stay "inexpensive" (i.e., replacing the device because the flash wore out would very significantly increase the TCO)

[Sorry, had to elide the rest as I'm late for a meeting this morning...]

Vote

[cross-post] nand flash bad blocks management

Join the Discussion

Didn't find your answer?