[cross-post] nand flash bad blocks management

A

alb 11 years ago

Hi everyone,

We have ~128Mbit of configuration to be stored in a Flash device and for reasons related to qualification (HiRel application) we are more inclined to the use of NAND technology instead of NOR. Unfortunately NAND flash suffers from bad blocks, which may also develop during the lifetime of the component and have to be handled.

I've read something about bad block management and it looks like there are two essential strategies to cope with the issue of bad blocks:

skip block
reserved block

The first one will skip a block whenever is bad and write on the first free one, updating also the logical block addressing (LBA). While the second strategy reserves a dedicated area to remap the bad blocks. In this second case the LBA shall be kept updated as well.

I do not see much of a difference between the two strategies except the fact that in case 1. I need to 'search' for the first available free block, while in second case I reserved a special area for it. Am I missing any other major difference?

The second question I have is about 'management'. I do not have a software stack to perform the management of these bad blocks and I'm obliged to do it with my FPGA. Does anyone here see any potential risk in doing so? Would I be better off dedicating a small footprint controller in the FPGA to handle the Flash Translation Layer with wear leveling and bad block management? Can anyone here point me to some IPcores readily available for doing this?

There's a high chance I will need to implement some sort of 'scrubbing' to avoid accumulation of errors. All these 'functions' to handle the Flash seem to me very suited for software but not for hardware. Does anyone here have a different opinion?

Any comment/suggestion/pointer/rant is appreciated.

Cheers,

Al

A: Because it messes up the order in which people normally read text. Q: Why is top-posting such a bad thing? A: Top-posting. Q: What is the most annoying thing on usenet and in e-mail?

Vote

B

Boudewijn Dijkstra 11 years ago

Op Mon, 12 Jan 2015 10:23:24 +0100 schreef alb :

The second strategy is required when the total logical storage capacity must be constant. I can imagine the existence of 'bad sectors' degrading performance on some filesystems.

Sounds like you're re-inventing eMMC.

Indeed regular reading (and IIRC also writing) can increase the longevity of the device. But it is up to you whether that is needed at all.

AFAIK, (e)MMC devices all have a small microcontroller inside.>

(Remove the obvious prefix to reply privately.) Gemaakt met Opera's e-mailprogramma: http://www.opera.com/mail/

Vote

A

alb 11 years ago

Hi Boudewijn,

In comp.arch.embedded Boudewijn Dijkstra wrote: []

Ok, that's a valid point, meaning that since I declare the user space only the total minus the reserved, the user may rely on that information.

But in terms of total amount of bad blocks for the quoted endurance will be exactly with the same number. None of the strategies mentioned wear less the device.

I didn't know there was a name for that. Well if that's so yes, but it's not for storing your birthday's picture, rather for space application.

Even if there are several 'experiments' running in low orbit with nand flash components, I do not know any operational satellite (like for meteo or similar) to have anything like this.

I'm not aiming to increase longevity. I'm aiming to guarantee that the system will cope with the expected bit flip and still guarantee mission objectives throughout the intended lifecycle (7.5 years on orbit).

Scrubbing is not so complicated, you read, correct and write back. But doing so when you hit a bad block during the rewrite and you have tons of other things to do in the meanwhile may have some side effects...to be evaluated and handled.

It does not surprise me, I have the requirement not to include *any* software onboard! I may let an embedded microcontroller with a hardcoded list of instruction slip through, but I'm not so sure.

Al

Vote

D

Don Y 11 years ago

Um, *reading* also causes fatigue in the array -- just not as quickly as

*writing*/erase. In most implementations, this isn't a problem because you're reading the block *into* RAM and then accessing it from RAM. But, if you just keep reading blocks repeatedly, you'll discover your ECC becoming increasingly more active/aggressive in "fixing" the degrading NAD cells.

So, either KNOW that your access patterns (read and write) *won't* disturb the array. *Or*, actively manage it by "refreshing" content after "lots of" accesses (e.g., 100K-ish) PER PAGE/BANK.

I can't see an *economical* way of doing this (in anything less than huge volumes) with dedicated hardware (e.g., FPGA).

Vote

B

Boudewijn Dijkstra 11 years ago

Indeed; my apologies. Performing many reads before an erase, will indeed cause bit errors that can be repaired by reprogramming. What I wanted to say, but misremembered, is that *not* reading over extended periods may also cause bit errors, due to charge leak. This can also be repaired by reprogramming. (ref: Micron TN2917)

Space exploration is not economical (yet). ;)

(Remove the obvious prefix to reply privately.) Gemaakt met Opera's e-mailprogramma: http://www.opera.com/mail/

Vote

A

alb 11 years ago

Hi Don,

In comp.arch.embedded Don Y wrote: []

reading does not cause *fatigue* in the sense that does not wear the device. The effect has been referred to 'read disturb' which may cause errors in pages other than the one read. With multiple readings of the same page you may end up inducing so many errors that your ECC would not be able to cope with when you try to access the *other* pages.

These sorts of problems though are showing up when we talk about a number of reading cycles in the hundreds of thousands if not million (google: The Inconvenient Truths of NAND Flash Memory).

We have to cope with bit flips anyway (low earth orbit), so we are obliged to scrub the memory, in order to avoid errors' accumulation we move the entire block, update the LBA and erase the one affected, so it becomes again available.

Well according to our latest estimates we are about at 30% of cell usage on an AX2000 (2MGates), without including any scrubbing (yet), but including the bad block management.

Al

Vote

D

Don Y 11 years ago

Yes, its amazing how many of the issues that were troublesome in OLD technologies have modern day equivalents! E.g., "print through" for tape; write-restore-after-read for core; etc.

Wise ass! :>

Yes, I meant "economical" in terms of device complexity. The more complex the device required for a given functionality, the less reliable (in an environment where you don't get second-chances)

Vote

[cross-post] nand flash bad blocks management

Join the Discussion

Didn't find your answer?