Amtel SAM9 "boot from NAND" is a myth?

- G
- Grant Edwards
  
  Contact options for registered users
Vote on answer
posted
13 years ago

Thu, Sep 16, 2010 7:45 PM

Yup, I found those, but was unable to figure how to download a datasheet. So far, it appears to me that older, lower density parts are much more likely to guarantee block 0 w/o requiring ECC than the newer, higher density parts.

--
Grant Edwards               grant.b.edwards        Yow! Hello.  Just walk
                                  at               along and try NOT to think
                              gmail.com            about your INTESTINES being
                                                   almost FORTY YARDS LONG!!

- W
- Wil Taphoorn
  
  Contact options for registered users
Vote on answer
posted
13 years ago

Thu, Sep 16, 2010 8:13 PM

Since that field is mandatory this line reads "at least one" to me.

Doesn't that mean that the programming device that is writing the boot sector has to verify for errors and, if so, reject the device?

--
Wil

- G
- Grant Edwards
  
  Contact options for registered users
Vote on answer
posted
13 years ago

Thu, Sep 16, 2010 8:21 PM

You mean that block 0 is guaranteed good _if_ the customer throws out any devices they find with a bad block 0?

Or, to phrase it differently: "Block 0 is guaranteed to be valid in all devices that have a valid block 0.".

That's a statement so meaningless that even George Bush would be proud of it. ;)

--
Grant Edwards               grant.b.edwards        Yow! I smell like a wet
                                  at               reducing clinic on Columbus
                              gmail.com            Day!

- W
- Wil Taphoorn
  
  Contact options for registered users
Vote on answer
posted
13 years ago

Thu, Sep 16, 2010 9:18 PM

No, I expect that this block can -at least once- be written without any bit errors (i.e. able to boot without ECC considerations). What I meant is that it is up to the design to take risks of reprogramming this boot sector.

--
Wil

- G
- Grant Edwards
  
  Contact options for registered users
Vote on answer
posted
13 years ago

Thu, Sep 16, 2010 9:20 PM

OK, I've gotten more clarificati The 1st block is guaranteed to be a valid block up to 1K cycles with ECC. (1bit/528bytes)

--
Grant Edwards               grant.b.edwards        Yow! Psychoanalysis??
                                  at               I thought this was a nude
                              gmail.com            rap session!!!

- G
- Grant Edwards
  
  Contact options for registered users
Vote on answer
posted
13 years ago

Thu, Sep 16, 2010 9:44 PM

It doesn't say that anywhere in the spec.

What it says is this:

The blocks are guaranteed to be valid for the endurance specified for this area (see section 5.6.1.23) when the host follows the specified number of bits to correct. Note the last phrase:

"when the host follows the specified number of bits to correct"

The blocks are only guaranteed valid _if_ you do ECC to correct the specified number of bit-errors.

OK, I understand what you mean. But, that's not what the OneNAND spec says, and the datasheets for many vendor's parts specifically state that you must do ECC if you expect block 0 to be valid.

--
Grant Edwards               grant.b.edwards        Yow! Kids, don't gross me
                                  at               off ... "Adventures with
                              gmail.com            MENTAL HYGIENE" can be
                                                   carried too FAR!

- W
- Wil Taphoorn
  
  Contact options for registered users
Vote on answer
posted
13 years ago

Thu, Sep 16, 2010 10:28 PM

True, "for the endurance specified", AKA "a number of times programmed".

But that doesn't mean you can't program it the first time. That is what I meant by "expect", I would not accept a device that flips a bit on the very first time it was programmed.

--
Wil

- G
- Grant Edwards
  
  Contact options for registered users
Vote on answer
posted
13 years ago

Fri, Sep 17, 2010 2:21 AM

That's immaterial. What's important is that it doesn't mean that you _can_ program it the first time. (without ECC)

I know what you meant by "expect", but I doubt that what you expect determines what a fab ships.

A bit can fail the first time you program block 0, and it will still meet the spec. That's what matters.

You can expect all sorts of things, but if a feature isn't in the part's specification, then it's foolish to design a product that depends on that feature.

The last batch of NAND chips I played with had 0 bad blocks.

I can "expect" 0 bad blocks all I want, but that's not going to stop the vendor from shipping parts with up to 20 bad blocks out of 1024 next week. A design that relies on NAND parts having 0 bad blocks is a bad idea no matter how hard I expect 100% good blocks.

--
Grant

- M
- Marc Jet
  
  Contact options for registered users
Vote on answer
posted
13 years ago

Fri, Sep 17, 2010 11:52 AM

IMHO it is actually worse.

The way many NAND datasheets are written, they allow for more than just 1 or 4 or 8 bad bits in a block. A certain number of blocks could go away COMPLETELY, and the part would still be in-spec.

People commonly expect bad blocks to have more bit errors than their ECC copes with. However, nowhere in the datasheets is a guarantee for this.

For what I know, blocks could just as well become all 1. Or all 0. Or return read timeout. Or worse, they could become "unerasable" - stuck at your own previous content (with your headers, and valid ECC!).

Now I want to see how your FTL copes with that!

- S
- Stefan Reuther
  
  Contact options for registered users
Vote on answer
posted
13 years ago

Fri, Sep 17, 2010 5:00 PM

I interpret that to mean that the boot sector can consist of X perfectly reliable bits and Y unreliable bits (e.g. permanently zero). The boot loader would then have to ECC-correct the unreliable bits each time it loads, and the manufacturer guarantees that Y doesn't grow above the ECC requirements.

Stefan

- R
- rickman
  
  Contact options for registered users
Vote on answer
posted
13 years ago

Fri, Sep 17, 2010 7:19 PM

Looking back, I never actually used a NAND flash in a design. I understand how the bad bits would be managed. But what about bad blocks? Is this a spec on delivery or is it allowed for blocks to go bad in the field? I can't see how that could be supported without a very complex scheme along the lines of RAID drives.

Rick

- G
- Grant Edwards
  
  Contact options for registered users
Vote on answer
posted
13 years ago

Sat, Sep 18, 2010 2:03 AM

That's what it means to me, that's what it means to the FAE's we're working with, and judging by the parts' datasheets, that's what it means to the guys doing QA at the fabs.

--
Grant

- A
- Allan Herriman
  
  Contact options for registered users
Vote on answer
posted
13 years ago

Sat, Sep 18, 2010 4:34 AM

It's pretty simple actually. When the driver reads a block that has an error, it copies the corrected contents to an unused block and sets the bad block flag in the original block, preventing its reuse. No software will ever clear the bad block flag, which means that the effective size of the device decreases as blocks go bad in the field.

From the point of view of the flash device, the bad block flag is just another bit. The meaning comes from the software behaviour. The device manufacturer will also mark some blocks bad during test. All filesystems will use this same bit. Even if you reformat the device and put a different filesystem on it, the bad block information is retained.

Cheers, Allan

- R
- rickman
  
  Contact options for registered users
Vote on answer
posted
13 years ago

Sat, Sep 18, 2010 11:26 AM

g

t

o

or

Or

ck

ice

ms

You lost me. If there is an recoverable error, the block is not bad, right? That's the purpose of the ECC. If the block accumulates enough bad bits that the ECC can not correct, then you can't recover the data.

Obviously there is something about the definition of "bad block" that I am not getting. Are blocks with *any* bit errors considered bad and not used? What if a block goes bad because it went from no bit errors to more than the correctable number of bit errors? As Marc indicated, a block can go bad for multiple reasons, many of which do not allow the data to be recovered.

This sounds just like a bad block on a hard drive. When the block goes bad, you lose data. No way around it, just tough luck! I suppose in both media that is one of the limitations of the media. I didn't realize that NAND Flash had this same sort of specified behavior which is considered part of normal operation. I'll have to keep that in mind.

Rick

- S
- Stefan Reuther
  
  Contact options for registered users
Vote on answer
posted
13 years ago

Sat, Sep 18, 2010 11:34 AM

[...]

But where do you store the "bad block" flag? It is pretty common to store it in the bad block itself. The point Marc is making is that this is not guaranteed to work.

In an ideal world, maybe. All file systems I have seen so far use different bad block schemes. Which is not surprising, as NAND flash parts themselves use different schemes to mark factory bad blocks.

Stefan

- D
- David Brown
  
  Contact options for registered users
Vote on answer
posted
13 years ago

Sat, Sep 18, 2010 2:25 PM

Just like with hard disks, the NAND flash ECC can correct several errors in a block. So when there are a few correctable errors in a block, the block is still "good" and still used. But once you have got close to the correctable limit, you can still read out the data but you mark it as bad so that it won't be used again.

There is always a possibility of a major failure that unexpectedly increases the error rate beyond the capabilities of the ECC. But that should be a fairly rare event - like a head crash on a hard disk. The idea is to detect slow, gradual decay and limit its consequences. If you need to protect against sudden disaster, then something equivalent to RAID is the answer.

- R
- rickman
  
  Contact options for registered users
Vote on answer
posted
13 years ago

Sun, Sep 19, 2010 3:05 AM

:

ng

ust

go

for

. Or

tuck

go

a

an

e

ust

device

stems

d

s

,

"Close" isn't good enough. You can't assume that it will fail gradually. If it goes from good to bad, then you have lost data. Now that I am aware of that, I will treat NAND flash the same as hard disks, not to be counted on for embedded projects where a file system failure is important.

Yes, a bad block happening without warning may be "rare", but the point is that it is JUST like a hard disk drive and can not be used in an app where this would cause a system failure. Any part of the system can fail, but a bad block is not considered a "failure" of the chip even though it can cause a failure of the system.

Rick

- R
- rickman
  
  Contact options for registered users
Vote on answer
posted
13 years ago

Sun, Sep 19, 2010 3:09 AM

or

an

Why do you need a bad block flag? If the block has an ECC failure, it is bad and the OS will note that. You may have to read the block ECC the first time it fails, but after that it can be noted in the file system as not part of a file and not part of free space on the drive.

evice

tems

I don't see how this is any different from a hard drive. There they use a combination of factory data and the file system to track bad blocks.

Rick

- D
- David Brown
  
  Contact options for registered users
Vote on answer
posted
13 years ago

Sun, Sep 19, 2010 2:04 PM

That's just nonsense.

/Everything/ has a chance of failure. Are you going to stop using microcontrollers because you've heard that they occasionally fail? Will you stop driving your car to work because they sometimes break down?

What is important for building reliable systems is to have an understanding of the failure modes of the parts, the chances of these failures, and the consequences of the failures. NAND flash has significant risk of failure with reasonably well understood characteristics - the failure of individual bits is mostly independent, and the risk of failure increases with each erase/write cycle. So what you get is a pattern of gradually more random bit failures within any given block, increasing as the block gets erased and re-written. You correct for a few bit failures, but if there are too many errors you consider the block to be failing - you can read from it, but you won't trust it to store new data. In most cases, you'll copy the data over to a different block.

Note that the same principle applies if the ECC coding only corrects a single error - with one correctable error you consider the block too risky for re-use, but trust the (corrected) data read out.

The only way to make a system safe in the event of rare catastrophic failures of critical systems is with redundancy. It applies to NAND devices just like it applies to every other part of the system.

The difference is that with a NAND flash, a bad block is /not/ considered a failure because you take the wear of the blocks into account in the design of the system, so that they don't lead to system failure.

Think of it like a battery - you know it is going to "fail", and plan accordingly so that it does not lead to a catastrophic failure of the system.

- D
- David Brown
  
  Contact options for registered users
Vote on answer
posted
13 years ago

Sun, Sep 19, 2010 2:06 PM

Failures can be intermittent - a partially failed bit could be read correctly or incorrectly depending on the data stored, the temperature, or the voltage. So if you see that you are getting failures, you make a note of them and don't use that block again.