Amtel SAM9 "boot from NAND" is a myth?

- A
- Allan Herriman
  
  Contact options for registered users
Vote on answer
posted
13 years ago

Sun, Sep 19, 2010 3:31 PM

Our experience has been that parts fresh from the factory will have some blocks flagged as bad. If we (by using a modified driver) write and read those blocks, some of them will actually work ok. Presumably the factory test is rather more rigorous.

("Modified driver" should be read as "partially ported and still bug ridden driver".)

It's been a while, but ISTR that the parts we were using had 0 to 3 bad blocks per device, which was within the manufacturer's spec. We stress tested a bunch of them and we did see a block go bad. The number of erase/ write cycles required exceeded the manufacturer's minimum spec. (This stress testing was performed to test the bad block handling in software.)

To keep this relevant to the OP, we were using parts that had a guaranteed good block 0. The board is still in production.

Regards, Allan

- U
- Ulf Samuelsson
  
  Contact options for registered users
Vote on answer
posted
13 years ago

Mon, Sep 20, 2010 8:26 AM

rickman skrev:

News to you: All flash memories will eventually "wear out". You have to have a strategy to handle this.

--
Best Regards
Ulf Samuelsson
These are my own personal opinions, which may
or may not be shared by my employer Atmel Nordic AB

- U
- Ulf Samuelsson
  
  Contact options for registered users
Vote on answer
posted
13 years ago

Mon, Sep 20, 2010 8:44 AM

Grant Edwards skrev:

The problem is that the NAND flash market has moved on since the AT91SAM9G20 was designed. The NAND flash used to guarantee Block 0. Now there are memories which does not have this guarantee. I think that the move in the industry is towards eMMC

Note that the fact that block 0 is OK, is no guarantee that you can boot. The part configuration must also be recognized by the boot ROM. Some manufacturers "reuse" id's so if the table contains two elements with the same Id, only the first will be found.

--
Best Regards
Ulf Samuelsson
These are my own personal opinions, which may
or may not be shared by my employer Atmel Nordic AB

- B
- Boudewijn Dijkstra
  
  Contact options for registered users
Vote on answer
posted
13 years ago

Mon, Sep 20, 2010 9:12 AM

It means that it's not doing particularly well and is likely to become bad.

Precisely. Without ECC, you wouldn't be able to evacuate the data to a good block.

In theory I think you could still use them for writing but you'd have to verify the data every time.

That would be "really bad".

--
Gemaakt met Opera's revolutionaire e-mailprogramma:  
http://www.opera.com/mail/
(remove the obvious prefix to reply by mail)

- M
- Marc Jet
  
  Contact options for registered users
Vote on answer
posted
13 years ago

Mon, Sep 20, 2010 11:15 AM

NAND chips allow for up to N "bad blocks" before a device is considered defective. Some blocks come already marked as bad from factory. It is recommended to preserve this information, as factory testing is usually more exhaustive than what you can implement in a typical embedded system. However, more bad blocks are allowed to develop DURING THE LIFE TIME of the device, up to the specified maximum (N).

This means that you whatever you write to the device, may or may not be readable afterwards! You have 3 choices how to handle this:

Choice 1: Use enough error correction to be 100% safe against it.

Note that "normal" ECC is definately not enough. On a device with 0 known bad blocks, up to N blocks can disappear from one moment to the other (in the worst possible scenario). To be safe against this, you must distribute each and every of your precious bits over at least N+1 blocks.

Algorithms exist that can do this (for example RS), but they are not nice. Besides the algorithmic complexity, there is another problem with this approach. The higher the storage efficiency (data bits versus redundancy bits), the more blocks you have to read before you're able to extract your bits. With N in the range of 20 to 40 in typical NAND chips, this results in an unavoidable and very high read/ write latency.

Choice 2: Avoid giving more reliability guarantees than your underlying technology.

This sounds simple and impossible, yet in fact it's quite realistic. The problem is not that your storage capacity may go away. It's just that vital data stored in a particular place may become unreadable. If you introduce a procedure to restore the data and make that procedure part of the normal operation of your device, then it's not a real problem.

In practice this means that your bad block layer must be able to identify the bad blocks in all circumstances. I know that many real- world algorithms (like the mentioned one of using the "bad block bit") are not 100% fit for the task. After all, the bad block bit may be stuck at '1', and you can't do what's necessary to mark it bad. But there are more reliable approaches that can achieve the necessary guarantee.

Of course the other essential part for this choice is to provide a way to restore the data, which can be a PC flasher program (like iTunes "restore device").

Then your device can be declared to be always working, without extending the reliability guarantees beyond those given by the NAND manufacturer.

Choice 3: Implement reasonable ECC, give all the guarantees, and hope for the best.

This seems to be "industry standard". It seems to work out quite OK, because NAND failures usually are not very catastrophic. As others have pointed out, creeping failures can be detected and data migrated before ECC capability is exceeded. Usually failures go in hand with write activity in the same block or page, and write patterns are under software control.

But then again, to make it very clear: this approach is not 100% safe. It's a compromise between feasibility and reliability.

You will see yield problems. Unless it's life threatening technology, you're probably better off accepting them than to cure them.

- G
- Grant Edwards
  
  Contact options for registered users
Vote on answer
posted
13 years ago

Mon, Sep 20, 2010 2:18 PM

That was the conclusion to which I eventually came after reviewing a bunch of datasheets. The parts that you could use to boot a G20 were all several years old, and the parts that required ECC on block 0 were newer. Since the hardware guys wanted a small (read BGA) package, that pretty much left only the recent parts that reuire ECC on block 0.

It looks like we're going to either have to settle for TSOP or add a SPI NOR flash to hold the 16KB bootstrap.

--
Grant Edwards               grant.b.edwards        Yow! I wish I was a
                                  at               sex-starved manicurist
                              gmail.com            found dead in the Bronx!!

- V
- Vladimir Vassilevsky
  
  Contact options for registered users
Vote on answer
posted
13 years ago

Mon, Sep 20, 2010 2:36 PM

That N could be as high as 2% of the total capacity. The tendency is allowing for more and more of N. The higher is the flash density, the lower is the reliability. This is especially true for the multilevel flash. If the application requires high reliability of data, I avoid using high density flash. There is also NAND flash of industrial quality, which is substantially more reliable then consumer grade.

You are making unfounded assumptions here.

And higher then maximum. N+1, N+2 and so on.

Incredible, isn't it?

Only the insurance agencies are promising 100% guaranteed result.

RAID or RAID-like solutions are well known for the safe storage of data.

Until some critical part of the filesystem fails, making all other data unaccessible.

Those intelligent measures introduce a lot of overhead and increase the amount of write activity. Also, they create critical situations when accidental power failure can destroy the filesystem.

Sure. Who cares about occasionaly broken .mp3 or .jpg file.

Vladimir Vassilevsky DSP and Mixed Signal Design Consultant

formatting link

- U
- Ulf Samuelsson
  
  Contact options for registered users
Vote on answer
posted
13 years ago

Mon, Sep 20, 2010 4:37 PM

If you add an SPI flash (or a dataflash) and plan to boot Linux, you are probably better of by putting also u-boot and u-boot environment in the dataflash. You might also want to consider the kernel.

Reason is the SAM-BA S/W which only knows how to erase the complete NAND flash. If you plan to program the NAND flash using another method, then of course, use the NAND flash for everything except bootstrap.

--
Best Regards
Ulf Samuelsson
These are my own personal opinions, which may (or may not)
be shared by my employer Atmel Nordic AB

- S
- Stefan Reuther
  
  Contact options for registered users
Vote on answer
posted
13 years ago

Mon, Sep 20, 2010 5:01 PM

How do you mark it "in the file system" if your file system is actually inside the NAND flash? Thought experiment: your bad block table is stored in a particular block. That block goes bad. Where do you mark that this block is now bad?

State of the art seems to be to use magic numbers for valid data, and destroy the ECC and/or magic numbers for blocks that are gone bad, so you can identify them later. That's the "bad block flag".

From what I've seen, those temporary failed bits are still within the specs of the NAND flash as long as you're running the part within specs. However, when you're way out of spec (say, 30°C over limit), all hell breaks loose.

Stefan

- S
- Stefan Reuther
  
  Contact options for registered users
Vote on answer
posted
13 years ago

Mon, Sep 20, 2010 5:19 PM

This means you have to distribute each single data block across, say,

161 blocks. With a block size of 4k and NOP=4 this means the minimum amount of data you can write (aka "cluster size") is 161 kBytes. Plus, remember that NAND flash tends to get more forgetful if you actually use NOP=4, so you'd more likely write 161x4 = 644 kBytes.

Well, that's certainly a way to reach 100,000 programming cycles.

That's why you don't use a single bit. If my bad block layer sees a bad block, it tries to actively stomp on all bits that still live there, to destroy as much of the ECC and magic numbers as possible. Remember, we don't need 100% reliability. After all, all components have a finite life, and the flash just needs to live a little longer than the plug connectors or capacitors in the device :-) And by using many bits, I believe to got the chance that they all refuse to flip low enough.

It's a flash. It's electrons that tunnel out gradually. It's not an evil gnome sitting within the package, deciding "today, I'll annoy the engineer in an especially evil twisted way", so while the data sheet allows a NAND flash to keep its old contents unmodifiably in a bad sector, I assume this doesn't happen in practice. Or, at least, not often enough to be observable.

Stefan

- S
- Stefan Reuther
  
  Contact options for registered users
Vote on answer
posted
13 years ago

Mon, Sep 20, 2010 5:22 PM

No, he is citing usual data sheets. So while skeptics may still doubt that factory testing happens, Marc's claim certainly is not unfounded, because you can read it in every data sheet :-]

Stefan

- G
- Grant Edwards
  
  Contact options for registered users
Vote on answer
posted
13 years ago

Mon, Sep 20, 2010 6:10 PM

While the ROM bootloader supports the 25xx series "dataflash" parts we got sold, there is no support in the AT91 bootstrap, U-Boot, or Linux

-- at least not that I could ever find. I asked about it on the AT91 forum a few months back and got the usual response (IOW, none at all).

Oh, I fixed that ages ago.

I added a few lines of code to the nand-flash, data-flash, and serial-flash applets so that they can all erase a region of flash.

Then I wrote my own ROM-boot-protocol client in Python.

[Besides the lack of an "erase region" command, SAM-BA won't work at all using a serial connection on a Linux host, it's not very usable from the command-line, and it isn't very easy to use as a module from other programs.]

We'll initially use "SAM-BA" replacement program to program prototypes. Then for production, the plan is to have the distributor ship them with U-Boot preprogrammed so that we can use the TFTP server in U-Boot to do the rest.

--
Grant Edwards               grant.b.edwards        Yow! I Know A Joke!!
                                  at               
                              gmail.com

- D
- David Brown
  
  Contact options for registered users
Vote on answer
posted
13 years ago

Mon, Sep 20, 2010 6:11 PM

This is what some people here apparently have trouble understanding. /Nothing/ is 100% reliable - it's just a matter of taking the reliability of your parts into account when designing a complete system.

- U
- Ulf Samuelsson
  
  Contact options for registered users
Vote on answer
posted
13 years ago

Mon, Sep 20, 2010 10:37 PM

It is hidden in a disused lavatory in the cellar marked: "Beware of the Leopard".

There are three different AT91bootstraps around.

1) The obvious AT91bootstrap you can download from

formatting link

2) My derivative of AT91bootstrap which adds Kconfig etc. and is used by open source projects like Buildroot and OpenEmbedded. 3) There is normally an AT91bootstrap in the "Softpack's". This is different from (1) and (2). It supports the 25xx series SPI flash but relies on libraries not normally available in arm-linux compilers so you may have to compile it using arm-newlib, IAR or Keil.

Nice, how about sharing!

Or, if you have an SD-Connector, you can boot from an SD-Card/ external SPI flash which programs the internal flash.

--
Best Regards
Ulf Samuelsson
These are my own personal opinions, which may (or may not)
be shared by my employer Atmel Nordic AB

- G
- Grant Edwards
  
  Contact options for registered users
Vote on answer
posted
13 years ago

Tue, Sep 21, 2010 3:30 PM

Ah! I just listened to Stephen Fry's auidiobook of THHGTTG last weekend while driving home from Chicago.

That's the one I looked at.

Though am using buildroot for my rootfs, I found it more convenient to build other things (kernel, bootstrap, U-Boot) separately, so I never really looked into that one.

That's interesting. I'll keep that in mind.

Sure. The changes to the applets can certainly be shared. I'll have to check with management regarding my sam-ba client replacement. I just double-checked, and the erase-region command has been added to the nandflash and dataflash applets, but it never got added to the serialflash (AT25xx) applet.

That's also an option, but since we'll have to connect an Ethernet cable anyway as part of the normal production test process, we want to use Ethernet as the programming interface as well.

--
Grant Edwards               grant.b.edwards        Yow! Is a tattoo real, like
                                  at               a curb or a battleship?
                              gmail.com            Or are we suffering in
                                                   Safeway?

- M
- Marc Jet
  
  Contact options for registered users
Vote on answer
posted
13 years ago

Tue, Sep 21, 2010 5:35 PM

This seems to be "industry standard" from my experience as well. But IMHO it's not a good solution to the problem.

Typical NAND datasheets do not specify the behaviour of bad blocks. The approach you mention, relies on certain behaviour from the bad blocks (e.g. ability to erase or overwrite). This is why I think it is a bad approach.

Another approach is the following:

The chip is partitioned into data blocks and spare blocks. During mount, all block headers are scanned in a specific order, e.g. ascending order for data blocks, and descending order for spare blocks.

Every data block contains a header which contains its physical block number and a (cryptographical) hash signature. Blocks without valid hash signature are considered bad or stale (e.g. powerfail during erase). In the first pass, every data block that passes this test is considered valid - until a spare block overrides it.

The spare block header contains its own physical block number, and the physical data block number of the block it replaces, and a hash signature as well. If a spare block exists for a data block, the data block is degraded to "bad". No matter what the data block content has claimed to be. Likewise if another valid spare block refers to the same data block, it overrides the previously read spare block (thus the block scanning order). After all, what we arbitrarily designated to be "spare" blocks, could be bad blocks too..

This method is able to memorize any combination of up to N bad blocks, no matter what the bad block behaviour is. Up to the collision resistence of the hash algorithm, of course. You can achieve any desired reliability by choosing the hash algorithm accordingly.

The key point to understand is that the bad block information should be stored in the good blocks, not in the bad ones. The good blocks are the ones that have their behaviour specified.

- P
- Paul
  
  Contact options for registered users
Vote on answer
posted
13 years ago

Tue, Sep 21, 2010 7:26 PM

So you will also have discovered about

"There is no UP for rain to fall from, therefore rainfall of the universe is none"

Let alone no sex...

Says he the geek with CDs of the original Radio series..

--
Paul Carpenter          | paul@pcserviceselectronics.co.uk
    PC Services
 Timing Diagram Font
  GNU H8 - compiler & Renesas H8/H8S/H8 Tiny
 For those web sites you hate

- R
- rickman
  
  Contact options for registered users
Vote on answer
posted
13 years ago

Wed, Sep 22, 2010 4:09 PM

r

e for

go

a

s an

he

s

t

a

That's fine. But my point is that if the block is "bad" either you can either set a bad block flag or the ECC value be be invalid when the media is read. In either case you can flag it in your access system (don't want to call it a file system) and not use that block again until the next reboot. This only has a performance impact at boot time. You don't have to *rely* on a bad block flag since that can also be faulty. But it can be used in addition to detecting an ECC error.

Rick

- R
- rickman
  
  Contact options for registered users
Vote on answer
posted
13 years ago

Wed, Sep 22, 2010 4:22 PM

it

C

Sure, if that is your file system, then it doesn't work very well for NAND flash does it? The bad sector 0 problem is one that hard drives have to this day, don't they? Or maybe the internal controller can remap that "invisibly" now that there are tons of embedded smarts in them. But that is the point. If your device can't "fix" a bad block in the lowest level of the file system on the drive, then it is subject to failure. If on boot, the software does what it has to do to recover the structure of the file system, then it will be robust.

Yes and once you find a "bad block" the "access system" (I really shouldn't call it a file system since you might not be working at that level) will have to remember this block in memory, not on the drive. Each time the system is booted, it will have to either read a valid bad block table, or construct its own. I supposed that each time the device needs a new block to write data it could search for a working block. That would be a very primitive system as well as slow, but it would work and would not require a bad block table.

BTW, I assume that in order to trust a block on a NAND drive each write would need to be verified in some manner. Is that also included in a NAND access system?

e a

Not sure what you mean by "within the specs". Are you saying the spec allows some level of intermittent failure on reads and/or writes? If so, there is still some level of intermittent that would be outside the spec and needs to be flagged as bad.

Rick

- D
- David Brown
  
  Contact options for registered users
Vote on answer
posted
13 years ago

Wed, Sep 22, 2010 7:24 PM

You can track bad blocks in all sorts of different ways. Some will involve more work when the bad block is discovered, others will involve more checking before using a block. But any file system, or "access system" if you like, has to have some way of tracking whether a block is in use or not. If you think of bad blocks as being in use in a special file that can't be accessed normally, then you have got simple and efficient bad block tracking (at least, it's as simple and efficient as the rest of your file system).