NAND chips allow for up to N "bad blocks" before a device is considered defective. Some blocks come already marked as bad from factory. It is recommended to preserve this information, as factory testing is usually more exhaustive than what you can implement in a typical embedded system. However, more bad blocks are allowed to develop DURING THE LIFE TIME of the device, up to the specified maximum (N).
This means that you whatever you write to the device, may or may not be readable afterwards! You have 3 choices how to handle this:
Choice 1: Use enough error correction to be 100% safe against it.
Note that "normal" ECC is definately not enough. On a device with 0 known bad blocks, up to N blocks can disappear from one moment to the other (in the worst possible scenario). To be safe against this, you must distribute each and every of your precious bits over at least N+1 blocks.
Algorithms exist that can do this (for example RS), but they are not nice. Besides the algorithmic complexity, there is another problem with this approach. The higher the storage efficiency (data bits versus redundancy bits), the more blocks you have to read before you're able to extract your bits. With N in the range of 20 to 40 in typical NAND chips, this results in an unavoidable and very high read/ write latency.
Choice 2: Avoid giving more reliability guarantees than your underlying technology.
This sounds simple and impossible, yet in fact it's quite realistic. The problem is not that your storage capacity may go away. It's just that vital data stored in a particular place may become unreadable. If you introduce a procedure to restore the data and make that procedure part of the normal operation of your device, then it's not a real problem.
In practice this means that your bad block layer must be able to identify the bad blocks in all circumstances. I know that many real- world algorithms (like the mentioned one of using the "bad block bit") are not 100% fit for the task. After all, the bad block bit may be stuck at '1', and you can't do what's necessary to mark it bad. But there are more reliable approaches that can achieve the necessary guarantee.
Of course the other essential part for this choice is to provide a way to restore the data, which can be a PC flasher program (like iTunes "restore device").
Then your device can be declared to be always working, without extending the reliability guarantees beyond those given by the NAND manufacturer.
Choice 3: Implement reasonable ECC, give all the guarantees, and hope for the best.
This seems to be "industry standard". It seems to work out quite OK, because NAND failures usually are not very catastrophic. As others have pointed out, creeping failures can be detected and data migrated before ECC capability is exceeded. Usually failures go in hand with write activity in the same block or page, and write patterns are under software control.
But then again, to make it very clear: this approach is not 100% safe. It's a compromise between feasibility and reliability.
You will see yield problems. Unless it's life threatening technology, you're probably better off accepting them than to cure them.