Hamming ECC

B

bitrex 8 years ago

Do you use Hamming/Reed Solomon/etc. codes to check for data errors when writing/reading e.g. large battery-backed SRAMs or Flash memory that's are used for non-volatile storage of user data for significant periods of time (couple weeks)? Or do you find single-bit errors in RAM reads to be not much of an issue in practice?

I'm considering writing the external memory in pages of say 8 bytes with

8 nibbles of Hamming code and 1 byte of parity bits for single-bit correction 2 bit detection. In this application a single-bit error would not be life-threatening or anything but significantly annoying for the user, the data isn't intrinsically fault tolerant to some degree like say image or audio data.

Vote

M

Martin Brown 8 years ago

Provided that you are not in a hot environment full of alpha particles or cosmic rays I'd say the risk was vanishingly small. All bets are off if your bus timings are out of spec but apart from that they are rare.

Parity error detection so that you know there is a problem with a particular byte might be good enough.

Regards, Martin Brown

Vote

D

David Brown 8 years ago

For configuration data and the like, I usually have a CRC checksum and dual copies of the data. The most likely failure mode is a power-off or reset while writing one copy of the data, rather than corruption of the memory, but it would handle that too.

It is a different matter if you have NAND flash - there bit-level failures are to be expected and you will need some sort of ECC.

If you really expect problems, or have high-value data, then you may want more than one flash chip in case of total failure. I've done that on a board for extreme environments - there were several flash chips and data was stored in RAID-5 stripes (with checksums on each block). I didn't see any stripe failures in practice.

Vote

C

Clive Arthur 8 years ago

I used a NAND flash memory which had 2 'spare' bytes for every 64 'normal' bytes in an embedded system. I think this may be quite a common arrangement. The uC had a built in CRC generator so it was easy to write a 16 bit CRC for every 64 data bytes. As well as detecting errors, this allows you to correct 1 bit error, not a brilliant performance, but easy to do.

Cheers

Clive

Vote

B

bitrex 8 years ago

Thanks, yeah that's the most realistic situation where errors would pop up in this application, "burst errors" where a page write doesn't complete due to power loss and you have half the page correct and half garbage.

I was thinking one could do like a quasi-"journaling filesystem" that combines strategies but doesn't require storing two complete independent copies of the data. Start with two flag bits in a table in the external SRAM for each memory page and unset them both at the start of a write. In a buffer in internal RAM calculate parity bytes for the data, call the data + parity bytes a page, calculate CRC-16, and write out the CRC to a page header. Set flag bit one for the page. Then write data and parity bytes, set flag bit two when done.

When you come back to read if the first flag is unset it means that a write to that page was interrupted in during the parity/CRC generation process and the page contains whatever it held before. If bit 1 is set but bit 2 isn't it means it was interrupted in the page writeout and the page is probably garbage, no point to attempt error correction. If they're both set but the CRC doesn't match then something else went wrong you can attempt error correction using the parity bytes.

Vote

G

gnuarm.deletethisbit 8 years ago

You need to give this more thought. Your conclusions about the states of the flags are not correct. Work through your process as a list of steps. See what the state of the flags are at every point in the steps.

Rick C.

Vote

B

bitrex 8 years ago

I'm guessing I've simply invented a wrong version of something that's already been invented.

Vote

B

bitrex 8 years ago

The flaw is likely that there's no reasonable way to guarantee the setting of the "flag bits" is itself an atomic operation.

Vote

B

bitrex 8 years ago

To start with the simpler one-flag-bit case it seems hard to get wrong. To begin a write you load the appropriate flag table byte that holds the flag bit you want to zero into say a processor register, flip the bit to zero, write it back out. If power is lost at any point during that process prior to the commit (e.g. WE being de-asserted after the address and data registers are latched on a parallel SRAM) then nothing has changed in the SRAM and the associated data page should have whatever it had originally.

If processor power is lost during the page write-out then the associated flag bit is still zero and it will be assumed the page write-out did not complete. If power is lost during the flag table byte flip bit back to 1 after a page write it's the reverse of the first paragraph, it will be assumed that the write-out did not complete when it actually did if power to the processor is lost during that, that sucks but seems like a fairly unlikely edge-case.

Vote

+

+++ATH0 8 years ago

Problems arise if you have any sort of multi-tasking, because then you need to have semaphores/mutexes around all this stuff, and you probably need to make the whole thing asynchronous to avoid deadlocks.

Journalling file systems end up writing a complete copy of the changed data to the journal, then to the file system, then updating the file system metadata and even then it's not 100% guaranteed safe because the final metadata update can be corrupted too. It's just a smaller window at risk.

You need a full copy of the data which is being changed in the journal so you can recover from interrupted partial page writes to the main file system. Simply knowing it failed or no longer has the correct checksum isn't much use.

Vote

B

bitrex 8 years ago

The situation I'm thinking of in a real-time application is for example where battery-backed SRAM is used to store e.g. sequences of user entered data for some purpose, like say a list of pulse lengths or something. User punches "1234567", it shows up on the screen, user hits "commit" and at that point it's stored to a list in "mass storage."

If during the write-out after the "commit" button is pushed power is lost during some part of the write sequence, if I know that it failed, I can at least on restart wipe that block of memory and not add the entry to the table the user sees so they at least know it didn't take.

Vote

R

Rob 8 years ago

When that is the usage, it is easy. The usual situation is that the SRAM is used to store configuration data that is valid as a complete set. It has to be checked if valid (e.g. CRC) and if not valid it should revert to the previous state or back to defaults, depending on the specs. That is where the 2 copies potentially come in.

What you have is just a serial storage of data and it requires only a CRC per record. ECC is not much use as bit errors in SRAM are rare, and the cause of errors more likely is power failure (both during the write and during powerdown when the battery runs out), and doing ECC potentially makes things worse ("correcting" bad data to seemingly valid good data that actually is trash).

Vote

Hamming ECC

Join the Discussion

Didn't find your answer?