Library for save an events log in Flash

- P
- pozz
  
  Opzioni di contatto per gli utenti registrati
pubblicata
3 settimane fa

Thu, Apr 18, 2024 2:38 PM

The request is very common: when some interested events occur in the system, they, with the related timestamp, must be saved in a log. The log must be saved in an external SPI Flash connected to a MCU. The log has a maximum number of events. After the log is completely filled, the new event overwrite the oldest event.

I tried to implement such library, but I eventually found it's not a simple task, mostly if you want a reliable library that works even when errors occur (for example, when one writing fails).

I started by assigning 5 sectors of SPI Flash to the log. There are 256 events in a sector (the event is a fixed struct). In this scenario, the maximum log size is 1024 events, because the 5th sector can be erased when the pointer reaches the end of a sector.

The first challenge is how the system can understand what is the first (newest) event in the log at startup. I solved saving a 16-bits counter ID to each event. The initialization routine starts reading all the IDs and taks the greatest as the last event. However initially the log is empty, so all the IDs are 0xFFFF, the maximum. One solution is to stop reading events when 0xFFFF is read and wrap-around ID at 0xFFFE and not 0xFFFF.

However there's another problem. What happens after writing 65535 events in the log? The ID restarts from 0, so the latest event hasn't the greatest ID anymore.

These are the saved IDs after 65536 events:

1^ SECT 2^ SECT 3^ SECT 4^ SECT 5^SECT----------> 0xFB00 ... 0xFC00 ... 0xFD00 ... 0xFE00 ... 0xFF00 ... 0xFFFF

The rule "newest event has greatest ID" is correct yet. Now a new event is written:

1^ SECT-------> 2^ SECT 3^ SECT 4^ SECT 5^SECT---------> 0x0000 0xFB01.. 0xFC00 .. 0xFD00 .. 0xFE00 .. 0xFF00 .. 0xFFFF

Now the rule doesn't work anymore. The solution I found is to detect discontinuity. The IDs are consecutive, so the initialization routine continues reading while the ID(n+1)=ID(n)+1. When there's a gap, the init function stops and found the ID and position of the newest event.

But he problems don't stop here. What happens if an event write failed? Should I verify the real written values, by reading back them and comparing with original data? In a reliable system, yes, I should.

I was thinking to protect an event with a CRC, so to understand at any time if an event is corrupted (bad written). However the possibility to have isolated corrupted events increase the complexity of the task a lot.

Suppose to write the 4th event with ID=3 at position 4 in the sector. Now suppose the write failed. I can try to re-write the same event at poisition 5. Should I use ID=4 or ID=5? At first, I think it's better to use ID=5. The CRC should detect that at position 4 is stored a corrupted event. After that two events are written as well, ID=6 and 7.

Now the higher application wants to read the 4th event of the log (starting from the newest). We know that the newest is ID=7, so the 4th event is ID=7-4+1=4. However the event with ID=4 is corrupted, so we should check the previous ID=3... and so on. Now we can't have a mathematical function that returns the position of an event starting from it's ordinal number from the newest event.

Eventually I think it's better to use a public domain library, if it exists.

- D
- David Brown
  
  Opzioni di contatto per gli utenti registrati
Vote on answer
pubblicata
3 settimane fa

Thu, Apr 18, 2024 7:30 PM

Start at 0xfffe, and count down. Or xor with 0xffff for storage. Or wrap at 0xfffe, as you suggest. Or use 32-bit values. Or have another way to indicate that the log entry is valid. Or, since you have a timestamp, there's no need to track an ID - the timestamp will be increasing monotonically.

Make your counts from 0 to 256*5 - 1, then wrap. Log entry "n" will be at address n * sizeof(log entry), with up to 256 log entries blank. Then you don't need to store a number at all.

An 8-bit or 16-bit CRC is peanuts to calculate and check. Write the whole log entry except for a byte or two (whatever is the minimum write size for the flashes), which must be something other than 0xff's. Check the entry after writing if you really want, then write the final byte or two. Then if there's a power-fail in the middle, it's obvious that the log entry is bad.

For SPI NOR flash, the risk of a bad write - other than through unexpected resets or power fails - is negligible for most purposes. Don't over-complicate things worrying about something that will never happen.

- P
- pozz
  
  Opzioni di contatto per gli utenti registrati
Vote on answer
pubblicata
3 settimane fa

Thu, Apr 18, 2024 8:36 PM

And what to do when the counter reaches zero? It can wrap-around up to

0xfffe (that is very similar to an increasing counter from 0x0000-0xFFFE).

I will add a CRC for each entry and that can be used to validate the event. An empty/erased slot filled with 0xFF will not pass CRC validation.

I don't want to use timestamps for two reasons:

- the system wall clock can be changed (the system is isolated)

- the library I'm writing doesn't know the content of "events", for it the event is an opaque sequence of bytes.

What do you mean with log entry "0"? Is it the oldest or the newest? I think the oldest, because that formula is imho correct in this case.

However it doesn't appear correct when the log has rotated, that happens after writing 5x256+1 events. In this case the newest entry ("n"=1024) is at address 0, not n*sizeof(entry).

I know. Here the increased complexity wasn't related to the CRC calculation, but to the possibility to have isolated corrupted slots in the buffer. Taking into account these corrupted slots isn't so simple for me.

I don't understand why to write in two steps. I think I can write all the bytes of an entry, including CRC. After writing, I can read back all the bytes and check integrity.

Indeed unexpected resets, but mainly power failes, are possible. Rare, but possible.

- D
- David Brown
  
  Opzioni di contatto per gli utenti registrati
Vote on answer
pubblicata
3 settimane fa

Fri, Apr 19, 2024 8:28 AM

How big are your log entries? How many entries are realistic in the lifetime of the system?

That's usually fine.

OK.

(I misread your "5 sectors of an SPI flash chip" as "5 SPI flash chips" when first replying. It makes no real difference to what I wrote, but I might have used "chip" instead of "sector".)

You have 256 entries per flash sector, and 5 flash sectors. For the log entry number "n" - where "n" is an abstract count that never wraps, your index "i" into the flash array is (n % 5*256). The sector number is then (i / 256), and the index into the sector is (i % 256). The position in the log is determined directly by the entry number, and you don't actually need to store it anywhere.

Think of this a different way - dispense with the log entry numbers entirely. When you start up, scan the flash to find the next free slot. You do this by looking at slot 0 first. If that is not empty, keep scanning until you find a free slot - that's the next free slot. If slot 0 is empty, scan until you have non-empty slots, then keep going until you get a free one again, and that's the next free slot. If you never find a used slot, or fail to find a free slot after the non-free slots, then your first free slot is slot 0.

Any new logs are then put in this slot, moving forward. If you need to read out old logs, move backwards. When storing new logs, as you are nearing the end of a flash sector (how near depends on the sector erase time and how often events can occur), start the erase of the next sector in line.

Think how such corruption could happen, and its consequences. For most event logs, it is simply not going to occur in the lifetime of working products - and if it does, as an isolated error in an event log, it doesn't matter significantly. Errors in a sensibly designed SPI NOR flash system would be an indication of serious hardware problems such as erratic power supplies, and then the log is the least of your concerns.

The only thing to consider is a reset or power failure in the middle of writing a log event.

No, you can't. Or to be more accurate, you /can/, but it leaves you open to the risk of undetectable errors in the event of a power fail in the middle of writing. An 8-bit CRC has a 1 in 256 chance of passing if data is partially written.

If you have enough capacitance on the board to ensure that any write completes, and detection of failure of external power so that you never start a log write after power has failed, then it is probably safe to do the write as a single action.

Yes.

In my experience, these things are simple enough to implement, and specific enough to the application and target, that I don't see the need of a library.

- P
- pozz
  
  Opzioni di contatto per gli utenti registrati
Vote on answer
pubblicata
3 settimane fa

Fri, Apr 19, 2024 2:58 PM

16 bytes each entry. It's difficult to reach 0xFFFF events in the log, but why limit our fantasy? :-D

With 20 events per day, the log will be filled after 9 years. It's a very long life, but I think I have solved the problem to understand if the ID was wrapped-around.

Yes, it is what I already do. However I disagree on the formula.

The higher-layer application requests log entry 0. What is it? The newest event. My eventlog library should convert 0 to the slot index in the Flash, that is directly related to the Flash addres (I don't really need the number of sector here).

If the log is empty, event 0 for the application is slot 0 for the eventlog library. However, if there are three events in the log, event 0 for the application is slot 2 for the eventlog library.

The application doesn't know anything about what I named the ID of the event. It's just a number used by the lower-layer eventlog module to find the first free slot at startup.

Yes, I agree with you.

I can use crc16. We are talking of power-fail/reset in the middle of event writing, this is possible, but rare. I think the case a crc16 reporting a valid slot when the writing was wrong or incomplete is extremely rare.

No I can't know in advance when the power will fail.