Linux embedded: how to avoid corruption on power off - Page 2

Do you have a question? Post it now! No Registration Necessary

Translate This Thread From English to

Threaded View
Re: Linux embedded: how to avoid corruption on power off
On Fri, 16 Jun 2017 23:40:21 -0700, Don Y

Quoted text here. Click to load it

No software can be guaranteed to work correctly in the face of
byzantine failures.  Failing "gracefully" - for some definition - even
if possible, still is failing.

Quoted text here. Click to load it

The erase "block" size != write "page" size of SSDs is a known

A DBMS can't address this by itself in software: "huge" VMM pages
[sometimes] are good for in-memory performance - but for reliable i/o,
huge file blocks *suck* both for performance and for space efficiency.

A professional DBMS hosting databases on SSD requires the SSDs to be
battery/supercap backed so power can't fail during a write, and also
that multiple SSDs be configured in a RAID ... not for speed, but for
increased reliability.

Caching on SSD is not really an issue, because if the "fast copy" is
unavailable the DBMS can go back to the [presumably slower] primary
store.  But *hosting* databases completely on SSD is a problem.

Quoted text here. Click to load it

My issue with these statements is that they are misleading, and that
in the sense where they are true, the problem can only be handled at
the system level by use of additional hardware - there's no way it can
be addressed locally, entirely in software.

No DBMS will seek into the middle of a file block and try to write 40
bytes.  Storage always is block oriented.

It's *true* that in your example above, e.g., updating the address
field will result in rewriting the entire file block (or blocks if
spanning) that contains the target data.

But it's *misleading* to say that you need to know, e.g., where is the
name field relative to the address field because the name field might
be corrupted by updating the address field.  It's true, but irrelevant
because the DBMS deals with that possibility automatically.

A proper DBMS always will work with a dynamic copy of the target file
block (the reason it should be run from RAM instead of entirely from
r/w Flash).  The journal (WAL) records the original block(s)
containing the record(s) to be changed, and the modifications made to
them.  If the write to stable storage fails, the journal allows
recovering either the original data or the modified data.

The journal always is written prior to modifying the stable store.  If
the journal writes fail, the write to the stable copy never will be
attempted: a "journal crash" is a halting error.

A DBMS run without journaling enabled is unsafe.

The longwinded point is that the DBMS *expects* that any file block it
tries to change may be corrupted during i/o, and it takes steps to
protect against losing data because of that.

But SSDs - even when working (more or less) properly - introduce a
failure mode where updating a single "file block" (SSD page) drops an
atomic bomb in the middle of the file system, with fallout affecting
other, possibly unrelated, "file blocks" (pages) as well.

It's like: what should the flight computer do if the wings fall off?
There's absolutely nothing it can do, so the developer of the flight
software should not waste time worrying about it.  It's a *system*
level issue.

Quoted text here. Click to load it

Block based i/o is not the issue.  The issue is that SSDs do the
equivalent of rewriting a whole platter track to change a single

Quoted text here. Click to load it

Byzantine failure.

The duration of the "window of vulnerability" is not the issue.  The
issue is the unpredictability of the result.

DBMS were designed at a time when disks were unreliable and operating
systems [if even present] were primitive.  Early DBMS often included
their own device code and took direct control of disk and tape devices
so that they could guarantee operation.

Quoted text here. Click to load it

We've had this conversation previously also: database terminology
today is almost universally misunderstood and misused by everyone.

The file(s) on the disk are not the "database" but merely a point in
time snapshot of the database.

The "database" really is the historical evolution of the stable store.
To recover from a failure, you need a point in time basis, and the
journal from that instance to the point of failure.  If you have every
journal entry from the beginning, you can reconstruct the data just
prior to the failure starting from an empty basis.

The point here being that you always need backups unless you can
afford to rebuild from scratch.

Quoted text here. Click to load it

DBMS are designed to maintain logically consistent data ... the "I" in
"ACID" stands for "isolation", not for "integrity".

No DBMS can operate reliably if the underlying storage is faulty.

Quoted text here. Click to load it

And recall that I warned you about the problems of trying to run a
reliable DBMS on an unattended appliance.  We didn't really discuss
the issue of SSDs per se, but we did discuss journaling (logging) and
trying to run the DBMS out of Flash rather than from RAM.


Re: Linux embedded: how to avoid corruption on power off
On 6/19/2017 6:42 PM, George Neuner wrote:
Quoted text here. Click to load it

But that's the nature of the "power down problem" (the nature of the OP's
question)!  It requires special attention in hardware *and* software.
To think you can magically execute a sequence of commands and be
guaranteed NOT to have "corruption" is naive.

Quoted text here. Click to load it

Exactly.  I suspect the OP isn't using rust to store his data
(and other corruptible items).  So, he's using some nonvolatile
*semiconductor* medium.  Most probably FLASH.  Most probably NAND
Flash.  Most probably MLC NAND Flash.

(i.e., one of the more problematic media to use reliably in this

Quoted text here. Click to load it

No, you're still missing the point.   You need to know physically, on the
*medium*, where the actual cells holding the bits of data for these
"variables" (records, etc.) reside because "issues" that cause the
memory devices to be corrupted have ramifications based on chip
topography (geography).

I.e., if a cell that you ("you" being the file system and FTL layers WELL
BELOW the DBMS) *think* you are altering happens to be adjacent to some
other cell (which it will almost assuredly be), then that adjacent cell
can be corrupted by the malformed actions consequential to the power
transition putting the chip(s) in a compromised operating state.

E.g., you go to circle a name in a (deadtree) phone book and your
hand (or the book) shudders in the process (because you're feeling faint
and on the verge of passing out).  Can you guarantee that you will
circle the *name* that you intended?  Or, some nearby name?  Or,
maybe an address or phone number somewhere in that vicinity?

It doesn't matter that you double or triple checked the spelling of the
name to be sure you'd found the right one.  Or, that you deliberately
chose a very fine point pen to ensure you would ONLY circle the item of
interest (i.e., that your software has been robustly designed).  When
the actual time comes for the pen to touch the paper, if you're not
"fully operational", all bets are off.

I.e., some MECHANISM is needed (not software) that will block your hand
from marking the page if you are unsteady.

Absent that (or, in the presence of a poorly conceived mechanism),
you have no way of knowing *later*, when you've "recovered", if you
may have done some damage (corruption) during that event.  Indeed,
you may not even be aware that you were unsteady at the time!

Quoted text here. Click to load it

But, as I noted above, you can't KNOW that the journal hasn't been
collaterally damaged (by your shakey hands).

In *normal* operation, writing (and to a lesser extent, READING) to
FLASH disturbs the data in nearby (i.e., NOT BEING ACCESSED) memory
cells.  When power and signals (levels and timing) are suspect
(i.e., as power is failing), this problem is magnified.

Quoted text here. Click to load it

What if I corrupt two blocks at the same time -- two UNRELATED (by any
notion that *you*, the developer, can fathom) blocks.  ANY two that I
want.  Can you recover?  Can you even guarantee to KNOW that this
has happened?

I.e., some other table in the same tablespace has been wacked as a
consequence of this errant write.  A table that hasn't been written
in months (no record of the most recent changes in the journal/WAL).
Will you *know* that it has been wacked?  How?  WHEN??

Quoted text here. Click to load it

Exactly.  You can't know -- nor predict -- which blocks/pages/cells
of the medium will be corrupted.  You probably won't even know which
of these were being *targeted* when the event occurred, let alone which
are affected by "collateral damage".

The whole point is that the system isn't operating as *intended*
(by the naive software developer) during these periods.  The hardware
and system designers have to provide guidance for THAT SPECIFIC SYSTEM
so the software developer knows what he can, can't and shouldn't do
as power failure approaches (along with recovery therefrom).

Early nonvolatile semiconductor memory (discounting WAROM) was
typically implemented as BBSRAM.  It was often protected by gating the
write line with a "POWER_OK" signal.  Obvious, right?  Power failing
should block writes!

But, that led to data being corrupted -- because the POWER_OK
(write inhibit) signal was asynchronous with the memory cycle.
So, a write could be prematurely terminated and corrupt the
data that was intended to be written leading to different outcomes:
- old data remains
- new data overwrites
- bogus data results
But, it tended to be just *the* location that was addressed
(unless the write inhibit happened too late in the power loss

Moving to bigger blocks of memory say BBDRAM replace the BBSRAM.
DRAM requiring less power per bit to operate (sustain).

A bit more complicated to implement as the refresh controller
has to remain active in the absence of power.  The flaw, here,
would be failing to synchronize the "inhibit" and potentially
aborting a RAS or CAS -- and clobbering an entire *row* in the
device (leaving it with unpredictable contents).

SRAM is now bigger *and* lower power -- and folks understand the
need to synchronously protect accesses.  So, its trivial to design
a (large!) block of BBSRAM that operates on lower power.  As you
can't "synchronize with the future", its easier to just give an
early warning to the processor (e.g., NMI) and have it deliberately
toggle a "protect memory" latch thereafter KNOWING that it shouldn't
wven bother trying to write to that memory!

But FLASH (esp SSD's and "memory cards") have progressed to the
point where they have their own controllers, etc. on board.  So,
from the outside, you can neither tell where (physically) a
particular write will affect the contents of the chip(s) packaged
within, nor can you know for sure what is happening (at a signal
level) inside the device.

So, how can you know how far in advance to stop writing?
How can you know, for sure, that your last write will actually
manage to end up being committed to the memory chip(s) within
the device (what if its controller encounters a write error
and opts to retry the write on a different block of memory,
adjusting its bookkeeping in the process)?

You do all the power calculations assuming the bulk capacity
in your power supply is at the *low* end of its rating -- for
the current temperature -- and assume your electronics are using
the *maximum* amount of power (including the memory card!)
and predict how much "up time" you have before the voltage(s)
in the system fall out of spec.  Then, back that off by some
amount of derating to make it easier to sleep at night.

Quoted text here. Click to load it

And, the OP is the system designer, as far as we are concerned.
Or, at least the *conduit* from USENET to that designer!

Quoted text here. Click to load it

The salient point in the above is that a write is TWO operations:
erase followed by write.  And, depending on the controller and the
state of wear in the actual underlying medium, possibly some
housekeeping as blocks are remapped.

The issue is that there is a window of time in which the operation is
"in progress".  But, in a VULNERABLE STATE!

If I issue a write to a magnetic disk, the "process" begins the
moment that I issue the right.  But, there are lots of delays
built into that process (rotational delay, access delay, etc.).
So, the actual window of vulnerability is very small:  when are the
heads actually positioned over the correct portion of the medium
to alter the magnetic domains therein.

And, if this event is "interfered with", the consequences are
confined to that portion of the medium -- not some other track
or platter or sector.

That's not the case with the current types of semiconductor
nonvolatile memory.  The "window of vulnerability" extends
throughout the duration of the write operation (erase, write,
internal verify and possible remapping/rewriting).

Quoted text here. Click to load it

Of course the size of the window is important!  The software can't do
*squat* while the operation is in process.  It can't decide that
it doesn't *really* want to do the write, please restore the previous
contents of that memory (block!).

And, the software can do nothing about the power remaining in
the power supply's bulk filter.  It's like skidding on black ice
and just *hoping* things come to a graceful conclusion BEFORE
you slam into the guardrail!

Quoted text here. Click to load it

How can I "accidentally" alter block 0 of a mounted tape when we're
at EOT (or any other place physically removed from block 0)

A semiconductor memory can alter ANYTHING at any time!  Signal
levels inside the die determine which rows are strobed.  Its
possible for NO row to be strobed, two rows, 5 rows, etc. -- the
decoders are only designed to generate "unique" outputs when
they are operating within their specified parameters.

Let Vcc sag, ground bounce, signals shift in amplitude/offset/timing
and you can't predict how they will affect the CHARGE stored in the

Quoted text here. Click to load it

This is c.a.e.  Do you really think the OP has a second copy of the
data set hiding on the medium?  And, that it is somehow magically
protected from the sorts of corruption described, here?

Quoted text here. Click to load it

Exactly.  And, there are no commands/opcodes that the OP can execute
that will "avoid corruption on power off".  If there were, the DBMS
would employ them and make that guarantee REGARDLESS OF STORAGE MEDIUM!


Quoted text here. Click to load it

The firmware in SSD's has to address all types of potential users
and deployments.

I don't.  I have ONE application that is accessing the nonvolatile
memory pool so I can tailor they design of that store to fit the
needs and expected usage of its one "client".  I.e., ensure that
the hardware behaves as the DBMS expects it to.

The bigger problem is addressing applications that screw up their
own datasets.  There is NOTHING that I can do to solve that
problem -- even hiring someone to babysit the system 24/7/365.
A buggy application is a buggy application.  Fix it.

I *can* ensure ApplicationA can't mess with ApplicationB's
dataset(s).  I *can* put triggers and admittance criteria
on data going *into* the tables (to try to intercept
stuff that doesn't pass the smell test).  But, a "determined"
application can still write bogus data to the objects to
which it is granted access.

Just like it could write bogus data in raw "files".

Re: Linux embedded: how to avoid corruption on power off
On 6/19/2017 8:36 PM, Don Y wrote:
Quoted text here. Click to load it

I can't afford (thermal budget) to power up yet another server
to access my literature archive (did I mention it is hot, here?
119F, today).  But, some looleg-ing turns up a few practical


Remember, even WITHIN a PCB, conditions on and in each chip can differ
from moment-to-moment due to the reactive nature of the traces and
dynamics of power consumption "around" the board.  So, just because
power is "good" at your "power supervisory circuit" doesn't mean
it's good throughout (and through-in?) the circuit.

Re: Linux embedded: how to avoid corruption on power off
On Monday, June 19, 2017 at 9:42:53 PM UTC-4, George Neuner wrote:
Quoted text here. Click to load it
Quoted text here. Click to load it
Just a small example:
A good DBMS can be surprisingly good at handling corruption.
I read a report around the time of the ORACLE 7 release of a system
using ORACLE running reliably for a significant time with bad
RAM on the motherboard. Their DBMS architecture impressed me and
that report really sold me on its reliability.


Re: Linux embedded: how to avoid corruption on power off
On Fri, 16 Jun 2017 12:10:26 +0200, pozz wrote:

Quoted text here. Click to load it

Google raspberry pi ups
There are several options if you dont want to roll your own

Republic of Texas

Re: Linux embedded: how to avoid corruption on power off
W dniu 2017-06-16 o 12:10, pozz pisze:
Quoted text here. Click to load it

I think that in this matter, most reliable is UPS. For example, in our RB300 we are using UPS based  
on supercaps. The microcontroller monitors the power supply voltage and controls the system shutdown  
in case of a power failure. More details:

Re: Linux embedded: how to avoid corruption on power off
Il 20/06/2017 09:34, Krzysztof Kajstura ha scritto:
Quoted text here. Click to load it

Does the UPS supply only CM, so at low voltage? How long the supercaps  
are able to supply correctly the CM, after cutting the input voltage?

Re: Linux embedded: how to avoid corruption on power off
Quoted text here. Click to load it

UPS is 5V, so it supplies all devices with 5V, 3.3V and 1.8V power supply voltages. Compute module  
is properly powered about 140 seconds, without devices connected to USB ports. In the case of high  
energy requirements by USB connected devices, there is a possibility unmount devices and turn off  
USB power supply (this can be controlled by GPIO).

Re: Linux embedded: how to avoid corruption on power off
Quoted text here. Click to load it

In a Linux-based product from another part of the corporation I work  
for, when power loss was detected, all non-essential services were  
killed, and all in-flight data was written out to a log that could be  
replayed on system startup. This was powered by a supercapacitor  
dimensioned to last some small number of seconds. However, this was a  
custom, bare-bone distribution running only their own software, and the  
filesystem was mounted read-only. IIRC the log used dedicated storage.  
My understanding is that this scheme worked well for them.


Site Timeline