DIYbanter - View Single Post - UPS server wiring no-no

Andrew Gabriel

In article ,
The Natural Philosopher writes:
'nt understand what it does at a brute hardware level enough for me to
realise that its not much point in me actually trying to explain:
nevertheless I will (try)

1/. It is phsyically impossible for a disk to write to two sectors
simultaneously,

Correct.

so a data + metadata transaction is never actually
written as an atomic aaction.

Wrong (see below).

2/. ZFS simply writes new data to an unused sector and when its
complete,

Correct.

it then updates the file metadata to point to it.

Wrong, it writes the metadata to a new block too, it never _updates_
anything. It only ever writes to free blocks.

3/ A crash during the new sector wrote means 'file as it was + corrupted
sector no one will ever see because its still 'spare'

Correct.

4/. a crash during a metadata update won't destroy the actual data, but
will totally **** up the entire file. IF its beyond the checksumming to
sort out.

Wrong. The metadata is unchanged, in exactly the same way the file
is unchanged in the case above.

5/. IF and ONLY IF the correct write data is in NVRAM, because the disk
itself has that facility, then it will indeed be able to say 'wah well,
thats what that sectors hould have been' and correct it.

ZFS needs no NVRAM for correctness, so assume there isn't any,
and I'll skip past your following assumptions of how NVRAM works,
which are not correct...

6/. If it carshes while writing to NVRAM with luck teh whole transaction
will be lost entorely.

HOWEVER
(a) this merely preserves transactions as either fully complete, or not
complete. That's fine for a database app that uses the system in that
way. Not fine for systems that may use the file system in other ways.

(b)It CRITICALLY depends on the disk having NVRAM.

(c) and the controller accurately reporting the state of the disk.

So in the end it is down to HARDWARE to make sure the 'atomic
transaction' metaphor is actually preserved (and hardware is NEVER
directly accessed by the OS anyway so the myth that 'hardware does what
it tells you is just that, a myth. With LUCK it will do what you tell
it, sometimes it doesn't).

That's what it means to be a transactional filesystem.

No, it isn't. That's what the glossy sales brochures tell you.

You haven't understood at all.

To get to that level - and it is a good level - requires a bit more
than a randomn mother board and disk controller coupled to random disks.

The disks have to be equipped with NVRAM and the controller has to not
say 'done it' just because its passed the data to the disk, it has to
relay a proper 'write completed' signal from the disk back to the OS.

OK, let's explain in more detail...

I think you do understand what a transactional filesystem is.
You do understand that it is physically impossible for a disk
to write to two sectors simultaneously (and that applies across
multiple disks in an array too). So I presume you can follow on
from that and understand that committing a transaction has to
depend on something atomic such as a single write, and that's
exactly how ZFS works.

As I said before, ZFS only ever writes to free disk space - it
never updates any block that's in use. So if an application does
update a block in a file, ZFS writes that updated block to a free
block on disk, and doesn't touch the original disk block that
file was using. Then the file metadata is updated, and exactly
the same happens - the metadata is written out to a free block,
and the original is untouched. A transaction is built up
entirely in free space which contains all the changes which
need to be applied to the disk. All the changes associated with
any one operation (a write to a file, a file create, a file
delete, a file rename, etc) are contained in the same transaction.
Any power loss (or other cause of unexpected interruption to the
system) will result in the filesystem appearing to be unchanged,
as this is all still just in free blocks, and so the filesystem
is still completely consistent. When the disk(s) have written back
all these changed blocks, ZFS finally writes back the uberblock
(its name for the superblock) which points to the new metadata,
and hence all the updates. This is an atomic single sector write,
which thus commits the transaction. It either happens, or it
doesn't happen. It can't half happen, so there's no window when
the filesystem is inconsistent - it's always consistent on disk,
so it's always safe from ZFS's point of view to power off the
system at any time.

I've skipped a ton of detail (I usually teach this as a half
day class), but that should be enough to understand the main
principles.

And NONE of this will help an app that is doing something in a non
transactional way either.

I already said that. If someone writes an app wrongly, all bets
are off. In the Enterprise space (where I work), most developers
know how to do this right though - it's key to large sectors of
industry. However, even if the developer did it all wrong, you
still won't ever get a corrupted filesystem.

--
Andrew Gabriel
[email address is not usable -- followup in the newsgroup]