Re: Ordering guarantee inside a single bio?

From: "Valdis Klētnieks" <valdis.kletnieks@vt.edu>
To: 오준택 <na94jun@gmail.com>
Cc: Lukas Straub <lukasstraub2@web.de>, kernelnewbies@kernelnewbies.org
Subject: Re: Ordering guarantee inside a single bio?
Date: Wed, 29 Jan 2020 15:28:37 -0500	[thread overview]
Message-ID: <47803.1580329717@turing-police> (raw)
In-Reply-To: <CAFyvkd2yYqt=izCg+kyRRv2U=azDiAyGLPVuUWzpjGCUy8aY=w@mail.gmail.com>

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1.1: Type: text/plain; charset=us-ascii, Size: 3994 bytes --]

On Tue, 28 Jan 2020 13:50:56 +0900, 오준택 said:

(Lukas - there's stuff for you further down...)

> If you write checksum for some data, ordering between checksum and data is
> not needed.

Actually, it is.

> When the crash occurs, we just recalculate checksum with data and compare
> the recalculated one with a written one.

And it's required because the read of the data that gets a checksum-data mismatch
may be weeks, months, or even years after a crash happens.  You don't have any
history to go on, *only* on the data as found and the two checksums.

You can't safely just recalculate the checksum, because that's the whole *point*
of the checksum - to detect that something has gone wrong.   And if it's the data
that has gone wrong, just recalculating the checksum is the exact wrong thing
to do.

Failing the read with a -EIO, and not touching the data or checksums is the proper thing to do.

> Even though checksum is written first, the recalculated checksum will be
> different with the written checksum because data is not written.

You missed an important point.  If you read the block and the checksum and they
don't match, you don't know if the checksum is wrong because it's stale, or if
the data has been corrupted.

That's part of why there's 2 checksums, one before and one after the data block.
That way, if the two checksums match each other but not the data, you know that
something has corrupted the data.  If the two checksums don't match, it gets more
interesting:

If the first one matches the data and the second doesn't, then either the second
one has gotten corrupted, or the system died between writing the data and the
second checksum.  But that's OK, because the first checksum says the data update
did succeed, so simply patching the second checksum is OK.

If the first one doesn't match and the second one *does*, then either the system died
between the first update and the data, or the first one is corrupted - and you don't
have a good way to distinguish between them unless you have timestamps.

If neither checksum matches the data, then you're pretty sure the system died
between the first checksum and finishing the data write.

Questions for Lukas:

First off, see my comment about -EIO.  Do you have plans for an ioctl or
other way for userspace to get the two checksums so diagnostic programs
can do better error diagnosis/recovery?

If I understand what you're doing, each 4096 (or whatever) block will actually
take (4096 + 2* checksum size) bytes, which means each logical consecutive
block will be offset from the start of a physical block by some amount.   This
effectively means that you are guaranteed one read-modify-write and possibly
two, for each write. (The other alternative is to devote an entire block to
each checksum, but that triples the size and at that point you may as well just
do a 2+1 raidset)

Even if your hardware is willing to do the RMW cycle in hardware, that still
hits you for at least one rotational latency, and possibly two.  If you have to
do the RMW in software, it gets a *lot* more painful (and actually *ensuring*
atomic writes gets more challenging).   At that point, are you still gaining
performance over the current dm-integrity scheme?

(There's also a lot more ugly that happens on high-end storage devices, where
your logical device is actually a 8+2 RAID6 LUN striped across 10 volumes - even a single
4K write is guaranteed to be a RMW, and you need to do a 32K write to make it
really be a write.

IBM's GPFS, SGI's CXFS, and probably other high-end file systems as well, go
another level of crazy in order to get high performace - you end up striping
the filesystem across 4 or 8 LUNs, so you want a logical blocksize that gets
you 4 or 8 times the 32K that each LUN wants to see.

At which point the storage admin is ready to shoot the end user who writes a
program that does 1K writes, causing your throughput to fall through the
floor.. Been there, done that, it gets ugly quickly... :)

[-- Attachment #1.2: Type: application/pgp-signature, Size: 832 bytes --]

[-- Attachment #2: Type: text/plain, Size: 170 bytes --]

_______________________________________________
Kernelnewbies mailing list
Kernelnewbies@kernelnewbies.org
https://lists.kernelnewbies.org/mailman/listinfo/kernelnewbies