Kernel Newbies archive on lore.kernel.org
 help / color / Atom feed
* Ordering guarantee inside a single bio?
@ 2020-01-26 12:07 Lukas Straub
  2020-01-27 17:27 ` Valdis Klētnieks
  0 siblings, 1 reply; 7+ messages in thread
From: Lukas Straub @ 2020-01-26 12:07 UTC (permalink / raw)
  To: kernelnewbies

Hello Everyone,
I am planing to write a new device-mapper target and I'm wondering if there is a ordering guarantee for the operation inside a single bio? For example if I issue a write bio to sector 0 of length 4, is it guaranteed that sector 0 is written first and sector 3 is written last?

Regards,
Lukas Straub

_______________________________________________
Kernelnewbies mailing list
Kernelnewbies@kernelnewbies.org
https://lists.kernelnewbies.org/mailman/listinfo/kernelnewbies

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Ordering guarantee inside a single bio?
  2020-01-26 12:07 Ordering guarantee inside a single bio? Lukas Straub
@ 2020-01-27 17:27 ` Valdis Klētnieks
  2020-01-27 18:22   ` Lukas Straub
  0 siblings, 1 reply; 7+ messages in thread
From: Valdis Klētnieks @ 2020-01-27 17:27 UTC (permalink / raw)
  To: Lukas Straub; +Cc: kernelnewbies

[-- Attachment #1.1: Type: text/plain, Size: 426 bytes --]

On Sun, 26 Jan 2020 13:07:38 +0100, Lukas Straub said:

> I am planing to write a new device-mapper target and I'm wondering if there
> is a ordering guarantee for the operation inside a single bio? For example if I
> issue a write bio to sector 0 of length 4, is it guaranteed that sector 0 is
> written first and sector 3 is written last?

I'll bite.  What are you doing where the order of writing out a single bio matters?

[-- Attachment #1.2: Type: application/pgp-signature, Size: 832 bytes --]

[-- Attachment #2: Type: text/plain, Size: 170 bytes --]

_______________________________________________
Kernelnewbies mailing list
Kernelnewbies@kernelnewbies.org
https://lists.kernelnewbies.org/mailman/listinfo/kernelnewbies

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Ordering guarantee inside a single bio?
  2020-01-27 17:27 ` Valdis Klētnieks
@ 2020-01-27 18:22   ` Lukas Straub
  2020-01-28  4:50     ` 오준택
  0 siblings, 1 reply; 7+ messages in thread
From: Lukas Straub @ 2020-01-27 18:22 UTC (permalink / raw)
  To: Valdis Klētnieks; +Cc: kernelnewbies

On Mon, 27 Jan 2020 12:27:58 -0500
"Valdis Klētnieks" <valdis.kletnieks@vt.edu> wrote:

> On Sun, 26 Jan 2020 13:07:38 +0100, Lukas Straub said:
> 
> > I am planing to write a new device-mapper target and I'm wondering if there
> > is a ordering guarantee for the operation inside a single bio? For example if I
> > issue a write bio to sector 0 of length 4, is it guaranteed that sector 0 is
> > written first and sector 3 is written last?  
> 
> I'll bite.  What are you doing where the order of writing out a single bio matters?

I plan to improve the performance of dm-integrity on HDDs by removing the requirement for bitmap or journal (which causes head seeks even for sequential writes). I also want to avoid cache flushes and FUA. The problem with dm-integrity is that the data and checksum update needs to be atomic.
So I came up with the following Idea:

The on-disk layout will look like this:
|csum_next-01|data-chunk-01|csum_prev-01|csum_next-02|data-chunk-02|csum_prev-02|...

Under normal conditions, csum_next-01 (a single sector) contains the checksums for data-chunk-01 and csum_prev-01 is a duplicate of csum_next-01.

Updating data will first update csum_next (with FUA), then update the data (FUA) and finally update csum_prev (FUA).
But if there is a ordering guarantee we have a fast path: If a full chunk of data is written, we simply issue a single big write with csum_next, data and csum_prev, all without FUA (except if the incoming request asks for that).
So that's why I'm asking.

_______________________________________________
Kernelnewbies mailing list
Kernelnewbies@kernelnewbies.org
https://lists.kernelnewbies.org/mailman/listinfo/kernelnewbies

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Ordering guarantee inside a single bio?
  2020-01-27 18:22   ` Lukas Straub
@ 2020-01-28  4:50     ` 오준택
  2020-01-29 20:28       ` Valdis Klētnieks
  0 siblings, 1 reply; 7+ messages in thread
From: 오준택 @ 2020-01-28  4:50 UTC (permalink / raw)
  To: Lukas Straub; +Cc: Valdis Klētnieks, kernelnewbies

[-- Attachment #1.1: Type: text/plain, Size: 2756 bytes --]

Hello,

As I know, there is no way to guarantee ordering between block writes
inside a bio.

That is the reason why bio for journal commit block write and for other log
block writes are separated in JBD2 module.

And, I think your idea can be optimized more efficiently.

If you write checksum for some data, ordering between checksum and data is
not needed.

When the crash occurs, we just recalculate checksum with data and compare
the recalculated one with a written one.

Even though checksum is written first, the recalculated checksum will be
different with the written checksum because data is not written.

So, i think if you use checksum, ordering guaranteeing is not needed.

This is first time that i send mail to kernelnewbies mailing list.

If i did wrong thing on this mail, very sorry about that.

Thank you.

Joontaek Oh.

2020년 1월 28일 (화) 오전 3:23, Lukas Straub <lukasstraub2@web.de>님이 작성:

> On Mon, 27 Jan 2020 12:27:58 -0500
> "Valdis Klētnieks" <valdis.kletnieks@vt.edu> wrote:
>
> > On Sun, 26 Jan 2020 13:07:38 +0100, Lukas Straub said:
> >
> > > I am planing to write a new device-mapper target and I'm wondering if
> there
> > > is a ordering guarantee for the operation inside a single bio? For
> example if I
> > > issue a write bio to sector 0 of length 4, is it guaranteed that
> sector 0 is
> > > written first and sector 3 is written last?
> >
> > I'll bite.  What are you doing where the order of writing out a single
> bio matters?
>
> I plan to improve the performance of dm-integrity on HDDs by removing the
> requirement for bitmap or journal (which causes head seeks even for
> sequential writes). I also want to avoid cache flushes and FUA. The problem
> with dm-integrity is that the data and checksum update needs to be atomic.
> So I came up with the following Idea:
>
> The on-disk layout will look like this:
>
> |csum_next-01|data-chunk-01|csum_prev-01|csum_next-02|data-chunk-02|csum_prev-02|...
>
> Under normal conditions, csum_next-01 (a single sector) contains the
> checksums for data-chunk-01 and csum_prev-01 is a duplicate of csum_next-01.
>
> Updating data will first update csum_next (with FUA), then update the data
> (FUA) and finally update csum_prev (FUA).
> But if there is a ordering guarantee we have a fast path: If a full chunk
> of data is written, we simply issue a single big write with csum_next, data
> and csum_prev, all without FUA (except if the incoming request asks for
> that).
> So that's why I'm asking.
>
> _______________________________________________
> Kernelnewbies mailing list
> Kernelnewbies@kernelnewbies.org
> https://lists.kernelnewbies.org/mailman/listinfo/kernelnewbies
>

[-- Attachment #1.2: Type: text/html, Size: 3624 bytes --]

<div dir="ltr"><div dir="ltr">Hello,<div><br></div><div>As I know, there is no way to guarantee ordering between block writes inside a bio.</div><div><br></div><div>That is the reason why bio for journal commit block write and for other log block writes are separated in JBD2 module.</div><div><br></div><div>And, I think your idea can be optimized more efficiently.</div><div><br></div><div>If you write checksum for some data, ordering between checksum and data is not needed.</div><div><br></div><div>When the crash occurs, we just recalculate checksum with data and compare the recalculated one with a written one.</div><div><br></div><div>Even though checksum is written first, the recalculated checksum will be different with the written checksum because data is not written.</div><div><br></div><div>So, i think if you use checksum, ordering guaranteeing is not needed.</div><div><br></div><div>This is first time that i send mail to kernelnewbies mailing list.</div><div><br></div><div>If i did wrong thing on this mail, very sorry about that.</div><div><br></div><div>Thank you.</div><div><br></div><div>Joontaek Oh.</div></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">2020년 1월 28일 (화) 오전 3:23, Lukas Straub &lt;<a href="mailto:lukasstraub2@web.de">lukasstraub2@web.de</a>&gt;님이 작성:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">On Mon, 27 Jan 2020 12:27:58 -0500<br>
&quot;Valdis Klētnieks&quot; &lt;<a href="mailto:valdis.kletnieks@vt.edu" target="_blank">valdis.kletnieks@vt.edu</a>&gt; wrote:<br>
<br>
&gt; On Sun, 26 Jan 2020 13:07:38 +0100, Lukas Straub said:<br>
&gt; <br>
&gt; &gt; I am planing to write a new device-mapper target and I&#39;m wondering if there<br>
&gt; &gt; is a ordering guarantee for the operation inside a single bio? For example if I<br>
&gt; &gt; issue a write bio to sector 0 of length 4, is it guaranteed that sector 0 is<br>
&gt; &gt; written first and sector 3 is written last?  <br>
&gt; <br>
&gt; I&#39;ll bite.  What are you doing where the order of writing out a single bio matters?<br>
<br>
I plan to improve the performance of dm-integrity on HDDs by removing the requirement for bitmap or journal (which causes head seeks even for sequential writes). I also want to avoid cache flushes and FUA. The problem with dm-integrity is that the data and checksum update needs to be atomic.<br>
So I came up with the following Idea:<br>
<br>
The on-disk layout will look like this:<br>
|csum_next-01|data-chunk-01|csum_prev-01|csum_next-02|data-chunk-02|csum_prev-02|...<br>
<br>
Under normal conditions, csum_next-01 (a single sector) contains the checksums for data-chunk-01 and csum_prev-01 is a duplicate of csum_next-01.<br>
<br>
Updating data will first update csum_next (with FUA), then update the data (FUA) and finally update csum_prev (FUA).<br>
But if there is a ordering guarantee we have a fast path: If a full chunk of data is written, we simply issue a single big write with csum_next, data and csum_prev, all without FUA (except if the incoming request asks for that).<br>
So that&#39;s why I&#39;m asking.<br>
<br>
_______________________________________________<br>
Kernelnewbies mailing list<br>
<a href="mailto:Kernelnewbies@kernelnewbies.org" target="_blank">Kernelnewbies@kernelnewbies.org</a><br>
<a href="https://lists.kernelnewbies.org/mailman/listinfo/kernelnewbies" rel="noreferrer" target="_blank">https://lists.kernelnewbies.org/mailman/listinfo/kernelnewbies</a><br>
</blockquote></div></div>

[-- Attachment #2: Type: text/plain, Size: 170 bytes --]

_______________________________________________
Kernelnewbies mailing list
Kernelnewbies@kernelnewbies.org
https://lists.kernelnewbies.org/mailman/listinfo/kernelnewbies

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Ordering guarantee inside a single bio?
  2020-01-28  4:50     ` 오준택
@ 2020-01-29 20:28       ` Valdis Klētnieks
  2020-01-30 14:16         ` Lukas Straub
  0 siblings, 1 reply; 7+ messages in thread
From: Valdis Klētnieks @ 2020-01-29 20:28 UTC (permalink / raw)
  To: 오준택
  Cc: Lukas Straub, kernelnewbies

[-- Warning: decoded text below may be mangled --]
[-- Attachment #1.1: Type: text/plain; charset=us-ascii, Size: 3994 bytes --]

On Tue, 28 Jan 2020 13:50:56 +0900, 오준택 said:

(Lukas - there's stuff for you further down...)

> If you write checksum for some data, ordering between checksum and data is
> not needed.

Actually, it is.

> When the crash occurs, we just recalculate checksum with data and compare
> the recalculated one with a written one.

And it's required because the read of the data that gets a checksum-data mismatch
may be weeks, months, or even years after a crash happens.  You don't have any
history to go on, *only* on the data as found and the two checksums.

You can't safely just recalculate the checksum, because that's the whole *point*
of the checksum - to detect that something has gone wrong.   And if it's the data
that has gone wrong, just recalculating the checksum is the exact wrong thing
to do.

Failing the read with a -EIO, and not touching the data or checksums is the proper thing to do.

> Even though checksum is written first, the recalculated checksum will be
> different with the written checksum because data is not written.

You missed an important point.  If you read the block and the checksum and they
don't match, you don't know if the checksum is wrong because it's stale, or if
the data has been corrupted.

That's part of why there's 2 checksums, one before and one after the data block.
That way, if the two checksums match each other but not the data, you know that
something has corrupted the data.  If the two checksums don't match, it gets more
interesting:

If the first one matches the data and the second doesn't, then either the second
one has gotten corrupted, or the system died between writing the data and the
second checksum.  But that's OK, because the first checksum says the data update
did succeed, so simply patching the second checksum is OK.

If the first one doesn't match and the second one *does*, then either the system died
between the first update and the data, or the first one is corrupted - and you don't
have a good way to distinguish between them unless you have timestamps.

If neither checksum matches the data, then you're pretty sure the system died
between the first checksum and finishing the data write.

Questions for Lukas:

First off, see my comment about -EIO.  Do you have plans for an ioctl or
other way for userspace to get the two checksums so diagnostic programs
can do better error diagnosis/recovery?

If I understand what you're doing, each 4096 (or whatever) block will actually
take (4096 + 2* checksum size) bytes, which means each logical consecutive
block will be offset from the start of a physical block by some amount.   This
effectively means that you are guaranteed one read-modify-write and possibly
two, for each write. (The other alternative is to devote an entire block to
each checksum, but that triples the size and at that point you may as well just
do a 2+1 raidset)

Even if your hardware is willing to do the RMW cycle in hardware, that still
hits you for at least one rotational latency, and possibly two.  If you have to
do the RMW in software, it gets a *lot* more painful (and actually *ensuring*
atomic writes gets more challenging).   At that point, are you still gaining
performance over the current dm-integrity scheme?

(There's also a lot more ugly that happens on high-end storage devices, where
your logical device is actually a 8+2 RAID6 LUN striped across 10 volumes - even a single
4K write is guaranteed to be a RMW, and you need to do a 32K write to make it
really be a write.

IBM's GPFS, SGI's CXFS, and probably other high-end file systems as well, go
another level of crazy in order to get high performace - you end up striping
the filesystem across 4 or 8 LUNs, so you want a logical blocksize that gets
you 4 or 8 times the 32K that each LUN wants to see.

At which point the storage admin is ready to shoot the end user who writes a
program that does 1K writes, causing your throughput to fall through the
floor.. Been there, done that, it gets ugly quickly... :)


[-- Attachment #1.2: Type: application/pgp-signature, Size: 832 bytes --]

[-- Attachment #2: Type: text/plain, Size: 170 bytes --]

_______________________________________________
Kernelnewbies mailing list
Kernelnewbies@kernelnewbies.org
https://lists.kernelnewbies.org/mailman/listinfo/kernelnewbies

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Ordering guarantee inside a single bio?
  2020-01-29 20:28       ` Valdis Klētnieks
@ 2020-01-30 14:16         ` Lukas Straub
  2020-01-31  3:26           ` Valdis Klētnieks
  0 siblings, 1 reply; 7+ messages in thread
From: Lukas Straub @ 2020-01-30 14:16 UTC (permalink / raw)
  To: Valdis Klētnieks; +Cc: 오준택, kernelnewbies

On Wed, 29 Jan 2020 15:28:37 -0500
"Valdis Klētnieks" <valdis.kletnieks@vt.edu> wrote:

> [...]
> 
> Questions for Lukas:
> 
> First off, see my comment about -EIO.  Do you have plans for an ioctl or
> other way for userspace to get the two checksums so diagnostic programs
> can do better error diagnosis/recovery?

Not really, but as I will integrate it with the existing dm-integrity
infrastructure it will support the recovery mode which won't check the
checksums.

Recovery will more or less happen as you described it above. 

> If I understand what you're doing, each 4096 (or whatever) block will actually
> take (4096 + 2* checksum size) bytes, which means each logical consecutive
> block will be offset from the start of a physical block by some amount.   This
> effectively means that you are guaranteed one read-modify-write and possibly
> two, for each write. (The other alternative is to devote an entire block to
> each checksum, but that triples the size and at that point you may as well just
> do a 2+1 raidset)

No, csum_next (and csum_prev) is a whole sector (i.e. physical block)
containing all the checksums for the following chunk of data (which
spans multiple sectors), so pretty similar to the current dm-integrity 
implementation apart from the 2nd checksum sector.
RMW is only needed for the checksum sector and because they don't take
up much space, they are easily cached in ram so only needing the write
in the best case.

Regarding the ordering guarantee, I have now gathered that the kernel
will happily split the bio if the size is not optimal for the hardware
which means it's not guaranteed - right? 

_______________________________________________
Kernelnewbies mailing list
Kernelnewbies@kernelnewbies.org
https://lists.kernelnewbies.org/mailman/listinfo/kernelnewbies

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Ordering guarantee inside a single bio?
  2020-01-30 14:16         ` Lukas Straub
@ 2020-01-31  3:26           ` Valdis Klētnieks
  0 siblings, 0 replies; 7+ messages in thread
From: Valdis Klētnieks @ 2020-01-31  3:26 UTC (permalink / raw)
  To: Lukas Straub
  Cc: 오준택, kernelnewbies

[-- Attachment #1.1: Type: text/plain, Size: 776 bytes --]

On Thu, 30 Jan 2020 15:16:17 +0100, Lukas Straub said:

> No, csum_next (and csum_prev) is a whole sector (i.e. physical block)
> containing all the checksums for the following chunk of data (which
> spans multiple sectors)

Oh, OK.  That works too.. :)

> Regarding the ordering guarantee, I have now gathered that the kernel
> will happily split the bio if the size is not optimal for the hardware
> which means it's not guaranteed - right?

And if the kernel doesn't split it and reorder the chunks, the hardware will
happily do so - and lie to you the whole way.  Ever since the hardware people
realized they could get away with lying about turning off the writeback cache,
it's been more and more of a challenge to guarantee correct performance from
storage subsystems.


[-- Attachment #1.2: Type: application/pgp-signature, Size: 832 bytes --]

[-- Attachment #2: Type: text/plain, Size: 170 bytes --]

_______________________________________________
Kernelnewbies mailing list
Kernelnewbies@kernelnewbies.org
https://lists.kernelnewbies.org/mailman/listinfo/kernelnewbies

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, back to index

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-01-26 12:07 Ordering guarantee inside a single bio? Lukas Straub
2020-01-27 17:27 ` Valdis Klētnieks
2020-01-27 18:22   ` Lukas Straub
2020-01-28  4:50     ` 오준택
2020-01-29 20:28       ` Valdis Klētnieks
2020-01-30 14:16         ` Lukas Straub
2020-01-31  3:26           ` Valdis Klētnieks

Kernel Newbies archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/kernelnewbies/0 kernelnewbies/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 kernelnewbies kernelnewbies/ https://lore.kernel.org/kernelnewbies \
		kernelnewbies@kernelnewbies.org
	public-inbox-index kernelnewbies

Example config snippet for mirrors

Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.kernelnewbies.kernelnewbies


AGPL code for this site: git clone https://public-inbox.org/public-inbox.git