loop: it looks like REQ_OP_FLUSH could return before IO completion.

All of lore.kernel.org
 help / color / mirror / Atom feed

* loop: it looks like REQ_OP_FLUSH could return before IO completion.
@ 2022-03-19 17:14 Eric Wheeler
  2022-03-21  0:21 ` Ming Lei
  0 siblings, 1 reply; 12+ messages in thread
From: Eric Wheeler @ 2022-03-19 17:14 UTC (permalink / raw)
  To: linux-block

Hello all,

In loop.c do_req_filebacked() for REQ_OP_FLUSH, lo_req_flush() is called: 
it does not appear that lo_req_flush() does anything to make sure 
ki_complete has been called for pending work, it just calls vfs_fsync().

Is this a consistency problem?

For example, if the loop driver has queued async kiocb's to the filesystem 
via .read_iter/.write_iter, is it the filesystem's responsibility to 
complete that work before returning from vfs_sync() or is it possible that 
the vfs_sync() completes before ->ki_complete() is called?

Relatedly, does vfs_fsync() do anything for direct IO?  Ultimately 
f_op->fsync() is called so maybe the FS is told to commit its structures 
like sparse allocations that may not be on disk yet.

--
Eric Wheeler

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: loop: it looks like REQ_OP_FLUSH could return before IO completion.
  2022-03-19 17:14 loop: it looks like REQ_OP_FLUSH could return before IO completion Eric Wheeler
@ 2022-03-21  0:21 ` Ming Lei
  2022-03-23  6:14   ` Eric Wheeler
  0 siblings, 1 reply; 12+ messages in thread
From: Ming Lei @ 2022-03-21  0:21 UTC (permalink / raw)
  To: Eric Wheeler; +Cc: linux-block

On Sat, Mar 19, 2022 at 10:14:29AM -0700, Eric Wheeler wrote:
> Hello all,
> 
> In loop.c do_req_filebacked() for REQ_OP_FLUSH, lo_req_flush() is called: 
> it does not appear that lo_req_flush() does anything to make sure 
> ki_complete has been called for pending work, it just calls vfs_fsync().
> 
> Is this a consistency problem?

No. What FLUSH command provides is just flushing cache in device side to
storage medium, so it is nothing to do with pending request.

Thanks, 
Ming

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: loop: it looks like REQ_OP_FLUSH could return before IO completion.
  2022-03-21  0:21 ` Ming Lei
@ 2022-03-23  6:14   ` Eric Wheeler
  2022-04-13 22:49     ` Eric Wheeler
  0 siblings, 1 reply; 12+ messages in thread
From: Eric Wheeler @ 2022-03-23  6:14 UTC (permalink / raw)
  To: Ming Lei; +Cc: linux-block

On Mon, 21 Mar 2022, Ming Lei wrote:
> On Sat, Mar 19, 2022 at 10:14:29AM -0700, Eric Wheeler wrote:
> > Hello all,
> > 
> > In loop.c do_req_filebacked() for REQ_OP_FLUSH, lo_req_flush() is called: 
> > it does not appear that lo_req_flush() does anything to make sure 
> > ki_complete has been called for pending work, it just calls vfs_fsync().
> > 
> > Is this a consistency problem?
> 
> No. What FLUSH command provides is just flushing cache in device side to
> storage medium, so it is nothing to do with pending request.

If a flush follows a series of writes, would it be best if the flush 
happened _after_ those writes complete?  Then then the storage medium will 
be sure to flush what was intended to be written.

It seems that this series of events could lead to inconsistent data:
	loop		->	filesystem
	write a
	write b
	flush
				write a
				flush
				write b
				crash, b is lost

If write+flush ordering is _not_ important, then can you help me 
understand why?

-Eric



> 
> 
> Thanks, 
> Ming
> 
> 

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: loop: it looks like REQ_OP_FLUSH could return before IO completion.
  2022-03-23  6:14   ` Eric Wheeler
@ 2022-04-13 22:49     ` Eric Wheeler
  2022-04-14  0:15       ` Ming Lei
  2022-04-14  0:28       ` Ming Lei
  0 siblings, 2 replies; 12+ messages in thread
From: Eric Wheeler @ 2022-04-13 22:49 UTC (permalink / raw)
  To: Ming Lei; +Cc: linux-block

On Tue, 22 Mar 2022, Eric Wheeler wrote:
> On Mon, 21 Mar 2022, Ming Lei wrote:
> > On Sat, Mar 19, 2022 at 10:14:29AM -0700, Eric Wheeler wrote:
> > > Hello all,
> > > 
> > > In loop.c do_req_filebacked() for REQ_OP_FLUSH, lo_req_flush() is called: 
> > > it does not appear that lo_req_flush() does anything to make sure 
> > > ki_complete has been called for pending work, it just calls vfs_fsync().
> > > 
> > > Is this a consistency problem?
> > 
> > No. What FLUSH command provides is just flushing cache in device side to
> > storage medium, so it is nothing to do with pending request.
> 
> If a flush follows a series of writes, would it be best if the flush 
> happened _after_ those writes complete?  Then then the storage medium will 
> be sure to flush what was intended to be written.
> 
> It seems that this series of events could lead to inconsistent data:
> 	loop		->	filesystem
> 	write a
> 	write b
> 	flush
> 				write a
> 				flush
> 				write b
> 				crash, b is lost
> 
> If write+flush ordering is _not_ important, then can you help me 
> understand why?
> 

Hi Ming, just checking in: did you see the message above?

Do you really mean to say that reordering writes around a flush is safe 
in the presence of a crash?


--
Eric Wheeler



> -Eric
> 
> 
> 
> > 
> > 
> > Thanks, 
> > Ming
> > 
> > 
> 

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: loop: it looks like REQ_OP_FLUSH could return before IO completion.
  2022-04-13 22:49     ` Eric Wheeler
@ 2022-04-14  0:15       ` Ming Lei
  2022-04-14  0:28       ` Ming Lei
  1 sibling, 0 replies; 12+ messages in thread
From: Ming Lei @ 2022-04-14  0:15 UTC (permalink / raw)
  To: Eric Wheeler; +Cc: linux-block

On Wed, Apr 13, 2022 at 03:49:07PM -0700, Eric Wheeler wrote:
> On Tue, 22 Mar 2022, Eric Wheeler wrote:
> > On Mon, 21 Mar 2022, Ming Lei wrote:
> > > On Sat, Mar 19, 2022 at 10:14:29AM -0700, Eric Wheeler wrote:
> > > > Hello all,
> > > > 
> > > > In loop.c do_req_filebacked() for REQ_OP_FLUSH, lo_req_flush() is called: 
> > > > it does not appear that lo_req_flush() does anything to make sure 
> > > > ki_complete has been called for pending work, it just calls vfs_fsync().
> > > > 
> > > > Is this a consistency problem?
> > > 
> > > No. What FLUSH command provides is just flushing cache in device side to
> > > storage medium, so it is nothing to do with pending request.
> > 
> > If a flush follows a series of writes, would it be best if the flush 
> > happened _after_ those writes complete?  Then then the storage medium will 
> > be sure to flush what was intended to be written.
> > 
> > It seems that this series of events could lead to inconsistent data:
> > 	loop		->	filesystem
> > 	write a
> > 	write b
> > 	flush
> > 				write a
> > 				flush
> > 				write b
> > 				crash, b is lost
> > 
> > If write+flush ordering is _not_ important, then can you help me 
> > understand why?
> > 
> 
> Hi Ming, just checking in: did you see the message above?
> 
> Do you really mean to say that reordering writes around a flush is safe 
> in the presence of a crash?

As I explained before, FLUSH doesn't provide order guarantee, and that
is supposed to be done by upper layer, such FS.

Again, what FLUSH does is just to flush cache in device side to medium,
that is it.


Thanks,
Ming


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: loop: it looks like REQ_OP_FLUSH could return before IO completion.
  2022-04-13 22:49     ` Eric Wheeler
  2022-04-14  0:15       ` Ming Lei
@ 2022-04-14  0:28       ` Ming Lei
  2022-04-15  2:10         ` Eric Wheeler
  1 sibling, 1 reply; 12+ messages in thread
From: Ming Lei @ 2022-04-14  0:28 UTC (permalink / raw)
  To: Eric Wheeler; +Cc: linux-block

On Wed, Apr 13, 2022 at 03:49:07PM -0700, Eric Wheeler wrote:
> On Tue, 22 Mar 2022, Eric Wheeler wrote:
> > On Mon, 21 Mar 2022, Ming Lei wrote:
> > > On Sat, Mar 19, 2022 at 10:14:29AM -0700, Eric Wheeler wrote:
> > > > Hello all,
> > > > 
> > > > In loop.c do_req_filebacked() for REQ_OP_FLUSH, lo_req_flush() is called: 
> > > > it does not appear that lo_req_flush() does anything to make sure 
> > > > ki_complete has been called for pending work, it just calls vfs_fsync().
> > > > 
> > > > Is this a consistency problem?
> > > 
> > > No. What FLUSH command provides is just flushing cache in device side to
> > > storage medium, so it is nothing to do with pending request.
> > 
> > If a flush follows a series of writes, would it be best if the flush 
> > happened _after_ those writes complete?  Then then the storage medium will 
> > be sure to flush what was intended to be written.
> > 
> > It seems that this series of events could lead to inconsistent data:
> > 	loop		->	filesystem
> > 	write a
> > 	write b
> > 	flush
> > 				write a
> > 				flush
> > 				write b
> > 				crash, b is lost
> > 
> > If write+flush ordering is _not_ important, then can you help me 
> > understand why?
> > 
> 
> Hi Ming, just checking in: did you see the message above?
> 
> Do you really mean to say that reordering writes around a flush is safe 
> in the presence of a crash?

Sorry, replied too quick.

BTW, what is the actual crash? Any dmesg log? From the above description, b is
just not flushed to storage when running flush, and sooner or later it will
land, so what is the real issue?


Thanks,
Ming


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: loop: it looks like REQ_OP_FLUSH could return before IO completion.
  2022-04-14  0:28       ` Ming Lei
@ 2022-04-15  2:10         ` Eric Wheeler
  2022-04-15 14:29           ` Ming Lei
  0 siblings, 1 reply; 12+ messages in thread
From: Eric Wheeler @ 2022-04-15  2:10 UTC (permalink / raw)
  To: Ming Lei; +Cc: linux-block

On Thu, 14 Apr 2022, Ming Lei wrote:
> On Wed, Apr 13, 2022 at 03:49:07PM -0700, Eric Wheeler wrote:
> > On Tue, 22 Mar 2022, Eric Wheeler wrote:
> > > On Mon, 21 Mar 2022, Ming Lei wrote:
> > > > On Sat, Mar 19, 2022 at 10:14:29AM -0700, Eric Wheeler wrote:
> > > > > Hello all,
> > > > > 
> > > > > In loop.c do_req_filebacked() for REQ_OP_FLUSH, lo_req_flush() is called: 
> > > > > it does not appear that lo_req_flush() does anything to make sure 
> > > > > ki_complete has been called for pending work, it just calls vfs_fsync().
> > > > > 
> > > > > Is this a consistency problem?
> > > > 
> > > > No. What FLUSH command provides is just flushing cache in device side to
> > > > storage medium, so it is nothing to do with pending request.
> > > 
> > > If a flush follows a series of writes, would it be best if the flush 
> > > happened _after_ those writes complete?  Then then the storage medium will 
> > > be sure to flush what was intended to be written.
> > > 
> > > It seems that this series of events could lead to inconsistent data:
> > > 	loop		->	filesystem
> > > 	write a
> > > 	write b
> > > 	flush
> > > 				write a
> > > 				flush
> > > 				write b
> > > 				crash, b is lost
> > > 
> > > If write+flush ordering is _not_ important, then can you help me 
> > > understand why?
> > > 
> > 
> > Hi Ming, just checking in: did you see the message above?
> > 
> > Do you really mean to say that reordering writes around a flush is safe 
> > in the presence of a crash?
> 
> Sorry, replied too quick.

Thats ok, thanks for considering the issue!

> BTW, what is the actual crash? Any dmesg log? From the above description, b is
> just not flushed to storage when running flush, and sooner or later it will
> land, so what is the real issue?

In this case "crash" is actually a filesystem snapshot using most any 
snapshot technology such as: dm-thin, btrfs, bcachefs, zfs.  From the 
filesystem inside the loop file's point of view, a snapshot is the same as 
a hardware crash.

Some background:

  We've already seen journal commit re-ordering caused by dm-crypt in 
  dm-thin snapshots:
	dm-thin -> dm-crypt -> [kvm with a disk using ext4]

  This is the original email about dm-crypt ordering:
	https://listman.redhat.com/archives/dm-devel/2016-September/msg00035.html 

  We "fixed" the dm-crypt issue by disabling parallel dm-crypt threads 
  with dm-crypt flags same_cpu_crypt+submit_from_crypt_cpus and haven't 
  seen the issue since. (Its noticably slower, but I'll take consistency 
  over performance any day! Not sure if that old dm-crypt ordering has 
  been fixed.)

So back to considering loop devs:

Having seen the dm-crypt issue I would like to verify that loop isn't 
susceptable to the same issue in the presence of lower-level 
snapshots---but it looks like it could be since flushes don't enforce 
ordering.  Here is why:

In ext4/super.c:ext4_sync_fs(), the ext4 code calls 
blkdev_issue_flush(sb->s_bdev) when barriers are enabled (which is 
default).  blkdev_issue_flush() sets REQ_PREFLUSH and according to 
blk_types.h this is a "request for cache flush"; you could think of 
in-flight IO's on the way through loop.ko and into the hypervisor 
filesystem where the loop's backing file lives as being in a "cache" of 
sorts---especially for non-DIO loopdevs hitting the pagecache.

Thus, ext4 critically expects that all IOs preceding a flush will hit 
persistent storage before all future IOs.  Failing that, journal replay 
cannot return the filesystem to a consistent state without a `fsck`.  

(Note that ext4 is just an example, really any journaled filesystem is at 
risk.)

Lets say a virtual machine uses a loopback file for a disk and the VM 
issues the following to delete some file called "/b":

  unlink("/b"):
	write: journal commit: unlink /b
	flush: blkdev_issue_flush()
	write: update fs datastructures (remove /b from a dentry)
	<hypervisor snapshot>

If the flush happens out of order then an operation like unlink("/b")  
could look like this where the guest VM's filesystem is on the left and 
the hypervisor's loop filesystem operations are on the right:

  VM ext4 filesystem            |  Hypervisor loop dev ordering
--------------------------------+--------------------------------
write: journal commit: unlink /b
flush: blkdev_issue_flush()
write: update fs dentry's
                                queued to loop: [journal commit: unlink /b]
                                queued to loop: [update fs dentry's]
                                flush: vfs_fsync() - out of order
                                queued to ext4: [journal commit: unlink /b]
                                queued to ext4: [update fs dentry's]
                                write lands: [update fs dentry's]
                                <snapshot!>
                                write lands: [journal commit: unlink /b]

Notice that the last two "write lands" are out of order because the 
vfs_fsync() does not separate them as expected by the VM's ext4 
implementation.

Presently, loop.c will never re-order actual WRITE's: they will be 
submitted to loopdev's `file*` handle in-submit-order because the 
workqueue will keep them ordered.  This is good[*].

But, REQ_FLUSH is not a write:

The vfs_fsync() in lo_req_flush() is _not_ ordered by the writequeue, and 
there exists no mechanism in loop.c to enforce completion of IOs submitted 
to the loopdev's `file*` handle prior to completing the vfs_fsync(), nor 
are subsequent IOs thereby required to complete after the flush.

Thus, the hypervisor's snapshot-capable filesystem can re-order the last 
two writes because the flush happened early.

In the re-ordered case on the hypervisor side:

  If a snapshot happens after the dentry removal but _before_ the journal 
  commit, then a journal replay of the resulting snapshot will be 
  inconsistent.

Flush re-ordering creates an inconsistency in two possible cases:

   a. In the snapshot after dentry removal but before journal commit.
   b. Crash after dentry removal but before journal comit.

Really a snapshot looks like a crash to the filesystem, so (a) and (b) are 
equivalent but (a) is easier to reason about. In either case, mounting the 
out-of-order filesystem (snapshot if (a), origin if (b)) will present 
kernel errors in the VM when the dentry is read:

	kernel: EXT4-fs error (device dm-2): ext4_lookup:1441: inode 
	 #1196093: comm rsync: deleted inode referenced: 1188710
	[ https://listman.redhat.com/archives/dm-devel/2016-September/028121.html ]

Fixing flush ordering provides for two possible improvements:
  ([*] from above about write ordering)

1. Consistency, as above.

2. Performance.  Right now loopdev IOs are serialized by a single write 
   queue per loopdev.  Parallel work queues could be used to submit IOs in 
   parallel to the filesystem serving the loopdev's `file*` handle since 
   VMs may submit IOs from different CPU cores.  Special care would need 
   to be taken for the following cases:

     a. Ordering of dependent IOs (ie, reads or writes of preceding 
        writes).
     b. Flushes need to wait for all workqueues to quiesce.

   W.r.t choosing the number of WQ's: Certainly not 1:1 CPU to workqueue 
   mapping because of the old WQ issue running out of pid's with lots of 
   CPUs, but here are some possibilities:

     a. 1:1 per socket
     b. User configurable as a module parameter
     c. Dedicated pool of workqueues for all loop devs that dispatch 
        queued IOs to the correct `file*` handle.  RCU might be useful.

What next?

Ok, so assuming consistency is an issue, #1 is the priority and #2 is 
nice-to-have.  This might be the right time for to consider these since 
there is so much discussion about loop.c on the list right now.

According to my understanding of the research above this appears to be an 
issue and there are other kernel developers who know this code better than I.  

I want to know if this is correct:

Should others be CC'ed on this topic?  If so, who?

--
Eric Wheeler

> 
> 
> Thanks,
> Ming
> 
> 

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: loop: it looks like REQ_OP_FLUSH could return before IO completion.
  2022-04-15  2:10         ` Eric Wheeler
@ 2022-04-15 14:29           ` Ming Lei
  2022-04-16  5:18             ` Christoph Hellwig
  0 siblings, 1 reply; 12+ messages in thread
From: Ming Lei @ 2022-04-15 14:29 UTC (permalink / raw)
  To: Eric Wheeler; +Cc: linux-block, linux-ext4

On Thu, Apr 14, 2022 at 07:10:04PM -0700, Eric Wheeler wrote:
> On Thu, 14 Apr 2022, Ming Lei wrote:
> > On Wed, Apr 13, 2022 at 03:49:07PM -0700, Eric Wheeler wrote:
> > > On Tue, 22 Mar 2022, Eric Wheeler wrote:
> > > > On Mon, 21 Mar 2022, Ming Lei wrote:
> > > > > On Sat, Mar 19, 2022 at 10:14:29AM -0700, Eric Wheeler wrote:
> > > > > > Hello all,
> > > > > > 
> > > > > > In loop.c do_req_filebacked() for REQ_OP_FLUSH, lo_req_flush() is called: 
> > > > > > it does not appear that lo_req_flush() does anything to make sure 
> > > > > > ki_complete has been called for pending work, it just calls vfs_fsync().
> > > > > > 
> > > > > > Is this a consistency problem?
> > > > > 
> > > > > No. What FLUSH command provides is just flushing cache in device side to
> > > > > storage medium, so it is nothing to do with pending request.
> > > > 
> > > > If a flush follows a series of writes, would it be best if the flush 
> > > > happened _after_ those writes complete?  Then then the storage medium will 
> > > > be sure to flush what was intended to be written.
> > > > 
> > > > It seems that this series of events could lead to inconsistent data:
> > > > 	loop		->	filesystem
> > > > 	write a
> > > > 	write b
> > > > 	flush
> > > > 				write a
> > > > 				flush
> > > > 				write b
> > > > 				crash, b is lost
> > > > 
> > > > If write+flush ordering is _not_ important, then can you help me 
> > > > understand why?
> > > > 
> > > 
> > > Hi Ming, just checking in: did you see the message above?
> > > 
> > > Do you really mean to say that reordering writes around a flush is safe 
> > > in the presence of a crash?
> > 
> > Sorry, replied too quick.
> 
> Thats ok, thanks for considering the issue!
>  
> > BTW, what is the actual crash? Any dmesg log? From the above description, b is
> > just not flushed to storage when running flush, and sooner or later it will
> > land, so what is the real issue?
> 
> In this case "crash" is actually a filesystem snapshot using most any 
> snapshot technology such as: dm-thin, btrfs, bcachefs, zfs.  From the 
> filesystem inside the loop file's point of view, a snapshot is the same as 
> a hardware crash.
> 
> Some background:
> 
>   We've already seen journal commit re-ordering caused by dm-crypt in 
>   dm-thin snapshots:
> 	dm-thin -> dm-crypt -> [kvm with a disk using ext4]
> 
>   This is the original email about dm-crypt ordering:
> 	https://listman.redhat.com/archives/dm-devel/2016-September/msg00035.html 
> 
>   We "fixed" the dm-crypt issue by disabling parallel dm-crypt threads 
>   with dm-crypt flags same_cpu_crypt+submit_from_crypt_cpus and haven't 
>   seen the issue since. (Its noticably slower, but I'll take consistency 
>   over performance any day! Not sure if that old dm-crypt ordering has 
>   been fixed.)
> 
> So back to considering loop devs:
> 
> Having seen the dm-crypt issue I would like to verify that loop isn't 
> susceptable to the same issue in the presence of lower-level 
> snapshots---but it looks like it could be since flushes don't enforce 
> ordering.  Here is why:
> 
> In ext4/super.c:ext4_sync_fs(), the ext4 code calls 
> blkdev_issue_flush(sb->s_bdev) when barriers are enabled (which is 
> default).  blkdev_issue_flush() sets REQ_PREFLUSH and according to 
> blk_types.h this is a "request for cache flush"; you could think of 
> in-flight IO's on the way through loop.ko and into the hypervisor 
> filesystem where the loop's backing file lives as being in a "cache" of 
> sorts---especially for non-DIO loopdevs hitting the pagecache.
> 
> Thus, ext4 critically expects that all IOs preceding a flush will hit 
> persistent storage before all future IOs.  Failing that, journal replay 
> cannot return the filesystem to a consistent state without a `fsck`.  

If ext4 expects the following order, it is ext4's responsibility to
maintain the order, and block layer may re-order all these IOs at will,
so do not expect IOs are issued to device in submission order

1) IOs preceding flush in 2)

2) flush

3) IOs after the flush

Even the device drive may re-order these IOs, such as AHCI/NCQ.

> 
> (Note that ext4 is just an example, really any journaled filesystem is at 
> risk.)
> 
> Lets say a virtual machine uses a loopback file for a disk and the VM 
> issues the following to delete some file called "/b":
> 
>   unlink("/b"):
> 	write: journal commit: unlink /b
> 	flush: blkdev_issue_flush()
> 	write: update fs datastructures (remove /b from a dentry)
> 	<hypervisor snapshot>
> 
> If the flush happens out of order then an operation like unlink("/b")  
> could look like this where the guest VM's filesystem is on the left and 
> the hypervisor's loop filesystem operations are on the right:
> 
>   VM ext4 filesystem            |  Hypervisor loop dev ordering
> --------------------------------+--------------------------------
> write: journal commit: unlink /b
> flush: blkdev_issue_flush()
> write: update fs dentry's
>                                 queued to loop: [journal commit: unlink /b]
>                                 queued to loop: [update fs dentry's]
>                                 flush: vfs_fsync() - out of order
>                                 queued to ext4: [journal commit: unlink /b]
>                                 queued to ext4: [update fs dentry's]
>                                 write lands: [update fs dentry's]
>                                 <snapshot!>
>                                 write lands: [journal commit: unlink /b]

If VM ext4 requires the above order, ext4 FS code has to start flush until
write(unlink /b) is done or do write(unlink /b) and flush in one single
command, and start write(update fs dentry's) after the flush is done.

Once ext4 submits IOs in this way, you will see the issue shouldn't be
related with hypervisor.

One issue I saw is in case of snapshot in block layer & loop/buffered
io. When loop write is done, the data is just in FS page cache, which
may be invisible to block snapshot(dm-snap), even page writeback may be
re-order. But if flush is used correctly, this way still works fine.

> 				
> Notice that the last two "write lands" are out of order because the 
> vfs_fsync() does not separate them as expected by the VM's ext4 
> implementation.
> 
> Presently, loop.c will never re-order actual WRITE's: they will be 
> submitted to loopdev's `file*` handle in-submit-order because the 
> workqueue will keep them ordered.  This is good[*].

Again do not expect block layer to maintain IO order.

> 
> But, REQ_FLUSH is not a write:
> 
> The vfs_fsync() in lo_req_flush() is _not_ ordered by the writequeue, and 
> there exists no mechanism in loop.c to enforce completion of IOs submitted 
> to the loopdev's `file*` handle prior to completing the vfs_fsync(), nor 
> are subsequent IOs thereby required to complete after the flush.

Yes, but flush is just to flush device cache to medium, and block
layer doesn't maintain io order, that is responsibility of block layer's
user(FS) to maintain io order.

> 
> Thus, the hypervisor's snapshot-capable filesystem can re-order the last 
> two writes because the flush happened early.
> 
> In the re-ordered case on the hypervisor side:
> 
>   If a snapshot happens after the dentry removal but _before_ the journal 
>   commit, then a journal replay of the resulting snapshot will be 
>   inconsistent.
> 
> Flush re-ordering creates an inconsistency in two possible cases:
> 
>    a. In the snapshot after dentry removal but before journal commit.
>    b. Crash after dentry removal but before journal comit.
> 
> Really a snapshot looks like a crash to the filesystem, so (a) and (b) are 
> equivalent but (a) is easier to reason about. In either case, mounting the 
> out-of-order filesystem (snapshot if (a), origin if (b)) will present 
> kernel errors in the VM when the dentry is read:
> 
> 	kernel: EXT4-fs error (device dm-2): ext4_lookup:1441: inode 
> 	 #1196093: comm rsync: deleted inode referenced: 1188710
> 	[ https://listman.redhat.com/archives/dm-devel/2016-September/028121.html ]
> 
> 
> Fixing flush ordering provides for two possible improvements:

No, what you should fix is to enhance the IO order in FS or journal
code, instead of block layer.

>   ([*] from above about write ordering)
> 
> 1. Consistency, as above.
> 
> 2. Performance.  Right now loopdev IOs are serialized by a single write 
>    queue per loopdev.  Parallel work queues could be used to submit IOs in 
>    parallel to the filesystem serving the loopdev's `file*` handle since 
>    VMs may submit IOs from different CPU cores.  Special care would need 
>    to be taken for the following cases:
> 
>      a. Ordering of dependent IOs (ie, reads or writes of preceding 
>         writes).
>      b. Flushes need to wait for all workqueues to quiesce.
> 
>    W.r.t choosing the number of WQ's: Certainly not 1:1 CPU to workqueue 
>    mapping because of the old WQ issue running out of pid's with lots of 
>    CPUs, but here are some possibilities:
> 
>      a. 1:1 per socket
>      b. User configurable as a module parameter
>      c. Dedicated pool of workqueues for all loop devs that dispatch 
>         queued IOs to the correct `file*` handle.  RCU might be useful.
> 
> 
> What next?
> 
> Ok, so assuming consistency is an issue, #1 is the priority and #2 is 
> nice-to-have.  This might be the right time for to consider these since 
> there is so much discussion about loop.c on the list right now.
> 
> According to my understanding of the research above this appears to be an 
> issue and there are other kernel developers who know this code better than I.  
> 
> I want to know if this is correct:
> 
> Should others be CC'ed on this topic?  If so, who?

ext4 or fsdevel guys.


Thanks,
Ming


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: loop: it looks like REQ_OP_FLUSH could return before IO completion.
  2022-04-15 14:29           ` Ming Lei
@ 2022-04-16  5:18             ` Christoph Hellwig
  2022-04-16 20:05               ` Eric Wheeler
  0 siblings, 1 reply; 12+ messages in thread
From: Christoph Hellwig @ 2022-04-16  5:18 UTC (permalink / raw)
  To: Ming Lei; +Cc: Eric Wheeler, linux-block, linux-ext4

On Fri, Apr 15, 2022 at 10:29:34PM +0800, Ming Lei wrote:
> If ext4 expects the following order, it is ext4's responsibility to
> maintain the order, and block layer may re-order all these IOs at will,
> so do not expect IOs are issued to device in submission order

Yes, and it has been so since REQ_FLUSH (which later became
REQ_OP_FLUSH) replaced REQ_BARRIER 12 years ago:

commit 28e7d1845216538303bb95d679d8fd4de50e2f1a
Author: Tejun Heo <tj@kernel.org>
Date:   Fri Sep 3 11:56:16 2010 +0200

block: drop barrier ordering by queue draining
    
    Filesystems will take all the responsibilities for ordering requests
    around commit writes and will only indicate how the commit writes
    themselves should be handled by block layers.  This patch drops
    barrier ordering by queue draining from block layer.


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: loop: it looks like REQ_OP_FLUSH could return before IO completion.
  2022-04-16  5:18             ` Christoph Hellwig
@ 2022-04-16 20:05               ` Eric Wheeler
  2022-04-17  0:59                 ` Jens Axboe
  0 siblings, 1 reply; 12+ messages in thread
From: Eric Wheeler @ 2022-04-16 20:05 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Ming Lei, linux-block, linux-ext4

On Fri, 15 Apr 2022, Christoph Hellwig wrote:
> On Fri, Apr 15, 2022 at 10:29:34PM +0800, Ming Lei wrote:
> > If ext4 expects the following order, it is ext4's responsibility to
> > maintain the order, and block layer may re-order all these IOs at will,
> > so do not expect IOs are issued to device in submission order
> 
> Yes, and it has been so since REQ_FLUSH (which later became
> REQ_OP_FLUSH) replaced REQ_BARRIER 12 years ago:
> 
> commit 28e7d1845216538303bb95d679d8fd4de50e2f1a
> Author: Tejun Heo <tj@kernel.org>
> Date:   Fri Sep 3 11:56:16 2010 +0200
> 
> block: drop barrier ordering by queue draining
>     
>     Filesystems will take all the responsibilities for ordering requests
>     around commit writes and will only indicate how the commit writes
>     themselves should be handled by block layers.  This patch drops
>     barrier ordering by queue draining from block layer.

Thanks Christoph. I think this answers my original question, too.

You may have already answered this implicitly above.  If you would be so 
kind as to confirm my or correct my understanding with a few more 
questions:

1. Is the only way for a filesystem to know if one IO completed before a 
   second IO to track the first IO's completion and submit the second IO 
   when the first IO's completes (eg a journal commit followed by the 
   subsequent metadata update)?  If not, then what block-layer mechanism 
   should be used?

2. Are there any IO ordering flags or mechanisms in the block layer at 
   this point---or---is it simply that all IOs entering the block layer 
   can always be re-ordered before reaching the media?

Thanks!

--
Eric Wheeler

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: loop: it looks like REQ_OP_FLUSH could return before IO completion.
  2022-04-16 20:05               ` Eric Wheeler
@ 2022-04-17  0:59                 ` Jens Axboe
  2022-04-17 16:32                   ` Eric Wheeler
  0 siblings, 1 reply; 12+ messages in thread
From: Jens Axboe @ 2022-04-17  0:59 UTC (permalink / raw)
  To: Eric Wheeler, Christoph Hellwig; +Cc: Ming Lei, linux-block, linux-ext4

On 4/16/22 2:05 PM, Eric Wheeler wrote:
> On Fri, 15 Apr 2022, Christoph Hellwig wrote:
>> On Fri, Apr 15, 2022 at 10:29:34PM +0800, Ming Lei wrote:
>>> If ext4 expects the following order, it is ext4's responsibility to
>>> maintain the order, and block layer may re-order all these IOs at will,
>>> so do not expect IOs are issued to device in submission order
>>
>> Yes, and it has been so since REQ_FLUSH (which later became
>> REQ_OP_FLUSH) replaced REQ_BARRIER 12 years ago:
>>
>> commit 28e7d1845216538303bb95d679d8fd4de50e2f1a
>> Author: Tejun Heo <tj@kernel.org>
>> Date:   Fri Sep 3 11:56:16 2010 +0200
>>
>> block: drop barrier ordering by queue draining
>>     
>>     Filesystems will take all the responsibilities for ordering requests
>>     around commit writes and will only indicate how the commit writes
>>     themselves should be handled by block layers.  This patch drops
>>     barrier ordering by queue draining from block layer.
> 
> Thanks Christoph. I think this answers my original question, too.
> 
> You may have already answered this implicitly above.  If you would be so 
> kind as to confirm my or correct my understanding with a few more 
> questions:
> 
> 1. Is the only way for a filesystem to know if one IO completed before a 
>    second IO to track the first IO's completion and submit the second IO 
>    when the first IO's completes (eg a journal commit followed by the 
>    subsequent metadata update)?  If not, then what block-layer mechanism 
>    should be used?

You either need to have a callback or wait on the IO, there's no other
way.

> 2. Are there any IO ordering flags or mechanisms in the block layer at 
>    this point---or---is it simply that all IOs entering the block layer 
>    can always be re-ordered before reaching the media?

No, no ordering flags are provided for this kind of use case. Any IO can
be reordered, hence the only reliable solution is to ensure the previous
have completed.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: loop: it looks like REQ_OP_FLUSH could return before IO completion.
  2022-04-17  0:59                 ` Jens Axboe
@ 2022-04-17 16:32                   ` Eric Wheeler
  0 siblings, 0 replies; 12+ messages in thread
From: Eric Wheeler @ 2022-04-17 16:32 UTC (permalink / raw)
  To: Jens Axboe; +Cc: Christoph Hellwig, Ming Lei, linux-block, linux-ext4

On Sat, 16 Apr 2022, Jens Axboe wrote:
> On 4/16/22 2:05 PM, Eric Wheeler wrote:
> > On Fri, 15 Apr 2022, Christoph Hellwig wrote:
> >> On Fri, Apr 15, 2022 at 10:29:34PM +0800, Ming Lei wrote:
> >>> If ext4 expects the following order, it is ext4's responsibility to
> >>> maintain the order, and block layer may re-order all these IOs at will,
> >>> so do not expect IOs are issued to device in submission order
> >>
> >> Yes, and it has been so since REQ_FLUSH (which later became
> >> REQ_OP_FLUSH) replaced REQ_BARRIER 12 years ago:
> >>
> >> commit 28e7d1845216538303bb95d679d8fd4de50e2f1a
> >> Author: Tejun Heo <tj@kernel.org>
> >> Date:   Fri Sep 3 11:56:16 2010 +0200
> >>
> >> block: drop barrier ordering by queue draining
> >>     
> >>     Filesystems will take all the responsibilities for ordering requests
> >>     around commit writes and will only indicate how the commit writes
> >>     themselves should be handled by block layers.  This patch drops
> >>     barrier ordering by queue draining from block layer.
> > 
> > Thanks Christoph. I think this answers my original question, too.
> > 
> > You may have already answered this implicitly above.  If you would be so 
> > kind as to confirm my or correct my understanding with a few more 
> > questions:
> > 
> > 1. Is the only way for a filesystem to know if one IO completed before a 
> >    second IO to track the first IO's completion and submit the second IO 
> >    when the first IO's completes (eg a journal commit followed by the 
> >    subsequent metadata update)?  If not, then what block-layer mechanism 
> >    should be used?
> 
> You either need to have a callback or wait on the IO, there's no other
> way.
> 
> > 2. Are there any IO ordering flags or mechanisms in the block layer at 
> >    this point---or---is it simply that all IOs entering the block layer 
> >    can always be re-ordered before reaching the media?
> 
> No, no ordering flags are provided for this kind of use case. Any IO can
> be reordered, hence the only reliable solution is to ensure the previous
> have completed.

Perfect, thanks Jens!

> 
> -- 
> Jens Axboe
> 
> 



--
Eric Wheeler



^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2022-04-17 16:32 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-03-19 17:14 loop: it looks like REQ_OP_FLUSH could return before IO completion Eric Wheeler
2022-03-21  0:21 ` Ming Lei
2022-03-23  6:14   ` Eric Wheeler
2022-04-13 22:49     ` Eric Wheeler
2022-04-14  0:15       ` Ming Lei
2022-04-14  0:28       ` Ming Lei
2022-04-15  2:10         ` Eric Wheeler
2022-04-15 14:29           ` Ming Lei
2022-04-16  5:18             ` Christoph Hellwig
2022-04-16 20:05               ` Eric Wheeler
2022-04-17  0:59                 ` Jens Axboe
2022-04-17 16:32                   ` Eric Wheeler

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.