linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Writeback bug causing writeback stalls
@ 2020-05-22  9:57 Martijn Coenen
  2020-05-22 14:41 ` Jan Kara
       [not found] ` <20200524140522.14196-1-hdanton@sina.com>
  0 siblings, 2 replies; 12+ messages in thread
From: Martijn Coenen @ 2020-05-22  9:57 UTC (permalink / raw)
  To: Al Viro, Jan Kara, Jens Axboe, miklos, tj
  Cc: linux-fsdevel, LKML, android-storage-core, kernel-team

Hi,

We've been working on implementing a FUSE filesystem in Android, and have
run into what appears to be a bug in the kernel writeback code. The problem
that we observed is that an inode in the filesystem is on the b_dirty_time list,
but its i_state field does have I_DIRTY_PAGES set, which I think means that
the inode is on the wrong list. This condition doesn't appear to fix itself even
as new pages are dirtied, because __mark_inode_dirty() has this check:

    if ((inode->i_state & flags) != flags) {

before considering moving the inode to another list. Since the inode already
has I_DIRTY_PAGES set, we're not going to move it to the dirty list. I *think*
the only way the inode gets out of this condition is whenever we handle the
b_dirty_time list, which can take a while.

The reason we even noticed this bug in the first place is that FUSE has a very
low wb max_ratio by default (1), and so at some point processes get stuck in
balance_dirty_pages_ratelimited(), waiting for pages to be written. They hold
the inode's write lock, and when other processes try to acquire it, they get
stuck. We have a watchdog that reboots the device after ~10 mins of a task
being stuck in D state, and this triggered regularly in some tests.

After careful studying of the kernel code, I found a reliable repro scenario
for this condition, which is described in more detail below. But essentially
what I think is happening is this:

__mark_inode_dirty() has an early return condition for when a sync is in
progress, where it updates the inode i_state but not the writeback list:

    inode->i_state |= flags;

    /*
    * If the inode is being synced, just update its dirty state.
    * The unlocker will place the inode on the appropriate
    * superblock list, based upon its state.
    */
    if (inode->i_state & I_SYNC)
        goto out_unlock_inode;

now this comment is true for the generic flusher threads, which run
writeback_sb_inodes(), which calls requeue_inode() to move the inode back to the
correct wb list when the sync is done. However, there is another
function that uses
I_SYNC: writeback_single_inode(). This function has some comments saying it
prefers not to touch writeback lists, and in fact only removes it if the inode
is clean:

    /*
    * Skip inode if it is clean and we have no outstanding writeback in
    * WB_SYNC_ALL mode. We don't want to mess with writeback lists in this
    * function since flusher thread may be doing for example sync in
    * parallel and if we move the inode, it could get skipped. So here we
    * make sure inode is on some writeback list and leave it there unless
    * we have completely cleaned the inode.
    */

writeback_single_inode() is called from a few functions, in particular
write_inode_now(). write_inode_now() is called by FUSE's flush() f_ops.

So, the sequence of events is something like this. Let's assume the inode is
already on b_dirty_time for valid reasons. Then:

CPU1                                          CPU2
fuse_flush()
  write_inode_now()
    writeback_single_inode()
      sets I_SYNC
        __writeback_single_inode()
          writes back data
          clears inode dirty flags
          unlocks inode
          calls mark_inode_dirty_sync()
            sets I_DIRTY_SYNC, but doesn't
            update wb list because I_SYNC is
            still set
                                              write() // somebody else writes
                                              mark_inode_dirty(I_DIRTY_PAGES)
                                              sets I_DIRTY_PAGES on i_state
                                              doesn't update wb list,
                                              because I_SYNC set
      locks inode again
      sees inode is still dirty,
      doesn't touch WB list
      clears I_SYNC

So now we have an inode on b_dirty_time with I_DIRTY_PAGES | I_DIRTY_SYNC set,
and subsequent calls to mark_inode_dirty() with either I_DIRTY_PAGES or
I_DIRTY_SYNC will do nothing to change that. The flusher won't touch
the inode either,
because it's not on a b_dirty or b_io list.

The easiest way to fix this, I think, is to call requeue_inode() at the end of
writeback_single_inode(), much like it is called from writeback_sb_inodes().
However, requeue_inode() has the following ominous warning:

/*
 * Find proper writeback list for the inode depending on its current state and
 * possibly also change of its state while we were doing writeback.  Here we
 * handle things such as livelock prevention or fairness of writeback among
 * inodes. This function can be called only by flusher thread - noone else
 * processes all inodes in writeback lists and requeueing inodes behind flusher
 * thread's back can have unexpected consequences.
 */

Obviously this is very critical code both from a correctness and a performance
point of view, so I wanted to run this by the maintainers and folks who have
contributed to this code first.

The way I got to reproduce this reliably was by using what is pretty much a
pass-through FUSE filesystem, and the following two commands run in parallel:

[1] fio --rw=write --size=5G -blocksize=80000 --name=test --directory=/sdcard/

[2] while true; do echo flushme >> /sdcard/test.0.0; sleep 0.1; done

I doubt the blocksize matters, it's just the blocksize that I observed being
used in one of our testruns that hit this. [2] essentially calls fuse_flush()
every 100ms, which is often enough to reproduce this problem within seconds;
FIO will stall and enter balance_dirty_pages_ratelimited(), and [2] will hang
because it needs the inode write lock.

Other filesystems may hit the same problem, though write_inode_now() is usually
only used when no more dirty pages are expected (eg in final iput()). There are
some other functions that call writeback_single_inode() that are more
widely used,
like sync_inode() and sync_inode_metadata().

Curious to hear your thoughts on this. I'm happy to provide more info
or traces if
needed.

Thanks,
Martijn

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Writeback bug causing writeback stalls
  2020-05-22  9:57 Writeback bug causing writeback stalls Martijn Coenen
@ 2020-05-22 14:41 ` Jan Kara
  2020-05-22 15:23   ` Martijn Coenen
       [not found] ` <20200524140522.14196-1-hdanton@sina.com>
  1 sibling, 1 reply; 12+ messages in thread
From: Jan Kara @ 2020-05-22 14:41 UTC (permalink / raw)
  To: Martijn Coenen
  Cc: Al Viro, Jan Kara, Jens Axboe, miklos, tj, linux-fsdevel, LKML,
	android-storage-core, kernel-team

Hi!

On Fri 22-05-20 11:57:42, Martijn Coenen wrote:
<snip>

> So, the sequence of events is something like this. Let's assume the inode is
> already on b_dirty_time for valid reasons. Then:
> 
> CPU1                                          CPU2
> fuse_flush()
>   write_inode_now()
>     writeback_single_inode()
>       sets I_SYNC
>         __writeback_single_inode()
>           writes back data
>           clears inode dirty flags
>           unlocks inode
>           calls mark_inode_dirty_sync()
>             sets I_DIRTY_SYNC, but doesn't
>             update wb list because I_SYNC is
>             still set
>                                               write() // somebody else writes
>                                               mark_inode_dirty(I_DIRTY_PAGES)
>                                               sets I_DIRTY_PAGES on i_state
>                                               doesn't update wb list,
>                                               because I_SYNC set
>       locks inode again
>       sees inode is still dirty,
>       doesn't touch WB list
>       clears I_SYNC
> 
> So now we have an inode on b_dirty_time with I_DIRTY_PAGES | I_DIRTY_SYNC set,
> and subsequent calls to mark_inode_dirty() with either I_DIRTY_PAGES or
> I_DIRTY_SYNC will do nothing to change that. The flusher won't touch
> the inode either,
> because it's not on a b_dirty or b_io list.

Thanks for the detailed analysis and explanation! I agree that what you
describe is a bug in the writeback code.

> The easiest way to fix this, I think, is to call requeue_inode() at the end of
> writeback_single_inode(), much like it is called from writeback_sb_inodes().
> However, requeue_inode() has the following ominous warning:
> 
> /*
>  * Find proper writeback list for the inode depending on its current state and
>  * possibly also change of its state while we were doing writeback.  Here we
>  * handle things such as livelock prevention or fairness of writeback among
>  * inodes. This function can be called only by flusher thread - noone else
>  * processes all inodes in writeback lists and requeueing inodes behind flusher
>  * thread's back can have unexpected consequences.
>  */
> 
> Obviously this is very critical code both from a correctness and a performance
> point of view, so I wanted to run this by the maintainers and folks who have
> contributed to this code first.

Sadly, the fix won't be so easy. The main problem with calling
requeue_inode() from writeback_single_inode() is that if there's parallel
sync(2) call, inode->i_io_list is used to track all inodes that need writing
before sync(2) can complete. So requeueing inodes in parallel while sync(2)
runs can result in breaking data integrity guarantees of it. But I agree
we need to find some mechanism to safely move inode to appropriate dirty
list reasonably quickly.

Probably I'd add an inode state flag telling that inode is queued for
writeback by flush worker and we won't touch dirty lists in that case,
otherwise we are safe to update current writeback list as needed. I'll work
on fixing this as when I was reading the code I've noticed there are other
quirks in the code as well. Thanks for the report!

								Honza

-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Writeback bug causing writeback stalls
  2020-05-22 14:41 ` Jan Kara
@ 2020-05-22 15:23   ` Martijn Coenen
  2020-05-22 15:36     ` Jan Kara
  0 siblings, 1 reply; 12+ messages in thread
From: Martijn Coenen @ 2020-05-22 15:23 UTC (permalink / raw)
  To: Jan Kara
  Cc: Al Viro, Jens Axboe, miklos, tj, linux-fsdevel, LKML, kernel-team

[ dropped android-storage-core@google.com from CC: since that list
can't receive emails from outside google.com - sorry about that ]

Hi Jan,

On Fri, May 22, 2020 at 4:41 PM Jan Kara <jack@suse.cz> wrote:
> > The easiest way to fix this, I think, is to call requeue_inode() at the end of
> > writeback_single_inode(), much like it is called from writeback_sb_inodes().
> > However, requeue_inode() has the following ominous warning:
> >
> > /*
> >  * Find proper writeback list for the inode depending on its current state and
> >  * possibly also change of its state while we were doing writeback.  Here we
> >  * handle things such as livelock prevention or fairness of writeback among
> >  * inodes. This function can be called only by flusher thread - noone else
> >  * processes all inodes in writeback lists and requeueing inodes behind flusher
> >  * thread's back can have unexpected consequences.
> >  */
> >
> > Obviously this is very critical code both from a correctness and a performance
> > point of view, so I wanted to run this by the maintainers and folks who have
> > contributed to this code first.
>
> Sadly, the fix won't be so easy. The main problem with calling
> requeue_inode() from writeback_single_inode() is that if there's parallel
> sync(2) call, inode->i_io_list is used to track all inodes that need writing
> before sync(2) can complete. So requeueing inodes in parallel while sync(2)
> runs can result in breaking data integrity guarantees of it.

Ah, makes sense.

> But I agree
> we need to find some mechanism to safely move inode to appropriate dirty
> list reasonably quickly.
>
> Probably I'd add an inode state flag telling that inode is queued for
> writeback by flush worker and we won't touch dirty lists in that case,
> otherwise we are safe to update current writeback list as needed. I'll work
> on fixing this as when I was reading the code I've noticed there are other
> quirks in the code as well. Thanks for the report!

Thanks! While looking at the code I also saw some other paths that
appeared to be racy, though I haven't worked them out in detail to
confirm that - the locking around the inode and writeback lists is
tricky. What's the best way to follow up on those? Happy to post them
to this same thread after I spend a bit more time looking at the code.

Thanks,
Martijn


>
>                                                                 Honza
>
> --
> Jan Kara <jack@suse.com>
> SUSE Labs, CR

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Writeback bug causing writeback stalls
  2020-05-22 15:23   ` Martijn Coenen
@ 2020-05-22 15:36     ` Jan Kara
  2020-05-23  8:15       ` Martijn Coenen
  0 siblings, 1 reply; 12+ messages in thread
From: Jan Kara @ 2020-05-22 15:36 UTC (permalink / raw)
  To: Martijn Coenen
  Cc: Jan Kara, Al Viro, Jens Axboe, miklos, tj, linux-fsdevel, LKML,
	kernel-team

On Fri 22-05-20 17:23:30, Martijn Coenen wrote:
> [ dropped android-storage-core@google.com from CC: since that list
> can't receive emails from outside google.com - sorry about that ]
> 
> Hi Jan,
> 
> On Fri, May 22, 2020 at 4:41 PM Jan Kara <jack@suse.cz> wrote:
> > > The easiest way to fix this, I think, is to call requeue_inode() at the end of
> > > writeback_single_inode(), much like it is called from writeback_sb_inodes().
> > > However, requeue_inode() has the following ominous warning:
> > >
> > > /*
> > >  * Find proper writeback list for the inode depending on its current state and
> > >  * possibly also change of its state while we were doing writeback.  Here we
> > >  * handle things such as livelock prevention or fairness of writeback among
> > >  * inodes. This function can be called only by flusher thread - noone else
> > >  * processes all inodes in writeback lists and requeueing inodes behind flusher
> > >  * thread's back can have unexpected consequences.
> > >  */
> > >
> > > Obviously this is very critical code both from a correctness and a performance
> > > point of view, so I wanted to run this by the maintainers and folks who have
> > > contributed to this code first.
> >
> > Sadly, the fix won't be so easy. The main problem with calling
> > requeue_inode() from writeback_single_inode() is that if there's parallel
> > sync(2) call, inode->i_io_list is used to track all inodes that need writing
> > before sync(2) can complete. So requeueing inodes in parallel while sync(2)
> > runs can result in breaking data integrity guarantees of it.
> 
> Ah, makes sense.
> 
> > But I agree
> > we need to find some mechanism to safely move inode to appropriate dirty
> > list reasonably quickly.
> >
> > Probably I'd add an inode state flag telling that inode is queued for
> > writeback by flush worker and we won't touch dirty lists in that case,
> > otherwise we are safe to update current writeback list as needed. I'll work
> > on fixing this as when I was reading the code I've noticed there are other
> > quirks in the code as well. Thanks for the report!
> 
> Thanks! While looking at the code I also saw some other paths that
> appeared to be racy, though I haven't worked them out in detail to
> confirm that - the locking around the inode and writeback lists is
> tricky. What's the best way to follow up on those? Happy to post them
> to this same thread after I spend a bit more time looking at the code.

Sure, if you are aware some some other problems, just write them to this
thread. FWIW stuff that I've found so far:

1) __I_DIRTY_TIME_EXPIRED setting in move_expired_inodes() can get lost as
there are other places doing RMW modifications of inode->i_state.

2) sync(2) is prone to livelocks as when we queue inodes from b_dirty_time
list, we don't take dirtied_when into account (and that's the only thing
that makes sure aggressive dirtier cannot livelock sync).

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Writeback bug causing writeback stalls
  2020-05-22 15:36     ` Jan Kara
@ 2020-05-23  8:15       ` Martijn Coenen
  2020-05-25  7:31         ` Jan Kara
  0 siblings, 1 reply; 12+ messages in thread
From: Martijn Coenen @ 2020-05-23  8:15 UTC (permalink / raw)
  To: Jan Kara, Jaegeuk Kim
  Cc: Al Viro, Jens Axboe, miklos, tj, linux-fsdevel, LKML, kernel-team

Jaegeuk wondered whether callers of write_inode_now() should hold
i_rwsem, and whether that would also prevent this problem. Some
existing callers of write_inode_now() do, eg ntfs and hfs:

hfs_file_fsync()
    inode_lock(inode);

    /* sync the inode to buffers */
    ret = write_inode_now(inode, 0);

but there are also some that don't (eg fat, fuse, orangefs).

Thanks,
Martijn


On Fri, May 22, 2020 at 5:36 PM Jan Kara <jack@suse.cz> wrote:
>
> On Fri 22-05-20 17:23:30, Martijn Coenen wrote:
> > [ dropped android-storage-core@google.com from CC: since that list
> > can't receive emails from outside google.com - sorry about that ]
> >
> > Hi Jan,
> >
> > On Fri, May 22, 2020 at 4:41 PM Jan Kara <jack@suse.cz> wrote:
> > > > The easiest way to fix this, I think, is to call requeue_inode() at the end of
> > > > writeback_single_inode(), much like it is called from writeback_sb_inodes().
> > > > However, requeue_inode() has the following ominous warning:
> > > >
> > > > /*
> > > >  * Find proper writeback list for the inode depending on its current state and
> > > >  * possibly also change of its state while we were doing writeback.  Here we
> > > >  * handle things such as livelock prevention or fairness of writeback among
> > > >  * inodes. This function can be called only by flusher thread - noone else
> > > >  * processes all inodes in writeback lists and requeueing inodes behind flusher
> > > >  * thread's back can have unexpected consequences.
> > > >  */
> > > >
> > > > Obviously this is very critical code both from a correctness and a performance
> > > > point of view, so I wanted to run this by the maintainers and folks who have
> > > > contributed to this code first.
> > >
> > > Sadly, the fix won't be so easy. The main problem with calling
> > > requeue_inode() from writeback_single_inode() is that if there's parallel
> > > sync(2) call, inode->i_io_list is used to track all inodes that need writing
> > > before sync(2) can complete. So requeueing inodes in parallel while sync(2)
> > > runs can result in breaking data integrity guarantees of it.
> >
> > Ah, makes sense.
> >
> > > But I agree
> > > we need to find some mechanism to safely move inode to appropriate dirty
> > > list reasonably quickly.
> > >
> > > Probably I'd add an inode state flag telling that inode is queued for
> > > writeback by flush worker and we won't touch dirty lists in that case,
> > > otherwise we are safe to update current writeback list as needed. I'll work
> > > on fixing this as when I was reading the code I've noticed there are other
> > > quirks in the code as well. Thanks for the report!
> >
> > Thanks! While looking at the code I also saw some other paths that
> > appeared to be racy, though I haven't worked them out in detail to
> > confirm that - the locking around the inode and writeback lists is
> > tricky. What's the best way to follow up on those? Happy to post them
> > to this same thread after I spend a bit more time looking at the code.
>
> Sure, if you are aware some some other problems, just write them to this
> thread. FWIW stuff that I've found so far:
>
> 1) __I_DIRTY_TIME_EXPIRED setting in move_expired_inodes() can get lost as
> there are other places doing RMW modifications of inode->i_state.
>
> 2) sync(2) is prone to livelocks as when we queue inodes from b_dirty_time
> list, we don't take dirtied_when into account (and that's the only thing
> that makes sure aggressive dirtier cannot livelock sync).
>
>                                                                 Honza
> --
> Jan Kara <jack@suse.com>
> SUSE Labs, CR

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Writeback bug causing writeback stalls
  2020-05-23  8:15       ` Martijn Coenen
@ 2020-05-25  7:31         ` Jan Kara
  2020-05-27  8:14           ` Martijn Coenen
  0 siblings, 1 reply; 12+ messages in thread
From: Jan Kara @ 2020-05-25  7:31 UTC (permalink / raw)
  To: Martijn Coenen
  Cc: Jan Kara, Jaegeuk Kim, Al Viro, Jens Axboe, miklos, tj,
	linux-fsdevel, LKML, kernel-team

On Sat 23-05-20 10:15:20, Martijn Coenen wrote:
> Jaegeuk wondered whether callers of write_inode_now() should hold
> i_rwsem, and whether that would also prevent this problem. Some
> existing callers of write_inode_now() do, eg ntfs and hfs:
> 
> hfs_file_fsync()
>     inode_lock(inode);
> 
>     /* sync the inode to buffers */
>     ret = write_inode_now(inode, 0);
> 
> but there are also some that don't (eg fat, fuse, orangefs).

Well, most importantly filesystems like ext4, xfs, btrfs don't hold i_rwsem
when writing back inode and that's deliberate because of performance. We
don't want to block writes (or event reads in case of XFS) for the inode
during writeback.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Writeback bug causing writeback stalls
       [not found] ` <20200524140522.14196-1-hdanton@sina.com>
@ 2020-05-25  7:38   ` Jan Kara
  0 siblings, 0 replies; 12+ messages in thread
From: Jan Kara @ 2020-05-25  7:38 UTC (permalink / raw)
  To: Hillf Danton
  Cc: Martijn Coenen, Al Viro, Jan Kara, Jens Axboe, miklos, tj,
	linux-fsdevel, LKML, android-storage-core, kernel-team

On Sun 24-05-20 22:05:22, Hillf Danton wrote:
> 
> On Fri, 22 May 2020 11:57:42 +0200 Martijn Coenen wrote:
> > 
> > So, the sequence of events is something like this. Let's assume the inode is
> > already on b_dirty_time for valid reasons. Then:
> > 
> > CPU1                                          CPU2
> > fuse_flush()
> >   write_inode_now()
> >     writeback_single_inode()
> >       sets I_SYNC
> >         __writeback_single_inode()
> >           writes back data
> >           clears inode dirty flags
> >           unlocks inode
> >           calls mark_inode_dirty_sync()
> >             sets I_DIRTY_SYNC, but doesn't
> >             update wb list because I_SYNC is
> >             still set
> >                                               write() // somebody else writes
> >                                               mark_inode_dirty(I_DIRTY_PAGES)
> >                                               sets I_DIRTY_PAGES on i_state
> >                                               doesn't update wb list,
> >                                               because I_SYNC set
> >       locks inode again
> >       sees inode is still dirty,
> >       doesn't touch WB list
> >       clears I_SYNC
> > 
> > So now we have an inode on b_dirty_time with I_DIRTY_PAGES | I_DIRTY_SYNC set,
> > and subsequent calls to mark_inode_dirty() with either I_DIRTY_PAGES or
> > I_DIRTY_SYNC will do nothing to change that. The flusher won't touch
> > the inode either, because it's not on a b_dirty or b_io list.

Hi Hillf,

> Based on the above analysis, check of I_DIRTY_TIME is added before and
> after calling __writeback_single_inode() to detect the case you reported.
> 
> If a dirty inode is not on the right io list after writeback, we can
> move it to a new one; and we can do that as we are the I_SYNC owner.
> 
> While changing its io list, the inode's dirty timestamp is also updated
> to the current tick as does in __mark_inode_dirty().

Apparently you didn't read my reply to Martinj because what you did in this
patch is exactly what I described that we cannot do because that can cause
sync(2) to miss inodes and thus break its data integrity guarantees. So we
have to come up with a different solution.

								Honza

> --- a/fs/fs-writeback.c
> +++ b/fs/fs-writeback.c
> @@ -1528,6 +1528,7 @@ static int writeback_single_inode(struct
>  				  struct writeback_control *wbc)
>  {
>  	struct bdi_writeback *wb;
> +	bool dt;
>  	int ret = 0;
>  
>  	spin_lock(&inode->i_lock);
> @@ -1560,6 +1561,7 @@ static int writeback_single_inode(struct
>  	     !mapping_tagged(inode->i_mapping, PAGECACHE_TAG_WRITEBACK)))
>  		goto out;
>  	inode->i_state |= I_SYNC;
> +	dt = inode->i_state & I_DIRTY_TIME;
>  	wbc_attach_and_unlock_inode(wbc, inode);
>  
>  	ret = __writeback_single_inode(inode, wbc);
> @@ -1574,6 +1576,14 @@ static int writeback_single_inode(struct
>  	 */
>  	if (!(inode->i_state & I_DIRTY_ALL))
>  		inode_io_list_del_locked(inode, wb);
> +	else if (!(inode->i_state & I_DIRTY_TIME) && dt) {
> +		/*
> +		 * We can correct inode's io list, however, by moving it to
> +		 * b_dirty from b_dirty_time as we are the I_SYNC owner
> +		 */
> +		inode->dirtied_when = jiffies;
> +		inode_io_list_move_locked(inode, wb, &wb->b_dirty);
> +	}
>  	spin_unlock(&wb->list_lock);
>  	inode_sync_complete(inode);
>  out:
> --
> 
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Writeback bug causing writeback stalls
  2020-05-25  7:31         ` Jan Kara
@ 2020-05-27  8:14           ` Martijn Coenen
  2020-05-29 15:20             ` Jan Kara
  0 siblings, 1 reply; 12+ messages in thread
From: Martijn Coenen @ 2020-05-27  8:14 UTC (permalink / raw)
  To: Jan Kara
  Cc: Jaegeuk Kim, Al Viro, Jens Axboe, miklos, tj, linux-fsdevel,
	LKML, kernel-team

Hi Jan,

On Mon, May 25, 2020 at 9:31 AM Jan Kara <jack@suse.cz> wrote:
> Well, most importantly filesystems like ext4, xfs, btrfs don't hold i_rwsem
> when writing back inode and that's deliberate because of performance. We
> don't want to block writes (or event reads in case of XFS) for the inode
> during writeback.

Thanks for clarifying, that makes sense. By the way, do you have an
ETA for your fix? We are under some time pressure to get this fixed in
our downstream kernels, but I'd much rather take a fix from upstream
from somebody who knows this code well. Alternatively, I can take a
stab at the idea you proposed and send a patch to LKML for review this
week.

Thanks,
Martijn


>
>                                                                 Honza
> --
> Jan Kara <jack@suse.com>
> SUSE Labs, CR

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Writeback bug causing writeback stalls
  2020-05-27  8:14           ` Martijn Coenen
@ 2020-05-29 15:20             ` Jan Kara
  2020-05-29 19:37               ` Martijn Coenen
  0 siblings, 1 reply; 12+ messages in thread
From: Jan Kara @ 2020-05-29 15:20 UTC (permalink / raw)
  To: Martijn Coenen
  Cc: Jan Kara, Jaegeuk Kim, Al Viro, Jens Axboe, miklos, tj,
	linux-fsdevel, LKML, kernel-team

[-- Attachment #1: Type: text/plain, Size: 983 bytes --]

Hello Martinj!

On Wed 27-05-20 10:14:09, Martijn Coenen wrote:
> On Mon, May 25, 2020 at 9:31 AM Jan Kara <jack@suse.cz> wrote:
> > Well, most importantly filesystems like ext4, xfs, btrfs don't hold i_rwsem
> > when writing back inode and that's deliberate because of performance. We
> > don't want to block writes (or event reads in case of XFS) for the inode
> > during writeback.
> 
> Thanks for clarifying, that makes sense. By the way, do you have an
> ETA for your fix? We are under some time pressure to get this fixed in
> our downstream kernels, but I'd much rather take a fix from upstream
> from somebody who knows this code well. Alternatively, I can take a
> stab at the idea you proposed and send a patch to LKML for review this
> week.

I understand. I have written a fix (attached). Currently its under testing
together with other cleanups. If everything works fine, I plan to submit
the patches on Monday.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

[-- Attachment #2: 0001-writeback-Avoid-skipping-inode-writeback.patch --]
[-- Type: text/x-patch, Size: 7847 bytes --]

From 6d0d46bfddccff7aa4ed114ce15c23dfb68ad2b2 Mon Sep 17 00:00:00 2001
From: Jan Kara <jack@suse.cz>
Date: Fri, 29 May 2020 15:05:22 +0200
Subject: [PATCH] writeback: Avoid skipping inode writeback

Inode's i_io_list list head is used to attach inode to several different
lists - wb->{b_dirty, b_dirty_time, b_io, b_more_io}. When flush worker
prepares a list of inodes to writeback e.g. for sync(2), it moves inodes
to b_io list. Thus it is critical for sync(2) data integrity guarantees
that inode is not requeued to any other writeback list when inode is
queued for processing by flush worker. That's the reason why
writeback_single_inode() does not touch i_io_list (unless the inode is
completely clean) and why __mark_inode_dirty() does not touch i_io_list
if I_SYNC flag is set.

However there are two flaws in the current logic:

1) When inode has only I_DIRTY_TIME set but it is already queued in b_io
list due to sync(2), concurrent __mark_inode_dirty(inode, I_DIRTY_SYNC)
can still move inode back to b_dirty list resulting in skipping
writeback of inode time stamps during sync(2).

2) When inode is on b_dirty_time list and writeback_single_inode() races
with __mark_inode_dirty() like:

writeback_single_inode()		__mark_inode_dirty(inode, I_DIRTY_PAGES)
  inode->i_state |= I_SYNC
  __writeback_single_inode()
					  inode->i_state |= I_DIRTY_PAGES;
					  if (inode->i_state & I_SYNC)
					    bail
  if (!(inode->i_state & I_DIRTY_ALL))
  - not true so nothing done

We end up with I_DIRTY_PAGES inode on b_dirty_time list and thus
standard background writeback will not writeback this inode leading to
possible dirty throttling stalls etc. (thanks to Martijn Coenen for this
analysis).

Fix these problems by tracking whether inode is queued in b_io or
b_more_io lists in a new I_SYNC_QUEUED flag. When this flag is set, we
know flush worker has queued inode and we should not touch i_io_list.
On the other hand we also know that once flush worker is done with the
inode it will requeue the inode to appropriate dirty list. When
I_SYNC_QUEUED is not set, __mark_inode_dirty() can (and must) move inode
to appropriate dirty list.

Reported-by: Martijn Coenen <maco@android.com>
Fixes: 0ae45f63d4ef ("vfs: add support for a lazytime mount option")
CC: stable@vger.kernel.org
Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/fs-writeback.c  | 39 +++++++++++++++++++++++++++++----------
 include/linux/fs.h |  8 ++++++--
 2 files changed, 35 insertions(+), 12 deletions(-)

diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index 76ac9c7d32ec..855c6611710a 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -144,7 +144,9 @@ static void inode_io_list_del_locked(struct inode *inode,
 				     struct bdi_writeback *wb)
 {
 	assert_spin_locked(&wb->list_lock);
+	assert_spin_locked(&inode->i_lock);
 
+	inode->i_state &= ~I_SYNC_QUEUED;
 	list_del_init(&inode->i_io_list);
 	wb_io_lists_depopulated(wb);
 }
@@ -1123,7 +1125,9 @@ void inode_io_list_del(struct inode *inode)
 	struct bdi_writeback *wb;
 
 	wb = inode_to_wb_and_lock_list(inode);
+	spin_lock(&inode->i_lock);
 	inode_io_list_del_locked(inode, wb);
+	spin_unlock(&inode->i_lock);
 	spin_unlock(&wb->list_lock);
 }
 
@@ -1172,8 +1176,9 @@ void sb_clear_inode_writeback(struct inode *inode)
  * the case then the inode must have been redirtied while it was being written
  * out and we don't reset its dirtied_when.
  */
-static void redirty_tail(struct inode *inode, struct bdi_writeback *wb)
+static void __redirty_tail(struct inode *inode, struct bdi_writeback *wb)
 {
+	assert_spin_locked(&inode->i_lock);
 	if (!list_empty(&wb->b_dirty)) {
 		struct inode *tail;
 
@@ -1182,6 +1187,14 @@ static void redirty_tail(struct inode *inode, struct bdi_writeback *wb)
 			inode->dirtied_when = jiffies;
 	}
 	inode_io_list_move_locked(inode, wb, &wb->b_dirty);
+	inode->i_state &= ~I_SYNC_QUEUED;
+}
+
+static void redirty_tail(struct inode *inode, struct bdi_writeback *wb)
+{
+	spin_lock(&inode->i_lock);
+	__redirty_tail(inode, wb);
+	spin_unlock(&inode->i_lock);
 }
 
 /*
@@ -1250,8 +1263,11 @@ static int move_expired_inodes(struct list_head *delaying_queue,
 			break;
 		list_move(&inode->i_io_list, &tmp);
 		moved++;
+		spin_lock(&inode->i_lock);
 		if (flags & EXPIRE_DIRTY_ATIME)
-			set_bit(__I_DIRTY_TIME_EXPIRED, &inode->i_state);
+			inode->i_state |= I_DIRTY_TIME_EXPIRED;
+		inode->i_state |= I_SYNC_QUEUED;
+		spin_unlock(&inode->i_lock);
 		if (sb_is_blkdev_sb(inode->i_sb))
 			continue;
 		if (sb && sb != inode->i_sb)
@@ -1394,7 +1410,7 @@ static void requeue_inode(struct inode *inode, struct bdi_writeback *wb,
 		 * writeback is not making progress due to locked
 		 * buffers. Skip this inode for now.
 		 */
-		redirty_tail(inode, wb);
+		__redirty_tail(inode, wb);
 		return;
 	}
 
@@ -1414,7 +1430,7 @@ static void requeue_inode(struct inode *inode, struct bdi_writeback *wb,
 			 * retrying writeback of the dirty page/inode
 			 * that cannot be performed immediately.
 			 */
-			redirty_tail(inode, wb);
+			__redirty_tail(inode, wb);
 		}
 	} else if (inode->i_state & I_DIRTY) {
 		/*
@@ -1422,10 +1438,11 @@ static void requeue_inode(struct inode *inode, struct bdi_writeback *wb,
 		 * such as delayed allocation during submission or metadata
 		 * updates after data IO completion.
 		 */
-		redirty_tail(inode, wb);
+		__redirty_tail(inode, wb);
 	} else if (inode->i_state & I_DIRTY_TIME) {
 		inode->dirtied_when = jiffies;
 		inode_io_list_move_locked(inode, wb, &wb->b_dirty_time);
+		inode->i_state &= ~I_SYNC_QUEUED;
 	} else {
 		/* The inode is clean. Remove from writeback lists. */
 		inode_io_list_del_locked(inode, wb);
@@ -1669,8 +1686,9 @@ static long writeback_sb_inodes(struct super_block *sb,
 		 */
 		spin_lock(&inode->i_lock);
 		if (inode->i_state & (I_NEW | I_FREEING | I_WILL_FREE)) {
+			inode->i_state &= ~I_SYNC_QUEUED;
+			__redirty_tail(inode, wb);
 			spin_unlock(&inode->i_lock);
-			redirty_tail(inode, wb);
 			continue;
 		}
 		if ((inode->i_state & I_SYNC) && wbc.sync_mode != WB_SYNC_ALL) {
@@ -2289,11 +2307,12 @@ void __mark_inode_dirty(struct inode *inode, int flags)
 		inode->i_state |= flags;
 
 		/*
-		 * If the inode is being synced, just update its dirty state.
-		 * The unlocker will place the inode on the appropriate
-		 * superblock list, based upon its state.
+		 * If the inode is queued for writeback by flush worker, just
+		 * update its dirty state. Once the flush worker is done with
+		 * the inode it will place it on the appropriate superblock
+		 * list, based upon its state.
 		 */
-		if (inode->i_state & I_SYNC)
+		if (inode->i_state & I_SYNC_QUEUED)
 			goto out_unlock_inode;
 
 		/*
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 45cc10cdf6dd..b02290d19edd 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2156,6 +2156,10 @@ static inline void kiocb_clone(struct kiocb *kiocb, struct kiocb *kiocb_src,
  *
  * I_CREATING		New object's inode in the middle of setting up.
  *
+ * I_SYNC_QUEUED	Inode is queued in b_io or b_more_io writeback lists.
+ *			Used to detect that mark_inode_dirty() should not move
+ * 			inode between dirty lists.
+ *
  * Q: What is the difference between I_WILL_FREE and I_FREEING?
  */
 #define I_DIRTY_SYNC		(1 << 0)
@@ -2173,11 +2177,11 @@ static inline void kiocb_clone(struct kiocb *kiocb, struct kiocb *kiocb_src,
 #define I_DIO_WAKEUP		(1 << __I_DIO_WAKEUP)
 #define I_LINKABLE		(1 << 10)
 #define I_DIRTY_TIME		(1 << 11)
-#define __I_DIRTY_TIME_EXPIRED	12
-#define I_DIRTY_TIME_EXPIRED	(1 << __I_DIRTY_TIME_EXPIRED)
+#define I_DIRTY_TIME_EXPIRED	(1 << 12)
 #define I_WB_SWITCH		(1 << 13)
 #define I_OVL_INUSE		(1 << 14)
 #define I_CREATING		(1 << 15)
+#define I_SYNC_QUEUED		(1 << 16)
 
 #define I_DIRTY_INODE (I_DIRTY_SYNC | I_DIRTY_DATASYNC)
 #define I_DIRTY (I_DIRTY_INODE | I_DIRTY_PAGES)
-- 
2.16.4


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: Writeback bug causing writeback stalls
  2020-05-29 15:20             ` Jan Kara
@ 2020-05-29 19:37               ` Martijn Coenen
  2020-06-01  9:09                 ` Jan Kara
  0 siblings, 1 reply; 12+ messages in thread
From: Martijn Coenen @ 2020-05-29 19:37 UTC (permalink / raw)
  To: Jan Kara
  Cc: Jaegeuk Kim, Al Viro, Jens Axboe, miklos, tj, linux-fsdevel,
	LKML, kernel-team

Hi Jan,

On Fri, May 29, 2020 at 5:20 PM Jan Kara <jack@suse.cz> wrote:
> I understand. I have written a fix (attached). Currently its under testing
> together with other cleanups. If everything works fine, I plan to submit
> the patches on Monday.

Thanks a lot for the quick fix! I ran my usual way to reproduce the
problem, and did not see it, so that's good! I do observe write speed
dips - eg we usually sustain 180 MB/s on this device, but now it
regularly dips down to 10 MB/s, then jumps back up again. That might
be unrelated to your patch though, I will run more tests over the
weekend and report back!

Best,
Martijn

>
>                                                                 Honza
> --
> Jan Kara <jack@suse.com>
> SUSE Labs, CR

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Writeback bug causing writeback stalls
  2020-05-29 19:37               ` Martijn Coenen
@ 2020-06-01  9:09                 ` Jan Kara
  2020-06-02 12:16                   ` Martijn Coenen
  0 siblings, 1 reply; 12+ messages in thread
From: Jan Kara @ 2020-06-01  9:09 UTC (permalink / raw)
  To: Martijn Coenen
  Cc: Jan Kara, Jaegeuk Kim, Al Viro, Jens Axboe, miklos, tj,
	linux-fsdevel, LKML, kernel-team

On Fri 29-05-20 21:37:50, Martijn Coenen wrote:
> Hi Jan,
> 
> On Fri, May 29, 2020 at 5:20 PM Jan Kara <jack@suse.cz> wrote:
> > I understand. I have written a fix (attached). Currently its under testing
> > together with other cleanups. If everything works fine, I plan to submit
> > the patches on Monday.
> 
> Thanks a lot for the quick fix! I ran my usual way to reproduce the
> problem, and did not see it, so that's good! I do observe write speed
> dips - eg we usually sustain 180 MB/s on this device, but now it
> regularly dips down to 10 MB/s, then jumps back up again. That might
> be unrelated to your patch though, I will run more tests over the
> weekend and report back!

Thanks for testing! My test run has completed fine so I'll submit patches
for review. But I'm curious what's causing the dips in throughput in your
test...

								Honza

-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Writeback bug causing writeback stalls
  2020-06-01  9:09                 ` Jan Kara
@ 2020-06-02 12:16                   ` Martijn Coenen
  0 siblings, 0 replies; 12+ messages in thread
From: Martijn Coenen @ 2020-06-02 12:16 UTC (permalink / raw)
  To: Jan Kara
  Cc: Jaegeuk Kim, Al Viro, Jens Axboe, miklos, tj, linux-fsdevel,
	LKML, kernel-team

On Mon, Jun 1, 2020 at 11:09 AM Jan Kara <jack@suse.cz> wrote:
> Thanks for testing! My test run has completed fine so I'll submit patches
> for review. But I'm curious what's causing the dips in throughput in your
> test...

It turned out to be unrelated to your patch. Sorry for the noise! We
have the patch in dogfood on some of our devices, and I will let you
know if we run into any issues. I'll also spend some more time
reviewing your patches and will respond to them later.

Thanks,
Martijn
>
>                                                                 Honza
>
> --
> Jan Kara <jack@suse.com>
> SUSE Labs, CR

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2020-06-02 12:16 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-05-22  9:57 Writeback bug causing writeback stalls Martijn Coenen
2020-05-22 14:41 ` Jan Kara
2020-05-22 15:23   ` Martijn Coenen
2020-05-22 15:36     ` Jan Kara
2020-05-23  8:15       ` Martijn Coenen
2020-05-25  7:31         ` Jan Kara
2020-05-27  8:14           ` Martijn Coenen
2020-05-29 15:20             ` Jan Kara
2020-05-29 19:37               ` Martijn Coenen
2020-06-01  9:09                 ` Jan Kara
2020-06-02 12:16                   ` Martijn Coenen
     [not found] ` <20200524140522.14196-1-hdanton@sina.com>
2020-05-25  7:38   ` Jan Kara

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).