All of lore.kernel.org
 help / color / mirror / Atom feed
From: Dave Chinner <david@fromorbit.com>
To: "Darrick J. Wong" <darrick.wong@oracle.com>
Cc: linux-xfs@vger.kernel.org
Subject: Re: [PATCH 00/30] xfs: rework inode flushing to make inode reclaim fully asynchronous
Date: Wed, 1 Jul 2020 07:51:02 +1000	[thread overview]
Message-ID: <20200630215102.GM2005@dread.disaster.area> (raw)
In-Reply-To: <20200630165203.GW7606@magnolia>

On Tue, Jun 30, 2020 at 09:52:03AM -0700, Darrick J. Wong wrote:
> On Mon, Jun 29, 2020 at 04:01:30PM -0700, Darrick J. Wong wrote:
> > Both of these failure cases have been difficult to reproduce, which is
> > to say that I can't get them to repro reliably.  Turning PREEMPT on
> > seems to make it reproduce faster, which makes me wonder if something in
> > this patchset is screwing up concurrency handling or something?  KASAN
> > and kmemleak have nothing to say.  I've also noticed that the less
> > heavily loaded the underlying VM host's storage system, the less likely
> > it is to happen, though that could be a coincidence.
> > 
> > Anyway, if I figure something out I'll holler, but I thought it was past
> > time to braindump on the mailing list.
> 
> Last night, Dave and I did some live debugging of a failed VM test
> system, and discovered that the xfs_reclaim_inodes() call does not
> actually reclaim all the IRECLAIMABLE inodes.  Because we fail to call
> xfs_reclaim_inode() on all the inodes, there are still inodes in the
> incore inode xarray, and they still have dquots attached.
> 
> This would explain the symptoms I've seen -- since we didn't reclaim the
> inodes, we didn't dqdetach them either, and so the dqpurge_all will spin
> forever on the still-referenced dquots.  This also explains the slub
> complaints about active xfs_inode/xfs_inode_log_item objects if I turn
> off quotas, since we didn't clean those up either.
> 
> Further analysis (aka adding tracepoints) shows xfs_reclaim_inode_grab
> deciding to skip some inodes because IFLOCK is set.  Adding code to
> cycle the i_flags_lock ahead of the unlocked IFLOCK test didn't make the
> symptoms go away, so I instrumented the inode flush "lock" functions to
> see what was going on (full version available here [1]):

[...]

> Bingo!  The xfs_ail_push_all_sync in xfs_unmountfs takes a bunch of
> inode iflocks, starts the inode cluster buffer write, and since the AIL
> is now empty, returns.  The unmount process moves on to calling
> xfs_reclaim_inodes, which as you can see in the last four lines:
> 
>           umount-10409 [001]    44.118882: xfs_reclaim_inode_grab: dev 259:0 ino 0x8a
> 
> This ^^^ is logged at the start of xfs_reclaim_inode_grab.
> 
>           umount-10409 [001]    44.118883: xfs_reclaim_inode_grab_iflock: dev 259:0 ino 0x8a
> 
> This is logged when x_r_i_g observes that the IFLOCK is set and bails out.
> 
>      kworker/2:1-50    [002]    44.118883: xfs_ifunlock:         dev 259:0 ino 0x8a
> 
> And finally this is the inode cluster buffer IO completion calling
> xfs_buf_inode_iodone -> xfs_iflush_done from a workqueue.
> 
> So it seems to me that inode reclaim races with the AIL for the IFLOCK,
> and when unmount inode reclaim loses, it does the wrong thing.

Yeah, that's what I suspected when I finished up yesterday, but I
couldn't quite connect how the AIL wasn't waiting for the inode
completion.

The moment I looked at it again this morning, I realised that it was
simply that xfs_ail_push_all_sync() is woken when the AIL is
emptied, and that happens about 20 lines of code before the flush
lock is dropped, and if the wakeup to the sleeping task is fast
enough, it can be running before the IO completion finishes the
wakeup.

And with a PREEMPT kernel, we might do preempts on wakeup (that was
the path to the scheduler bug we kept hitting), hence increasing the
chance that the unmount task will run before the IO completion
finishes and drops the inode flush lock.

> Questions: Do we need to teach xfs_reclaim_inodes_ag to increment
> @skipped if xfs_reclaim_inode_grab rejects an inode?  xfs_reclaim_inodes
> is the only consumer of the @skipped value, and elevated skipped will
> cause it to rerun the scan, so I think this will work.

No, we just need to get rid of the racy check in
xfs_reclaim_inode_grab(). I'm going to get rid of the whole skipped
thing, too.

> Or, do we need to wait for the ail items to complete after xfsaild does
> its xfs_buf_delwri_submit_nowait thing?

We've already waited for the -AIL items- to complete, and that's
really all we should be doing at the xfs_ail_push_all_sync layer.

The issue is that xfs_ail_push_all_sync() doesn't quite wait for IO
to complete so we've been conflating these two different operations
for a long time (essentially since we moved to logging everything
and tracking all dirty metadata in the AIL). In general, they mean
the same thing, but in this specific corner case the subtle
distinction actually matters.

It's easy enough to avoid - just get rid of what, independently,
this patchset makes a questionable optimisation in
xfs_reclaim_inode_grab(). i.e we no longer block reclaim on locks
and so optimisations to avoid blocking on locks....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

  reply	other threads:[~2020-06-30 21:51 UTC|newest]

Thread overview: 43+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-06-22  8:15 [PATCH 00/30] xfs: rework inode flushing to make inode reclaim fully asynchronous Dave Chinner
2020-06-22  8:15 ` [PATCH 01/30] xfs: Don't allow logging of XFS_ISTALE inodes Dave Chinner
2020-06-22  8:15 ` [PATCH 02/30] xfs: remove logged flag from inode log item Dave Chinner
2020-06-22  8:15 ` [PATCH 03/30] xfs: add an inode item lock Dave Chinner
2020-06-23  2:30   ` Darrick J. Wong
2020-06-22  8:15 ` [PATCH 04/30] xfs: mark inode buffers in cache Dave Chinner
2020-06-23  2:32   ` Darrick J. Wong
2020-06-22  8:15 ` [PATCH 05/30] xfs: mark dquot " Dave Chinner
2020-06-22  8:15 ` [PATCH 06/30] xfs: mark log recovery buffers for completion Dave Chinner
2020-06-22  8:15 ` [PATCH 07/30] xfs: call xfs_buf_iodone directly Dave Chinner
2020-06-22  8:15 ` [PATCH 08/30] xfs: clean up whacky buffer log item list reinit Dave Chinner
2020-06-22  8:15 ` [PATCH 09/30] xfs: make inode IO completion buffer centric Dave Chinner
2020-06-22  8:15 ` [PATCH 10/30] xfs: use direct calls for dquot IO completion Dave Chinner
2020-06-22  8:15 ` [PATCH 11/30] xfs: clean up the buffer iodone callback functions Dave Chinner
2020-06-22  8:15 ` [PATCH 12/30] xfs: get rid of log item callbacks Dave Chinner
2020-06-22  8:15 ` [PATCH 13/30] xfs: handle buffer log item IO errors directly Dave Chinner
2020-06-23  2:38   ` Darrick J. Wong
2020-06-22  8:15 ` [PATCH 14/30] xfs: unwind log item error flagging Dave Chinner
2020-06-22  8:15 ` [PATCH 15/30] xfs: move xfs_clear_li_failed out of xfs_ail_delete_one() Dave Chinner
2020-06-22  8:15 ` [PATCH 16/30] xfs: pin inode backing buffer to the inode log item Dave Chinner
2020-06-23  2:39   ` Darrick J. Wong
2020-06-22  8:15 ` [PATCH 17/30] xfs: make inode reclaim almost non-blocking Dave Chinner
2020-06-22  8:15 ` [PATCH 18/30] xfs: remove IO submission from xfs_reclaim_inode() Dave Chinner
2020-06-22  8:15 ` [PATCH 19/30] xfs: allow multiple reclaimers per AG Dave Chinner
2020-06-22  8:15 ` [PATCH 20/30] xfs: don't block inode reclaim on the ILOCK Dave Chinner
2020-06-22  8:15 ` [PATCH 21/30] xfs: remove SYNC_TRYLOCK from inode reclaim Dave Chinner
2020-07-01  4:48   ` [PATCH 21/30 V2] " Dave Chinner
2020-06-22  8:15 ` [PATCH 22/30] xfs: remove SYNC_WAIT from xfs_reclaim_inodes() Dave Chinner
2020-07-01  4:51   ` [PATCH 22/30 V2] " Dave Chinner
2020-06-22  8:15 ` [PATCH 23/30] xfs: clean up inode reclaim comments Dave Chinner
2020-06-22  8:15 ` [PATCH 24/30] xfs: rework stale inodes in xfs_ifree_cluster Dave Chinner
2020-06-22  8:16 ` [PATCH 25/30] xfs: attach inodes to the cluster buffer when dirtied Dave Chinner
2020-06-22  8:16 ` [PATCH 26/30] xfs: xfs_iflush() is no longer necessary Dave Chinner
2020-06-22  8:16 ` [PATCH 27/30] xfs: rename xfs_iflush_int() Dave Chinner
2020-06-22  8:16 ` [PATCH 28/30] xfs: rework xfs_iflush_cluster() dirty inode iteration Dave Chinner
2020-06-22  8:16 ` [PATCH 29/30] xfs: factor xfs_iflush_done Dave Chinner
2020-06-22 22:16   ` [PATCH 29/30 V2] " Dave Chinner
2020-06-22  8:16 ` [PATCH 30/30] xfs: remove xfs_inobp_check() Dave Chinner
2020-06-29 23:01 ` [PATCH 00/30] xfs: rework inode flushing to make inode reclaim fully asynchronous Darrick J. Wong
2020-06-30 16:52   ` Darrick J. Wong
2020-06-30 21:51     ` Dave Chinner [this message]
  -- strict thread matches above, loose matches on Subject: below --
2020-06-04  7:45 Dave Chinner
2020-06-01 21:42 Dave Chinner

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20200630215102.GM2005@dread.disaster.area \
    --to=david@fromorbit.com \
    --cc=darrick.wong@oracle.com \
    --cc=linux-xfs@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.