All of lore.kernel.org
 help / color / mirror / Atom feed
From: Brian Foster <bfoster@redhat.com>
To: Dave Chinner <david@fromorbit.com>
Cc: xfs@oss.sgi.com
Subject: Re: [PATCH] xfs: add readahead bufs to lru early to prevent post-unmount panic
Date: Tue, 5 Jul 2016 12:45:52 -0400	[thread overview]
Message-ID: <20160705164552.GA6317@bfoster.bfoster> (raw)
In-Reply-To: <20160701223011.GA28130@bfoster.bfoster>

On Fri, Jul 01, 2016 at 06:30:12PM -0400, Brian Foster wrote:
> On Fri, Jul 01, 2016 at 08:44:57AM +1000, Dave Chinner wrote:
> > On Thu, Jun 30, 2016 at 08:53:49AM -0400, Brian Foster wrote:
> > > Newly allocated XFS metadata buffers are added to the LRU once the hold
> > > count is released, which typically occurs after I/O completion. There is
> > > no other mechanism at current that tracks the existence or I/O state of
> > > a new buffer. Further, readahead I/O tends to be submitted
> > > asynchronously by nature, which means the I/O can remain in flight and
> > > actually complete long after the calling context is gone. This means
> > > that file descriptors or any other holds on the filesystem can be
> > > released, allowing the filesystem to be unmounted while I/O is still in
> > > flight. When I/O completion occurs, core data structures may have been
> > > freed, causing completion to run into invalid memory accesses and likely
> > > to panic.
> > > 
> > > This problem is reproduced on XFS via directory readahead. A filesystem
> > > is mounted, a directory is opened/closed and the filesystem immediately
> > > unmounted. The open/close cycle triggers a directory readahead that if
> > > delayed long enough, runs buffer I/O completion after the unmount has
> > > completed.
> > > 
> > > To work around this problem, add readahead buffers to the LRU earlier
> > > than other buffers (when the buffer is allocated, specifically). The
> > > buffer hold count will ultimately remain until I/O completion, which
> > > means any shrinker activity will skip the buffer until then. This makes
> > > the buffer visible to xfs_wait_buftarg(), however, which ensures that an
> > > unmount or quiesce waits for I/O completion appropriately.
> > > 
> > > Signed-off-by: Brian Foster <bfoster@redhat.com>
> > > ---
> > > 
> > > This addresses the problem reproduced by the recently posted xfstests
> > > test:
> > > 
> > >   http://thread.gmane.org/gmane.comp.file-systems.fstests/2740
> > > 
> > > This could probably be made more involved, i.e., to create another list
> > > of buffers in flight or some such. This seems more simple/sane to me,
> > > however, and survives my testing so far...
> > > 
> > > Brian
> > > 
> > >  fs/xfs/xfs_buf.c | 12 ++++++++++++
> > >  1 file changed, 12 insertions(+)
> > > 
> > > diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
> > > index 4665ff6..3f03df9 100644
> > > --- a/fs/xfs/xfs_buf.c
> > > +++ b/fs/xfs/xfs_buf.c
> > > @@ -590,8 +590,20 @@ xfs_buf_get_map(
> > >  		return NULL;
> > >  	}
> > >  
> > > +	/*
> > > +	 * If the buffer found doesn't match the one allocated above, somebody
> > > +	 * else beat us to insertion and we can toss the new one.
> > > +	 *
> > > +	 * If we did add the buffer and it happens to be readahead, add to the
> > > +	 * LRU now rather than waiting until the hold is released. Otherwise,
> > > +	 * the buffer is not visible to xfs_wait_buftarg() while in flight and
> > > +	 * nothing else prevents an unmount before I/O completion.
> > > +	 */
> > >  	if (bp != new_bp)
> > >  		xfs_buf_free(new_bp);
> > > +	else if (flags & XBF_READ_AHEAD &&
> > > +		 list_lru_add(&bp->b_target->bt_lru, &bp->b_lru))
> > > +		atomic_inc(&bp->b_hold);
> > 
> > This doesn't sit right with me. The LRU is for "unused" objects, and
> > readahead objects are not unused until IO completes and nobody is
> > waiting on them.
> > 
> > As it is, this points out another problem with readahead buffers -
> > they aren't actually being cached properly because b_lru_ref == 0,
> > which means they are immediately reclaimed on IO completion rather
> > than being added to the LRU....
> > 
> > I also think that it's not sufficient to cover the generic case of
> > async IO that has no waiter. i.e. we could do get_buf, submit async
> > write, drop submitter reference, and now we have the same problem
> > but on a write.  i.e. this problem is and async IO issue, not a
> > readahead issue.
> > 
> > I think that it might be better to fix it by doing this:
> > 
> > 	1. ensure async IO submission always has b_lru_ref set, and
> > 	if it isn't, set it to 1. This ensures the buffer will be
> > 	added to the LRU on completion if it isn't already there.
> > 
> > 	2. keep a count of async buffer IO in progress. A per-cpu
> > 	counter in the buftarg will be fine for this. Increment in
> > 	xfs_buf_submit(), decrement in the xfs_buf_rele() call from
> > 	xfs_buf_iodone() once we've determined if the buffer needs
> > 	adding to the LRU or not.
> > 
> > 	3. make xfs_wait_buftarg() wait until the async IO count
> > 	goes to zero before it gives up trying to release buffers on
> > 	the LRU.
> > 
> 
> After playing with this a bit this afternoon, I don't think it is so
> straightforward to maintain consistency between xfs_buf_submit() and
> xfs_buf_rele(). Some buffers are actually never released (superblock,
> log buffers). Other buffers can actually be submitted for I/O multiple
> times before they are ultimately released (e.g., log recovery buffer
> read -> delwri submission).
> 

I think I can get around these problems by skipping all uncached I/O and
maintaining a per-buffer I/O count that is sunk into the global buftarg
count once the buffer is released. E.g., something like the following
patch. Not fully tested, but works on some quick tests...

diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
index 4665ff6..8a04b66 100644
--- a/fs/xfs/xfs_buf.c
+++ b/fs/xfs/xfs_buf.c
@@ -177,6 +177,7 @@ _xfs_buf_alloc(
 	XB_SET_OWNER(bp);
 	bp->b_target = target;
 	bp->b_flags = flags;
+	atomic_set(&bp->b_io_count, 0);
 
 	/*
 	 * Set length and io_length to the same value initially.
@@ -865,6 +866,21 @@ xfs_buf_hold(
 	atomic_inc(&bp->b_hold);
 }
 
+static inline void
+xfs_buf_rele_iocount(
+	struct xfs_buf	*bp)
+{
+	int		val;
+
+	val = atomic_read(&bp->b_io_count);
+	if (!val)
+		return;
+
+	atomic_sub(val, &bp->b_io_count);
+	percpu_counter_add(&bp->b_target->bt_io_count, -val);
+	wake_up(&bp->b_target->bt_io_wait);
+}
+
 /*
  *	Releases a hold on the specified buffer.  If the
  *	the hold count is 1, calls xfs_buf_free.
@@ -880,8 +896,10 @@ xfs_buf_rele(
 	if (!pag) {
 		ASSERT(list_empty(&bp->b_lru));
 		ASSERT(RB_EMPTY_NODE(&bp->b_rbnode));
-		if (atomic_dec_and_test(&bp->b_hold))
+		if (atomic_dec_and_test(&bp->b_hold)) {
+			xfs_buf_rele_iocount(bp);
 			xfs_buf_free(bp);
+		}
 		return;
 	}
 
@@ -890,6 +908,9 @@ xfs_buf_rele(
 	ASSERT(atomic_read(&bp->b_hold) > 0);
 	if (atomic_dec_and_lock(&bp->b_hold, &pag->pag_buf_lock)) {
 		spin_lock(&bp->b_lock);
+
+		xfs_buf_rele_iocount(bp);
+
 		if (!(bp->b_flags & XBF_STALE) && atomic_read(&bp->b_lru_ref)) {
 			/*
 			 * If the buffer is added to the LRU take a new
@@ -1277,6 +1298,18 @@ _xfs_buf_ioapply(
 	rw |= REQ_META;
 
 	/*
+	 * XXX: uncached check indirectly filters out the sb buffer and log
+	 * buffers (possibly others?), that are held and never released to the
+	 * LRU
+	 */
+	if (bp->b_flags & XBF_ASYNC &&
+	    bp->b_bn != XFS_BUF_DADDR_NULL &&	/* uncached */
+	    atomic_read(&bp->b_lru_ref) && list_empty(&bp->b_lru)) {
+		percpu_counter_inc(&bp->b_target->bt_io_count);
+		atomic_inc(&bp->b_io_count);
+	}
+
+	/*
 	 * Walk all the vectors issuing IO on them. Set up the initial offset
 	 * into the buffer and the desired IO size before we start -
 	 * _xfs_buf_ioapply_vec() will modify them appropriately for each
@@ -1533,6 +1566,8 @@ xfs_wait_buftarg(
 	 * ensure here that all reference counts have been dropped before we
 	 * start walking the LRU list.
 	 */
+	wait_event(btp->bt_io_wait,
+		   (percpu_counter_sum(&btp->bt_io_count) == 0));
 	drain_workqueue(btp->bt_mount->m_buf_workqueue);
 
 	/* loop until there is nothing left on the lru list. */
@@ -1629,6 +1664,8 @@ xfs_free_buftarg(
 	struct xfs_buftarg	*btp)
 {
 	unregister_shrinker(&btp->bt_shrinker);
+	ASSERT(percpu_counter_sum(&btp->bt_io_count) == 0);
+	percpu_counter_destroy(&btp->bt_io_count);
 	list_lru_destroy(&btp->bt_lru);
 
 	if (mp->m_flags & XFS_MOUNT_BARRIER)
@@ -1693,6 +1730,10 @@ xfs_alloc_buftarg(
 	if (list_lru_init(&btp->bt_lru))
 		goto error;
 
+	if (percpu_counter_init(&btp->bt_io_count, 0, GFP_KERNEL))
+		goto error;
+	init_waitqueue_head(&btp->bt_io_wait);
+
 	btp->bt_shrinker.count_objects = xfs_buftarg_shrink_count;
 	btp->bt_shrinker.scan_objects = xfs_buftarg_shrink_scan;
 	btp->bt_shrinker.seeks = DEFAULT_SEEKS;
diff --git a/fs/xfs/xfs_buf.h b/fs/xfs/xfs_buf.h
index 8bfb974..2518d09 100644
--- a/fs/xfs/xfs_buf.h
+++ b/fs/xfs/xfs_buf.h
@@ -115,6 +115,10 @@ typedef struct xfs_buftarg {
 	/* LRU control structures */
 	struct shrinker		bt_shrinker;
 	struct list_lru		bt_lru;
+
+	/* XXX: would atomic_t suffice? */
+	struct percpu_counter	bt_io_count;
+	wait_queue_head_t	bt_io_wait;
 } xfs_buftarg_t;
 
 struct xfs_buf;
@@ -183,6 +187,7 @@ typedef struct xfs_buf {
 	unsigned int		b_page_count;	/* size of page array */
 	unsigned int		b_offset;	/* page offset in first page */
 	int			b_error;	/* error code on I/O */
+	atomic_t		b_io_count;
 
 	/*
 	 * async write failure retry count. Initialised to zero on the first

> I have a semi-functional patch that holds more of a pure I/O count,
> which means the count is decremented immediately in xfs_buf_ioend()
> rather than deferred to release. One downside is that while this
> technically still resolves the original problem, it's racy in that the
> count is dropped before the buffer is added to the LRU. This still works
> for the original problem because we also drain the ioend workqueue in
> xfs_wait_buftarg(), but it's not correct because we allow for
> non-deferred completion in the event of I/O errors (i.e.,
> xfs_buf_ioend() called directly from xfs_buf_submit()).
> 
> Brian
> 
> > That will ensure readahead buffers are cached, and we capture both
> > async read and async write buffers in xfs_wait_buftarg().
> > 
> > Cheers,
> > 
> > Dave.
> > -- 
> > Dave Chinner
> > david@fromorbit.com
> > 
> > _______________________________________________
> > xfs mailing list
> > xfs@oss.sgi.com
> > http://oss.sgi.com/mailman/listinfo/xfs
> 
> _______________________________________________
> xfs mailing list
> xfs@oss.sgi.com
> http://oss.sgi.com/mailman/listinfo/xfs

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

  reply	other threads:[~2016-07-05 16:46 UTC|newest]

Thread overview: 18+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2016-06-30 12:53 [PATCH] xfs: add readahead bufs to lru early to prevent post-unmount panic Brian Foster
2016-06-30 22:44 ` Dave Chinner
2016-06-30 23:56   ` Brian Foster
2016-07-01  4:33     ` Dave Chinner
2016-07-01 12:53       ` Brian Foster
2016-07-04  0:08         ` Dave Chinner
2016-07-05 13:42           ` Brian Foster
2016-07-01 22:30   ` Brian Foster
2016-07-05 16:45     ` Brian Foster [this message]
2016-07-11  5:20       ` Dave Chinner
2016-07-11 13:52         ` Brian Foster
2016-07-11 15:29           ` Brian Foster
2016-07-11 22:44             ` Dave Chinner
2016-07-12 12:03               ` Brian Foster
2016-07-12 17:22                 ` Brian Foster
2016-07-12 23:57                   ` Dave Chinner
2016-07-13 11:32                     ` Brian Foster
2016-07-13 12:49                       ` Brian Foster

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20160705164552.GA6317@bfoster.bfoster \
    --to=bfoster@redhat.com \
    --cc=david@fromorbit.com \
    --cc=xfs@oss.sgi.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.