All of lore.kernel.org
 help / color / mirror / Atom feed
* discard improvements
@ 2016-10-17 20:22 Christoph Hellwig
  2016-10-17 20:22 ` [PATCH 1/3] xfs: fix the AG loop in xfs_alloc_vextent for busy extents Christoph Hellwig
                   ` (3 more replies)
  0 siblings, 4 replies; 19+ messages in thread
From: Christoph Hellwig @ 2016-10-17 20:22 UTC (permalink / raw)
  To: linux-xfs; +Cc: michaelcallahan

This series improves XFS discard support so that we don't stall the
log commit thread and improve performance for workloads with lots
of deletions.

The first patch is a fix for a long-standing allocator issue when we
have busy extents in the way of an allocation.  Without the async
discard support it was harder to hit, but it's basically been there
since we added busy extent tracking and non-synchronous delete
transactions.

The second patch is the main async discard support, and the third
one allows to batch up multiple extents into a single discard bio,
which is a common occurance.


^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH 1/3] xfs: fix the AG loop in xfs_alloc_vextent for busy extents
  2016-10-17 20:22 discard improvements Christoph Hellwig
@ 2016-10-17 20:22 ` Christoph Hellwig
  2016-10-19 13:48   ` Brian Foster
  2016-11-08  6:15   ` Dave Chinner
  2016-10-17 20:22 ` [PATCH 2/3] xfs: don't block the log commit handler for discards Christoph Hellwig
                   ` (2 subsequent siblings)
  3 siblings, 2 replies; 19+ messages in thread
From: Christoph Hellwig @ 2016-10-17 20:22 UTC (permalink / raw)
  To: linux-xfs; +Cc: michaelcallahan

If we have too many busy extents, or just enough to make our wanted
allocation impossible we will never move on to another AG in
xfs_alloc_vextent currently.  Change the loop exit condition to keep
looking for an AG if we can't allocate an extent.

For the single AG cases we don't need to change anything as the higher
layers need to handle this case.

Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 fs/xfs/libxfs/xfs_alloc.c | 11 ++++++++---
 1 file changed, 8 insertions(+), 3 deletions(-)

diff --git a/fs/xfs/libxfs/xfs_alloc.c b/fs/xfs/libxfs/xfs_alloc.c
index effb64c..0a04bec 100644
--- a/fs/xfs/libxfs/xfs_alloc.c
+++ b/fs/xfs/libxfs/xfs_alloc.c
@@ -2753,13 +2753,18 @@ xfs_alloc_vextent(
 				trace_xfs_alloc_vextent_nofix(args);
 				goto error0;
 			}
+
 			/*
-			 * If we get a buffer back then the allocation will fly.
+			 * If we get a buffer back then the allocation will fly,
+			 * unless there are busy extents in the way.
 			 */
 			if (args->agbp) {
-				if ((error = xfs_alloc_ag_vextent(args)))
+				error = xfs_alloc_ag_vextent(args);
+				if (error)
 					goto error0;
-				break;
+				if (args->agbno != NULLAGBLOCK)
+					break;
+				xfs_trans_brelse(args->tp, args->agbp);
 			}
 
 			trace_xfs_alloc_vextent_loopfailed(args);
-- 
2.1.4


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH 2/3] xfs: don't block the log commit handler for discards
  2016-10-17 20:22 discard improvements Christoph Hellwig
  2016-10-17 20:22 ` [PATCH 1/3] xfs: fix the AG loop in xfs_alloc_vextent for busy extents Christoph Hellwig
@ 2016-10-17 20:22 ` Christoph Hellwig
  2016-10-17 23:29   ` Dave Chinner
  2016-10-17 20:22 ` [PATCH 3/3] xfs: merge discard requests Christoph Hellwig
  2016-10-28 10:11 ` discard improvements Avi Kivity
  3 siblings, 1 reply; 19+ messages in thread
From: Christoph Hellwig @ 2016-10-17 20:22 UTC (permalink / raw)
  To: linux-xfs; +Cc: michaelcallahan

Instead we submit the discard requests and use another workqueue to
release the extents from the extent busy list.

Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 fs/xfs/libxfs/xfs_bmap.c | 11 ++++++-
 fs/xfs/xfs_discard.c     | 29 -----------------
 fs/xfs/xfs_discard.h     |  1 -
 fs/xfs/xfs_log_cil.c     | 84 +++++++++++++++++++++++++++++++++++++++++++-----
 fs/xfs/xfs_log_priv.h    |  1 +
 fs/xfs/xfs_mount.c       |  1 +
 fs/xfs/xfs_super.c       |  8 +++++
 fs/xfs/xfs_super.h       |  2 ++
 8 files changed, 98 insertions(+), 39 deletions(-)

diff --git a/fs/xfs/libxfs/xfs_bmap.c b/fs/xfs/libxfs/xfs_bmap.c
index c27344c..cbea3b2 100644
--- a/fs/xfs/libxfs/xfs_bmap.c
+++ b/fs/xfs/libxfs/xfs_bmap.c
@@ -3684,10 +3684,12 @@ xfs_bmap_btalloc(
 	int		tryagain;
 	int		error;
 	int		stripe_align;
+	bool		trydiscard;
 
 	ASSERT(ap->length);
 
 	mp = ap->ip->i_mount;
+	trydiscard = (mp->m_flags & XFS_MOUNT_DISCARD);
 
 	/* stripe alignment for allocation is determined by mount parameters */
 	stripe_align = 0;
@@ -3708,7 +3710,7 @@ xfs_bmap_btalloc(
 		ASSERT(ap->length);
 	}
 
-
+retry:
 	nullfb = *ap->firstblock == NULLFSBLOCK;
 	fb_agno = nullfb ? NULLAGNUMBER : XFS_FSB_TO_AGNO(mp, *ap->firstblock);
 	if (nullfb) {
@@ -3888,6 +3890,13 @@ xfs_bmap_btalloc(
 			return error;
 		ap->dfops->dop_low = true;
 	}
+
+	if (args.fsbno == NULLFSBLOCK && trydiscard) {
+		trydiscard = false;
+		flush_workqueue(xfs_discard_wq);
+		goto retry;
+	}
+
 	if (args.fsbno != NULLFSBLOCK) {
 		/*
 		 * check the allocation happened at the same or higher AG than
diff --git a/fs/xfs/xfs_discard.c b/fs/xfs/xfs_discard.c
index 4ff499a..d796ffa 100644
--- a/fs/xfs/xfs_discard.c
+++ b/fs/xfs/xfs_discard.c
@@ -208,32 +208,3 @@ xfs_ioc_trim(
 		return -EFAULT;
 	return 0;
 }
-
-int
-xfs_discard_extents(
-	struct xfs_mount	*mp,
-	struct list_head	*list)
-{
-	struct xfs_extent_busy	*busyp;
-	int			error = 0;
-
-	list_for_each_entry(busyp, list, list) {
-		trace_xfs_discard_extent(mp, busyp->agno, busyp->bno,
-					 busyp->length);
-
-		error = blkdev_issue_discard(mp->m_ddev_targp->bt_bdev,
-				XFS_AGB_TO_DADDR(mp, busyp->agno, busyp->bno),
-				XFS_FSB_TO_BB(mp, busyp->length),
-				GFP_NOFS, 0);
-		if (error && error != -EOPNOTSUPP) {
-			xfs_info(mp,
-	 "discard failed for extent [0x%llx,%u], error %d",
-				 (unsigned long long)busyp->bno,
-				 busyp->length,
-				 error);
-			return error;
-		}
-	}
-
-	return 0;
-}
diff --git a/fs/xfs/xfs_discard.h b/fs/xfs/xfs_discard.h
index 344879a..0f070f9 100644
--- a/fs/xfs/xfs_discard.h
+++ b/fs/xfs/xfs_discard.h
@@ -5,6 +5,5 @@ struct fstrim_range;
 struct list_head;
 
 extern int	xfs_ioc_trim(struct xfs_mount *, struct fstrim_range __user *);
-extern int	xfs_discard_extents(struct xfs_mount *, struct list_head *);
 
 #endif /* XFS_DISCARD_H */
diff --git a/fs/xfs/xfs_log_cil.c b/fs/xfs/xfs_log_cil.c
index a4ab192..82f1cbc 100644
--- a/fs/xfs/xfs_log_cil.c
+++ b/fs/xfs/xfs_log_cil.c
@@ -30,6 +30,9 @@
 #include "xfs_trans_priv.h"
 #include "xfs_log.h"
 #include "xfs_log_priv.h"
+#include "xfs_trace.h"
+
+struct workqueue_struct *xfs_discard_wq;
 
 /*
  * Allocate a new ticket. Failing to get a new ticket makes it really hard to
@@ -491,6 +494,75 @@ xlog_cil_free_logvec(
 	}
 }
 
+static void
+xlog_discard_endio_work(
+	struct work_struct	*work)
+{
+	struct xfs_cil_ctx	*ctx =
+		container_of(work, struct xfs_cil_ctx, discard_endio_work);
+	struct xfs_mount	*mp = ctx->cil->xc_log->l_mp;
+
+	xfs_extent_busy_clear(mp, &ctx->busy_extents, false);
+	kmem_free(ctx);
+}
+
+/*
+ * Queue up the actual completion to a thread to avoid IRQ-safe locking for
+ * pagb_lock.  Note that we need a unbounded workqueue, otherwise we might
+ * get the execution delayed up to 30 seconds for weird reasons.
+ */
+static void
+xlog_discard_endio(
+	struct bio		*bio)
+{
+	struct xfs_cil_ctx	*ctx = bio->bi_private;
+
+	INIT_WORK(&ctx->discard_endio_work, xlog_discard_endio_work);
+	queue_work(xfs_discard_wq, &ctx->discard_endio_work);
+}
+
+static void
+xlog_discard_busy_extents(
+	struct xfs_mount	*mp,
+	struct xfs_cil_ctx	*ctx)
+{
+	struct list_head	*list = &ctx->busy_extents;
+	struct xfs_extent_busy	*busyp;
+	struct bio		*bio = NULL;
+	struct blk_plug		plug;
+	int			error = 0;
+
+	ASSERT(mp->m_flags & XFS_MOUNT_DISCARD);
+
+	blk_start_plug(&plug);
+	list_for_each_entry(busyp, list, list) {
+		trace_xfs_discard_extent(mp, busyp->agno, busyp->bno,
+					 busyp->length);
+
+		error = __blkdev_issue_discard(mp->m_ddev_targp->bt_bdev,
+				XFS_AGB_TO_DADDR(mp, busyp->agno, busyp->bno),
+				XFS_FSB_TO_BB(mp, busyp->length),
+				GFP_NOFS, 0, &bio);
+		if (error && error != -EOPNOTSUPP) {
+			xfs_info(mp,
+	 "discard failed for extent [0x%llx,%u], error %d",
+				 (unsigned long long)busyp->bno,
+				 busyp->length,
+				 error);
+			break;
+		}
+	}
+
+	if (bio) {
+		bio->bi_private = ctx;
+		bio->bi_end_io = xlog_discard_endio;
+		submit_bio(bio);
+	} else {
+		xlog_discard_endio_work(&ctx->discard_endio_work);
+	}
+	blk_finish_plug(&plug);
+}
+
 /*
  * Mark all items committed and clear busy extents. We free the log vector
  * chains in a separate pass so that we unpin the log items as quickly as
@@ -525,14 +597,10 @@ xlog_cil_committed(
 
 	xlog_cil_free_logvec(ctx->lv_chain);
 
-	if (!list_empty(&ctx->busy_extents)) {
-		ASSERT(mp->m_flags & XFS_MOUNT_DISCARD);
-
-		xfs_discard_extents(mp, &ctx->busy_extents);
-		xfs_extent_busy_clear(mp, &ctx->busy_extents, false);
-	}
-
-	kmem_free(ctx);
+	if (!list_empty(&ctx->busy_extents))
+		xlog_discard_busy_extents(mp, ctx);
+	else
+		kmem_free(ctx);
 }
 
 /*
diff --git a/fs/xfs/xfs_log_priv.h b/fs/xfs/xfs_log_priv.h
index 2b6eec5..c2604a5 100644
--- a/fs/xfs/xfs_log_priv.h
+++ b/fs/xfs/xfs_log_priv.h
@@ -257,6 +257,7 @@ struct xfs_cil_ctx {
 	struct xfs_log_vec	*lv_chain;	/* logvecs being pushed */
 	struct xfs_log_callback	log_cb;		/* completion callback hook. */
 	struct list_head	committing;	/* ctx committing list */
+	struct work_struct	discard_endio_work;
 };
 
 /*
diff --git a/fs/xfs/xfs_mount.c b/fs/xfs/xfs_mount.c
index fc78739..2ec411c 100644
--- a/fs/xfs/xfs_mount.c
+++ b/fs/xfs/xfs_mount.c
@@ -1052,6 +1052,7 @@ xfs_unmountfs(
 	cancel_delayed_work_sync(&mp->m_cowblocks_work);
 
 	xfs_fs_unreserve_ag_blocks(mp);
+	flush_workqueue(xfs_discard_wq);
 	xfs_qm_unmount_quotas(mp);
 	xfs_rtunmount_inodes(mp);
 	IRELE(mp->m_rootip);
diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
index 56ad5f3..1900233 100644
--- a/fs/xfs/xfs_super.c
+++ b/fs/xfs/xfs_super.c
@@ -1950,12 +1950,20 @@ xfs_init_workqueues(void)
 	if (!xfs_alloc_wq)
 		return -ENOMEM;
 
+	xfs_discard_wq = alloc_workqueue("xfsdiscard", WQ_UNBOUND, 0);
+	if (!xfs_discard_wq)
+		goto out_free_alloc_wq;
+
 	return 0;
+out_free_alloc_wq:
+	destroy_workqueue(xfs_alloc_wq);
+	return -ENOMEM;
 }
 
 STATIC void
 xfs_destroy_workqueues(void)
 {
+	destroy_workqueue(xfs_discard_wq);
 	destroy_workqueue(xfs_alloc_wq);
 }
 
diff --git a/fs/xfs/xfs_super.h b/fs/xfs/xfs_super.h
index b6418ab..5f2f324 100644
--- a/fs/xfs/xfs_super.h
+++ b/fs/xfs/xfs_super.h
@@ -73,6 +73,8 @@ extern const struct quotactl_ops xfs_quotactl_operations;
 
 extern void xfs_reinit_percpu_counters(struct xfs_mount *mp);
 
+extern struct workqueue_struct *xfs_discard_wq;
+
 #define XFS_M(sb)		((struct xfs_mount *)((sb)->s_fs_info))
 
 #endif	/* __XFS_SUPER_H__ */
-- 
2.1.4


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH 3/3] xfs: merge discard requests
  2016-10-17 20:22 discard improvements Christoph Hellwig
  2016-10-17 20:22 ` [PATCH 1/3] xfs: fix the AG loop in xfs_alloc_vextent for busy extents Christoph Hellwig
  2016-10-17 20:22 ` [PATCH 2/3] xfs: don't block the log commit handler for discards Christoph Hellwig
@ 2016-10-17 20:22 ` Christoph Hellwig
  2016-10-28 10:11 ` discard improvements Avi Kivity
  3 siblings, 0 replies; 19+ messages in thread
From: Christoph Hellwig @ 2016-10-17 20:22 UTC (permalink / raw)
  To: linux-xfs; +Cc: michaelcallahan

We often have multiple busy extent entries that are adjacent.  Merge them
before calling __blkdev_issue_discard to give the drive larger request.

To allow for better merging the busy extent sort function is also updated
to sort for the full block number and not just the AG number.

Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 fs/xfs/xfs_extent_busy.c | 12 ++++++++----
 fs/xfs/xfs_log_cil.c     | 44 +++++++++++++++++++++++++++++++++-----------
 2 files changed, 41 insertions(+), 15 deletions(-)

diff --git a/fs/xfs/xfs_extent_busy.c b/fs/xfs/xfs_extent_busy.c
index 162dc18..50c277e 100644
--- a/fs/xfs/xfs_extent_busy.c
+++ b/fs/xfs/xfs_extent_busy.c
@@ -596,9 +596,13 @@ xfs_extent_busy_clear(
 int
 xfs_extent_busy_ag_cmp(
 	void			*priv,
-	struct list_head	*a,
-	struct list_head	*b)
+	struct list_head	*l1,
+	struct list_head	*l2)
 {
-	return container_of(a, struct xfs_extent_busy, list)->agno -
-		container_of(b, struct xfs_extent_busy, list)->agno;
+	struct xfs_extent_busy	*b1 =
+		container_of(l1, struct xfs_extent_busy, list);
+	struct xfs_extent_busy	*b2 =
+		container_of(l2, struct xfs_extent_busy, list);
+
+	return b1->agno - b2->agno || b1->bno - b2->bno;
 }
diff --git a/fs/xfs/xfs_log_cil.c b/fs/xfs/xfs_log_cil.c
index 82f1cbc..c30b19f 100644
--- a/fs/xfs/xfs_log_cil.c
+++ b/fs/xfs/xfs_log_cil.c
@@ -526,11 +526,14 @@ xlog_discard_busy_extents(
 	struct xfs_mount	*mp,
 	struct xfs_cil_ctx	*ctx)
 {
+	struct block_device	*bdev = mp->m_ddev_targp->bt_bdev;
 	struct list_head	*list = &ctx->busy_extents;
 	struct xfs_extent_busy	*busyp;
 	struct bio		*bio = NULL;
 	struct blk_plug		plug;
 	int			error = 0;
+	bool			have_prev = false;
+	sector_t		bno, len, prev_bno, prev_len;
 
 	ASSERT(mp->m_flags & XFS_MOUNT_DISCARD);
 
@@ -539,18 +542,37 @@ xlog_discard_busy_extents(
 		trace_xfs_discard_extent(mp, busyp->agno, busyp->bno,
 					 busyp->length);
 
-		error = __blkdev_issue_discard(mp->m_ddev_targp->bt_bdev,
-				XFS_AGB_TO_DADDR(mp, busyp->agno, busyp->bno),
-				XFS_FSB_TO_BB(mp, busyp->length),
-				GFP_NOFS, 0, &bio);
-		if (error && error != -EOPNOTSUPP) {
-			xfs_info(mp,
-	 "discard failed for extent [0x%llx,%u], error %d",
-				 (unsigned long long)busyp->bno,
-				 busyp->length,
-				 error);
-			break;
+		bno = XFS_AGB_TO_DADDR(mp, busyp->agno, busyp->bno);
+		len = XFS_FSB_TO_BB(mp, busyp->length);
+
+		if (have_prev) {
+			if (prev_bno + prev_len == bno) {
+				prev_len += len;
+				continue;
+			}
+
+			error = __blkdev_issue_discard(bdev, prev_bno, prev_len,
+					GFP_NOFS, 0, &bio);
+			if (error && error != -EOPNOTSUPP)
+				goto done;
 		}
+
+		prev_bno = bno;
+		prev_len = len;
+		have_prev = true;
+	}
+
+	if (have_prev) {
+		error = __blkdev_issue_discard(bdev, prev_bno, prev_len,
+				GFP_NOFS, 0, &bio);
+	}
+
+done:
+	if (error && error != -EOPNOTSUPP) {
+		xfs_info(mp,
+			"discard failed for extent [0x%llx,0x%llx], error %d",
+			(unsigned long long)bno, (unsigned long long)len,
+			error);
 	}
 
 	if (bio) {
-- 
2.1.4


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: [PATCH 2/3] xfs: don't block the log commit handler for discards
  2016-10-17 20:22 ` [PATCH 2/3] xfs: don't block the log commit handler for discards Christoph Hellwig
@ 2016-10-17 23:29   ` Dave Chinner
  2016-10-18  5:05     ` Christoph Hellwig
  2016-10-19 10:58     ` Christoph Hellwig
  0 siblings, 2 replies; 19+ messages in thread
From: Dave Chinner @ 2016-10-17 23:29 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: linux-xfs, michaelcallahan

On Mon, Oct 17, 2016 at 10:22:32PM +0200, Christoph Hellwig wrote:
> Instead we submit the discard requests and use another workqueue to
> release the extents from the extent busy list.

I lik ethe idea, and have toyed with it in the past. A couple of
questions about the implementation, though....

> 
> Signed-off-by: Christoph Hellwig <hch@lst.de>
> ---
>  fs/xfs/libxfs/xfs_bmap.c | 11 ++++++-
>  fs/xfs/xfs_discard.c     | 29 -----------------
>  fs/xfs/xfs_discard.h     |  1 -
>  fs/xfs/xfs_log_cil.c     | 84 +++++++++++++++++++++++++++++++++++++++++++-----
>  fs/xfs/xfs_log_priv.h    |  1 +
>  fs/xfs/xfs_mount.c       |  1 +
>  fs/xfs/xfs_super.c       |  8 +++++
>  fs/xfs/xfs_super.h       |  2 ++
>  8 files changed, 98 insertions(+), 39 deletions(-)
> 
> diff --git a/fs/xfs/libxfs/xfs_bmap.c b/fs/xfs/libxfs/xfs_bmap.c
> index c27344c..cbea3b2 100644
> --- a/fs/xfs/libxfs/xfs_bmap.c
> +++ b/fs/xfs/libxfs/xfs_bmap.c
> @@ -3684,10 +3684,12 @@ xfs_bmap_btalloc(
>  	int		tryagain;
>  	int		error;
>  	int		stripe_align;
> +	bool		trydiscard;
>  
>  	ASSERT(ap->length);
>  
>  	mp = ap->ip->i_mount;
> +	trydiscard = (mp->m_flags & XFS_MOUNT_DISCARD);
>  
>  	/* stripe alignment for allocation is determined by mount parameters */
>  	stripe_align = 0;
> @@ -3708,7 +3710,7 @@ xfs_bmap_btalloc(
>  		ASSERT(ap->length);
>  	}
>  
> -
> +retry:
>  	nullfb = *ap->firstblock == NULLFSBLOCK;
>  	fb_agno = nullfb ? NULLAGNUMBER : XFS_FSB_TO_AGNO(mp, *ap->firstblock);
>  	if (nullfb) {
> @@ -3888,6 +3890,13 @@ xfs_bmap_btalloc(
>  			return error;
>  		ap->dfops->dop_low = true;
>  	}
> +
> +	if (args.fsbno == NULLFSBLOCK && trydiscard) {
> +		trydiscard = false;
> +		flush_workqueue(xfs_discard_wq);
> +		goto retry;
> +	}

So this is the new behaviour that triggers flushing of the discard
list rather than having it occur from a log force inside
xfs_extent_busy_update_extent().

However, xfs_extent_busy_update_extent() also has backoff when it
finds an extent on the busy list being discarded, which means it
could spin waiting for the discard work to complete.

Wouldn't it be better to trigger this workqueue flush in
xfs_extent_busy_update_extent() in both these cases so that the
behaviour remains the same for userdata allocations hitting
uncommitted busy extents, but also allow us to remove the spinning
for allocations where the busy extent is currently being discarded?

> +xlog_discard_endio_work(
> +	struct work_struct	*work)
> +{
> +	struct xfs_cil_ctx	*ctx =
> +		container_of(work, struct xfs_cil_ctx, discard_endio_work);
> +	struct xfs_mount	*mp = ctx->cil->xc_log->l_mp;
> +
> +	xfs_extent_busy_clear(mp, &ctx->busy_extents, false);
> +	kmem_free(ctx);
> +}
> +
> +/*
> + * Queue up the actual completion to a thread to avoid IRQ-safe locking for
> + * pagb_lock.  Note that we need a unbounded workqueue, otherwise we might
> + * get the execution delayed up to 30 seconds for weird reasons.
> + */
> +static void
> +xlog_discard_endio(
> +	struct bio		*bio)
> +{
> +	struct xfs_cil_ctx	*ctx = bio->bi_private;
> +
> +	INIT_WORK(&ctx->discard_endio_work, xlog_discard_endio_work);
> +	queue_work(xfs_discard_wq, &ctx->discard_endio_work);
> +}
> +
> +static void
> +xlog_discard_busy_extents(
> +	struct xfs_mount	*mp,
> +	struct xfs_cil_ctx	*ctx)
> +{
> +	struct list_head	*list = &ctx->busy_extents;
> +	struct xfs_extent_busy	*busyp;
> +	struct bio		*bio = NULL;
> +	struct blk_plug		plug;
> +	int			error = 0;
> +
> +	ASSERT(mp->m_flags & XFS_MOUNT_DISCARD);
> +
> +	blk_start_plug(&plug);
> +	list_for_each_entry(busyp, list, list) {
> +		trace_xfs_discard_extent(mp, busyp->agno, busyp->bno,
> +					 busyp->length);
> +
> +		error = __blkdev_issue_discard(mp->m_ddev_targp->bt_bdev,
> +				XFS_AGB_TO_DADDR(mp, busyp->agno, busyp->bno),
> +				XFS_FSB_TO_BB(mp, busyp->length),
> +				GFP_NOFS, 0, &bio);
> +		if (error && error != -EOPNOTSUPP) {
> +			xfs_info(mp,
> +	 "discard failed for extent [0x%llx,%u], error %d",
> +				 (unsigned long long)busyp->bno,
> +				 busyp->length,
> +				 error);
> +			break;
> +		}
> +	}
> +
> +	if (bio) {
> +		bio->bi_private = ctx;
> +		bio->bi_end_io = xlog_discard_endio;
> +		submit_bio(bio);
> +	} else {
> +		xlog_discard_endio_work(&ctx->discard_endio_work);
> +	}

This creates one long bio chain with all the regions to discard on
it, and then when all it completes we call xlog_discard_endio() to
release all the busy extents.

Why not pull the busy extent from the list and attach it to each
bio returned and submit them individually and run per-busy extent
completions? That will substantially reduce the latency of discard
completions when there are long lists of extents to discard....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 2/3] xfs: don't block the log commit handler for discards
  2016-10-17 23:29   ` Dave Chinner
@ 2016-10-18  5:05     ` Christoph Hellwig
  2016-10-19 10:58     ` Christoph Hellwig
  1 sibling, 0 replies; 19+ messages in thread
From: Christoph Hellwig @ 2016-10-18  5:05 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Christoph Hellwig, linux-xfs, michaelcallahan

On Tue, Oct 18, 2016 at 10:29:08AM +1100, Dave Chinner wrote:
> > +	if (args.fsbno == NULLFSBLOCK && trydiscard) {
> > +		trydiscard = false;
> > +		flush_workqueue(xfs_discard_wq);
> > +		goto retry;
> > +	}
> 
> So this is the new behaviour that triggers flushing of the discard
> list rather than having it occur from a log force inside
> xfs_extent_busy_update_extent().
> 
> However, xfs_extent_busy_update_extent() also has backoff when it
> finds an extent on the busy list being discarded, which means it
> could spin waiting for the discard work to complete.
> 
> Wouldn't it be better to trigger this workqueue flush in
> xfs_extent_busy_update_extent() in both these cases so that the
> behaviour remains the same for userdata allocations hitting
> uncommitted busy extents, but also allow us to remove the spinning
> for allocations where the busy extent is currently being discarded?

I've actually tried that, but I never managed to actually hit that
case.  For some reason we mostly hit busy extents, but not those
actually beeing discarded in my tests - I suspect it's just
the better log throughput we get.  But for completeness we should
probably forc it from xfs_extent_busy_update_extent as well.  I'll
see if I can hit it with setups that have slow discard (e.g. old
SATA SSDs without queued trim).

> This creates one long bio chain with all the regions to discard on
> it, and then when all it completes we call xlog_discard_endio() to
> release all the busy extents.
> 
> Why not pull the busy extent from the list and attach it to each
> bio returned and submit them individually and run per-busy extent
> completions? That will substantially reduce the latency of discard
> completions when there are long lists of extents to discard....

I'll look into that.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 2/3] xfs: don't block the log commit handler for discards
  2016-10-17 23:29   ` Dave Chinner
  2016-10-18  5:05     ` Christoph Hellwig
@ 2016-10-19 10:58     ` Christoph Hellwig
  2016-10-28 16:16       ` Michael Callahan
  1 sibling, 1 reply; 19+ messages in thread
From: Christoph Hellwig @ 2016-10-19 10:58 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Christoph Hellwig, linux-xfs, michaelcallahan

On Tue, Oct 18, 2016 at 10:29:08AM +1100, Dave Chinner wrote:
> > +	if (args.fsbno == NULLFSBLOCK && trydiscard) {
> > +		trydiscard = false;
> > +		flush_workqueue(xfs_discard_wq);
> > +		goto retry;
> > +	}
> 
> So this is the new behaviour that triggers flushing of the discard
> list rather than having it occur from a log force inside
> xfs_extent_busy_update_extent().
> 
> However, xfs_extent_busy_update_extent() also has backoff when it
> finds an extent on the busy list being discarded, which means it
> could spin waiting for the discard work to complete.
> 
> Wouldn't it be better to trigger this workqueue flush in
> xfs_extent_busy_update_extent() in both these cases so that the
> behaviour remains the same for userdata allocations hitting
> uncommitted busy extents, but also allow us to remove the spinning
> for allocations where the busy extent is currently being discarded?

So the current xfs_extent_busy_update_extent busy wait is something we
actually never hit at all - it's only hit when an extent under discard
is reused by an AGFL allocation, which basically does not happen.

I'm not feeling very eager to touch that corner case code, and would
rather leave it as-is.

The new flush deals with the case where we weren't able to find any space
due to the discard list.  To honest I almost don't manage to trigger it
anymore once I found the issue fixed in patch 1.  It might be possible
to even drop this retry entirely now.

> This creates one long bio chain with all the regions to discard on
> it, and then when all it completes we call xlog_discard_endio() to
> release all the busy extents.
> 
> Why not pull the busy extent from the list and attach it to each
> bio returned and submit them individually and run per-busy extent
> completions? That will substantially reduce the latency of discard
> completions when there are long lists of extents to discard....

Because that would defeat the merging I currently do, which is
very effectice.  It would also increase the size of the busy extent
structure as it would grow a work_struct, and increase lock contention
in the completion handler.  All in all not that pretty, especially
as the most common number of discards are one digit or small two
digit.  And this is just going to further decrease once I finish
up my block layer patches to allow multi-range discards by merging
multiple discard bios into a single request.  With that even double
digit numbers of discards are fairly rare.

Now if we eventually want to split the completions I think we'll
need to start merging the extent_busy structures once they are added
to the CIL.  That's quite a bit of effort and I'd like to avoid it
for now.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 1/3] xfs: fix the AG loop in xfs_alloc_vextent for busy extents
  2016-10-17 20:22 ` [PATCH 1/3] xfs: fix the AG loop in xfs_alloc_vextent for busy extents Christoph Hellwig
@ 2016-10-19 13:48   ` Brian Foster
  2016-10-21 12:48     ` Christoph Hellwig
  2016-11-08  6:15   ` Dave Chinner
  1 sibling, 1 reply; 19+ messages in thread
From: Brian Foster @ 2016-10-19 13:48 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: linux-xfs, michaelcallahan

On Mon, Oct 17, 2016 at 10:22:31PM +0200, Christoph Hellwig wrote:
> If we have too many busy extents, or just enough to make our wanted
> allocation impossible we will never move on to another AG in
> xfs_alloc_vextent currently.  Change the loop exit condition to keep
> looking for an AG if we can't allocate an extent.
> 
> For the single AG cases we don't need to change anything as the higher
> layers need to handle this case.
> 
> Signed-off-by: Christoph Hellwig <hch@lst.de>
> ---
>  fs/xfs/libxfs/xfs_alloc.c | 11 ++++++++---
>  1 file changed, 8 insertions(+), 3 deletions(-)
> 
> diff --git a/fs/xfs/libxfs/xfs_alloc.c b/fs/xfs/libxfs/xfs_alloc.c
> index effb64c..0a04bec 100644
> --- a/fs/xfs/libxfs/xfs_alloc.c
> +++ b/fs/xfs/libxfs/xfs_alloc.c
> @@ -2753,13 +2753,18 @@ xfs_alloc_vextent(
>  				trace_xfs_alloc_vextent_nofix(args);
>  				goto error0;
>  			}
> +
>  			/*
> -			 * If we get a buffer back then the allocation will fly.
> +			 * If we get a buffer back then the allocation will fly,
> +			 * unless there are busy extents in the way.
>  			 */
>  			if (args->agbp) {
> -				if ((error = xfs_alloc_ag_vextent(args)))
> +				error = xfs_alloc_ag_vextent(args);
> +				if (error)
>  					goto error0;
> -				break;
> +				if (args->agbno != NULLAGBLOCK)
> +					break;
> +				xfs_trans_brelse(args->tp, args->agbp);

How is this safe with respect to xfs_alloc_fix_freelist() potentially
dirtying the agf? Haven't we had deadlock and/or other problems due to
xfs_alloc_fix_freelist() succeeding when an allocation ultimately fails,
and subsequently rotating to a potentially lower agno?

Brian

>  			}
>  
>  			trace_xfs_alloc_vextent_loopfailed(args);
> -- 
> 2.1.4
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 1/3] xfs: fix the AG loop in xfs_alloc_vextent for busy extents
  2016-10-19 13:48   ` Brian Foster
@ 2016-10-21 12:48     ` Christoph Hellwig
  2016-10-21 14:41       ` Brian Foster
  0 siblings, 1 reply; 19+ messages in thread
From: Christoph Hellwig @ 2016-10-21 12:48 UTC (permalink / raw)
  To: Brian Foster; +Cc: Christoph Hellwig, linux-xfs, michaelcallahan

On Wed, Oct 19, 2016 at 09:48:08AM -0400, Brian Foster wrote:
> >  			if (args->agbp) {
> > -				if ((error = xfs_alloc_ag_vextent(args)))
> > +				error = xfs_alloc_ag_vextent(args);
> > +				if (error)
> >  					goto error0;
> > -				break;
> > +				if (args->agbno != NULLAGBLOCK)
> > +					break;
> > +				xfs_trans_brelse(args->tp, args->agbp);
> 
> How is this safe with respect to xfs_alloc_fix_freelist() potentially
> dirtying the agf? Haven't we had deadlock and/or other problems due to
> xfs_alloc_fix_freelist() succeeding when an allocation ultimately fails,
> and subsequently rotating to a potentially lower agno?

We've had the case where the allocation would fail despite
xfs_alloc_fix_freelist succeeding forever, it's just that with
discard in general and async discard in particular we can hit it
much more easily.

The only way to make the allocation no fail if xfs_alloc_fix_freelist
succeeded would be to force out any busy extents at the low level
if we are tigh on space, I'll have to see how doable that would be.

The other option would be to roll around the transaction when switching
to a different AG so that we can release the locks.  That sounds a lot
easier, and also less fragile in the long run.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 1/3] xfs: fix the AG loop in xfs_alloc_vextent for busy extents
  2016-10-21 12:48     ` Christoph Hellwig
@ 2016-10-21 14:41       ` Brian Foster
  0 siblings, 0 replies; 19+ messages in thread
From: Brian Foster @ 2016-10-21 14:41 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: linux-xfs, michaelcallahan

On Fri, Oct 21, 2016 at 02:48:24PM +0200, Christoph Hellwig wrote:
> On Wed, Oct 19, 2016 at 09:48:08AM -0400, Brian Foster wrote:
> > >  			if (args->agbp) {
> > > -				if ((error = xfs_alloc_ag_vextent(args)))
> > > +				error = xfs_alloc_ag_vextent(args);
> > > +				if (error)
> > >  					goto error0;
> > > -				break;
> > > +				if (args->agbno != NULLAGBLOCK)
> > > +					break;
> > > +				xfs_trans_brelse(args->tp, args->agbp);
> > 
> > How is this safe with respect to xfs_alloc_fix_freelist() potentially
> > dirtying the agf? Haven't we had deadlock and/or other problems due to
> > xfs_alloc_fix_freelist() succeeding when an allocation ultimately fails,
> > and subsequently rotating to a potentially lower agno?
> 
> We've had the case where the allocation would fail despite
> xfs_alloc_fix_freelist succeeding forever, it's just that with
> discard in general and async discard in particular we can hit it
> much more easily.
> 

Perhaps, but I think that is beside the point. IME, any time that has
resulted in such a deadlock, we attribute it to the fact that
xfs_alloc_fix_freelist() succeeded in a case where it shouldn't have. I
don't think we should introduce more of such cases if we can help it.

> The only way to make the allocation no fail if xfs_alloc_fix_freelist
> succeeded would be to force out any busy extents at the low level
> if we are tigh on space, I'll have to see how doable that would be.
> 
> The other option would be to roll around the transaction when switching
> to a different AG so that we can release the locks.  That sounds a lot
> easier, and also less fragile in the long run.

That sounds reasonable provided we don't have any partial modifications
or whatnot in the transaction. Another angle might be to find a way to
take the busy extents into consideration before xfs_alloc_fix_freelist()
actually makes any changes, but I also don't know how straightforward
that might be either.

Brian

> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: discard improvements
  2016-10-17 20:22 discard improvements Christoph Hellwig
                   ` (2 preceding siblings ...)
  2016-10-17 20:22 ` [PATCH 3/3] xfs: merge discard requests Christoph Hellwig
@ 2016-10-28 10:11 ` Avi Kivity
  2016-10-31 15:14   ` Christoph Hellwig
  3 siblings, 1 reply; 19+ messages in thread
From: Avi Kivity @ 2016-10-28 10:11 UTC (permalink / raw)
  To: Christoph Hellwig, linux-xfs; +Cc: michaelcallahan

On 10/17/2016 11:22 PM, Christoph Hellwig wrote:
> iThis series improves XFS discard support so that we don't stall the
> log commit thread and improve performance for workloads with lots
> of deletions.
>
> The first patch is a fix for a long-standing allocator issue when we
> have busy extents in the way of an allocation.  Without the async
> discard support it was harder to hit, but it's basically been there
> since we added busy extent tracking and non-synchronous delete
> transactions.
>
> The second patch is the main async discard support, and the third
> one allows to batch up multiple extents into a single discard bio,
> which is a common occurance.
>

With these patches, is it reasonable to run with discard enabled, or is 
more work needed?

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 2/3] xfs: don't block the log commit handler for discards
  2016-10-19 10:58     ` Christoph Hellwig
@ 2016-10-28 16:16       ` Michael Callahan
  2016-10-31 15:16         ` Christoph Hellwig
  0 siblings, 1 reply; 19+ messages in thread
From: Michael Callahan @ 2016-10-28 16:16 UTC (permalink / raw)
  To: Christoph Hellwig, Dave Chinner; +Cc: linux-xfs



On 10/19/16, 4:58 AM, "Christoph Hellwig" <hch@lst.de> wrote:

>On Tue, Oct 18, 2016 at 10:29:08AM +1100, Dave Chinner wrote:
>> > +	if (args.fsbno == NULLFSBLOCK && trydiscard) {
>> > +		trydiscard = false;
>> > +		flush_workqueue(xfs_discard_wq);
>> > +		goto retry;
>> > +	}
>> 
>> So this is the new behaviour that triggers flushing of the discard
>> list rather than having it occur from a log force inside
>> xfs_extent_busy_update_extent().
>> 
>> However, xfs_extent_busy_update_extent() also has backoff when it
>> finds an extent on the busy list being discarded, which means it
>> could spin waiting for the discard work to complete.
>> 
>> Wouldn't it be better to trigger this workqueue flush in
>> xfs_extent_busy_update_extent() in both these cases so that the
>> behaviour remains the same for userdata allocations hitting
>> uncommitted busy extents, but also allow us to remove the spinning
>> for allocations where the busy extent is currently being discarded?
>
>So the current xfs_extent_busy_update_extent busy wait is something we
>actually never hit at all - it's only hit when an extent under discard
>is reused by an AGFL allocation, which basically does not happen.
>
>I'm not feeling very eager to touch that corner case code, and would
>rather leave it as-is.
>
>The new flush deals with the case where we weren't able to find any space
>due to the discard list.  To honest I almost don't manage to trigger it
>anymore once I found the issue fixed in patch 1.  It might be possible
>to even drop this retry entirely now.
>
>> This creates one long bio chain with all the regions to discard on
>> it, and then when all it completes we call xlog_discard_endio() to
>> release all the busy extents.
>> 
>> Why not pull the busy extent from the list and attach it to each
>> bio returned and submit them individually and run per-busy extent
>> completions? That will substantially reduce the latency of discard
>> completions when there are long lists of extents to discard....
>
>Because that would defeat the merging I currently do, which is
>very effectice.  It would also increase the size of the busy extent
>structure as it would grow a work_struct, and increase lock contention
>in the completion handler.  All in all not that pretty, especially
>as the most common number of discards are one digit or small two
>digit.  And this is just going to further decrease once I finish
>up my block layer patches to allow multi-range discards by merging
>multiple discard bios into a single request.  With that even double
>digit numbers of discards are fairly rare.
>
>Now if we eventually want to split the completions I think we'll
>need to start merging the extent_busy structures once they are added
>to the CIL.  That's quite a bit of effort and I'd like to avoid it
>for now.

Doesn't the block layer already do a reasonable job of merging adjacent
discards?  This is about the only bio-level optimization that blk-mq does
but it should be working.

Also last I looked the md layer of the software raid stack could re-dice
these into many stripe sized pieces anyway and that also needed to be
fixed.

  Michael


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: discard improvements
  2016-10-28 10:11 ` discard improvements Avi Kivity
@ 2016-10-31 15:14   ` Christoph Hellwig
  2016-11-05 18:18     ` Avi Kivity
  0 siblings, 1 reply; 19+ messages in thread
From: Christoph Hellwig @ 2016-10-31 15:14 UTC (permalink / raw)
  To: Avi Kivity; +Cc: Christoph Hellwig, linux-xfs, michaelcallahan

On Fri, Oct 28, 2016 at 01:11:55PM +0300, Avi Kivity wrote:
> With these patches, is it reasonable to run with discard enabled, or is 
> more work needed?

As usual the answer is: it depends.  With the high-end NVMe devices
I've been testing with you absolutely should enable discard.  With
SAS SSDs the same is probably true.  I haven't done much testing with
SATA SSDs but I'd be more cautious there, especially as very few seem
to support queued TRIM yet.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 2/3] xfs: don't block the log commit handler for discards
  2016-10-28 16:16       ` Michael Callahan
@ 2016-10-31 15:16         ` Christoph Hellwig
  0 siblings, 0 replies; 19+ messages in thread
From: Christoph Hellwig @ 2016-10-31 15:16 UTC (permalink / raw)
  To: Michael Callahan; +Cc: Christoph Hellwig, Dave Chinner, linux-xfs

On Fri, Oct 28, 2016 at 04:16:01PM +0000, Michael Callahan wrote:
> Doesn't the block layer already do a reasonable job of merging adjacent
> discards?

It certainly didn't with my fast NVMe device.  But even if it did
we have all the information available here, so it's still cheaper to
do it in the submitter.  One thing we should try is to move it into
__blkdev_issue_discard in the future, though.

Note that I've also started to look into merging disjoint discard
range into a single request for ranged trim/unmap/deallocate and that
also seems to work nicely, but it will need some large tree-wide
changes.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: discard improvements
  2016-10-31 15:14   ` Christoph Hellwig
@ 2016-11-05 18:18     ` Avi Kivity
  2016-11-06 16:36       ` Christoph Hellwig
  0 siblings, 1 reply; 19+ messages in thread
From: Avi Kivity @ 2016-11-05 18:18 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: linux-xfs, michaelcallahan

On 10/31/2016 05:14 PM, Christoph Hellwig wrote:
> On Fri, Oct 28, 2016 at 01:11:55PM +0300, Avi Kivity wrote:
>> With these patches, is it reasonable to run with discard enabled, or is
>> more work needed?
> As usual the answer is: it depends.  With the high-end NVMe devices
> I've been testing with you absolutely should enable discard.  With
> SAS SSDs the same is probably true.  I haven't done much testing with
> SATA SSDs but I'd be more cautious there, especially as very few seem
> to support queued TRIM yet.

Thanks.

The big problem with "it depends" is that usually the information it 
depends on is not available, so the application has to rely on a human 
to guess correctly.  With the trend for machines to be managed by 
machines, it's really hard to get optimal performance.  So exposure of 
any information that can help to make a decision, including that the 
kernel can make good use of TRIM, will be very helpful.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: discard improvements
  2016-11-05 18:18     ` Avi Kivity
@ 2016-11-06 16:36       ` Christoph Hellwig
  2016-11-06 23:21         ` Dave Chinner
  0 siblings, 1 reply; 19+ messages in thread
From: Christoph Hellwig @ 2016-11-06 16:36 UTC (permalink / raw)
  To: Avi Kivity; +Cc: Christoph Hellwig, linux-xfs, michaelcallahan

On Sat, Nov 05, 2016 at 08:18:05PM +0200, Avi Kivity wrote:
> The big problem with "it depends" is that usually the information it 
> depends on is not available, so the application has to rely on a human to 
> guess correctly.  With the trend for machines to be managed by machines, 
> it's really hard to get optimal performance.  So exposure of any 
> information that can help to make a decision, including that the kernel can 
> make good use of TRIM, will be very helpful.

I agree with you, but until devices tell us that their discard
implementation sucks on their own there is not much I can help
you there.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: discard improvements
  2016-11-06 16:36       ` Christoph Hellwig
@ 2016-11-06 23:21         ` Dave Chinner
  0 siblings, 0 replies; 19+ messages in thread
From: Dave Chinner @ 2016-11-06 23:21 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Avi Kivity, linux-xfs, michaelcallahan

On Sun, Nov 06, 2016 at 05:36:42PM +0100, Christoph Hellwig wrote:
> On Sat, Nov 05, 2016 at 08:18:05PM +0200, Avi Kivity wrote:
> > The big problem with "it depends" is that usually the information it 
> > depends on is not available, so the application has to rely on a human to 
> > guess correctly.  With the trend for machines to be managed by machines, 
> > it's really hard to get optimal performance.  So exposure of any 
> > information that can help to make a decision, including that the kernel can 
> > make good use of TRIM, will be very helpful.
> 
> I agree with you, but until devices tell us that their discard
> implementation sucks on their own there is not much I can help
> you there.

The problem is that we keep seeing cases where devices fail discards
because of firmware issues. These problems are not going away, and
it's just not the cheap SSDs that are the problem here.

e.g an 800GB Intel p3700 NVMe drive that was reported to the ext4
list last week that failed discard for all LBAs over 700GB. Whatever
the problem was, it was quietly fixed in a firmware update...

https://bugzilla.kernel.org/show_bug.cgi?id=186551

If we can't trust the high end enterprise SSDs to have a reliable
discard implementation, then discard will continue to be a "use at
your own risk" kind of feature...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 1/3] xfs: fix the AG loop in xfs_alloc_vextent for busy extents
  2016-10-17 20:22 ` [PATCH 1/3] xfs: fix the AG loop in xfs_alloc_vextent for busy extents Christoph Hellwig
  2016-10-19 13:48   ` Brian Foster
@ 2016-11-08  6:15   ` Dave Chinner
  2016-11-10 19:17     ` Christoph Hellwig
  1 sibling, 1 reply; 19+ messages in thread
From: Dave Chinner @ 2016-11-08  6:15 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: linux-xfs, michaelcallahan

On Mon, Oct 17, 2016 at 10:22:31PM +0200, Christoph Hellwig wrote:
> If we have too many busy extents, or just enough to make our wanted
> allocation impossible we will never move on to another AG in
> xfs_alloc_vextent currently.  Change the loop exit condition to keep
> looking for an AG if we can't allocate an extent.
> 
> For the single AG cases we don't need to change anything as the higher
> layers need to handle this case.
> 
> Signed-off-by: Christoph Hellwig <hch@lst.de>
> ---
>  fs/xfs/libxfs/xfs_alloc.c | 11 ++++++++---
>  1 file changed, 8 insertions(+), 3 deletions(-)
> 
> diff --git a/fs/xfs/libxfs/xfs_alloc.c b/fs/xfs/libxfs/xfs_alloc.c
> index effb64c..0a04bec 100644
> --- a/fs/xfs/libxfs/xfs_alloc.c
> +++ b/fs/xfs/libxfs/xfs_alloc.c
> @@ -2753,13 +2753,18 @@ xfs_alloc_vextent(
>  				trace_xfs_alloc_vextent_nofix(args);
>  				goto error0;
>  			}
> +
>  			/*
> -			 * If we get a buffer back then the allocation will fly.
> +			 * If we get a buffer back then the allocation will fly,
> +			 * unless there are busy extents in the way.
>  			 */
>  			if (args->agbp) {
> -				if ((error = xfs_alloc_ag_vextent(args)))
> +				error = xfs_alloc_ag_vextent(args);
> +				if (error)
>  					goto error0;
> -				break;
> +				if (args->agbno != NULLAGBLOCK)
> +					break;
> +				xfs_trans_brelse(args->tp, args->agbp);
>  			}
>  
>  			trace_xfs_alloc_vextent_loopfailed(args);

Here's the problem: At ENOSPC, we dirty AG headers fixing freelists,
then fail to allocate inodes, resulting in a shutdown - a regression
that xfs/076 trips over quite regularly on my test machines.

XFS (pmem1): Internal error xfs_trans_cancel at line 983 of file fs/xfs/xfs_trans.c.  Caller xfs_create+0x498/0x780
CPU: 3 PID: 19236 Comm: touch Not tainted 4.9.0-rc4-dgc+ #1019
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Debian-1.8.2-1 04/01/2014
 ffffc900144dbb28 ffffffff81822480 ffff880239580500 ffff88023abd8000
 ffffc900144dbb40 ffffffff8152340c ffffffff81534708 ffffc900144dbb80
 ffffffff8154570b 0000000100000000 ffff88032e9ff800 ffff88023abd8000
Call Trace:
 [<ffffffff81822480>] dump_stack+0x63/0x83
 [<ffffffff8152340c>] xfs_error_report+0x3c/0x40
 [<ffffffff81534708>] ? xfs_create+0x498/0x780
 [<ffffffff8154570b>] xfs_trans_cancel+0x12b/0x150
 [<ffffffff81534708>] xfs_create+0x498/0x780
 [<ffffffff815325af>] xfs_generic_create+0x1df/0x2c0
 [<ffffffff815326c6>] xfs_vn_create+0x16/0x20
 [<ffffffff812120c2>] path_openat+0x1312/0x13e0
 [<ffffffff811a4c86>] ? unlock_page+0x36/0x40
 [<ffffffff8121335e>] do_filp_open+0x7e/0xe0
 [<ffffffff8121234f>] ? getname_flags+0x4f/0x1f0
 [<ffffffff811f81c2>] ? kmem_cache_alloc+0x42/0x180
 [<ffffffff81c8e960>] ? _raw_spin_unlock+0x10/0x30
 [<ffffffff812219e2>] ? __alloc_fd+0xb2/0x160
 [<ffffffff81200fe3>] do_sys_open+0x123/0x200
 [<ffffffff812010de>] SyS_open+0x1e/0x20
 [<ffffffff81c8efb7>] entry_SYSCALL_64_fastpath+0x1a/0xa9

You're going to have to rethink this one, Christoph.

Cheers,

Dave.
> -- 
> 2.1.4
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 1/3] xfs: fix the AG loop in xfs_alloc_vextent for busy extents
  2016-11-08  6:15   ` Dave Chinner
@ 2016-11-10 19:17     ` Christoph Hellwig
  0 siblings, 0 replies; 19+ messages in thread
From: Christoph Hellwig @ 2016-11-10 19:17 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Christoph Hellwig, linux-xfs, michaelcallahan

On Tue, Nov 08, 2016 at 05:15:39PM +1100, Dave Chinner wrote:
> Here's the problem: At ENOSPC, we dirty AG headers fixing freelists,
> then fail to allocate inodes, resulting in a shutdown - a regression
> that xfs/076 trips over quite regularly on my test machines.

I could not reproduce it, but based on the discussion with Brian I
agree.  But that sad part is that this bug already exists in the
existing code due to the busy extent tracking code - my little
patch just made it a lot easier to hit.  I'll see how I can come up
with a proper fix for it.

^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2016-11-10 19:18 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-10-17 20:22 discard improvements Christoph Hellwig
2016-10-17 20:22 ` [PATCH 1/3] xfs: fix the AG loop in xfs_alloc_vextent for busy extents Christoph Hellwig
2016-10-19 13:48   ` Brian Foster
2016-10-21 12:48     ` Christoph Hellwig
2016-10-21 14:41       ` Brian Foster
2016-11-08  6:15   ` Dave Chinner
2016-11-10 19:17     ` Christoph Hellwig
2016-10-17 20:22 ` [PATCH 2/3] xfs: don't block the log commit handler for discards Christoph Hellwig
2016-10-17 23:29   ` Dave Chinner
2016-10-18  5:05     ` Christoph Hellwig
2016-10-19 10:58     ` Christoph Hellwig
2016-10-28 16:16       ` Michael Callahan
2016-10-31 15:16         ` Christoph Hellwig
2016-10-17 20:22 ` [PATCH 3/3] xfs: merge discard requests Christoph Hellwig
2016-10-28 10:11 ` discard improvements Avi Kivity
2016-10-31 15:14   ` Christoph Hellwig
2016-11-05 18:18     ` Avi Kivity
2016-11-06 16:36       ` Christoph Hellwig
2016-11-06 23:21         ` Dave Chinner

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.