All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/4] online discard support V3
@ 2011-05-04 18:55 Christoph Hellwig
  2011-05-04 18:55 ` [PATCH 1/4] xfs: add online discard support Christoph Hellwig
                   ` (3 more replies)
  0 siblings, 4 replies; 15+ messages in thread
From: Christoph Hellwig @ 2011-05-04 18:55 UTC (permalink / raw)
  To: xfs

Add support for discarding unused blocks at CIL commit time.  The first
patch is the guts of the implementation and relies and the resently
included improved busy extent tracking.  The second patch is an optimization
that helps a lot with performance, and the last two patches submit the
discard requests asynchronously to not stall the log I/O completions threads
and improve performance, but they currently trip over bugs in the block layer.

The performance with these patches is quite bad on the SATA SSDs I tested,
with up to 50% slowdowns on meta data intensive workloads, although ext4 is
much worse and btrfs is almost as bad as bad.  I've demonstrated a prototype
of a vectored discard at LSF that builds on the code subitted here and
only changes the internals of xfs_discard_extents, which brings the performance
back to acceptable levels.

For now my suggestion is to put patches 1 and 2 in to give people a chance
to play with online discard on XFS.  If the block layer issues get fixed
in time we can add patches 3 and 4 later in the 2.6.40 cycle, if not they
will still be needed once we get the proper vectored discard support that
I'm going to start working on soon.

And here's the block layer workaround for the discard merge bug:


Index: xfs/block/blk-core.c
===================================================================
--- xfs.orig/block/blk-core.c	2011-03-30 16:04:45.700659775 +0200
+++ xfs/block/blk-core.c	2011-03-30 16:04:59.775160021 +0200
@@ -1247,7 +1247,7 @@ static int __make_request(struct request
 	 */
 	blk_queue_bounce(q, &bio);
 
-	if (bio->bi_rw & (REQ_FLUSH | REQ_FUA)) {
+	if (bio->bi_rw & (REQ_FLUSH | REQ_FUA | REQ_DISCARD)) {
 		spin_lock_irq(q->queue_lock);
 		where = ELEVATOR_INSERT_FLUSH;
 		goto get_rq;

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH 1/4] xfs: add online discard support
  2011-05-04 18:55 [PATCH 0/4] online discard support V3 Christoph Hellwig
@ 2011-05-04 18:55 ` Christoph Hellwig
  2011-05-19 21:53   ` Alex Elder
  2011-05-20 13:45   ` [PATCH 1/4 v2] " Christoph Hellwig
  2011-05-04 18:55 ` [PATCH 2/4] xfs: do not discard alloc btree blocks Christoph Hellwig
                   ` (2 subsequent siblings)
  3 siblings, 2 replies; 15+ messages in thread
From: Christoph Hellwig @ 2011-05-04 18:55 UTC (permalink / raw)
  To: xfs

[-- Attachment #1: xfs-add-online-discard-support --]
[-- Type: text/plain, Size: 9609 bytes --]

Now that we have reliably tracking of deleted extents in a transaction
we can easily implement "online" discard support which calls
blkdev_issue_discard once a transaction commits.

The actual discard is a two stage operation as we first have to mark
the busy extent as not available for reuse before we can start the
actual discard.  Note that we don't bother supporting discard for
the non-delaylog mode.

Signed-off-by: Christoph Hellwig <hch@lst.de>

Index: xfs/fs/xfs/linux-2.6/xfs_super.c
===================================================================
--- xfs.orig/fs/xfs/linux-2.6/xfs_super.c	2011-05-04 20:44:30.466422727 +0200
+++ xfs/fs/xfs/linux-2.6/xfs_super.c	2011-05-04 20:45:06.302895250 +0200
@@ -112,6 +112,8 @@ mempool_t *xfs_ioend_pool;
 #define MNTOPT_QUOTANOENF  "qnoenforce"	/* same as uqnoenforce */
 #define MNTOPT_DELAYLOG   "delaylog"	/* Delayed loging enabled */
 #define MNTOPT_NODELAYLOG "nodelaylog"	/* Delayed loging disabled */
+#define MNTOPT_DISCARD	"discard"	/* Discard unused blocks */
+#define MNTOPT_NODISCARD "nodiscard"	/* Do not discard unused blocks */
 
 /*
  * Table driven mount option parser.
@@ -355,6 +357,10 @@ xfs_parseargs(
 			mp->m_flags |= XFS_MOUNT_DELAYLOG;
 		} else if (!strcmp(this_char, MNTOPT_NODELAYLOG)) {
 			mp->m_flags &= ~XFS_MOUNT_DELAYLOG;
+		} else if (!strcmp(this_char, MNTOPT_DISCARD)) {
+			mp->m_flags |= XFS_MOUNT_DISCARD;
+		} else if (!strcmp(this_char, MNTOPT_NODISCARD)) {
+			mp->m_flags &= ~XFS_MOUNT_DISCARD;
 		} else if (!strcmp(this_char, "ihashsize")) {
 			xfs_warn(mp,
 	"ihashsize no longer used, option is deprecated.");
@@ -488,6 +494,7 @@ xfs_showargs(
 		{ XFS_MOUNT_FILESTREAMS,	"," MNTOPT_FILESTREAM },
 		{ XFS_MOUNT_GRPID,		"," MNTOPT_GRPID },
 		{ XFS_MOUNT_DELAYLOG,		"," MNTOPT_DELAYLOG },
+		{ XFS_MOUNT_DISCARD,		"," MNTOPT_DISCARD },
 		{ 0, NULL }
 	};
 	static struct proc_xfs_info xfs_info_unset[] = {
Index: xfs/fs/xfs/xfs_mount.h
===================================================================
--- xfs.orig/fs/xfs/xfs_mount.h	2011-05-04 20:44:30.356423323 +0200
+++ xfs/fs/xfs/xfs_mount.h	2011-05-04 20:45:06.302895250 +0200
@@ -224,6 +224,7 @@ typedef struct xfs_mount {
 #define XFS_MOUNT_FS_SHUTDOWN	(1ULL << 4)	/* atomic stop of all filesystem
 						   operations, typically for
 						   disk errors in metadata */
+#define XFS_MOUNT_DISCARD	(1ULL << 5)	/* discard unused blocks */
 #define XFS_MOUNT_RETERR	(1ULL << 6)     /* return alignment errors to
 						   user */
 #define XFS_MOUNT_NOALIGN	(1ULL << 7)	/* turn off stripe alignment
Index: xfs/fs/xfs/xfs_log_cil.c
===================================================================
--- xfs.orig/fs/xfs/xfs_log_cil.c	2011-05-04 20:44:30.369756584 +0200
+++ xfs/fs/xfs/xfs_log_cil.c	2011-05-04 20:45:06.302895250 +0200
@@ -29,6 +29,7 @@
 #include "xfs_mount.h"
 #include "xfs_error.h"
 #include "xfs_alloc.h"
+#include "xfs_discard.h"
 
 /*
  * Perform initial CIL structure initialisation. If the CIL is not
@@ -361,18 +362,28 @@ xlog_cil_committed(
 	int	abort)
 {
 	struct xfs_cil_ctx	*ctx = args;
+	struct xfs_mount	*mp = ctx->cil->xc_log->l_mp;
 
 	xfs_trans_committed_bulk(ctx->cil->xc_log->l_ailp, ctx->lv_chain,
 					ctx->start_lsn, abort);
 
 	xfs_alloc_busy_sort(&ctx->busy_extents);
-	xfs_alloc_busy_clear(ctx->cil->xc_log->l_mp, &ctx->busy_extents);
+	xfs_alloc_busy_clear(mp, &ctx->busy_extents,
+			     (mp->m_flags & XFS_MOUNT_DISCARD) && !abort);
 
 	spin_lock(&ctx->cil->xc_cil_lock);
 	list_del(&ctx->committing);
 	spin_unlock(&ctx->cil->xc_cil_lock);
 
 	xlog_cil_free_logvec(ctx->lv_chain);
+
+	if (!list_empty(&ctx->busy_extents)) {
+		ASSERT(mp->m_flags & XFS_MOUNT_DISCARD);
+
+		xfs_discard_extents(mp, &ctx->busy_extents);
+		xfs_alloc_busy_clear(mp, &ctx->busy_extents, false);
+	}
+
 	kmem_free(ctx);
 }
 
Index: xfs/fs/xfs/linux-2.6/xfs_discard.c
===================================================================
--- xfs.orig/fs/xfs/linux-2.6/xfs_discard.c	2011-05-04 20:44:30.329756801 +0200
+++ xfs/fs/xfs/linux-2.6/xfs_discard.c	2011-05-04 20:45:06.306228566 +0200
@@ -191,3 +191,32 @@ xfs_ioc_trim(
 		return -XFS_ERROR(EFAULT);
 	return 0;
 }
+
+int
+xfs_discard_extents(
+	struct xfs_mount	*mp,
+	struct list_head	*list)
+{
+	struct xfs_busy_extent	*busyp;
+	int			error = 0;
+
+	list_for_each_entry(busyp, list, list) {
+		trace_xfs_discard_extent(mp, busyp->agno, busyp->bno,
+					 busyp->length);
+
+		error = -blkdev_issue_discard(mp->m_ddev_targp->bt_bdev,
+				XFS_AGB_TO_DADDR(mp, busyp->agno, busyp->bno),
+				XFS_FSB_TO_BB(mp, busyp->length),
+				GFP_NOFS, 0);
+		if (error && error != EOPNOTSUPP) {
+			xfs_info(mp,
+	 "discard failed for extent [0x%llu,%u], error %d",
+				 (unsigned long long)busyp->bno,
+				 busyp->length,
+				 error);
+			return error;
+		}
+	}
+
+	return 0;
+}
Index: xfs/fs/xfs/linux-2.6/xfs_discard.h
===================================================================
--- xfs.orig/fs/xfs/linux-2.6/xfs_discard.h	2011-05-04 20:44:30.343090061 +0200
+++ xfs/fs/xfs/linux-2.6/xfs_discard.h	2011-05-04 20:45:06.306228566 +0200
@@ -2,7 +2,9 @@
 #define XFS_DISCARD_H 1
 
 struct fstrim_range;
+struct list_head;
 
 extern int	xfs_ioc_trim(struct xfs_mount *, struct fstrim_range __user *);
+extern int	xfs_discard_extents(struct xfs_mount *, struct list_head *);
 
 #endif /* XFS_DISCARD_H */
Index: xfs/fs/xfs/xfs_ag.h
===================================================================
--- xfs.orig/fs/xfs/xfs_ag.h	2011-05-04 20:44:30.376423214 +0200
+++ xfs/fs/xfs/xfs_ag.h	2011-05-04 20:45:11.406200936 +0200
@@ -187,6 +187,8 @@ struct xfs_busy_extent {
 	xfs_agnumber_t	agno;
 	xfs_agblock_t	bno;
 	xfs_extlen_t	length;
+	unsigned int	flags;
+#define XFS_ALLOC_BUSY_DISCARDED	0x01	/* undergoing a discard op. */
 };
 
 /*
Index: xfs/fs/xfs/xfs_alloc.c
===================================================================
--- xfs.orig/fs/xfs/xfs_alloc.c	2011-05-04 20:44:30.386423159 +0200
+++ xfs/fs/xfs/xfs_alloc.c	2011-05-04 20:45:11.432867459 +0200
@@ -2610,6 +2610,18 @@ xfs_alloc_busy_update_extent(
 	xfs_agblock_t		bend = bbno + busyp->length;
 
 	/*
+	 * This extent is currently beeing discard.  Give the thread
+	 * performing the discard a chance to mark the extent unbusy
+	 * and retry.
+	 */
+	if (busyp->flags & XFS_ALLOC_BUSY_DISCARDED) {
+		spin_unlock(&pag->pagb_lock);
+		delay(1);
+		spin_lock(&pag->pagb_lock);
+		return false;
+	}
+
+	/*
 	 * If there is a busy extent overlapping a user allocation, we have
 	 * no choice but to force the log and retry the search.
 	 *
@@ -2814,7 +2826,8 @@ restart:
 		 * If this is a metadata allocation, try to reuse the busy
 		 * extent instead of trimming the allocation.
 		 */
-		if (!args->userdata) {
+		if (!args->userdata &&
+		    !(busyp->flags & XFS_ALLOC_BUSY_DISCARDED)) {
 			if (!xfs_alloc_busy_update_extent(args->mp, args->pag,
 							  busyp, fbno, flen,
 							  false))
@@ -2980,10 +2993,16 @@ xfs_alloc_busy_clear_one(
 	kmem_free(busyp);
 }
 
+/*
+ * Remove all extents on the passed in list from the busy extents tree.
+ * If do_discard is set skip extents that need to be discarded, and mark
+ * these as undergoing a discard operation instead.
+ */
 void
 xfs_alloc_busy_clear(
 	struct xfs_mount	*mp,
-	struct list_head	*list)
+	struct list_head	*list,
+	bool			do_discard)
 {
 	struct xfs_busy_extent	*busyp, *n;
 	struct xfs_perag	*pag = NULL;
@@ -3000,7 +3019,10 @@ xfs_alloc_busy_clear(
 			agno = busyp->agno;
 		}
 
-		xfs_alloc_busy_clear_one(mp, pag, busyp);
+		if (do_discard && busyp->length)
+			busyp->flags = XFS_ALLOC_BUSY_DISCARDED;
+		else
+			xfs_alloc_busy_clear_one(mp, pag, busyp);
 	}
 
 	if (pag) {
Index: xfs/Documentation/filesystems/xfs.txt
===================================================================
--- xfs.orig/Documentation/filesystems/xfs.txt	2011-05-04 20:44:30.409756366 +0200
+++ xfs/Documentation/filesystems/xfs.txt	2011-05-04 20:45:06.306228566 +0200
@@ -39,6 +39,12 @@ When mounting an XFS filesystem, the fol
 	drive level write caching to be enabled, for devices that
 	support write barriers.
 
+  discard
+	Issue command to let the block device reclaim space freed by the
+	filesystem.  This is useful for SSD devices, thinly provisioned
+	LUNs and virtual machine images, but may have a performance
+	impact.
+
   dmapi
 	Enable the DMAPI (Data Management API) event callouts.
 	Use with the "mtpt" option.
Index: xfs/fs/xfs/xfs_alloc.h
===================================================================
--- xfs.orig/fs/xfs/xfs_alloc.h	2011-05-04 20:44:30.393089791 +0200
+++ xfs/fs/xfs/xfs_alloc.h	2011-05-04 20:45:11.452867351 +0200
@@ -140,7 +140,8 @@ xfs_alloc_busy_insert(struct xfs_trans *
 	xfs_agblock_t bno, xfs_extlen_t len);
 
 void
-xfs_alloc_busy_clear(struct xfs_mount *mp, struct list_head *list);
+xfs_alloc_busy_clear(struct xfs_mount *mp, struct list_head *list,
+	bool do_discard);
 
 int
 xfs_alloc_busy_search(struct xfs_mount *mp, xfs_agnumber_t agno,
Index: xfs/fs/xfs/xfs_trans.c
===================================================================
--- xfs.orig/fs/xfs/xfs_trans.c	2011-05-04 20:44:30.399756421 +0200
+++ xfs/fs/xfs/xfs_trans.c	2011-05-04 20:45:06.306228566 +0200
@@ -609,7 +609,7 @@ xfs_trans_free(
 	struct xfs_trans	*tp)
 {
 	xfs_alloc_busy_sort(&tp->t_busy);
-	xfs_alloc_busy_clear(tp->t_mountp, &tp->t_busy);
+	xfs_alloc_busy_clear(tp->t_mountp, &tp->t_busy, false);
 
 	atomic_dec(&tp->t_mountp->m_active_trans);
 	xfs_trans_free_dqinfo(tp);

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH 2/4] xfs: do not discard alloc btree blocks
  2011-05-04 18:55 [PATCH 0/4] online discard support V3 Christoph Hellwig
  2011-05-04 18:55 ` [PATCH 1/4] xfs: add online discard support Christoph Hellwig
@ 2011-05-04 18:55 ` Christoph Hellwig
  2011-05-19 21:54   ` Alex Elder
  2011-05-04 18:55 ` [PATCH 3/4] xfs: add a reference count to the CIL context Christoph Hellwig
  2011-05-04 18:55 ` [PATCH 4/4] xfs: make discard operations asynchronous Christoph Hellwig
  3 siblings, 1 reply; 15+ messages in thread
From: Christoph Hellwig @ 2011-05-04 18:55 UTC (permalink / raw)
  To: xfs

[-- Attachment #1: xfs-dont-discard-allocbtblocks --]
[-- Type: text/plain, Size: 3995 bytes --]

Blocks for the allocation btree are allocated from and release to
the AGFL, and thus frequently reused.  Even worse we do not have
an easy way to avoid using an AGFL block when it is discarded due
to the simple FILO list of free blocks, and thus can frequently
stall on blocks that are currently undergoing a discard.

Add a flag to the busy extent tracking structure to skip the discard for
allocation btree blocks.  In normal operation these blocks are reused
frequently enough that there is no need to discard them anyway, but
if they spill over to the allocation btree as part of a balance we
"leak" blocks that we would otherwise discard.  We could fix this
by adding another flag and keeping these block in the rbtree even
after they aren't busy any more so that we could discard them when
they migrate out of the AGFL.  Given that this would cause significant
overhead I don't think it's worthwile for now.

Signed-off-by: Christoph Hellwig <hch@lst.de>

Index: xfs/fs/xfs/xfs_ag.h
===================================================================
--- xfs.orig/fs/xfs/xfs_ag.h	2011-05-04 20:45:11.406200936 +0200
+++ xfs/fs/xfs/xfs_ag.h	2011-05-04 20:45:15.309513124 +0200
@@ -189,6 +189,7 @@ struct xfs_busy_extent {
 	xfs_extlen_t	length;
 	unsigned int	flags;
 #define XFS_ALLOC_BUSY_DISCARDED	0x01	/* undergoing a discard op. */
+#define XFS_ALLOC_BUSY_SKIP_DISCARD	0x02	/* do not discard */
 };
 
 /*
Index: xfs/fs/xfs/xfs_alloc_btree.c
===================================================================
--- xfs.orig/fs/xfs/xfs_alloc_btree.c	2011-05-04 20:45:11.419534199 +0200
+++ xfs/fs/xfs/xfs_alloc_btree.c	2011-05-04 20:45:15.309513124 +0200
@@ -120,7 +120,8 @@ xfs_allocbt_free_block(
 	if (error)
 		return error;
 
-	xfs_alloc_busy_insert(cur->bc_tp, be32_to_cpu(agf->agf_seqno), bno, 1);
+	xfs_alloc_busy_insert(cur->bc_tp, be32_to_cpu(agf->agf_seqno), bno, 1,
+			      XFS_ALLOC_BUSY_SKIP_DISCARD);
 	xfs_trans_agbtree_delta(cur->bc_tp, -1);
 	return 0;
 }
Index: xfs/fs/xfs/xfs_alloc.c
===================================================================
--- xfs.orig/fs/xfs/xfs_alloc.c	2011-05-04 20:45:11.432867459 +0200
+++ xfs/fs/xfs/xfs_alloc.c	2011-05-04 20:45:15.312846439 +0200
@@ -2470,7 +2470,7 @@ xfs_free_extent(
 
 	error = xfs_free_ag_extent(tp, args.agbp, args.agno, args.agbno, len, 0);
 	if (!error)
-		xfs_alloc_busy_insert(tp, args.agno, args.agbno, len);
+		xfs_alloc_busy_insert(tp, args.agno, args.agbno, len, 0);
 error0:
 	xfs_perag_put(args.pag);
 	return error;
@@ -2481,7 +2481,8 @@ xfs_alloc_busy_insert(
 	struct xfs_trans	*tp,
 	xfs_agnumber_t		agno,
 	xfs_agblock_t		bno,
-	xfs_extlen_t		len)
+	xfs_extlen_t		len,
+	unsigned int		flags)
 {
 	struct xfs_busy_extent	*new;
 	struct xfs_busy_extent	*busyp;
@@ -2505,6 +2506,7 @@ xfs_alloc_busy_insert(
 	new->bno = bno;
 	new->length = len;
 	INIT_LIST_HEAD(&new->list);
+	new->flags = flags;
 
 	/* trace before insert to be able to see failed inserts */
 	trace_xfs_alloc_busy(tp->t_mountp, agno, bno, len);
@@ -3019,7 +3021,8 @@ xfs_alloc_busy_clear(
 			agno = busyp->agno;
 		}
 
-		if (do_discard && busyp->length)
+		if (do_discard && busyp->length &&
+		    !(busyp->flags & XFS_ALLOC_BUSY_SKIP_DISCARD))
 			busyp->flags = XFS_ALLOC_BUSY_DISCARDED;
 		else
 			xfs_alloc_busy_clear_one(mp, pag, busyp);
Index: xfs/fs/xfs/xfs_alloc.h
===================================================================
--- xfs.orig/fs/xfs/xfs_alloc.h	2011-05-04 20:45:11.452867351 +0200
+++ xfs/fs/xfs/xfs_alloc.h	2011-05-04 20:45:15.312846439 +0200
@@ -137,7 +137,7 @@ xfs_alloc_longest_free_extent(struct xfs
 #ifdef __KERNEL__
 void
 xfs_alloc_busy_insert(struct xfs_trans *tp, xfs_agnumber_t agno,
-	xfs_agblock_t bno, xfs_extlen_t len);
+	xfs_agblock_t bno, xfs_extlen_t len, unsigned int flags);
 
 void
 xfs_alloc_busy_clear(struct xfs_mount *mp, struct list_head *list,

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH 3/4] xfs: add a reference count to the CIL context
  2011-05-04 18:55 [PATCH 0/4] online discard support V3 Christoph Hellwig
  2011-05-04 18:55 ` [PATCH 1/4] xfs: add online discard support Christoph Hellwig
  2011-05-04 18:55 ` [PATCH 2/4] xfs: do not discard alloc btree blocks Christoph Hellwig
@ 2011-05-04 18:55 ` Christoph Hellwig
  2011-05-19 21:54   ` Alex Elder
  2011-05-04 18:55 ` [PATCH 4/4] xfs: make discard operations asynchronous Christoph Hellwig
  3 siblings, 1 reply; 15+ messages in thread
From: Christoph Hellwig @ 2011-05-04 18:55 UTC (permalink / raw)
  To: xfs

[-- Attachment #1: xfs-cil-ctx-refcounting --]
[-- Type: text/plain, Size: 4229 bytes --]

For the upcoming asynchronoyus discard support we need to be able to delay
freeing the CIL context until the last discard request that reference it
has completed.  Add a reference count to the CIL context, and only clear
the busy extents and free the CIL context structure when it reaches zero,
and a work item to allow freeing it from irq context.

Note that this means delaying the clearing of the busy extents for a little
bit even on non-discard mounts, but with the new busy extent trim/reuse
code there is no real life impact of this change.

Signed-off-by: Christoph Hellwig <hch@lst.de>

Index: xfs/fs/xfs/xfs_log_cil.c
===================================================================
--- xfs.orig/fs/xfs/xfs_log_cil.c	2011-05-03 12:00:49.000000000 +0200
+++ xfs/fs/xfs/xfs_log_cil.c	2011-05-03 12:17:19.399350953 +0200
@@ -20,7 +20,7 @@
 #include "xfs_types.h"
 #include "xfs_bit.h"
 #include "xfs_log.h"
-#include "xfs_inum.h"
+#include "xfs_inum.h" 
 #include "xfs_trans.h"
 #include "xfs_trans_priv.h"
 #include "xfs_log_priv.h"
@@ -31,6 +31,46 @@
 #include "xfs_alloc.h"
 #include "xfs_discard.h"
 
+static void
+xlog_cil_ctx_free(
+	struct xfs_cil_ctx	*ctx)
+{
+	if (!list_empty(&ctx->busy_extents)) {
+		xfs_alloc_busy_clear(ctx->cil->xc_log->l_mp,
+				     &ctx->busy_extents, false);
+	}
+	kmem_free(ctx);
+}
+
+static void
+xlog_cil_ctx_free_work(
+	struct work_struct	*work)
+{
+	xlog_cil_ctx_free(container_of(work, struct xfs_cil_ctx, work));
+}
+
+static void
+xlog_cil_ctx_init(
+	struct xfs_cil_ctx	*ctx,
+	struct xfs_cil		*cil,
+	xfs_lsn_t		sequence)
+{
+	INIT_LIST_HEAD(&ctx->committing);
+	INIT_LIST_HEAD(&ctx->busy_extents);
+	ctx->sequence = sequence;
+	ctx->cil = cil;
+	atomic_set(&ctx->ref, 1);
+	INIT_WORK(&ctx->work, xlog_cil_ctx_free_work);
+	cil->xc_ctx = ctx;
+
+	/*
+	 * Mirror the sequence into the cil structure so that we can do
+	 * unlocked checks against the current sequence in log forces without
+	 * risking deferencing a freed context pointer.
+	 */
+	cil->xc_current_sequence = ctx->sequence;
+}
+
 /*
  * Perform initial CIL structure initialisation. If the CIL is not
  * enabled in this filesystem, ensure the log->l_cilp is null so
@@ -64,12 +104,7 @@ xlog_cil_init(
 	init_rwsem(&cil->xc_ctx_lock);
 	init_waitqueue_head(&cil->xc_commit_wait);
 
-	INIT_LIST_HEAD(&ctx->committing);
-	INIT_LIST_HEAD(&ctx->busy_extents);
-	ctx->sequence = 1;
-	ctx->cil = cil;
-	cil->xc_ctx = ctx;
-	cil->xc_current_sequence = ctx->sequence;
+	xlog_cil_ctx_init(ctx, cil, 1);
 
 	cil->xc_log = log;
 	log->l_cilp = cil;
@@ -381,10 +416,10 @@ xlog_cil_committed(
 		ASSERT(mp->m_flags & XFS_MOUNT_DISCARD);
 
 		xfs_discard_extents(mp, &ctx->busy_extents);
-		xfs_alloc_busy_clear(mp, &ctx->busy_extents, false);
 	}
 
-	kmem_free(ctx);
+	if (atomic_dec_and_test(&ctx->ref))
+		xlog_cil_ctx_free(ctx);
 }
 
 /*
@@ -491,18 +526,7 @@ xlog_cil_push(
 	 * during log forces to extract the commit lsn of the sequence that
 	 * needs to be forced.
 	 */
-	INIT_LIST_HEAD(&new_ctx->committing);
-	INIT_LIST_HEAD(&new_ctx->busy_extents);
-	new_ctx->sequence = ctx->sequence + 1;
-	new_ctx->cil = cil;
-	cil->xc_ctx = new_ctx;
-
-	/*
-	 * mirror the new sequence into the cil structure so that we can do
-	 * unlocked checks against the current sequence in log forces without
-	 * risking deferencing a freed context pointer.
-	 */
-	cil->xc_current_sequence = new_ctx->sequence;
+	xlog_cil_ctx_init(new_ctx, cil, ctx->sequence + 1);
 
 	/*
 	 * The switch is now done, so we can drop the context lock and move out
Index: xfs/fs/xfs/xfs_log_priv.h
===================================================================
--- xfs.orig/fs/xfs/xfs_log_priv.h	2011-05-03 11:41:29.000000000 +0200
+++ xfs/fs/xfs/xfs_log_priv.h	2011-05-03 12:13:49.743820088 +0200
@@ -390,6 +390,8 @@ struct xfs_cil_ctx {
 	struct xfs_log_vec	*lv_chain;	/* logvecs being pushed */
 	xfs_log_callback_t	log_cb;		/* completion callback hook. */
 	struct list_head	committing;	/* ctx committing list */
+	atomic_t		ref;		/* reference count */
+	struct work_struct	work;		/* for deferred freeing */
 };
 
 /*

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH 4/4] xfs: make discard operations asynchronous
  2011-05-04 18:55 [PATCH 0/4] online discard support V3 Christoph Hellwig
                   ` (2 preceding siblings ...)
  2011-05-04 18:55 ` [PATCH 3/4] xfs: add a reference count to the CIL context Christoph Hellwig
@ 2011-05-04 18:55 ` Christoph Hellwig
  3 siblings, 0 replies; 15+ messages in thread
From: Christoph Hellwig @ 2011-05-04 18:55 UTC (permalink / raw)
  To: xfs

[-- Attachment #1: xfs-async-discard-workqueue --]
[-- Type: text/plain, Size: 8912 bytes --]

Instead of waiting for each discard request keep the CIL context alive
until all of them are done, at which point we can tear it down completly
and remove the busy extents from the rbtree.

Signed-off-by: Christoph Hellwig <hch@lst.de>

Index: xfs/fs/xfs/linux-2.6/xfs_discard.c
===================================================================
--- xfs.orig/fs/xfs/linux-2.6/xfs_discard.c	2011-05-03 19:43:13.467745055 +0200
+++ xfs/fs/xfs/linux-2.6/xfs_discard.c	2011-05-03 19:43:14.514406051 +0200
@@ -30,6 +30,7 @@
 #include "xfs_inode.h"
 #include "xfs_alloc.h"
 #include "xfs_error.h"
+#include "xfs_log_priv.h"
 #include "xfs_discard.h"
 #include "xfs_trace.h"
 
@@ -192,31 +193,88 @@ xfs_ioc_trim(
 	return 0;
 }
 
-int
-xfs_discard_extents(
-	struct xfs_mount	*mp,
-	struct list_head	*list)
+STATIC void
+xfs_discard_end_io(
+	struct bio		*bio,
+	int			err)
 {
-	struct xfs_busy_extent	*busyp;
-	int			error = 0;
+	struct xfs_cil_ctx	*ctx = bio->bi_private;
 
-	list_for_each_entry(busyp, list, list) {
-		trace_xfs_discard_extent(mp, busyp->agno, busyp->bno,
-					 busyp->length);
-
-		error = -blkdev_issue_discard(mp->m_ddev_targp->bt_bdev,
-				XFS_AGB_TO_DADDR(mp, busyp->agno, busyp->bno),
-				XFS_FSB_TO_BB(mp, busyp->length),
-				GFP_NOFS, 0);
-		if (error && error != EOPNOTSUPP) {
-			xfs_info(mp,
-	 "discard failed for extent [0x%llu,%u], error %d",
-				 (unsigned long long)busyp->bno,
-				 busyp->length,
-				 error);
-			return error;
+	if (err && err != EOPNOTSUPP) {
+		xfs_info(ctx->cil->xc_log->l_mp,
+			 "discard failed at sector 0x%llu, error %d",
+			 (unsigned long long)bio->bi_sector, err);
+	}
+
+	if (atomic_dec_and_test(&ctx->ref))
+		queue_work(xfs_discard_workqueue, &ctx->work);
+	bio_put(bio);
+}
+
+STATIC int
+xfs_issue_discard(
+	struct xfs_cil_ctx	*ctx,
+	struct xfs_busy_extent	*busyp)
+{
+	struct xfs_mount	*mp = ctx->cil->xc_log->l_mp;
+	struct block_device	*bdev = mp->m_ddev_targp->bt_bdev;
+	struct request_queue	*q = bdev_get_queue(bdev);
+	unsigned int		max_discard_sectors;
+	struct bio		*bio;
+	sector_t		sector;
+	sector_t		nr_sects;
+
+	if (!blk_queue_discard(q))
+		return -EOPNOTSUPP;
+
+	trace_xfs_discard_extent(mp, busyp->agno, busyp->bno, busyp->length);
+
+	sector = XFS_AGB_TO_DADDR(mp, busyp->agno, busyp->bno);
+	nr_sects = XFS_FSB_TO_BB(mp, busyp->length);
+
+	/*
+	 * Ensure that max_discard_sectors is of the proper granularity
+	 */
+	max_discard_sectors = min(q->limits.max_discard_sectors, UINT_MAX >> 9);
+	if (q->limits.discard_granularity) {
+		unsigned int disc_sects = q->limits.discard_granularity >> 9;
+
+		max_discard_sectors &= ~(disc_sects - 1);
+	}
+
+	while (nr_sects) {
+		bio = bio_alloc(GFP_NOFS, 1);
+		if (!bio)
+			return -ENOMEM;
+
+		bio->bi_sector = sector;
+		bio->bi_end_io = xfs_discard_end_io;
+		bio->bi_bdev = bdev;
+		bio->bi_private = ctx;
+
+		if (nr_sects > max_discard_sectors) {
+			bio->bi_size = max_discard_sectors << 9;
+			nr_sects -= max_discard_sectors;
+			sector += max_discard_sectors;
+		} else {
+			bio->bi_size = nr_sects << 9;
+			nr_sects = 0;
 		}
+
+		atomic_inc(&ctx->ref);
+		submit_bio(REQ_WRITE | REQ_DISCARD, bio);
 	}
 
 	return 0;
 }
+
+int
+xfs_discard_extents(
+	struct xfs_cil_ctx	*ctx)
+{
+	struct xfs_busy_extent	*busyp;
+
+	list_for_each_entry(busyp, &ctx->busy_extents, list)
+		xfs_issue_discard(ctx, busyp);
+	return 0;
+}
Index: xfs/fs/xfs/linux-2.6/xfs_buf.c
===================================================================
--- xfs.orig/fs/xfs/linux-2.6/xfs_buf.c	2011-05-03 19:43:07.164445869 +0200
+++ xfs/fs/xfs/linux-2.6/xfs_buf.c	2011-05-03 19:43:14.517739367 +0200
@@ -48,6 +48,7 @@ STATIC void xfs_buf_delwri_queue(xfs_buf
 static struct workqueue_struct *xfslogd_workqueue;
 struct workqueue_struct *xfsdatad_workqueue;
 struct workqueue_struct *xfsconvertd_workqueue;
+struct workqueue_struct *xfs_discard_workqueue;
 
 #ifdef XFS_BUF_LOCK_TRACKING
 # define XB_SET_OWNER(bp)	((bp)->b_last_holder = current->pid)
@@ -1785,6 +1786,7 @@ xfs_flush_buftarg(
 	LIST_HEAD(wait_list);
 	struct blk_plug plug;
 
+	xfs_buf_runall_queues(xfs_discard_workqueue);
 	xfs_buf_runall_queues(xfsconvertd_workqueue);
 	xfs_buf_runall_queues(xfsdatad_workqueue);
 	xfs_buf_runall_queues(xfslogd_workqueue);
@@ -1848,8 +1850,15 @@ xfs_buf_init(void)
 	if (!xfsconvertd_workqueue)
 		goto out_destroy_xfsdatad_workqueue;
 
+	xfs_discard_workqueue = alloc_workqueue("xfs_discard",
+					WQ_MEM_RECLAIM | WQ_HIGHPRI, 1);
+	if (!xfs_discard_workqueue)
+		goto out_destroy_xfsconvertd_workqueue;
+
 	return 0;
 
+ out_destroy_xfsconvertd_workqueue:
+	destroy_workqueue(xfsconvertd_workqueue);
  out_destroy_xfsdatad_workqueue:
 	destroy_workqueue(xfsdatad_workqueue);
  out_destroy_xfslogd_workqueue:
@@ -1863,6 +1872,7 @@ xfs_buf_init(void)
 void
 xfs_buf_terminate(void)
 {
+	destroy_workqueue(xfs_discard_workqueue);
 	destroy_workqueue(xfsconvertd_workqueue);
 	destroy_workqueue(xfsdatad_workqueue);
 	destroy_workqueue(xfslogd_workqueue);
Index: xfs/fs/xfs/linux-2.6/xfs_buf.h
===================================================================
--- xfs.orig/fs/xfs/linux-2.6/xfs_buf.h	2011-05-03 19:43:07.181112445 +0200
+++ xfs/fs/xfs/linux-2.6/xfs_buf.h	2011-05-03 19:43:14.517739367 +0200
@@ -348,4 +348,6 @@ extern struct list_head *xfs_get_buftarg
 #define xfs_binval(buftarg)		xfs_flush_buftarg(buftarg, 1)
 #define XFS_bflush(buftarg)		xfs_flush_buftarg(buftarg, 1)
 
+extern struct workqueue_struct *xfs_discard_workqueue;
+
 #endif	/* __XFS_BUF_H__ */
Index: xfs/fs/xfs/xfs_alloc.c
===================================================================
--- xfs.orig/fs/xfs/xfs_alloc.c	2011-05-03 19:43:13.967742345 +0200
+++ xfs/fs/xfs/xfs_alloc.c	2011-05-03 19:43:14.517739367 +0200
@@ -1084,6 +1084,7 @@ restart:
 		if (!forced++) {
 			trace_xfs_alloc_near_busy(args);
 			xfs_log_force(args->mp, XFS_LOG_SYNC);
+			flush_workqueue(xfs_discard_workqueue);
 			goto restart;
 		}
 
@@ -1243,8 +1244,10 @@ restart:
 				xfs_btree_del_cursor(cnt_cur,
 						     XFS_BTREE_NOERROR);
 				trace_xfs_alloc_size_busy(args);
-				if (!forced++)
+				if (!forced++) {
 					xfs_log_force(args->mp, XFS_LOG_SYNC);
+					flush_workqueue(xfs_discard_workqueue);
+				}
 				goto restart;
 			}
 		}
@@ -1314,6 +1317,7 @@ restart:
 			xfs_btree_del_cursor(cnt_cur, XFS_BTREE_NOERROR);
 			trace_xfs_alloc_size_busy(args);
 			xfs_log_force(args->mp, XFS_LOG_SYNC);
+			flush_workqueue(xfs_discard_workqueue);
 			goto restart;
 		}
 		goto out_nominleft;
@@ -2612,13 +2616,13 @@ xfs_alloc_busy_update_extent(
 	xfs_agblock_t		bend = bbno + busyp->length;
 
 	/*
-	 * This extent is currently beeing discard.  Give the thread
-	 * performing the discard a chance to mark the extent unbusy
-	 * and retry.
+	 * This extent is currently beeing discard.  Flush the discard
+	 * completion queue and retry the search.
 	 */
 	if (busyp->flags & XFS_ALLOC_BUSY_DISCARDED) {
 		spin_unlock(&pag->pagb_lock);
-		delay(1);
+		flush_workqueue(xfs_discard_workqueue);
+		trace_xfs_alloc_busy_discarded(mp, pag->pag_agno, fbno, flen);
 		spin_lock(&pag->pagb_lock);
 		return false;
 	}
Index: xfs/fs/xfs/linux-2.6/xfs_trace.h
===================================================================
--- xfs.orig/fs/xfs/linux-2.6/xfs_trace.h	2011-05-03 19:43:07.194445707 +0200
+++ xfs/fs/xfs/linux-2.6/xfs_trace.h	2011-05-03 19:43:14.521072682 +0200
@@ -1183,6 +1183,7 @@ DEFINE_BUSY_EVENT(xfs_alloc_busy_enomem)
 DEFINE_BUSY_EVENT(xfs_alloc_busy_force);
 DEFINE_BUSY_EVENT(xfs_alloc_busy_reuse);
 DEFINE_BUSY_EVENT(xfs_alloc_busy_clear);
+DEFINE_BUSY_EVENT(xfs_alloc_busy_discarded);
 
 TRACE_EVENT(xfs_alloc_busy_trim,
 	TP_PROTO(struct xfs_mount *mp, xfs_agnumber_t agno,
Index: xfs/fs/xfs/linux-2.6/xfs_discard.h
===================================================================
--- xfs.orig/fs/xfs/linux-2.6/xfs_discard.h	2011-05-03 19:43:13.467745055 +0200
+++ xfs/fs/xfs/linux-2.6/xfs_discard.h	2011-05-03 19:43:14.521072682 +0200
@@ -2,9 +2,9 @@
 #define XFS_DISCARD_H 1
 
 struct fstrim_range;
-struct list_head;
+struct xfs_cil_ctx;
 
 extern int	xfs_ioc_trim(struct xfs_mount *, struct fstrim_range __user *);
-extern int	xfs_discard_extents(struct xfs_mount *, struct list_head *);
+extern int	xfs_discard_extents(struct xfs_cil_ctx *);
 
 #endif /* XFS_DISCARD_H */
Index: xfs/fs/xfs/xfs_log_cil.c
===================================================================
--- xfs.orig/fs/xfs/xfs_log_cil.c	2011-05-03 19:43:14.214407676 +0200
+++ xfs/fs/xfs/xfs_log_cil.c	2011-05-03 19:43:14.521072682 +0200
@@ -415,7 +415,7 @@ xlog_cil_committed(
 	if (!list_empty(&ctx->busy_extents)) {
 		ASSERT(mp->m_flags & XFS_MOUNT_DISCARD);
 
-		xfs_discard_extents(mp, &ctx->busy_extents);
+		xfs_discard_extents(ctx);
 	}
 
 	if (atomic_dec_and_test(&ctx->ref))

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 1/4] xfs: add online discard support
  2011-05-04 18:55 ` [PATCH 1/4] xfs: add online discard support Christoph Hellwig
@ 2011-05-19 21:53   ` Alex Elder
  2011-05-20 10:24     ` Christoph Hellwig
  2011-05-20 13:45   ` [PATCH 1/4 v2] " Christoph Hellwig
  1 sibling, 1 reply; 15+ messages in thread
From: Alex Elder @ 2011-05-19 21:53 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: xfs

On Wed, 2011-05-04 at 14:55 -0400, Christoph Hellwig wrote:
> Now that we have reliably tracking of deleted extents in a transaction
> we can easily implement "online" discard support which calls
> blkdev_issue_discard once a transaction commits.
> 
> The actual discard is a two stage operation as we first have to mark
> the busy extent as not available for reuse before we can start the
> actual discard.  Note that we don't bother supporting discard for
> the non-delaylog mode.

Generally, this is OK--I don't really see bugs.
But I have some comments and questions.

The first is, why not support it for non-delaylog?

We have not formally deprecated that mode of operation
(though I don't have a problem with doing that).  What
I mean is, if we're going to kill off non-delaylog
we ought to announce that intention and have a plan
and schedule for when we can kill off the old code
for good.  (But that's a separate discussion...)

In any case, it looks like it would not be *that*
hard to maintain discard support.  (I haven't looked
at the follow-on patches yet though; maybe it's
harder than it seems.)
- Add a "do_discard" flag to xfs_trans_free() and pass
  !(abortflag & XFS_LI_ABORTED) from xfs_trans_committed()
  and false from all other callers.
- Add the same discard supporting code to xfs_trans_free()
  that you're adding to xlog_cil_committed().  I.e. use:
     xfs_alloc_busy_clear(tp->t_mountp, &tp->t_busy, do_discard);
     ...
     if (!list_empty(&tp->t_busy)) {
         ASSERT(mp->m_flags & XFS_MOUNT_DISCARD);
         ...
     }
  That last block could be encapsulated into a function like:
    void xfs_discard_busy(struct xfs_mount *mp, struct list_head *list);


Second, why is it a two phase operation (marking an
extent for discard, then doing all the discards at
once)?  Is it just so you can do the discards without
holding the perag lock?

Other thoughts below.

> Signed-off-by: Christoph Hellwig <hch@lst.de>
> 
> Index: xfs/fs/xfs/linux-2.6/xfs_super.c
> ===================================================================
> --- xfs.orig/fs/xfs/linux-2.6/xfs_super.c	2011-05-04 20:44:30.466422727 +0200
> +++ xfs/fs/xfs/linux-2.6/xfs_super.c	2011-05-04 20:45:06.302895250 +0200
> @@ -112,6 +112,8 @@ mempool_t *xfs_ioend_pool;
>  #define MNTOPT_QUOTANOENF  "qnoenforce"	/* same as uqnoenforce */
>  #define MNTOPT_DELAYLOG   "delaylog"	/* Delayed loging enabled */
>  #define MNTOPT_NODELAYLOG "nodelaylog"	/* Delayed loging disabled */
> +#define MNTOPT_DISCARD	"discard"	/* Discard unused blocks */
> +#define MNTOPT_NODISCARD "nodiscard"	/* Do not discard unused blocks */

Alignment in this section of this file looks a bit funny.
Perhaps you could clean it up (though your options are
limited).

>  
>  /*
>   * Table driven mount option parser.

. . .

> Index: xfs/fs/xfs/xfs_log_cil.c
> ===================================================================
> --- xfs.orig/fs/xfs/xfs_log_cil.c	2011-05-04 20:44:30.369756584 +0200
> +++ xfs/fs/xfs/xfs_log_cil.c	2011-05-04 20:45:06.302895250 +0200

. . .

> @@ -361,18 +362,28 @@ xlog_cil_committed(
>  	int	abort)
>  {
>  	struct xfs_cil_ctx	*ctx = args;
> +	struct xfs_mount	*mp = ctx->cil->xc_log->l_mp;
>  
>  	xfs_trans_committed_bulk(ctx->cil->xc_log->l_ailp, ctx->lv_chain,
>  					ctx->start_lsn, abort);
>  
>  	xfs_alloc_busy_sort(&ctx->busy_extents);

I still think sorting the list belongs inside xfs_alloc_busy_clear().
I see that list_sort() is not necessarily trivial for an already
sorted list though...

> -	xfs_alloc_busy_clear(ctx->cil->xc_log->l_mp, &ctx->busy_extents);
> +	xfs_alloc_busy_clear(mp, &ctx->busy_extents,
> +			     (mp->m_flags & XFS_MOUNT_DISCARD) && !abort);
>  
>  	spin_lock(&ctx->cil->xc_cil_lock);
>  	list_del(&ctx->committing);
>  	spin_unlock(&ctx->cil->xc_cil_lock);
>  
>  	xlog_cil_free_logvec(ctx->lv_chain);
> +
> +	if (!list_empty(&ctx->busy_extents)) {
> +		ASSERT(mp->m_flags & XFS_MOUNT_DISCARD);
> +
> +		xfs_discard_extents(mp, &ctx->busy_extents);
> +		xfs_alloc_busy_clear(mp, &ctx->busy_extents, false);
> +	}
> +
>  	kmem_free(ctx);
>  }
>  
> Index: xfs/fs/xfs/linux-2.6/xfs_discard.c
> ===================================================================
> --- xfs.orig/fs/xfs/linux-2.6/xfs_discard.c	2011-05-04 20:44:30.329756801 +0200
> +++ xfs/fs/xfs/linux-2.6/xfs_discard.c	2011-05-04 20:45:06.306228566 +0200
> @@ -191,3 +191,32 @@ xfs_ioc_trim(
>  		return -XFS_ERROR(EFAULT);
>  	return 0;
>  }
> +
> +int
> +xfs_discard_extents(
> +	struct xfs_mount	*mp,
> +	struct list_head	*list)
> +{
> +	struct xfs_busy_extent	*busyp;
> +	int			error = 0;
> +
> +	list_for_each_entry(busyp, list, list) {
> +		trace_xfs_discard_extent(mp, busyp->agno, busyp->bno,
> +					 busyp->length);
> +
> +		error = -blkdev_issue_discard(mp->m_ddev_targp->bt_bdev,
> +				XFS_AGB_TO_DADDR(mp, busyp->agno, busyp->bno),
> +				XFS_FSB_TO_BB(mp, busyp->length),
> +				GFP_NOFS, 0);

		if (error == EOPNOTSUPP) {
			/*
			 * Report this once per mount point somehow?
			 * If so, turn off the mount option?
			 */
			break;
		} else if (error) {
> +		if (error && error != EOPNOTSUPP) {
> +			xfs_info(mp,
> +	 "discard failed for extent [0x%llu,%u], error %d",
> +				 (unsigned long long)busyp->bno,
> +				 busyp->length,
> +				 error);
> +			return error;
> +		}
> +	}
> +
> +	return 0;
> +}

. . .
> Index: xfs/fs/xfs/xfs_alloc.c
> ===================================================================
> --- xfs.orig/fs/xfs/xfs_alloc.c	2011-05-04 20:44:30.386423159 +0200
> +++ xfs/fs/xfs/xfs_alloc.c	2011-05-04 20:45:11.432867459 +0200
> @@ -2610,6 +2610,18 @@ xfs_alloc_busy_update_extent(
>  	xfs_agblock_t		bend = bbno + busyp->length;
>  
>  	/*
> +	 * This extent is currently beeing discard.  Give the thread

				currently being discarded.

> +	 * performing the discard a chance to mark the extent unbusy
> +	 * and retry.
> +	 */
> +	if (busyp->flags & XFS_ALLOC_BUSY_DISCARDED) {
> +		spin_unlock(&pag->pagb_lock);
> +		delay(1);

I hate seeing calls to delay() although sometimes
it's the right thing to do...  I don't have a feel
for how long a discard is likely to take so I don't
know whether waiting here instead would be worth
the effort.

> +		spin_lock(&pag->pagb_lock);
> +		return false;
> +	}
> +
> +	/*
>  	 * If there is a busy extent overlapping a user allocation, we have
>  	 * no choice but to force the log and retry the search.
>  	 *

. . .

> Index: xfs/Documentation/filesystems/xfs.txt
> ===================================================================
> --- xfs.orig/Documentation/filesystems/xfs.txt	2011-05-04 20:44:30.409756366 +0200
> +++ xfs/Documentation/filesystems/xfs.txt	2011-05-04 20:45:06.306228566 +0200
> @@ -39,6 +39,12 @@ When mounting an XFS filesystem, the fol
>  	drive level write caching to be enabled, for devices that
>  	support write barriers.
>  
> +  discard
> +	Issue command to let the block device reclaim space freed by the
> +	filesystem.  This is useful for SSD devices, thinly provisioned
> +	LUNs and virtual machine images, but may have a performance
> +	impact.

If this option is to only be available for delaylog, it should
say so here (and maybe report that it's being ignored if it's
supplied with nodelaylog at mount time).

> +
>    dmapi
>  	Enable the DMAPI (Data Management API) event callouts.
>  	Use with the "mtpt" option.

. . .



_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 2/4] xfs: do not discard alloc btree blocks
  2011-05-04 18:55 ` [PATCH 2/4] xfs: do not discard alloc btree blocks Christoph Hellwig
@ 2011-05-19 21:54   ` Alex Elder
  0 siblings, 0 replies; 15+ messages in thread
From: Alex Elder @ 2011-05-19 21:54 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: xfs

On Wed, 2011-05-04 at 14:55 -0400, Christoph Hellwig wrote:
> Blocks for the allocation btree are allocated from and release to
> the AGFL, and thus frequently reused.  Even worse we do not have
> an easy way to avoid using an AGFL block when it is discarded due
> to the simple FILO list of free blocks, and thus can frequently
> stall on blocks that are currently undergoing a discard.
> 
> Add a flag to the busy extent tracking structure to skip the discard for
> allocation btree blocks.  In normal operation these blocks are reused
> frequently enough that there is no need to discard them anyway, but
> if they spill over to the allocation btree as part of a balance we
> "leak" blocks that we would otherwise discard.  We could fix this
> by adding another flag and keeping these block in the rbtree even
> after they aren't busy any more so that we could discard them when
> they migrate out of the AGFL.  Given that this would cause significant
> overhead I don't think it's worthwile for now.
> 
> Signed-off-by: Christoph Hellwig <hch@lst.de>

This one looks good.  (If you add support for discard
with nodelaylog it will need to be updated.)

Reviewed-by: Alex Elder <aelder@sgi.com>


_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 3/4] xfs: add a reference count to the CIL context
  2011-05-04 18:55 ` [PATCH 3/4] xfs: add a reference count to the CIL context Christoph Hellwig
@ 2011-05-19 21:54   ` Alex Elder
  2011-05-20 10:25     ` Christoph Hellwig
  0 siblings, 1 reply; 15+ messages in thread
From: Alex Elder @ 2011-05-19 21:54 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: xfs

On Wed, 2011-05-04 at 14:55 -0400, Christoph Hellwig wrote:
> plain text document attachment (xfs-cil-ctx-refcounting)
> For the upcoming asynchronoyus discard support we need to be able to delay
> freeing the CIL context until the last discard request that reference it
> has completed.  Add a reference count to the CIL context, and only clear
> the busy extents and free the CIL context structure when it reaches zero,
> and a work item to allow freeing it from irq context.
> 
> Note that this means delaying the clearing of the busy extents for a little
> bit even on non-discard mounts, but with the new busy extent trim/reuse
> code there is no real life impact of this change.

It will affect allocation patterns but it's really hard to say
whether it will make them worse or better...

> Signed-off-by: Christoph Hellwig <hch@lst.de>

This looks good.  Encapsulating the context initialization
was a good step even without the new work item field.  One
little thing below.

Reviewed-by: Alex Elder <aelder@sgi.com>

PS  I've run out of time and will have to review the
    fourth patch in this series tomorrow.

> Index: xfs/fs/xfs/xfs_log_cil.c
> ===================================================================
> --- xfs.orig/fs/xfs/xfs_log_cil.c	2011-05-03 12:00:49.000000000 +0200
> +++ xfs/fs/xfs/xfs_log_cil.c	2011-05-03 12:17:19.399350953 +0200
> @@ -20,7 +20,7 @@
>  #include "xfs_types.h"
>  #include "xfs_bit.h"
>  #include "xfs_log.h"
> -#include "xfs_inum.h"
> +#include "xfs_inum.h" 
>  #include "xfs_trans.h"
>  #include "xfs_trans_priv.h"
>  #include "xfs_log_priv.h"

Kill this hunk.  (You added a space at end-of-line.)

. . .


_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 1/4] xfs: add online discard support
  2011-05-19 21:53   ` Alex Elder
@ 2011-05-20 10:24     ` Christoph Hellwig
  2011-05-20 11:43       ` Lukas Czerner
  2011-05-20 13:40       ` Alex Elder
  0 siblings, 2 replies; 15+ messages in thread
From: Christoph Hellwig @ 2011-05-20 10:24 UTC (permalink / raw)
  To: Alex Elder; +Cc: xfs

On Thu, May 19, 2011 at 04:53:44PM -0500, Alex Elder wrote:
> The first is, why not support it for non-delaylog?

Because:

 a) performance is going to suck even more horribly with the
    amount of trim commands needed, with no chance of actually
    fixing it
 b) the async discard code in patch 3 not easily applyable to
    the non-delaylog case, we'd need to keep two parallel codebases,
    one of them guaranteed to be untested.

> Second, why is it a two phase operation (marking an
> extent for discard, then doing all the discards at
> once)?  Is it just so you can do the discards without
> holding the perag lock?

Because we must prevent the allocation code from reusing an extent
that is undergoing a discard right now to prevent corruption, thus
we need to mark it as do not touch first. 

> >  	xfs_trans_committed_bulk(ctx->cil->xc_log->l_ailp, ctx->lv_chain,
> >  					ctx->start_lsn, abort);
> >  
> >  	xfs_alloc_busy_sort(&ctx->busy_extents);
> 
> I still think sorting the list belongs inside xfs_alloc_busy_clear().
> I see that list_sort() is not necessarily trivial for an already
> sorted list though...

It's a bad idea to do the sort twice for no good reason, and adding
another parameter to further overload xfs_alloc_busy_clear behaviour
doesn't seem smart either.

> 		if (error == EOPNOTSUPP) {
> 			/*
> 			 * Report this once per mount point somehow?
> 			 * If so, turn off the mount option?
> 			 */
> 			break;

We've been through this discussion again lately with dm and ext4
folks, and the conclusion is that EOPNOTSUPP is perfectly fine to happen
here.

> > +	 * performing the discard a chance to mark the extent unbusy
> > +	 * and retry.
> > +	 */
> > +	if (busyp->flags & XFS_ALLOC_BUSY_DISCARDED) {
> > +		spin_unlock(&pag->pagb_lock);
> > +		delay(1);
> 
> I hate seeing calls to delay() although sometimes
> it's the right thing to do...  I don't have a feel
> for how long a discard is likely to take so I don't
> know whether waiting here instead would be worth
> the effort.

It's not nice, but if the block layer gets fixed and we do asynchronous
discards it simply goes away.

> If this option is to only be available for delaylog, it should
> say so here (and maybe report that it's being ignored if it's
> supplied with nodelaylog at mount time).

ok.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 3/4] xfs: add a reference count to the CIL context
  2011-05-19 21:54   ` Alex Elder
@ 2011-05-20 10:25     ` Christoph Hellwig
  0 siblings, 0 replies; 15+ messages in thread
From: Christoph Hellwig @ 2011-05-20 10:25 UTC (permalink / raw)
  To: Alex Elder; +Cc: Christoph Hellwig, xfs

On Thu, May 19, 2011 at 04:54:54PM -0500, Alex Elder wrote:
> >  #include "xfs_log.h"
> > -#include "xfs_inum.h"
> > +#include "xfs_inum.h" 
> >  #include "xfs_trans.h"
> >  #include "xfs_trans_priv.h"
> >  #include "xfs_log_priv.h"
> 
> Kill this hunk.  (You added a space at end-of-line.)

Yeah, sorry.  not sure how that happened.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 1/4] xfs: add online discard support
  2011-05-20 10:24     ` Christoph Hellwig
@ 2011-05-20 11:43       ` Lukas Czerner
  2011-05-20 13:57         ` Alex Elder
  2011-05-20 13:40       ` Alex Elder
  1 sibling, 1 reply; 15+ messages in thread
From: Lukas Czerner @ 2011-05-20 11:43 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: xfs, Alex Elder

On Fri, 20 May 2011, Christoph Hellwig wrote:

> On Thu, May 19, 2011 at 04:53:44PM -0500, Alex Elder wrote:
> > The first is, why not support it for non-delaylog?
> 
> Because:
> 
>  a) performance is going to suck even more horribly with the
>     amount of trim commands needed, with no chance of actually
>     fixing it
>  b) the async discard code in patch 3 not easily applyable to
>     the non-delaylog case, we'd need to keep two parallel codebases,
>     one of them guaranteed to be untested.
> 
> > Second, why is it a two phase operation (marking an
> > extent for discard, then doing all the discards at
> > once)?  Is it just so you can do the discards without
> > holding the perag lock?
> 
> Because we must prevent the allocation code from reusing an extent
> that is undergoing a discard right now to prevent corruption, thus
> we need to mark it as do not touch first. 
> 
> > >  	xfs_trans_committed_bulk(ctx->cil->xc_log->l_ailp, ctx->lv_chain,
> > >  					ctx->start_lsn, abort);
> > >  
> > >  	xfs_alloc_busy_sort(&ctx->busy_extents);
> > 
> > I still think sorting the list belongs inside xfs_alloc_busy_clear().
> > I see that list_sort() is not necessarily trivial for an already
> > sorted list though...
> 
> It's a bad idea to do the sort twice for no good reason, and adding
> another parameter to further overload xfs_alloc_busy_clear behaviour
> doesn't seem smart either.
> 
> > 		if (error == EOPNOTSUPP) {
> > 			/*
> > 			 * Report this once per mount point somehow?
Actually, this is a good idea see https://lkml.org/lkml/2011/5/5/162. So
you will get EOPNOTSUPP *only* if the device (as a whole) does not
support discard.

> > 			 * If so, turn off the mount option?
Not so good idea, as some people mentioned several times, you can change
the devices in dmsetup to SSD (for example) without umount and you would
like your previous mount option to work. In the opposite case, the user
just gets warning (once a day perhaps?) and its up to him to deal with
it.

Or, we can turn it of (with warning) and rely on the user to notice that
it is turned off. But I would rather not rely on that.

-Lukas

> > 			 */
> > 			break;
> 
> We've been through this discussion again lately with dm and ext4
> folks, and the conclusion is that EOPNOTSUPP is perfectly fine to happen
> here.
> 
> > > +	 * performing the discard a chance to mark the extent unbusy
> > > +	 * and retry.
> > > +	 */
> > > +	if (busyp->flags & XFS_ALLOC_BUSY_DISCARDED) {
> > > +		spin_unlock(&pag->pagb_lock);
> > > +		delay(1);
> > 
> > I hate seeing calls to delay() although sometimes
> > it's the right thing to do...  I don't have a feel
> > for how long a discard is likely to take so I don't
> > know whether waiting here instead would be worth
> > the effort.
> 
> It's not nice, but if the block layer gets fixed and we do asynchronous
> discards it simply goes away.
> 
> > If this option is to only be available for delaylog, it should
> > say so here (and maybe report that it's being ignored if it's
> > supplied with nodelaylog at mount time).
> 
> ok.
> 
> _______________________________________________
> xfs mailing list
> xfs@oss.sgi.com
> http://oss.sgi.com/mailman/listinfo/xfs
> 

-- 

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 1/4] xfs: add online discard support
  2011-05-20 10:24     ` Christoph Hellwig
  2011-05-20 11:43       ` Lukas Czerner
@ 2011-05-20 13:40       ` Alex Elder
  1 sibling, 0 replies; 15+ messages in thread
From: Alex Elder @ 2011-05-20 13:40 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: xfs

On Fri, 2011-05-20 at 06:24 -0400, Christoph Hellwig wrote:
> On Thu, May 19, 2011 at 04:53:44PM -0500, Alex Elder wrote:
> > The first is, why not support it for non-delaylog?
> 
> Because:
> 
>  a) performance is going to suck even more horribly with the
>     amount of trim commands needed, with no chance of actually
>     fixing it
>  b) the async discard code in patch 3 not easily applyable to
>     the non-delaylog case, we'd need to keep two parallel codebases,
>     one of them guaranteed to be untested.

I hadn't looked through all of the patches yet.

> > Second, why is it a two phase operation (marking an
> > extent for discard, then doing all the discards at
> > once)?  Is it just so you can do the discards without
> > holding the perag lock?
> 
> Because we must prevent the allocation code from reusing an extent
> that is undergoing a discard right now to prevent corruption, thus
> we need to mark it as do not touch first. 

OK.

> > >  	xfs_trans_committed_bulk(ctx->cil->xc_log->l_ailp, ctx->lv_chain,
> > >  					ctx->start_lsn, abort);
> > >  
> > >  	xfs_alloc_busy_sort(&ctx->busy_extents);
> > 
> > I still think sorting the list belongs inside xfs_alloc_busy_clear().
> > I see that list_sort() is not necessarily trivial for an already
> > sorted list though...
> 
> It's a bad idea to do the sort twice for no good reason, and adding
> another parameter to further overload xfs_alloc_busy_clear behaviour
> doesn't seem smart either.

It's fine, it's just the purist in me that wants the interface
to hide the dirty work of sorting.  But I concede your points
here--not a big deal.

> > 		if (error == EOPNOTSUPP) {
> > 			/*
> > 			 * Report this once per mount point somehow?
> > 			 * If so, turn off the mount option?
> > 			 */
> > 			break;
> 
> We've been through this discussion again lately with dm and ext4
> folks, and the conclusion is that EOPNOTSUPP is perfectly fine to happen
> here.

The main thing I wanted to suggest was dropping out after
the first one, but considering the underlying device may
be a compound logical volume I understand now why it's
better to go through them all.  (I hadn't been paying
attention to the dm and ext4 discussion.)

> > > +	 * performing the discard a chance to mark the extent unbusy
> > > +	 * and retry.
> > > +	 */
> > > +	if (busyp->flags & XFS_ALLOC_BUSY_DISCARDED) {
> > > +		spin_unlock(&pag->pagb_lock);
> > > +		delay(1);
> > 
> > I hate seeing calls to delay() although sometimes
> > it's the right thing to do...  I don't have a feel
> > for how long a discard is likely to take so I don't
> > know whether waiting here instead would be worth
> > the effort.
> 
> It's not nice, but if the block layer gets fixed and we do asynchronous
> discards it simply goes away.

Sounds good.

> > If this option is to only be available for delaylog, it should
> > say so here (and maybe report that it's being ignored if it's
> > supplied with nodelaylog at mount time).
> 
> ok.
> 

Thanks for answering my questions.  I'll look for an update but
unless there's something obviously wrong I'll turn it around
pretty quickly so this gets in during the merge window.

					-Alex

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH 1/4 v2] xfs: add online discard support
  2011-05-04 18:55 ` [PATCH 1/4] xfs: add online discard support Christoph Hellwig
  2011-05-19 21:53   ` Alex Elder
@ 2011-05-20 13:45   ` Christoph Hellwig
  2011-05-20 15:42     ` Alex Elder
  1 sibling, 1 reply; 15+ messages in thread
From: Christoph Hellwig @ 2011-05-20 13:45 UTC (permalink / raw)
  To: xfs

Now that we have reliably tracking of deleted extents in a transaction
we can easily implement "online" discard support which calls
blkdev_issue_discard once a transaction commits.

The actual discard is a two stage operation as we first have to mark
the busy extent as not available for reuse before we can start the
actual discard.  Note that we don't bother supporting discard for
the non-delaylog mode.

Signed-off-by: Christoph Hellwig <hch@lst.de>

Index: xfs/fs/xfs/linux-2.6/xfs_super.c
===================================================================
--- xfs.orig/fs/xfs/linux-2.6/xfs_super.c	2011-05-20 15:25:42.005803658 +0200
+++ xfs/fs/xfs/linux-2.6/xfs_super.c	2011-05-20 15:30:10.748304489 +0200
@@ -110,8 +110,10 @@ mempool_t *xfs_ioend_pool;
 #define MNTOPT_GQUOTANOENF "gqnoenforce"/* group quota limit enforcement */
 #define MNTOPT_PQUOTANOENF "pqnoenforce"/* project quota limit enforcement */
 #define MNTOPT_QUOTANOENF  "qnoenforce"	/* same as uqnoenforce */
-#define MNTOPT_DELAYLOG   "delaylog"	/* Delayed loging enabled */
-#define MNTOPT_NODELAYLOG "nodelaylog"	/* Delayed loging disabled */
+#define MNTOPT_DELAYLOG    "delaylog"	/* Delayed loging enabled */
+#define MNTOPT_NODELAYLOG  "nodelaylog"	/* Delayed loging disabled */
+#define MNTOPT_DISCARD	   "discard"	/* Discard unused blocks */
+#define MNTOPT_NODISCARD   "nodiscard"	/* Do not discard unused blocks */
 
 /*
  * Table driven mount option parser.
@@ -355,6 +357,10 @@ xfs_parseargs(
 			mp->m_flags |= XFS_MOUNT_DELAYLOG;
 		} else if (!strcmp(this_char, MNTOPT_NODELAYLOG)) {
 			mp->m_flags &= ~XFS_MOUNT_DELAYLOG;
+		} else if (!strcmp(this_char, MNTOPT_DISCARD)) {
+			mp->m_flags |= XFS_MOUNT_DISCARD;
+		} else if (!strcmp(this_char, MNTOPT_NODISCARD)) {
+			mp->m_flags &= ~XFS_MOUNT_DISCARD;
 		} else if (!strcmp(this_char, "ihashsize")) {
 			xfs_warn(mp,
 	"ihashsize no longer used, option is deprecated.");
@@ -388,6 +394,13 @@ xfs_parseargs(
 		return EINVAL;
 	}
 
+	if ((mp->m_flags & XFS_MOUNT_DISCARD) &&
+	    !(mp->m_flags & XFS_MOUNT_DELAYLOG)) {
+		xfs_warn(mp,
+	"the discard option is incompatible with the nodelaylog option");
+		return EINVAL;
+	}
+
 #ifndef CONFIG_XFS_QUOTA
 	if (XFS_IS_QUOTA_RUNNING(mp)) {
 		xfs_warn(mp, "quota support not available in this kernel.");
@@ -488,6 +501,7 @@ xfs_showargs(
 		{ XFS_MOUNT_FILESTREAMS,	"," MNTOPT_FILESTREAM },
 		{ XFS_MOUNT_GRPID,		"," MNTOPT_GRPID },
 		{ XFS_MOUNT_DELAYLOG,		"," MNTOPT_DELAYLOG },
+		{ XFS_MOUNT_DISCARD,		"," MNTOPT_DISCARD },
 		{ 0, NULL }
 	};
 	static struct proc_xfs_info xfs_info_unset[] = {
Index: xfs/fs/xfs/xfs_mount.h
===================================================================
--- xfs.orig/fs/xfs/xfs_mount.h	2011-05-20 15:25:42.025805406 +0200
+++ xfs/fs/xfs/xfs_mount.h	2011-05-20 15:25:59.189804374 +0200
@@ -224,6 +224,7 @@ typedef struct xfs_mount {
 #define XFS_MOUNT_FS_SHUTDOWN	(1ULL << 4)	/* atomic stop of all filesystem
 						   operations, typically for
 						   disk errors in metadata */
+#define XFS_MOUNT_DISCARD	(1ULL << 5)	/* discard unused blocks */
 #define XFS_MOUNT_RETERR	(1ULL << 6)     /* return alignment errors to
 						   user */
 #define XFS_MOUNT_NOALIGN	(1ULL << 7)	/* turn off stripe alignment
Index: xfs/fs/xfs/xfs_log_cil.c
===================================================================
--- xfs.orig/fs/xfs/xfs_log_cil.c	2011-05-20 15:25:42.033804966 +0200
+++ xfs/fs/xfs/xfs_log_cil.c	2011-05-20 15:25:59.197804089 +0200
@@ -29,6 +29,7 @@
 #include "xfs_mount.h"
 #include "xfs_error.h"
 #include "xfs_alloc.h"
+#include "xfs_discard.h"
 
 /*
  * Perform initial CIL structure initialisation. If the CIL is not
@@ -361,18 +362,28 @@ xlog_cil_committed(
 	int	abort)
 {
 	struct xfs_cil_ctx	*ctx = args;
+	struct xfs_mount	*mp = ctx->cil->xc_log->l_mp;
 
 	xfs_trans_committed_bulk(ctx->cil->xc_log->l_ailp, ctx->lv_chain,
 					ctx->start_lsn, abort);
 
 	xfs_alloc_busy_sort(&ctx->busy_extents);
-	xfs_alloc_busy_clear(ctx->cil->xc_log->l_mp, &ctx->busy_extents);
+	xfs_alloc_busy_clear(mp, &ctx->busy_extents,
+			     (mp->m_flags & XFS_MOUNT_DISCARD) && !abort);
 
 	spin_lock(&ctx->cil->xc_cil_lock);
 	list_del(&ctx->committing);
 	spin_unlock(&ctx->cil->xc_cil_lock);
 
 	xlog_cil_free_logvec(ctx->lv_chain);
+
+	if (!list_empty(&ctx->busy_extents)) {
+		ASSERT(mp->m_flags & XFS_MOUNT_DISCARD);
+
+		xfs_discard_extents(mp, &ctx->busy_extents);
+		xfs_alloc_busy_clear(mp, &ctx->busy_extents, false);
+	}
+
 	kmem_free(ctx);
 }
 
Index: xfs/fs/xfs/linux-2.6/xfs_discard.c
===================================================================
--- xfs.orig/fs/xfs/linux-2.6/xfs_discard.c	2011-05-20 15:25:42.009803019 +0200
+++ xfs/fs/xfs/linux-2.6/xfs_discard.c	2011-05-20 15:25:59.209804819 +0200
@@ -191,3 +191,32 @@ xfs_ioc_trim(
 		return -XFS_ERROR(EFAULT);
 	return 0;
 }
+
+int
+xfs_discard_extents(
+	struct xfs_mount	*mp,
+	struct list_head	*list)
+{
+	struct xfs_busy_extent	*busyp;
+	int			error = 0;
+
+	list_for_each_entry(busyp, list, list) {
+		trace_xfs_discard_extent(mp, busyp->agno, busyp->bno,
+					 busyp->length);
+
+		error = -blkdev_issue_discard(mp->m_ddev_targp->bt_bdev,
+				XFS_AGB_TO_DADDR(mp, busyp->agno, busyp->bno),
+				XFS_FSB_TO_BB(mp, busyp->length),
+				GFP_NOFS, 0);
+		if (error && error != EOPNOTSUPP) {
+			xfs_info(mp,
+	 "discard failed for extent [0x%llu,%u], error %d",
+				 (unsigned long long)busyp->bno,
+				 busyp->length,
+				 error);
+			return error;
+		}
+	}
+
+	return 0;
+}
Index: xfs/fs/xfs/linux-2.6/xfs_discard.h
===================================================================
--- xfs.orig/fs/xfs/linux-2.6/xfs_discard.h	2011-05-20 15:25:42.017803624 +0200
+++ xfs/fs/xfs/linux-2.6/xfs_discard.h	2011-05-20 15:25:59.213864508 +0200
@@ -2,7 +2,9 @@
 #define XFS_DISCARD_H 1
 
 struct fstrim_range;
+struct list_head;
 
 extern int	xfs_ioc_trim(struct xfs_mount *, struct fstrim_range __user *);
+extern int	xfs_discard_extents(struct xfs_mount *, struct list_head *);
 
 #endif /* XFS_DISCARD_H */
Index: xfs/fs/xfs/xfs_ag.h
===================================================================
--- xfs.orig/fs/xfs/xfs_ag.h	2011-05-20 15:25:42.045804125 +0200
+++ xfs/fs/xfs/xfs_ag.h	2011-05-20 15:25:59.217804311 +0200
@@ -187,6 +187,8 @@ struct xfs_busy_extent {
 	xfs_agnumber_t	agno;
 	xfs_agblock_t	bno;
 	xfs_extlen_t	length;
+	unsigned int	flags;
+#define XFS_ALLOC_BUSY_DISCARDED	0x01	/* undergoing a discard op. */
 };
 
 /*
Index: xfs/fs/xfs/xfs_alloc.c
===================================================================
--- xfs.orig/fs/xfs/xfs_alloc.c	2011-05-20 15:25:52.409233306 +0200
+++ xfs/fs/xfs/xfs_alloc.c	2011-05-20 15:27:46.436367797 +0200
@@ -2609,6 +2609,18 @@ xfs_alloc_busy_update_extent(
 	xfs_agblock_t		bend = bbno + busyp->length;
 
 	/*
+	 * This extent is currently being discarded.  Give the thread
+	 * performing the discard a chance to mark the extent unbusy
+	 * and retry.
+	 */
+	if (busyp->flags & XFS_ALLOC_BUSY_DISCARDED) {
+		spin_unlock(&pag->pagb_lock);
+		delay(1);
+		spin_lock(&pag->pagb_lock);
+		return false;
+	}
+
+	/*
 	 * If there is a busy extent overlapping a user allocation, we have
 	 * no choice but to force the log and retry the search.
 	 *
@@ -2813,7 +2825,8 @@ restart:
 		 * If this is a metadata allocation, try to reuse the busy
 		 * extent instead of trimming the allocation.
 		 */
-		if (!args->userdata) {
+		if (!args->userdata &&
+		    !(busyp->flags & XFS_ALLOC_BUSY_DISCARDED)) {
 			if (!xfs_alloc_busy_update_extent(args->mp, args->pag,
 							  busyp, fbno, flen,
 							  false))
@@ -2979,10 +2992,16 @@ xfs_alloc_busy_clear_one(
 	kmem_free(busyp);
 }
 
+/*
+ * Remove all extents on the passed in list from the busy extents tree.
+ * If do_discard is set skip extents that need to be discarded, and mark
+ * these as undergoing a discard operation instead.
+ */
 void
 xfs_alloc_busy_clear(
 	struct xfs_mount	*mp,
-	struct list_head	*list)
+	struct list_head	*list,
+	bool			do_discard)
 {
 	struct xfs_busy_extent	*busyp, *n;
 	struct xfs_perag	*pag = NULL;
@@ -2999,7 +3018,10 @@ xfs_alloc_busy_clear(
 			agno = busyp->agno;
 		}
 
-		xfs_alloc_busy_clear_one(mp, pag, busyp);
+		if (do_discard && busyp->length)
+			busyp->flags = XFS_ALLOC_BUSY_DISCARDED;
+		else
+			xfs_alloc_busy_clear_one(mp, pag, busyp);
 	}
 
 	if (pag) {
Index: xfs/Documentation/filesystems/xfs.txt
===================================================================
--- xfs.orig/Documentation/filesystems/xfs.txt	2011-05-20 15:25:42.096307936 +0200
+++ xfs/Documentation/filesystems/xfs.txt	2011-05-20 15:30:29.096409728 +0200
@@ -39,6 +39,12 @@ When mounting an XFS filesystem, the fol
 	drive level write caching to be enabled, for devices that
 	support write barriers.
 
+  discard
+	Issue command to let the block device reclaim space freed by the
+	filesystem.  This is useful for SSD devices, thinly provisioned
+	LUNs and virtual machine images, but may have a performance
+	impact.  This option is incompatible with the nodelaylog option.
+
   dmapi
 	Enable the DMAPI (Data Management API) event callouts.
 	Use with the "mtpt" option.
Index: xfs/fs/xfs/xfs_alloc.h
===================================================================
--- xfs.orig/fs/xfs/xfs_alloc.h	2011-05-20 15:25:42.065803648 +0200
+++ xfs/fs/xfs/xfs_alloc.h	2011-05-20 15:25:59.225803393 +0200
@@ -140,7 +140,8 @@ xfs_alloc_busy_insert(struct xfs_trans *
 	xfs_agblock_t bno, xfs_extlen_t len);
 
 void
-xfs_alloc_busy_clear(struct xfs_mount *mp, struct list_head *list);
+xfs_alloc_busy_clear(struct xfs_mount *mp, struct list_head *list,
+	bool do_discard);
 
 int
 xfs_alloc_busy_search(struct xfs_mount *mp, xfs_agnumber_t agno,
Index: xfs/fs/xfs/xfs_trans.c
===================================================================
--- xfs.orig/fs/xfs/xfs_trans.c	2011-05-20 15:25:42.077804275 +0200
+++ xfs/fs/xfs/xfs_trans.c	2011-05-20 15:25:59.229804062 +0200
@@ -609,7 +609,7 @@ xfs_trans_free(
 	struct xfs_trans	*tp)
 {
 	xfs_alloc_busy_sort(&tp->t_busy);
-	xfs_alloc_busy_clear(tp->t_mountp, &tp->t_busy);
+	xfs_alloc_busy_clear(tp->t_mountp, &tp->t_busy, false);
 
 	atomic_dec(&tp->t_mountp->m_active_trans);
 	xfs_trans_free_dqinfo(tp);

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 1/4] xfs: add online discard support
  2011-05-20 11:43       ` Lukas Czerner
@ 2011-05-20 13:57         ` Alex Elder
  0 siblings, 0 replies; 15+ messages in thread
From: Alex Elder @ 2011-05-20 13:57 UTC (permalink / raw)
  To: Lukas Czerner; +Cc: Christoph Hellwig, xfs

On Fri, 2011-05-20 at 13:43 +0200, Lukas Czerner wrote:
> On Fri, 20 May 2011, Christoph Hellwig wrote:
> 
> > On Thu, May 19, 2011 at 04:53:44PM -0500, Alex Elder wrote:
> > > The first is, why not support it for non-delaylog?
> > 
> > Because:
> > 
. . .
> > 
> > It's a bad idea to do the sort twice for no good reason, and adding
> > another parameter to further overload xfs_alloc_busy_clear behaviour
> > doesn't seem smart either.
> > 
> > > 		if (error == EOPNOTSUPP) {
> > > 			/*
> > > 			 * Report this once per mount point somehow?
> Actually, this is a good idea see https://lkml.org/lkml/2011/5/5/162. So
> you will get EOPNOTSUPP *only* if the device (as a whole) does not
> support discard.

I.e., if *any* component of the underlying storage supports
discard, the aggregate device supports discard.  (Rather than
only if *all* components support it.)  This seems pretty reasonable.

> > > 			 * If so, turn off the mount option?
> Not so good idea, as some people mentioned several times, you can change
> the devices in dmsetup to SSD (for example) without umount and you would
> like your previous mount option to work. In the opposite case, the user
> just gets warning (once a day perhaps?) and its up to him to deal with
> it.

Sorry, I wasn't following that discussion closely.

> Or, we can turn it of (with warning) and rely on the user to notice that
> it is turned off. But I would rather not rely on that.

I agree with you.  I didn't realize the underlying storage could
change attributes without notification of some kind.  The FS layer
might benefit from knowing when such changes take place.

					-Alex

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 1/4 v2] xfs: add online discard support
  2011-05-20 13:45   ` [PATCH 1/4 v2] " Christoph Hellwig
@ 2011-05-20 15:42     ` Alex Elder
  0 siblings, 0 replies; 15+ messages in thread
From: Alex Elder @ 2011-05-20 15:42 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: xfs

On Fri, 2011-05-20 at 09:45 -0400, Christoph Hellwig wrote:
> Now that we have reliably tracking of deleted extents in a transaction
> we can easily implement "online" discard support which calls
> blkdev_issue_discard once a transaction commits.
> 
> The actual discard is a two stage operation as we first have to mark
> the busy extent as not available for reuse before we can start the
> actual discard.  Note that we don't bother supporting discard for
> the non-delaylog mode.
> 
> Signed-off-by: Christoph Hellwig <hch@lst.de>

Looks good.

Given Lukas' note about EOPNOTSUPP getting reported only
if 100% of the underlying storage does not (at the moment)
support discard:
- You might as well drop out of the loop in xfs_discard_extents()
  once EOPNOTSUPP gets reported for any extent.
- It might be nice to report something the first time that
  happens on a particular mount point, record that fact,
  then report again if it begins to *not* happen at some
  point thereafter (due to the underlying support being
  changed via dmsetup while still mounted).

But I'm not going to wait for that.  I'll take in this
patch as-is and if you care to act on those suggestions
will rely on you to do so by sending a new patch.

Reviewed-by: Alex Elder <aelder@sgi.com>


_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2011-05-20 15:42 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-05-04 18:55 [PATCH 0/4] online discard support V3 Christoph Hellwig
2011-05-04 18:55 ` [PATCH 1/4] xfs: add online discard support Christoph Hellwig
2011-05-19 21:53   ` Alex Elder
2011-05-20 10:24     ` Christoph Hellwig
2011-05-20 11:43       ` Lukas Czerner
2011-05-20 13:57         ` Alex Elder
2011-05-20 13:40       ` Alex Elder
2011-05-20 13:45   ` [PATCH 1/4 v2] " Christoph Hellwig
2011-05-20 15:42     ` Alex Elder
2011-05-04 18:55 ` [PATCH 2/4] xfs: do not discard alloc btree blocks Christoph Hellwig
2011-05-19 21:54   ` Alex Elder
2011-05-04 18:55 ` [PATCH 3/4] xfs: add a reference count to the CIL context Christoph Hellwig
2011-05-19 21:54   ` Alex Elder
2011-05-20 10:25     ` Christoph Hellwig
2011-05-04 18:55 ` [PATCH 4/4] xfs: make discard operations asynchronous Christoph Hellwig

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.