[PATCH 0/6] [PATCH 0/6] more efficient busy extent handling and discard support

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH 0/6] [PATCH 0/6] more efficient busy extent handling and discard support
@ 2011-03-22 19:55 Christoph Hellwig
  2011-03-22 19:55 ` [PATCH 1/6] xfs: optimize AGFL refills Christoph Hellwig
                   ` (5 more replies)
  0 siblings, 6 replies; 31+ messages in thread
From: Christoph Hellwig @ 2011-03-22 19:55 UTC (permalink / raw)
  To: xfs

This series updates the busy extent tracking to:

 a) keep exact track of busy extents
 b) allow reallocations where they are not a problem, that is generally
    everything but metadata -> userdata reallocations

The result is that compilebench is around 5% faster and neither it nor
any other workload I've thrown at the patchset causes a log force anymore.

The second to last patch also adds discard support, which is rather trivial
now that we do track the busy extents exactly.  The last patch further
optimizes the discard code to not stall the log state machine but
execute asynchronously.  This does however trigger a bug in the block
layer, for the report and a workaround see:

	http://marc.info/?l=linux-kernel&m=130082329118907&w=2

There is a few further optimizations we can add but I haven't done yet:

 - do not attempt to discard blocks on the AGFL, as the chance of reuse
   is very high, and we can't avoid reallocating from them
 - on SSDs preferably reuse blocks on the busy list to avoid discards

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 31+ messages in thread

* [PATCH 1/6] xfs: optimize AGFL refills
  2011-03-22 19:55 [PATCH 0/6] [PATCH 0/6] more efficient busy extent handling and discard support Christoph Hellwig
@ 2011-03-22 19:55 ` Christoph Hellwig
  2011-03-22 22:30   ` Alex Elder
  2011-03-22 23:30   ` Dave Chinner
  2011-03-22 19:55 ` [PATCH 2/6] xfs: do not immediately reuse busy extent ranges Christoph Hellwig
                   ` (4 subsequent siblings)
  5 siblings, 2 replies; 31+ messages in thread
From: Christoph Hellwig @ 2011-03-22 19:55 UTC (permalink / raw)
  To: xfs

[-- Attachment #1: xfs-optimize-freelist-refills --]
[-- Type: text/plain, Size: 3496 bytes --]

Avoid forcing out busy extent when moving blocks from/to the AGFL.  We
archive this my moving the busy search out of xfs_alloc_get_freelist into
the callers that need it, and by moving the busy list insert from
xfs_free_ag_extent extent which is used both by AGFL refills and real
allocation to xfs_free_extent, which is only used by the latter.

Signed-off-by: Christoph Hellwig <hch@lst.de>

Index: xfs/fs/xfs/xfs_alloc.c
===================================================================
--- xfs.orig/fs/xfs/xfs_alloc.c	2011-03-19 16:49:23.774797370 +0100
+++ xfs/fs/xfs/xfs_alloc.c	2011-03-19 16:49:38.882797272 +0100
@@ -1326,6 +1326,8 @@ xfs_alloc_ag_vextent_small(
 		if (error)
 			goto error0;
 		if (fbno != NULLAGBLOCK) {
+			if (xfs_alloc_busy_search(args->mp, args->agno, fbno, 1))
+				xfs_trans_set_sync(args->tp);
 			if (args->userdata) {
 				xfs_buf_t	*bp;
 
@@ -1617,18 +1619,6 @@ xfs_free_ag_extent(
 
 	trace_xfs_free_extent(mp, agno, bno, len, isfl, haveleft, haveright);
 
-	/*
-	 * Since blocks move to the free list without the coordination
-	 * used in xfs_bmap_finish, we can't allow block to be available
-	 * for reallocation and non-transaction writing (user data)
-	 * until we know that the transaction that moved it to the free
-	 * list is permanently on disk.  We track the blocks by declaring
-	 * these blocks as "busy"; the busy list is maintained on a per-ag
-	 * basis and each transaction records which entries should be removed
-	 * when the iclog commits to disk.  If a busy block is allocated,
-	 * the iclog is pushed up to the LSN that freed the block.
-	 */
-	xfs_alloc_busy_insert(tp, agno, bno, len);
 	return 0;
 
  error0:
@@ -1923,21 +1913,6 @@ xfs_alloc_get_freelist(
 	xfs_alloc_log_agf(tp, agbp, logflags);
 	*bnop = bno;
 
-	/*
-	 * As blocks are freed, they are added to the per-ag busy list and
-	 * remain there until the freeing transaction is committed to disk.
-	 * Now that we have allocated blocks, this list must be searched to see
-	 * if a block is being reused.  If one is, then the freeing transaction
-	 * must be pushed to disk before this transaction.
-	 *
-	 * We do this by setting the current transaction to a sync transaction
-	 * which guarantees that the freeing transaction is on disk before this
-	 * transaction. This is done instead of a synchronous log force here so
-	 * that we don't sit and wait with the AGF locked in the transaction
-	 * during the log force.
-	 */
-	if (xfs_alloc_busy_search(mp, be32_to_cpu(agf->agf_seqno), bno, 1))
-		xfs_trans_set_sync(tp);
 	return 0;
 }
 
@@ -2407,6 +2382,8 @@ xfs_free_extent(
 		be32_to_cpu(XFS_BUF_TO_AGF(args.agbp)->agf_length));
 #endif
 	error = xfs_free_ag_extent(tp, args.agbp, args.agno, args.agbno, len, 0);
+	if (error)
+		xfs_alloc_busy_insert(tp, args.agno, args.agbno, len);
 error0:
 	xfs_perag_put(args.pag);
 	return error;
Index: xfs/fs/xfs/xfs_alloc_btree.c
===================================================================
--- xfs.orig/fs/xfs/xfs_alloc_btree.c	2011-03-19 16:49:22.929299332 +0100
+++ xfs/fs/xfs/xfs_alloc_btree.c	2011-03-19 16:49:30.382841378 +0100
@@ -94,6 +94,8 @@ xfs_allocbt_alloc_block(
 		*stat = 0;
 		return 0;
 	}
+	if (xfs_alloc_busy_search(cur->bc_mp, cur->bc_private.a.agno, bno, 1))
+		xfs_trans_set_sync(cur->bc_tp);
 
 	xfs_trans_agbtree_delta(cur->bc_tp, 1);
 	new->s = cpu_to_be32(bno);

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 31+ messages in thread

* [PATCH 2/6] xfs: do not immediately reuse busy extent ranges
  2011-03-22 19:55 [PATCH 0/6] [PATCH 0/6] more efficient busy extent handling and discard support Christoph Hellwig
  2011-03-22 19:55 ` [PATCH 1/6] xfs: optimize AGFL refills Christoph Hellwig
@ 2011-03-22 19:55 ` Christoph Hellwig
  2011-03-22 22:30   ` Alex Elder
  2011-03-25 21:03   ` Alex Elder
  2011-03-22 19:55 ` [PATCH 3/6] xfs: exact busy extent tracking Christoph Hellwig
                   ` (3 subsequent siblings)
  5 siblings, 2 replies; 31+ messages in thread
From: Christoph Hellwig @ 2011-03-22 19:55 UTC (permalink / raw)
  To: xfs

[-- Attachment #1: xfs-skip-busy-extents --]
[-- Type: text/plain, Size: 21217 bytes --]

Every time we reallocate a busy extent, we cause a synchronous log force
to occur to ensure the freeing transaction is on disk before we continue
and use the newly allocated extent.  This is extremely sub-optimal as we
have to mark every transaction with blocks that get reused as synchronous.

Instead of searching the busy extent list after deciding on the extent to
allocate, check each candidate extent during the allocation decisions as
to whether they are in the busy list.  If they are in the busy list, we
trim the busy range out of the extent we have found and determine if that
trimmed range is still OK for allocation. In many cases, this check can
be incorporated into the allocation extent alignment code which already
does trimming of the found extent before determining if it is a valid
candidate for allocation.

Based on two earlier patches from Dave Chinner.

Signed-off-by: Christoph Hellwig <hch@lst.de>

Index: xfs/fs/xfs/xfs_alloc.c
===================================================================
--- xfs.orig/fs/xfs/xfs_alloc.c	2011-03-19 16:51:44.645299267 +0100
+++ xfs/fs/xfs/xfs_alloc.c	2011-03-19 18:11:35.157296743 +0100
@@ -41,19 +41,13 @@
 #define	XFSA_FIXUP_BNO_OK	1
 #define	XFSA_FIXUP_CNT_OK	2
 
-/*
- * Prototypes for per-ag allocation routines
- */
-
 STATIC int xfs_alloc_ag_vextent_exact(xfs_alloc_arg_t *);
 STATIC int xfs_alloc_ag_vextent_near(xfs_alloc_arg_t *);
 STATIC int xfs_alloc_ag_vextent_size(xfs_alloc_arg_t *);
 STATIC int xfs_alloc_ag_vextent_small(xfs_alloc_arg_t *,
-	xfs_btree_cur_t *, xfs_agblock_t *, xfs_extlen_t *, int *);
-
-/*
- * Internal functions.
- */
+		xfs_btree_cur_t *, xfs_agblock_t *, xfs_extlen_t *, int *);
+STATIC void xfs_alloc_busy_trim(struct xfs_alloc_arg *,
+		xfs_agblock_t, xfs_extlen_t, xfs_agblock_t *, xfs_extlen_t *);
 
 /*
  * Lookup the record equal to [bno, len] in the btree given by cur.
@@ -154,19 +148,21 @@ xfs_alloc_compute_aligned(
 	xfs_extlen_t	*reslen)	/* result length */
 {
 	xfs_agblock_t	bno;
-	xfs_extlen_t	diff;
 	xfs_extlen_t	len;
 
-	if (args->alignment > 1 && foundlen >= args->minlen) {
-		bno = roundup(foundbno, args->alignment);
-		diff = bno - foundbno;
-		len = diff >= foundlen ? 0 : foundlen - diff;
+	/* Trim busy sections out of found extent */
+	xfs_alloc_busy_trim(args, foundbno, foundlen, &bno, &len);
+
+	if (args->alignment > 1 && len >= args->minlen) {
+		xfs_agblock_t	aligned_bno = roundup(bno, args->alignment);
+		xfs_extlen_t	diff = aligned_bno - bno;
+
+		*resbno = aligned_bno;
+		*reslen = diff >= len ? 0 : len - diff;
 	} else {
-		bno = foundbno;
-		len = foundlen;
+		*resbno = bno;
+		*reslen = len;
 	}
-	*resbno = bno;
-	*reslen = len;
 }
 
 /*
@@ -541,16 +537,8 @@ xfs_alloc_ag_vextent(
 		if (error)
 			return error;
 
-		/*
-		 * Search the busylist for these blocks and mark the
-		 * transaction as synchronous if blocks are found. This
-		 * avoids the need to block due to a synchronous log
-		 * force to ensure correct ordering as the synchronous
-		 * transaction will guarantee that for us.
-		 */
-		if (xfs_alloc_busy_search(args->mp, args->agno,
-					args->agbno, args->len))
-			xfs_trans_set_sync(args->tp);
+		ASSERT(!xfs_alloc_busy_search(args->mp, args->agno,
+					      args->agbno, args->len));
 	}
 
 	if (!args->isfl) {
@@ -577,14 +565,14 @@ xfs_alloc_ag_vextent_exact(
 {
 	xfs_btree_cur_t	*bno_cur;/* by block-number btree cursor */
 	xfs_btree_cur_t	*cnt_cur;/* by count btree cursor */
-	xfs_agblock_t	end;	/* end of allocated extent */
 	int		error;
 	xfs_agblock_t	fbno;	/* start block of found extent */
-	xfs_agblock_t	fend;	/* end block of found extent */
 	xfs_extlen_t	flen;	/* length of found extent */
+	xfs_agblock_t	tbno;	/* start block of trimmed extent */
+	xfs_extlen_t	tlen;	/* length of trimmed extent */
+	xfs_agblock_t	tend;	/* end block of trimmed extent */
+	xfs_agblock_t	end;	/* end of allocated extent */
 	int		i;	/* success/failure of operation */
-	xfs_agblock_t	maxend;	/* end of maximal extent */
-	xfs_agblock_t	minend;	/* end of minimal extent */
 	xfs_extlen_t	rlen;	/* length of returned extent */
 
 	ASSERT(args->alignment == 1);
@@ -614,14 +602,22 @@ xfs_alloc_ag_vextent_exact(
 		goto error0;
 	XFS_WANT_CORRUPTED_GOTO(i == 1, error0);
 	ASSERT(fbno <= args->agbno);
-	minend = args->agbno + args->minlen;
-	maxend = args->agbno + args->maxlen;
-	fend = fbno + flen;
 
 	/*
-	 * Give up if the freespace isn't long enough for the minimum request.
+	 * Check for overlapping busy extents.
+	 */
+	xfs_alloc_busy_trim(args, fbno, flen, &tbno, &tlen);
+
+	/*
+	 * Give up if the start of the extent is busy, or the freespace isn't
+	 * long enough for the minimum request.
 	 */
-	if (fend < minend)
+	if (tbno > args->agbno)
+		goto not_found;
+	if (tlen < args->minlen)
+		goto not_found;
+	tend = tbno + tlen;
+	if (tend < args->agbno + args->minlen)
 		goto not_found;
 
 	/*
@@ -630,14 +626,14 @@ xfs_alloc_ag_vextent_exact(
 	 *
 	 * Fix the length according to mod and prod if given.
 	 */
-	end = XFS_AGBLOCK_MIN(fend, maxend);
+	end = XFS_AGBLOCK_MIN(tend, args->agbno + args->maxlen);
 	args->len = end - args->agbno;
 	xfs_alloc_fix_len(args);
 	if (!xfs_alloc_fix_minleft(args))
 		goto not_found;
 
 	rlen = args->len;
-	ASSERT(args->agbno + rlen <= fend);
+	ASSERT(args->agbno + rlen <= tend);
 	end = args->agbno + rlen;
 
 	/*
@@ -686,11 +682,11 @@ xfs_alloc_find_best_extent(
 	struct xfs_btree_cur	**scur,	/* searching cursor */
 	xfs_agblock_t		gdiff,	/* difference for search comparison */
 	xfs_agblock_t		*sbno,	/* extent found by search */
-	xfs_extlen_t		*slen,
-	xfs_extlen_t		*slena,	/* aligned length */
+	xfs_extlen_t		*slen,	/* extent length */
+	xfs_agblock_t		*sbnoa,	/* aligned extent found by search */
+	xfs_extlen_t		*slena,	/* aligned extent length */
 	int			dir)	/* 0 = search right, 1 = search left */
 {
-	xfs_agblock_t		bno;
 	xfs_agblock_t		new;
 	xfs_agblock_t		sdiff;
 	int			error;
@@ -708,16 +704,16 @@ xfs_alloc_find_best_extent(
 		if (error)
 			goto error0;
 		XFS_WANT_CORRUPTED_GOTO(i == 1, error0);
-		xfs_alloc_compute_aligned(args, *sbno, *slen, &bno, slena);
+		xfs_alloc_compute_aligned(args, *sbno, *slen, sbnoa, slena);
 
 		/*
 		 * The good extent is closer than this one.
 		 */
 		if (!dir) {
-			if (bno >= args->agbno + gdiff)
+			if (*sbnoa >= args->agbno + gdiff)
 				goto out_use_good;
 		} else {
-			if (bno <= args->agbno - gdiff)
+			if (*sbnoa <= args->agbno - gdiff)
 				goto out_use_good;
 		}
 
@@ -729,8 +725,8 @@ xfs_alloc_find_best_extent(
 			xfs_alloc_fix_len(args);
 
 			sdiff = xfs_alloc_compute_diff(args->agbno, args->len,
-						       args->alignment, *sbno,
-						       *slen, &new);
+						       args->alignment, *sbnoa,
+						       *slena, &new);
 
 			/*
 			 * Choose closer size and invalidate other cursor.
@@ -780,7 +776,7 @@ xfs_alloc_ag_vextent_near(
 	xfs_agblock_t	gtbnoa;		/* aligned ... */
 	xfs_extlen_t	gtdiff;		/* difference to right side entry */
 	xfs_extlen_t	gtlen;		/* length of right side entry */
-	xfs_extlen_t	gtlena = 0;	/* aligned ... */
+	xfs_extlen_t	gtlena;		/* aligned ... */
 	xfs_agblock_t	gtnew;		/* useful start bno of right side */
 	int		error;		/* error code */
 	int		i;		/* result code, temporary */
@@ -789,9 +785,10 @@ xfs_alloc_ag_vextent_near(
 	xfs_agblock_t	ltbnoa;		/* aligned ... */
 	xfs_extlen_t	ltdiff;		/* difference to left side entry */
 	xfs_extlen_t	ltlen;		/* length of left side entry */
-	xfs_extlen_t	ltlena = 0;	/* aligned ... */
+	xfs_extlen_t	ltlena;		/* aligned ... */
 	xfs_agblock_t	ltnew;		/* useful start bno of left side */
 	xfs_extlen_t	rlen;		/* length of returned extent */
+	int		forced = 0;
 #if defined(DEBUG) && defined(__KERNEL__)
 	/*
 	 * Randomly don't execute the first algorithm.
@@ -800,13 +797,20 @@ xfs_alloc_ag_vextent_near(
 
 	dofirst = random32() & 1;
 #endif
+
+restart:
+	bno_cur_lt = NULL;
+	bno_cur_gt = NULL;
+	ltlen = 0;
+	gtlena = 0;
+	ltlena = 0;
+
 	/*
 	 * Get a cursor for the by-size btree.
 	 */
 	cnt_cur = xfs_allocbt_init_cursor(args->mp, args->tp, args->agbp,
 		args->agno, XFS_BTNUM_CNT);
-	ltlen = 0;
-	bno_cur_lt = bno_cur_gt = NULL;
+
 	/*
 	 * See if there are any free extents as big as maxlen.
 	 */
@@ -822,11 +826,13 @@ xfs_alloc_ag_vextent_near(
 			goto error0;
 		if (i == 0 || ltlen == 0) {
 			xfs_btree_del_cursor(cnt_cur, XFS_BTREE_NOERROR);
+			trace_xfs_alloc_near_noentry(args);
 			return 0;
 		}
 		ASSERT(i == 1);
 	}
 	args->wasfromfl = 0;
+
 	/*
 	 * First algorithm.
 	 * If the requested extent is large wrt the freespaces available
@@ -890,7 +896,7 @@ xfs_alloc_ag_vextent_near(
 			if (args->len < blen)
 				continue;
 			ltdiff = xfs_alloc_compute_diff(args->agbno, args->len,
-				args->alignment, ltbno, ltlen, &ltnew);
+				args->alignment, ltbnoa, ltlena, &ltnew);
 			if (ltnew != NULLAGBLOCK &&
 			    (args->len > blen || ltdiff < bdiff)) {
 				bdiff = ltdiff;
@@ -1042,11 +1048,12 @@ xfs_alloc_ag_vextent_near(
 			args->len = XFS_EXTLEN_MIN(ltlena, args->maxlen);
 			xfs_alloc_fix_len(args);
 			ltdiff = xfs_alloc_compute_diff(args->agbno, args->len,
-				args->alignment, ltbno, ltlen, &ltnew);
+				args->alignment, ltbnoa, ltlena, &ltnew);
 
 			error = xfs_alloc_find_best_extent(args,
 						&bno_cur_lt, &bno_cur_gt,
-						ltdiff, &gtbno, &gtlen, &gtlena,
+						ltdiff, &gtbno, &gtlen,
+						&gtbnoa, &gtlena,
 						0 /* search right */);
 		} else {
 			ASSERT(gtlena >= args->minlen);
@@ -1057,11 +1064,12 @@ xfs_alloc_ag_vextent_near(
 			args->len = XFS_EXTLEN_MIN(gtlena, args->maxlen);
 			xfs_alloc_fix_len(args);
 			gtdiff = xfs_alloc_compute_diff(args->agbno, args->len,
-				args->alignment, gtbno, gtlen, &gtnew);
+				args->alignment, gtbnoa, gtlena, &gtnew);
 
 			error = xfs_alloc_find_best_extent(args,
 						&bno_cur_gt, &bno_cur_lt,
-						gtdiff, &ltbno, &ltlen, &ltlena,
+						gtdiff, &ltbno, &ltlen,
+						&ltbnoa, &ltlena,
 						1 /* search left */);
 		}
 
@@ -1073,6 +1081,12 @@ xfs_alloc_ag_vextent_near(
 	 * If we couldn't get anything, give up.
 	 */
 	if (bno_cur_lt == NULL && bno_cur_gt == NULL) {
+		if (!forced++) {
+			trace_xfs_alloc_near_busy(args);
+			xfs_log_force(args->mp, XFS_LOG_SYNC);
+			goto restart;
+		}
+
 		trace_xfs_alloc_size_neither(args);
 		args->agbno = NULLAGBLOCK;
 		return 0;
@@ -1107,12 +1121,13 @@ xfs_alloc_ag_vextent_near(
 		return 0;
 	}
 	rlen = args->len;
-	(void)xfs_alloc_compute_diff(args->agbno, rlen, args->alignment, ltbno,
-		ltlen, &ltnew);
+	(void)xfs_alloc_compute_diff(args->agbno, rlen, args->alignment,
+				     ltbnoa, ltlena, &ltnew);
 	ASSERT(ltnew >= ltbno);
-	ASSERT(ltnew + rlen <= ltbno + ltlen);
+	ASSERT(ltnew + rlen <= ltbnoa + ltlena);
 	ASSERT(ltnew + rlen <= be32_to_cpu(XFS_BUF_TO_AGF(args->agbp)->agf_length));
 	args->agbno = ltnew;
+
 	if ((error = xfs_alloc_fixup_trees(cnt_cur, bno_cur_lt, ltbno, ltlen,
 			ltnew, rlen, XFSA_FIXUP_BNO_OK)))
 		goto error0;
@@ -1155,26 +1170,35 @@ xfs_alloc_ag_vextent_size(
 	int		i;		/* temp status variable */
 	xfs_agblock_t	rbno;		/* returned block number */
 	xfs_extlen_t	rlen;		/* length of returned extent */
+	int		forced = 0;
 
+restart:
 	/*
 	 * Allocate and initialize a cursor for the by-size btree.
 	 */
 	cnt_cur = xfs_allocbt_init_cursor(args->mp, args->tp, args->agbp,
 		args->agno, XFS_BTNUM_CNT);
 	bno_cur = NULL;
+
 	/*
 	 * Look for an entry >= maxlen+alignment-1 blocks.
 	 */
 	if ((error = xfs_alloc_lookup_ge(cnt_cur, 0,
 			args->maxlen + args->alignment - 1, &i)))
 		goto error0;
+
 	/*
-	 * If none, then pick up the last entry in the tree unless the
-	 * tree is empty.
-	 */
-	if (!i) {
-		if ((error = xfs_alloc_ag_vextent_small(args, cnt_cur, &fbno,
-				&flen, &i)))
+	 * If none or we have busy extents that we cannot allocate from, then
+	 * we have to settle for a smaller extent. In the case that there are
+	 * no large extents, this will return the last entry in the tree unless
+	 * the tree is empty. In the case that there are only busy large
+	 * extents, this will return the largest small extent unless there
+	 * are no smaller extents available.
+	 */
+	if (!i || forced > 1) {
+		error = xfs_alloc_ag_vextent_small(args, cnt_cur,
+						   &fbno, &flen, &i);
+		if (error)
 			goto error0;
 		if (i == 0 || flen == 0) {
 			xfs_btree_del_cursor(cnt_cur, XFS_BTREE_NOERROR);
@@ -1182,22 +1206,56 @@ xfs_alloc_ag_vextent_size(
 			return 0;
 		}
 		ASSERT(i == 1);
-	}
-	/*
-	 * There's a freespace as big as maxlen+alignment-1, get it.
-	 */
-	else {
-		if ((error = xfs_alloc_get_rec(cnt_cur, &fbno, &flen, &i)))
-			goto error0;
-		XFS_WANT_CORRUPTED_GOTO(i == 1, error0);
-	}
+		xfs_alloc_compute_aligned(args, fbno, flen, &rbno, &rlen);
+	} else {
+		/*
+		 * Search for a non-busy extent that is large enough.
+		 * If we are at low space, don't check, or if we fall of
+		 * the end of the btree, turn off the busy check and
+		 * restart.
+		 */
+		for (;;) {
+			error = xfs_alloc_get_rec(cnt_cur, &fbno, &flen, &i);
+			if (error)
+				goto error0;
+			XFS_WANT_CORRUPTED_GOTO(i == 1, error0);
+
+			xfs_alloc_compute_aligned(args, fbno, flen,
+						  &rbno, &rlen);
+
+			if (rlen >= args->maxlen)
+				break;
+
+			error = xfs_btree_increment(cnt_cur, 0, &i);
+			if (error)
+				goto error0;
+			if (i == 0) {
+				/*
+				 * Our only valid extents must have been busy.
+				 * Make it unbusy by forcing the log out and
+				 * retrying. If we've been here before, forcing
+				 * the log isn't making the extents available,
+				 * which means they have probably been freed in
+				 * this transaction.  In that case, we have to
+				 * give up on them and we'll attempt a minlen
+				 * allocation the next time around.
+				 */
+				xfs_btree_del_cursor(cnt_cur,
+						     XFS_BTREE_NOERROR);
+				trace_xfs_alloc_size_busy(args);
+				if (!forced++)
+					xfs_log_force(args->mp, XFS_LOG_SYNC);
+				goto restart;
+			}
+		}
+ 	}
+
 	/*
 	 * In the first case above, we got the last entry in the
 	 * by-size btree.  Now we check to see if the space hits maxlen
 	 * once aligned; if not, we search left for something better.
 	 * This can't happen in the second case above.
 	 */
-	xfs_alloc_compute_aligned(args, fbno, flen, &rbno, &rlen);
 	rlen = XFS_EXTLEN_MIN(args->maxlen, rlen);
 	XFS_WANT_CORRUPTED_GOTO(rlen == 0 ||
 			(rlen <= flen && rbno + rlen <= fbno + flen), error0);
@@ -1251,13 +1309,19 @@ xfs_alloc_ag_vextent_size(
 	 * Fix up the length.
 	 */
 	args->len = rlen;
-	xfs_alloc_fix_len(args);
-	if (rlen < args->minlen || !xfs_alloc_fix_minleft(args)) {
-		xfs_btree_del_cursor(cnt_cur, XFS_BTREE_NOERROR);
-		trace_xfs_alloc_size_nominleft(args);
-		args->agbno = NULLAGBLOCK;
-		return 0;
+	if (rlen < args->minlen) {
+		if (!forced++) {
+			xfs_btree_del_cursor(cnt_cur, XFS_BTREE_NOERROR);
+			trace_xfs_alloc_size_busy(args);
+			xfs_log_force(args->mp, XFS_LOG_SYNC);
+			goto restart;
+		}
+		goto out_nominleft;
 	}
+	xfs_alloc_fix_len(args);
+
+	if (!xfs_alloc_fix_minleft(args))
+		goto out_nominleft;
 	rlen = args->len;
 	XFS_WANT_CORRUPTED_GOTO(rlen <= flen, error0);
 	/*
@@ -1287,6 +1351,12 @@ error0:
 	if (bno_cur)
 		xfs_btree_del_cursor(bno_cur, XFS_BTREE_ERROR);
 	return error;
+
+out_nominleft:
+	xfs_btree_del_cursor(cnt_cur, XFS_BTREE_NOERROR);
+	trace_xfs_alloc_size_nominleft(args);
+	args->agbno = NULLAGBLOCK;
+	return 0;
 }
 
 /*
@@ -2634,6 +2704,181 @@ xfs_alloc_busy_search(
 	return match;
 }
 
+/*
+ * For a given extent [fbno, flen], search the busy extent list
+ * to find a subset of the extent that is not busy.
+ */
+STATIC void
+xfs_alloc_busy_trim(
+	struct xfs_alloc_arg	*args,
+	xfs_agblock_t		fbno,
+	xfs_extlen_t		flen,
+	xfs_agblock_t		*rbno,
+	xfs_extlen_t		*rlen)
+{
+	struct rb_node		*rbp;
+
+	ASSERT(flen > 0);
+
+	spin_lock(&args->pag->pagb_lock);
+	rbp = args->pag->pagb_tree.rb_node;
+	while (rbp && flen >= args->minlen) {
+		struct xfs_busy_extent *busyp =
+			rb_entry(rbp, struct xfs_busy_extent, rb_node);
+		xfs_agblock_t	fend = fbno + flen;
+		xfs_agblock_t	bbno = busyp->bno;
+		xfs_agblock_t	bend = bbno + busyp->length;
+
+		if (fend <= bbno) {
+			rbp = rbp->rb_left;
+			continue;
+		} else if (fbno >= bend) {
+			rbp = rbp->rb_right;
+			continue;
+		}
+
+		if (bbno <= fbno) {
+			/* start overlap */
+
+			/*
+			 * Case 1:
+			 *    bbno           bend
+			 *    +BBBBBBBBBBBBBBBBB+
+			 *        +---------+
+			 *        fbno   fend
+			 *
+			 * Case 2:
+			 *    bbno           bend
+			 *    +BBBBBBBBBBBBBBBBB+
+			 *    +-------------+
+			 *    fbno       fend
+			 *
+			 * Case 3:
+			 *    bbno           bend
+			 *    +BBBBBBBBBBBBBBBBB+
+			 *        +-------------+
+			 *        fbno       fend
+			 *
+			 * Case 4:
+			 *    bbno           bend
+			 *    +BBBBBBBBBBBBBBBBB+
+			 *    +-----------------+
+			 *    fbno           fend
+			 *
+			 * No unbusy region in extent, return failure.
+			 */
+			if (fend <= bend)
+				goto fail;
+
+			/*
+			 * Case 5:
+			 *    bbno           bend
+			 *    +BBBBBBBBBBBBBBBBB+
+			 *        +----------------------+
+			 *        fbno                fend
+			 *
+			 * Case 6:
+			 *    bbno           bend
+			 *    +BBBBBBBBBBBBBBBBB+
+			 *    +--------------------------+
+			 *    fbno                    fend
+			 *
+			 * Needs to be trimmed to:
+			 *                       +-------+
+			 *                       fbno fend
+			 */
+			fbno = bend;
+		} else if (bend >= fend) {
+			/* end overlap */
+
+			/*
+			 * Case 7:
+			 *             bbno           bend
+			 *             +BBBBBBBBBBBBBBBBB+
+			 *    +------------------+
+			 *    fbno            fend
+			 *
+			 * Case 8:
+			 *             bbno           bend
+			 *             +BBBBBBBBBBBBBBBBB+
+			 *    +--------------------------+
+			 *    fbno                    fend
+			 *
+			 * Needs to be trimmed to:
+			 *    +-------+
+			 *    fbno fend
+			 */
+			fend = bbno;
+		} else {
+			/* middle overlap */
+
+			/*
+			 * Case 9:
+			 *             bbno           bend
+			 *             +BBBBBBBBBBBBBBBBB+
+			 *    +-----------------------------------+
+			 *    fbno                             fend
+			 *
+			 * Can be trimmed to:
+			 *    +-------+        OR         +-------+
+			 *    fbno fend                   fbno fend
+			 *
+			 * We prefer the lower bno extent because the next
+			 * allocation for this inode will use "end" as the
+			 * target for first block.  If the busy segment has
+			 * cleared, this will get a contiguous allocation next
+			 * time around; if thebusy segment has not cleared,
+			 * it will get an allocation at bend, which is a forward
+			 * allocation.
+			 *
+			 * If we choose segment at bend, and this remains the
+			 * best extent for the next allocation (e.g. NEAR_BNO
+			 * allocation) we'll next allocate at bno, which will
+			 * give us backwards allocation.  We already know that
+			 * backwards allocation direction causes significant
+			 * fragmentation of directories and degradataion of
+			 * directory performance.
+			 *
+			 * Always chose the option that produces forward
+			 * allocation patterns so that sequential reads and
+			 * writes only ever seek in one direction.  Only choose
+			 * the higher bno extent if the remainin unused extent
+			 * length is much larger than the current allocation
+			 * request, promising us a contiguous allocation in
+			 * the following free space.
+			 */
+
+			if (bbno - fbno >= args->maxlen) {
+				/* left candidate fits perfect */
+				fend = bbno;
+			} else if (fend - bend >= args->maxlen * 4) {
+				/* right candidate has enough free space */
+				fbno = bend;
+			} else if (bbno - fbno >= args->minlen) {
+				/* left candidate fits minimum requirement */
+				fend = bbno;
+			} else {
+				goto fail;
+			}
+		}
+
+		flen = fend - fbno;
+	}
+	spin_unlock(&args->pag->pagb_lock);
+
+	*rbno = fbno;
+	*rlen = flen;
+	return;
+fail:
+	/*
+	 * Return a zero extent length as failure indications.  All callers
+	 * re-check if the trimmed extent satisfies the minlen requirement.
+	 */
+	spin_unlock(&args->pag->pagb_lock);
+	*rbno = fbno;
+	*rlen = 0;
+}
+
 void
 xfs_alloc_busy_clear(
 	struct xfs_mount	*mp,
Index: xfs/fs/xfs/linux-2.6/xfs_trace.h
===================================================================
--- xfs.orig/fs/xfs/linux-2.6/xfs_trace.h	2011-03-19 16:51:44.661297219 +0100
+++ xfs/fs/xfs/linux-2.6/xfs_trace.h	2011-03-19 16:52:48.258797626 +0100
@@ -1433,11 +1433,14 @@ DEFINE_ALLOC_EVENT(xfs_alloc_near_first)
 DEFINE_ALLOC_EVENT(xfs_alloc_near_greater);
 DEFINE_ALLOC_EVENT(xfs_alloc_near_lesser);
 DEFINE_ALLOC_EVENT(xfs_alloc_near_error);
+DEFINE_ALLOC_EVENT(xfs_alloc_near_noentry);
+DEFINE_ALLOC_EVENT(xfs_alloc_near_busy);
 DEFINE_ALLOC_EVENT(xfs_alloc_size_neither);
 DEFINE_ALLOC_EVENT(xfs_alloc_size_noentry);
 DEFINE_ALLOC_EVENT(xfs_alloc_size_nominleft);
 DEFINE_ALLOC_EVENT(xfs_alloc_size_done);
 DEFINE_ALLOC_EVENT(xfs_alloc_size_error);
+DEFINE_ALLOC_EVENT(xfs_alloc_size_busy);
 DEFINE_ALLOC_EVENT(xfs_alloc_small_freelist);
 DEFINE_ALLOC_EVENT(xfs_alloc_small_notenough);
 DEFINE_ALLOC_EVENT(xfs_alloc_small_done);

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 31+ messages in thread

* [PATCH 3/6] xfs: exact busy extent tracking
  2011-03-22 19:55 [PATCH 0/6] [PATCH 0/6] more efficient busy extent handling and discard support Christoph Hellwig
  2011-03-22 19:55 ` [PATCH 1/6] xfs: optimize AGFL refills Christoph Hellwig
  2011-03-22 19:55 ` [PATCH 2/6] xfs: do not immediately reuse busy extent ranges Christoph Hellwig
@ 2011-03-22 19:55 ` Christoph Hellwig
  2011-03-22 23:47   ` Dave Chinner
  2011-03-25 21:04   ` Alex Elder
  2011-03-22 19:55 ` [PATCH 4/6] xfs: allow reusing busy extents where safe Christoph Hellwig
                   ` (2 subsequent siblings)
  5 siblings, 2 replies; 31+ messages in thread
From: Christoph Hellwig @ 2011-03-22 19:55 UTC (permalink / raw)
  To: xfs

[-- Attachment #1: xfs-better-busy-extent-tracking --]
[-- Type: text/plain, Size: 16502 bytes --]

Update the extent tree in case we have to reuse a busy extent, so that it
always is kept uptodate.  This is done by replacing the busy list searches
with a new xfs_alloc_busy_reuse helper, which updates the busy extent tree
in case of a reuse.   Also replace setting transactions to sync with forcing
the log out in case we found a busy extent to reuse.  This makes the code a
lot more simple, and is required for discard support later on.  While it
will cause performance regressios with just this patch applied, the impact
is more than mitigated by the next patch in the series.

Signed-off-by: Christoph Hellwig <hch@lst.de>

Index: xfs/fs/xfs/xfs_alloc.c
===================================================================
--- xfs.orig/fs/xfs/xfs_alloc.c	2011-03-20 19:41:55.835479390 +0100
+++ xfs/fs/xfs/xfs_alloc.c	2011-03-21 14:49:14.157973188 +0100
@@ -1396,8 +1396,8 @@ xfs_alloc_ag_vextent_small(
 		if (error)
 			goto error0;
 		if (fbno != NULLAGBLOCK) {
-			if (xfs_alloc_busy_search(args->mp, args->agno, fbno, 1))
-				xfs_trans_set_sync(args->tp);
+			xfs_alloc_busy_reuse(args->tp, args->agno, fbno, 1);
+
 			if (args->userdata) {
 				xfs_buf_t	*bp;
 
@@ -2459,100 +2459,6 @@ error0:
 	return error;
 }
 
-
-/*
- * AG Busy list management
- * The busy list contains block ranges that have been freed but whose
- * transactions have not yet hit disk.  If any block listed in a busy
- * list is reused, the transaction that freed it must be forced to disk
- * before continuing to use the block.
- *
- * xfs_alloc_busy_insert - add to the per-ag busy list
- * xfs_alloc_busy_clear - remove an item from the per-ag busy list
- * xfs_alloc_busy_search - search for a busy extent
- */
-
-/*
- * Insert a new extent into the busy tree.
- *
- * The busy extent tree is indexed by the start block of the busy extent.
- * there can be multiple overlapping ranges in the busy extent tree but only
- * ever one entry at a given start block. The reason for this is that
- * multi-block extents can be freed, then smaller chunks of that extent
- * allocated and freed again before the first transaction commit is on disk.
- * If the exact same start block is freed a second time, we have to wait for
- * that busy extent to pass out of the tree before the new extent is inserted.
- * There are two main cases we have to handle here.
- *
- * The first case is a transaction that triggers a "free - allocate - free"
- * cycle. This can occur during btree manipulations as a btree block is freed
- * to the freelist, then allocated from the free list, then freed again. In
- * this case, the second extxpnet free is what triggers the duplicate and as
- * such the transaction IDs should match. Because the extent was allocated in
- * this transaction, the transaction must be marked as synchronous. This is
- * true for all cases where the free/alloc/free occurs in the one transaction,
- * hence the addition of the ASSERT(tp->t_flags & XFS_TRANS_SYNC) to this case.
- * This serves to catch violations of the second case quite effectively.
- *
- * The second case is where the free/alloc/free occur in different
- * transactions. In this case, the thread freeing the extent the second time
- * can't mark the extent busy immediately because it is already tracked in a
- * transaction that may be committing.  When the log commit for the existing
- * busy extent completes, the busy extent will be removed from the tree. If we
- * allow the second busy insert to continue using that busy extent structure,
- * it can be freed before this transaction is safely in the log.  Hence our
- * only option in this case is to force the log to remove the existing busy
- * extent from the list before we insert the new one with the current
- * transaction ID.
- *
- * The problem we are trying to avoid in the free-alloc-free in separate
- * transactions is most easily described with a timeline:
- *
- *      Thread 1	Thread 2	Thread 3	xfslogd
- *	xact alloc
- *	free X
- *	mark busy
- *	commit xact
- *	free xact
- *			xact alloc
- *			alloc X
- *			busy search
- *			mark xact sync
- *			commit xact
- *			free xact
- *			force log
- *			checkpoint starts
- *			....
- *					xact alloc
- *					free X
- *					mark busy
- *					finds match
- *					*** KABOOM! ***
- *					....
- *							log IO completes
- *							unbusy X
- *			checkpoint completes
- *
- * By issuing a log force in thread 3 @ "KABOOM", the thread will block until
- * the checkpoint completes, and the busy extent it matched will have been
- * removed from the tree when it is woken. Hence it can then continue safely.
- *
- * However, to ensure this matching process is robust, we need to use the
- * transaction ID for identifying transaction, as delayed logging results in
- * the busy extent and transaction lifecycles being different. i.e. the busy
- * extent is active for a lot longer than the transaction.  Hence the
- * transaction structure can be freed and reallocated, then mark the same
- * extent busy again in the new transaction. In this case the new transaction
- * will have a different tid but can have the same address, and hence we need
- * to check against the tid.
- *
- * Future: for delayed logging, we could avoid the log force if the extent was
- * first freed in the current checkpoint sequence. This, however, requires the
- * ability to pin the current checkpoint in memory until this transaction
- * commits to ensure that both the original free and the current one combine
- * logically into the one checkpoint. If the checkpoint sequences are
- * different, however, we still need to wait on a log force.
- */
 void
 xfs_alloc_busy_insert(
 	struct xfs_trans	*tp,
@@ -2564,9 +2470,7 @@ xfs_alloc_busy_insert(
 	struct xfs_busy_extent	*busyp;
 	struct xfs_perag	*pag;
 	struct rb_node		**rbp;
-	struct rb_node		*parent;
-	int			match;
-
+	struct rb_node		*parent = NULL;
 
 	new = kmem_zalloc(sizeof(struct xfs_busy_extent), KM_MAYFAIL);
 	if (!new) {
@@ -2583,66 +2487,30 @@ xfs_alloc_busy_insert(
 	new->agno = agno;
 	new->bno = bno;
 	new->length = len;
-	new->tid = xfs_log_get_trans_ident(tp);
-
 	INIT_LIST_HEAD(&new->list);
 
 	/* trace before insert to be able to see failed inserts */
 	trace_xfs_alloc_busy(tp, agno, bno, len, 0);
 
 	pag = xfs_perag_get(tp->t_mountp, new->agno);
-restart:
 	spin_lock(&pag->pagb_lock);
 	rbp = &pag->pagb_tree.rb_node;
-	parent = NULL;
-	busyp = NULL;
-	match = 0;
-	while (*rbp && match >= 0) {
+	while (*rbp) {
 		parent = *rbp;
 		busyp = rb_entry(parent, struct xfs_busy_extent, rb_node);
 
 		if (new->bno < busyp->bno) {
 			/* may overlap, but exact start block is lower */
 			rbp = &(*rbp)->rb_left;
-			if (new->bno + new->length > busyp->bno)
-				match = busyp->tid == new->tid ? 1 : -1;
+			BUG_ON(new->bno + new->length > busyp->bno);
 		} else if (new->bno > busyp->bno) {
 			/* may overlap, but exact start block is higher */
 			rbp = &(*rbp)->rb_right;
-			if (bno < busyp->bno + busyp->length)
-				match = busyp->tid == new->tid ? 1 : -1;
+			BUG_ON(bno < busyp->bno + busyp->length);
 		} else {
-			match = busyp->tid == new->tid ? 1 : -1;
-			break;
+			BUG();
 		}
 	}
-	if (match < 0) {
-		/* overlap marked busy in different transaction */
-		spin_unlock(&pag->pagb_lock);
-		xfs_log_force(tp->t_mountp, XFS_LOG_SYNC);
-		goto restart;
-	}
-	if (match > 0) {
-		/*
-		 * overlap marked busy in same transaction. Update if exact
-		 * start block match, otherwise combine the busy extents into
-		 * a single range.
-		 */
-		if (busyp->bno == new->bno) {
-			busyp->length = max(busyp->length, new->length);
-			spin_unlock(&pag->pagb_lock);
-			ASSERT(tp->t_flags & XFS_TRANS_SYNC);
-			xfs_perag_put(pag);
-			kmem_free(new);
-			return;
-		}
-		rb_erase(&busyp->rb_node, &pag->pagb_tree);
-		new->length = max(busyp->bno + busyp->length,
-					new->bno + new->length) -
-				min(busyp->bno, new->bno);
-		new->bno = min(busyp->bno, new->bno);
-	} else
-		busyp = NULL;
 
 	rb_link_node(&new->rb_node, parent, rbp);
 	rb_insert_color(&new->rb_node, &pag->pagb_tree);
@@ -2650,7 +2518,6 @@ restart:
 	list_add(&new->list, &tp->t_busy);
 	spin_unlock(&pag->pagb_lock);
 	xfs_perag_put(pag);
-	kmem_free(busyp);
 }
 
 /*
@@ -2704,6 +2571,173 @@ xfs_alloc_busy_search(
 	return match;
 }
 
+STATIC int
+xfs_alloc_busy_try_reuse(
+	struct xfs_perag	*pag,
+	struct xfs_busy_extent	*busyp,
+	xfs_agblock_t		fbno,
+	xfs_agblock_t		fend)
+{
+	xfs_agblock_t	bbno = busyp->bno;
+	xfs_agblock_t	bend = bbno + busyp->length;
+
+	if (bbno < fbno && bend > fend) {
+		/*
+		 * Case 1:
+		 *    bbno           bend
+		 *    +BBBBBBBBBBBBBBBBB+
+		 *        +---------+
+		 *        fbno   fend
+		 */
+
+		/*
+		 * We would have to split the busy extent to be able
+		 * to track it correct, which we cannot do because we
+		 * would have to modify the list of busy extents
+		 * attached to the transaction/CIL context, which
+		 * is mutable.
+		 *
+		 * Force out the log to clear the busy extents
+		 * and retry the search.
+		 */
+		return -1;
+	} else if (bbno >= fbno && bend <= fend) {
+		/*
+		 * Case 2:
+		 *    bbno           bend
+		 *    +BBBBBBBBBBBBBBBBB+
+		 *    +-----------------+
+		 *    fbno           fend
+		 *
+		 * Case 3:
+		 *    bbno           bend
+		 *    +BBBBBBBBBBBBBBBBB+
+		 *    +--------------------------+
+		 *    fbno                    fend
+		 *
+		 * Case 4:
+		 *             bbno           bend
+		 *             +BBBBBBBBBBBBBBBBB+
+		 *    +--------------------------+
+		 *    fbno                    fend
+		 *
+		 * Case 5:
+		 *             bbno           bend
+		 *             +BBBBBBBBBBBBBBBBB+
+		 *    +-----------------------------------+
+		 *    fbno                             fend
+		 *
+		 */
+
+		/*
+		 * The busy extent is fully covered by the extent
+		 * we are allocating, and can simply be removed from
+		 * the rbtree.  However we cannot remove it from the
+		 * immutable list tracking busy extents in the
+		 * transaction or CIL context, so set the length
+		 * to zero to mark it invalid.
+		 */
+		rb_erase(&busyp->rb_node, &pag->pagb_tree);
+		busyp->length = 0;
+	} else if (bbno == fbno) {
+		/*
+		 * Case 6:
+		 *              bbno           bend
+		 *             +BBBBBBBBBBBBBBBBB+
+		 *             +---------+
+		 *             fbno   fend
+		 *
+		 * Case 7:
+		 *             bbno           bend
+		 *             +BBBBBBBBBBBBBBBBB+
+		 *    +------------------+
+		 *    fbno            fend
+		 *
+		 */
+
+		busyp->bno = fend;
+	} else if (bend == fend) {
+		/*
+		 * Case 8:
+		 *    bbno           bend
+		 *    +BBBBBBBBBBBBBBBBB+
+		 *        +-------------+
+		 *        fbno       fend
+		 *
+		 * Case 9:
+		 *    bbno           bend
+		 *    +BBBBBBBBBBBBBBBBB+
+		 *        +----------------------+
+		 *        fbno                fend
+		 */
+
+		busyp->length = fbno - busyp->bno;
+	} else {
+		BUG();
+	}
+
+	return 1;
+}
+
+
+/*
+ * For a given extent [fbno, flen], make sure we can reuse it safely.
+ */
+void
+xfs_alloc_busy_reuse(
+	struct xfs_trans	*tp,
+	xfs_agnumber_t		agno,
+	xfs_agblock_t		fbno,
+	xfs_extlen_t		flen)
+{
+	struct xfs_perag	*pag;
+	struct rb_node		*rbp;
+
+	ASSERT(flen > 0);
+
+	pag = xfs_perag_get(tp->t_mountp, agno);
+restart:
+	spin_lock(&pag->pagb_lock);
+	rbp = pag->pagb_tree.rb_node;
+	while (rbp) {
+		struct xfs_busy_extent *busyp =
+			rb_entry(rbp, struct xfs_busy_extent, rb_node);
+		xfs_agblock_t	fend = fbno + flen;
+		xfs_agblock_t	bbno = busyp->bno;
+		xfs_agblock_t	bend = bbno + busyp->length;
+		int		overlap;
+
+		if (fend <= bbno) {
+			rbp = rbp->rb_left;
+			continue;
+		} else if (fbno >= bend) {
+			rbp = rbp->rb_right;
+			continue;
+		}
+
+		overlap = xfs_alloc_busy_try_reuse(pag, busyp,
+						   fbno, fbno + flen);
+		if (overlap) {
+			spin_unlock(&pag->pagb_lock);
+			xfs_log_force(tp->t_mountp, XFS_LOG_SYNC);
+			goto restart;
+		}
+
+		/*
+		 * No more busy extents to search.
+		 */
+		if (bbno <= fbno && bend >= fend)
+			break;
+
+		if (fbno < bbno)
+			rbp = rbp->rb_left;
+		else
+			rbp = rbp->rb_right;
+	}
+	spin_unlock(&pag->pagb_lock);
+	xfs_perag_put(pag);
+}
+
 /*
  * For a given extent [fbno, flen], search the busy extent list
  * to find a subset of the extent that is not busy.
@@ -2889,14 +2923,12 @@ xfs_alloc_busy_clear(
 	trace_xfs_alloc_unbusy(mp, busyp->agno, busyp->bno,
 						busyp->length);
 
-	ASSERT(xfs_alloc_busy_search(mp, busyp->agno, busyp->bno,
-						busyp->length) == 1);
-
 	list_del_init(&busyp->list);
 
 	pag = xfs_perag_get(mp, busyp->agno);
 	spin_lock(&pag->pagb_lock);
-	rb_erase(&busyp->rb_node, &pag->pagb_tree);
+	if (busyp->length)
+		rb_erase(&busyp->rb_node, &pag->pagb_tree);
 	spin_unlock(&pag->pagb_lock);
 	xfs_perag_put(pag);
 
Index: xfs/fs/xfs/xfs_alloc.h
===================================================================
--- xfs.orig/fs/xfs/xfs_alloc.h	2011-03-20 19:41:55.847481367 +0100
+++ xfs/fs/xfs/xfs_alloc.h	2011-03-21 14:48:04.202480897 +0100
@@ -145,6 +145,10 @@ xfs_alloc_busy_clear(struct xfs_mount *m
 int
 xfs_alloc_busy_search(struct xfs_mount *mp, xfs_agnumber_t agno,
 	xfs_agblock_t bno, xfs_extlen_t len);
+
+void
+xfs_alloc_busy_reuse(struct xfs_trans *tp, xfs_agnumber_t agno,
+	xfs_agblock_t fbno, xfs_extlen_t flen);
 #endif	/* __KERNEL__ */
 
 /*
Index: xfs/fs/xfs/xfs_alloc_btree.c
===================================================================
--- xfs.orig/fs/xfs/xfs_alloc_btree.c	2011-03-20 19:41:55.863479131 +0100
+++ xfs/fs/xfs/xfs_alloc_btree.c	2011-03-21 14:48:04.210474164 +0100
@@ -94,8 +94,8 @@ xfs_allocbt_alloc_block(
 		*stat = 0;
 		return 0;
 	}
-	if (xfs_alloc_busy_search(cur->bc_mp, cur->bc_private.a.agno, bno, 1))
-		xfs_trans_set_sync(cur->bc_tp);
+
+	xfs_alloc_busy_reuse(cur->bc_tp, cur->bc_private.a.agno, bno, 1);
 
 	xfs_trans_agbtree_delta(cur->bc_tp, 1);
 	new->s = cpu_to_be32(bno);
Index: xfs/fs/xfs/xfs_ag.h
===================================================================
--- xfs.orig/fs/xfs/xfs_ag.h	2011-03-21 14:37:58.553976032 +0100
+++ xfs/fs/xfs/xfs_ag.h	2011-03-21 14:48:04.222474401 +0100
@@ -187,7 +187,6 @@ struct xfs_busy_extent {
 	xfs_agnumber_t	agno;
 	xfs_agblock_t	bno;
 	xfs_extlen_t	length;
-	xlog_tid_t	tid;		/* transaction that created this */
 };
 
 /*
Index: xfs/fs/xfs/xfs_log.c
===================================================================
--- xfs.orig/fs/xfs/xfs_log.c	2011-03-21 14:37:58.565974096 +0100
+++ xfs/fs/xfs/xfs_log.c	2011-03-21 14:38:03.000000000 +0100
@@ -3248,13 +3248,6 @@ xfs_log_ticket_get(
 	return ticket;
 }
 
-xlog_tid_t
-xfs_log_get_trans_ident(
-	struct xfs_trans	*tp)
-{
-	return tp->t_ticket->t_tid;
-}
-
 /*
  * Allocate and initialise a new log ticket.
  */
Index: xfs/fs/xfs/xfs_log.h
===================================================================
--- xfs.orig/fs/xfs/xfs_log.h	2011-03-21 14:37:58.578474771 +0100
+++ xfs/fs/xfs/xfs_log.h	2011-03-21 14:38:03.000000000 +0100
@@ -189,8 +189,6 @@ void	  xlog_iodone(struct xfs_buf *);
 struct xlog_ticket *xfs_log_ticket_get(struct xlog_ticket *ticket);
 void	  xfs_log_ticket_put(struct xlog_ticket *ticket);
 
-xlog_tid_t xfs_log_get_trans_ident(struct xfs_trans *tp);
-
 void	xfs_log_commit_cil(struct xfs_mount *mp, struct xfs_trans *tp,
 				struct xfs_log_vec *log_vector,
 				xfs_lsn_t *commit_lsn, int flags);
Index: xfs/fs/xfs/xfs_log_priv.h
===================================================================
--- xfs.orig/fs/xfs/xfs_log_priv.h	2011-03-21 14:37:58.590474432 +0100
+++ xfs/fs/xfs/xfs_log_priv.h	2011-03-21 14:38:03.000000000 +0100
@@ -144,6 +144,8 @@ static inline uint xlog_get_client_id(__
 #define	XLOG_RECOVERY_NEEDED	0x4	/* log was recovered */
 #define XLOG_IO_ERROR		0x8	/* log hit an I/O error, and being
 					   shutdown */
+typedef __uint32_t xlog_tid_t;
+
 
 #ifdef __KERNEL__
 /*
Index: xfs/fs/xfs/xfs_types.h
===================================================================
--- xfs.orig/fs/xfs/xfs_types.h	2011-03-21 14:37:58.610475776 +0100
+++ xfs/fs/xfs/xfs_types.h	2011-03-21 14:38:03.000000000 +0100
@@ -73,8 +73,6 @@ typedef	__int32_t	xfs_tid_t;	/* transact
 typedef	__uint32_t	xfs_dablk_t;	/* dir/attr block number (in file) */
 typedef	__uint32_t	xfs_dahash_t;	/* dir/attr hash value */
 
-typedef __uint32_t	xlog_tid_t;	/* transaction ID type */
-
 /*
  * These types are 64 bits on disk but are either 32 or 64 bits in memory.
  * Disk based types:

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 31+ messages in thread

* [PATCH 4/6] xfs: allow reusing busy extents where safe
  2011-03-22 19:55 [PATCH 0/6] [PATCH 0/6] more efficient busy extent handling and discard support Christoph Hellwig
                   ` (2 preceding siblings ...)
  2011-03-22 19:55 ` [PATCH 3/6] xfs: exact busy extent tracking Christoph Hellwig
@ 2011-03-22 19:55 ` Christoph Hellwig
  2011-03-23  0:20   ` Dave Chinner
  2011-03-25 21:04   ` Alex Elder
  2011-03-22 19:55 ` [PATCH 5/6] xfs: add online discard support Christoph Hellwig
  2011-03-22 19:55 ` [PATCH 6/6] xfs: make discard operations asynchronous Christoph Hellwig
  5 siblings, 2 replies; 31+ messages in thread
From: Christoph Hellwig @ 2011-03-22 19:55 UTC (permalink / raw)
  To: xfs

[-- Attachment #1: xfs-simplify-user-allocations --]
[-- Type: text/plain, Size: 16231 bytes --]

Allow reusing any busy extent for metadata allocations, and reusing busy
userdata extents for userdata allocations.  Most of the complexity is
propagating the userdata information from the XFS_BMAPI_METADATA flag
to xfs_bunmapi into the low-level extent freeing routines.  After that
we can just track what type of busy extent we have and treat it accordingly.

Signed-off-by: Christoph Hellwig <hch@lst.de>

Index: xfs/fs/xfs/xfs_alloc.c
===================================================================
--- xfs.orig/fs/xfs/xfs_alloc.c	2011-03-21 14:49:14.000000000 +0100
+++ xfs/fs/xfs/xfs_alloc.c	2011-03-21 14:51:31.746155282 +0100
@@ -1396,7 +1396,8 @@ xfs_alloc_ag_vextent_small(
 		if (error)
 			goto error0;
 		if (fbno != NULLAGBLOCK) {
-			xfs_alloc_busy_reuse(args->tp, args->agno, fbno, 1);
+			xfs_alloc_busy_reuse(args->tp, args->agno, fbno, 1,
+					     args->userdata);
 
 			if (args->userdata) {
 				xfs_buf_t	*bp;
@@ -2431,7 +2432,8 @@ int				/* error */
 xfs_free_extent(
 	xfs_trans_t	*tp,	/* transaction pointer */
 	xfs_fsblock_t	bno,	/* starting block number of extent */
-	xfs_extlen_t	len)	/* length of extent */
+	xfs_extlen_t	len,
+	bool		userdata)/* length of extent */
 {
 	xfs_alloc_arg_t	args;
 	int		error;
@@ -2444,6 +2446,7 @@ xfs_free_extent(
 	ASSERT(args.agno < args.mp->m_sb.sb_agcount);
 	args.agbno = XFS_FSB_TO_AGBNO(args.mp, bno);
 	args.pag = xfs_perag_get(args.mp, args.agno);
+	args.userdata = userdata;
 	if ((error = xfs_alloc_fix_freelist(&args, XFS_ALLOC_FLAG_FREEING)))
 		goto error0;
 #ifdef DEBUG
@@ -2453,7 +2456,7 @@ xfs_free_extent(
 #endif
 	error = xfs_free_ag_extent(tp, args.agbp, args.agno, args.agbno, len, 0);
 	if (error)
-		xfs_alloc_busy_insert(tp, args.agno, args.agbno, len);
+		xfs_alloc_busy_insert(tp, args.agno, args.agbno, len, userdata);
 error0:
 	xfs_perag_put(args.pag);
 	return error;
@@ -2464,7 +2467,8 @@ xfs_alloc_busy_insert(
 	struct xfs_trans	*tp,
 	xfs_agnumber_t		agno,
 	xfs_agblock_t		bno,
-	xfs_extlen_t		len)
+	xfs_extlen_t		len,
+	bool			userdata)
 {
 	struct xfs_busy_extent	*new;
 	struct xfs_busy_extent	*busyp;
@@ -2487,6 +2491,7 @@ xfs_alloc_busy_insert(
 	new->agno = agno;
 	new->bno = bno;
 	new->length = len;
+	new->flags = userdata ? XFS_ALLOC_BUSY_USERDATA : 0;
 	INIT_LIST_HEAD(&new->list);
 
 	/* trace before insert to be able to see failed inserts */
@@ -2688,7 +2693,8 @@ xfs_alloc_busy_reuse(
 	struct xfs_trans	*tp,
 	xfs_agnumber_t		agno,
 	xfs_agblock_t		fbno,
-	xfs_extlen_t		flen)
+	xfs_extlen_t		flen,
+	bool			userdata)
 {
 	struct xfs_perag	*pag;
 	struct rb_node		*rbp;
@@ -2717,7 +2723,7 @@ restart:
 
 		overlap = xfs_alloc_busy_try_reuse(pag, busyp,
 						   fbno, fbno + flen);
-		if (overlap) {
+		if (overlap == -1 || (overlap && userdata)) {
 			spin_unlock(&pag->pagb_lock);
 			xfs_log_force(tp->t_mountp, XFS_LOG_SYNC);
 			goto restart;
@@ -2754,6 +2760,7 @@ xfs_alloc_busy_trim(
 
 	ASSERT(flen > 0);
 
+restart:
 	spin_lock(&args->pag->pagb_lock);
 	rbp = args->pag->pagb_tree.rb_node;
 	while (rbp && flen >= args->minlen) {
@@ -2771,6 +2778,31 @@ xfs_alloc_busy_trim(
 			continue;
 		}
 
+		if (!args->userdata ||
+		    (busyp->flags & XFS_ALLOC_BUSY_USERDATA)) {
+			int overlap;
+
+			overlap = xfs_alloc_busy_try_reuse(args->pag, busyp,
+							   fbno, fbno + flen);
+			if (unlikely(overlap == -1)) {
+				spin_unlock(&args->pag->pagb_lock);
+				xfs_log_force(args->mp, XFS_LOG_SYNC);
+				goto restart;
+			}
+
+			/*
+			 * No more busy extents to search.
+			 */
+			if (bbno <= fbno && bend >= fend)
+				goto out;
+
+			if (fbno < bbno)
+				rbp = rbp->rb_left;
+			else
+				rbp = rbp->rb_right;
+			continue;
+		}
+
 		if (bbno <= fbno) {
 			/* start overlap */
 
@@ -2898,8 +2930,8 @@ xfs_alloc_busy_trim(
 
 		flen = fend - fbno;
 	}
+out:
 	spin_unlock(&args->pag->pagb_lock);
-
 	*rbno = fbno;
 	*rlen = flen;
 	return;
Index: xfs/fs/xfs/xfs_alloc.h
===================================================================
--- xfs.orig/fs/xfs/xfs_alloc.h	2011-03-21 14:48:04.000000000 +0100
+++ xfs/fs/xfs/xfs_alloc.h	2011-03-21 14:49:21.933973842 +0100
@@ -137,7 +137,7 @@ xfs_alloc_longest_free_extent(struct xfs
 #ifdef __KERNEL__
 void
 xfs_alloc_busy_insert(struct xfs_trans *tp, xfs_agnumber_t agno,
-	xfs_agblock_t bno, xfs_extlen_t len);
+	xfs_agblock_t bno, xfs_extlen_t len, bool userdata);
 
 void
 xfs_alloc_busy_clear(struct xfs_mount *mp, struct xfs_busy_extent *busyp);
@@ -148,7 +148,7 @@ xfs_alloc_busy_search(struct xfs_mount *
 
 void
 xfs_alloc_busy_reuse(struct xfs_trans *tp, xfs_agnumber_t agno,
-	xfs_agblock_t fbno, xfs_extlen_t flen);
+	xfs_agblock_t fbno, xfs_extlen_t flen, bool userdata);
 #endif	/* __KERNEL__ */
 
 /*
@@ -224,7 +224,8 @@ int				/* error */
 xfs_free_extent(
 	struct xfs_trans *tp,	/* transaction pointer */
 	xfs_fsblock_t	bno,	/* starting block number of extent */
-	xfs_extlen_t	len);	/* length of extent */
+	xfs_extlen_t	len,
+	bool		userdata);/* length of extent */
 
 int					/* error */
 xfs_alloc_lookup_le(
Index: xfs/fs/xfs/xfs_alloc_btree.c
===================================================================
--- xfs.orig/fs/xfs/xfs_alloc_btree.c	2011-03-21 14:48:04.000000000 +0100
+++ xfs/fs/xfs/xfs_alloc_btree.c	2011-03-21 14:49:21.937977488 +0100
@@ -95,7 +95,7 @@ xfs_allocbt_alloc_block(
 		return 0;
 	}
 
-	xfs_alloc_busy_reuse(cur->bc_tp, cur->bc_private.a.agno, bno, 1);
+	xfs_alloc_busy_reuse(cur->bc_tp, cur->bc_private.a.agno, bno, 1, false);
 
 	xfs_trans_agbtree_delta(cur->bc_tp, 1);
 	new->s = cpu_to_be32(bno);
@@ -120,18 +120,8 @@ xfs_allocbt_free_block(
 	if (error)
 		return error;
 
-	/*
-	 * Since blocks move to the free list without the coordination used in
-	 * xfs_bmap_finish, we can't allow block to be available for
-	 * reallocation and non-transaction writing (user data) until we know
-	 * that the transaction that moved it to the free list is permanently
-	 * on disk. We track the blocks by declaring these blocks as "busy";
-	 * the busy list is maintained on a per-ag basis and each transaction
-	 * records which entries should be removed when the iclog commits to
-	 * disk. If a busy block is allocated, the iclog is pushed up to the
-	 * LSN that freed the block.
-	 */
-	xfs_alloc_busy_insert(cur->bc_tp, be32_to_cpu(agf->agf_seqno), bno, 1);
+	xfs_alloc_busy_insert(cur->bc_tp, be32_to_cpu(agf->agf_seqno), bno,
+			      1, false);
 	xfs_trans_agbtree_delta(cur->bc_tp, -1);
 	return 0;
 }
Index: xfs/fs/xfs/xfs_ag.h
===================================================================
--- xfs.orig/fs/xfs/xfs_ag.h	2011-03-21 14:48:04.000000000 +0100
+++ xfs/fs/xfs/xfs_ag.h	2011-03-21 14:49:21.941981228 +0100
@@ -187,6 +187,8 @@ struct xfs_busy_extent {
 	xfs_agnumber_t	agno;
 	xfs_agblock_t	bno;
 	xfs_extlen_t	length;
+	unsigned int	flags;
+#define XFS_ALLOC_BUSY_USERDATA		0x01	/* freed data extents */
 };
 
 /*
Index: xfs/fs/xfs/xfs_bmap.c
===================================================================
--- xfs.orig/fs/xfs/xfs_bmap.c	2011-03-21 14:48:04.238474593 +0100
+++ xfs/fs/xfs/xfs_bmap.c	2011-03-21 14:49:21.949973870 +0100
@@ -180,22 +180,6 @@ xfs_bmap_btree_to_extents(
 	int			whichfork); /* data or attr fork */
 
 /*
- * Called by xfs_bmapi to update file extent records and the btree
- * after removing space (or undoing a delayed allocation).
- */
-STATIC int				/* error */
-xfs_bmap_del_extent(
-	xfs_inode_t		*ip,	/* incore inode pointer */
-	xfs_trans_t		*tp,	/* current trans pointer */
-	xfs_extnum_t		idx,	/* extent number to update/insert */
-	xfs_bmap_free_t		*flist,	/* list of extents to be freed */
-	xfs_btree_cur_t		*cur,	/* if null, not a btree */
-	xfs_bmbt_irec_t		*new,	/* new data to add to file extents */
-	int			*logflagsp,/* inode logging flags */
-	int			whichfork, /* data or attr fork */
-	int			rsvd);	 /* OK to allocate reserved blocks */
-
-/*
  * Remove the entry "free" from the free item list.  Prev points to the
  * previous entry, unless "free" is the head of the list.
  */
@@ -2811,7 +2795,7 @@ xfs_bmap_btree_to_extents(
 	cblock = XFS_BUF_TO_BLOCK(cbp);
 	if ((error = xfs_btree_check_block(cur, cblock, 0, cbp)))
 		return error;
-	xfs_bmap_add_free(cbno, 1, cur->bc_private.b.flist, mp);
+	xfs_bmap_add_free(mp, cur->bc_private.b.flist, cbno, 1, 0);
 	ip->i_d.di_nblocks--;
 	xfs_trans_mod_dquot_byino(tp, ip, XFS_TRANS_DQ_BCOUNT, -1L);
 	xfs_trans_binval(tp, cbp);
@@ -2838,8 +2822,7 @@ xfs_bmap_del_extent(
 	xfs_btree_cur_t		*cur,	/* if null, not a btree */
 	xfs_bmbt_irec_t		*del,	/* data to remove from extents */
 	int			*logflagsp, /* inode logging flags */
-	int			whichfork, /* data or attr fork */
-	int			rsvd)	/* OK to allocate reserved blocks */
+	int			flags)	/* XFS_BMAPI_* flags */
 {
 	xfs_filblks_t		da_new;	/* new delay-alloc indirect blocks */
 	xfs_filblks_t		da_old;	/* old delay-alloc indirect blocks */
@@ -2849,7 +2832,6 @@ xfs_bmap_del_extent(
 	int			do_fx;	/* free extent at end of routine */
 	xfs_bmbt_rec_host_t	*ep;	/* current extent entry pointer */
 	int			error;	/* error return value */
-	int			flags;	/* inode logging flags */
 	xfs_bmbt_irec_t		got;	/* current extent entry */
 	xfs_fileoff_t		got_endoff;	/* first offset past got */
 	int			i;	/* temp state */
@@ -2861,12 +2843,17 @@ xfs_bmap_del_extent(
 	uint			qfield;	/* quota field to update */
 	xfs_filblks_t		temp;	/* for indirect length calculations */
 	xfs_filblks_t		temp2;	/* for indirect length calculations */
-	int			state = 0;
+	int			state, whichfork;
 
 	XFS_STATS_INC(xs_del_exlist);
 
-	if (whichfork == XFS_ATTR_FORK)
-		state |= BMAP_ATTRFORK;
+	if (flags & XFS_BMAPI_ATTRFORK) {
+		whichfork = XFS_ATTR_FORK;
+		state = BMAP_ATTRFORK;
+	} else {
+		whichfork = XFS_DATA_FORK;
+		state = 0;
+	}
 
 	mp = ip->i_mount;
 	ifp = XFS_IFORK_PTR(ip, whichfork);
@@ -3121,9 +3108,13 @@ xfs_bmap_del_extent(
 	/*
 	 * If we need to, add to list of extents to delete.
 	 */
-	if (do_fx)
-		xfs_bmap_add_free(del->br_startblock, del->br_blockcount, flist,
-			mp);
+	if (do_fx) {
+		xfs_bmap_add_free(mp, flist, del->br_startblock,
+				  del->br_blockcount,
+				  (flags & XFS_BMAPI_METADATA) ? 0 :
+				   XFS_BFI_USERDATA);
+	}
+
 	/*
 	 * Adjust inode # blocks in the file.
 	 */
@@ -3142,7 +3133,9 @@ xfs_bmap_del_extent(
 	ASSERT(da_old >= da_new);
 	if (da_old > da_new) {
 		xfs_icsb_modify_counters(mp, XFS_SBS_FDBLOCKS,
-			(int64_t)(da_old - da_new), rsvd);
+			(int64_t)(da_old - da_new),
+			!!(flags & XFS_BMAPI_RSVBLOCKS));
+
 	}
 done:
 	*logflagsp = flags;
@@ -3723,10 +3716,11 @@ error0:
 /* ARGSUSED */
 void
 xfs_bmap_add_free(
+	struct xfs_mount	*mp,		/* mount point structure */
+	struct xfs_bmap_free	*flist,		/* list of extents */
 	xfs_fsblock_t		bno,		/* fs block number of extent */
 	xfs_filblks_t		len,		/* length of extent */
-	xfs_bmap_free_t		*flist,		/* list of extents */
-	xfs_mount_t		*mp)		/* mount point structure */
+	unsigned int		flags)
 {
 	xfs_bmap_free_item_t	*cur;		/* current (next) element */
 	xfs_bmap_free_item_t	*new;		/* new element */
@@ -3750,6 +3744,7 @@ xfs_bmap_add_free(
 	new = kmem_zone_alloc(xfs_bmap_free_item_zone, KM_SLEEP);
 	new->xbfi_startblock = bno;
 	new->xbfi_blockcount = (xfs_extlen_t)len;
+	new->xbfi_flags = XFS_BFI_USERDATA;
 	for (prev = NULL, cur = flist->xbf_first;
 	     cur != NULL;
 	     prev = cur, cur = cur->xbfi_next) {
@@ -3883,8 +3878,11 @@ xfs_bmap_finish(
 	efd = xfs_trans_get_efd(ntp, efi, flist->xbf_count);
 	for (free = flist->xbf_first; free != NULL; free = next) {
 		next = free->xbfi_next;
-		if ((error = xfs_free_extent(ntp, free->xbfi_startblock,
-				free->xbfi_blockcount))) {
+
+		error = xfs_free_extent(ntp, free->xbfi_startblock,
+				free->xbfi_blockcount,
+				!!(free->xbfi_flags & XFS_BFI_USERDATA));
+		if (error) {
 			/*
 			 * The bmap free list will be cleaned up at a
 			 * higher level.  The EFI will be canceled when
@@ -5278,7 +5276,7 @@ xfs_bunmapi(
 			goto error0;
 		}
 		error = xfs_bmap_del_extent(ip, tp, lastx, flist, cur, &del,
-				&tmp_logflags, whichfork, rsvd);
+				&tmp_logflags, flags);
 		logflags |= tmp_logflags;
 		if (error)
 			goto error0;
Index: xfs/fs/xfs/xfs_bmap.h
===================================================================
--- xfs.orig/fs/xfs/xfs_bmap.h	2011-03-21 14:48:04.000000000 +0100
+++ xfs/fs/xfs/xfs_bmap.h	2011-03-21 14:49:21.953974825 +0100
@@ -35,6 +35,8 @@ typedef struct xfs_bmap_free_item
 {
 	xfs_fsblock_t		xbfi_startblock;/* starting fs block number */
 	xfs_extlen_t		xbfi_blockcount;/* number of blocks in extent */
+	unsigned int		xbfi_flags;
+#define XFS_BFI_USERDATA	0x01		/* userdata extent */
 	struct xfs_bmap_free_item *xbfi_next;	/* link to next entry */
 } xfs_bmap_free_item_t;
 
@@ -188,10 +190,11 @@ xfs_bmap_add_attrfork(
  */
 void
 xfs_bmap_add_free(
+	struct xfs_mount	*mp,		/* mount point structure */
+	struct xfs_bmap_free	*flist,		/* list of extents */
 	xfs_fsblock_t		bno,		/* fs block number of extent */
 	xfs_filblks_t		len,		/* length of extent */
-	xfs_bmap_free_t		*flist,		/* list of extents */
-	struct xfs_mount	*mp);		/* mount point structure */
+	unsigned int		flags);
 
 /*
  * Routine to clean up the free list data structure when
Index: xfs/fs/xfs/xfs_bmap_btree.c
===================================================================
--- xfs.orig/fs/xfs/xfs_bmap_btree.c	2011-03-21 14:48:04.000000000 +0100
+++ xfs/fs/xfs/xfs_bmap_btree.c	2011-03-21 14:49:21.957973331 +0100
@@ -598,7 +598,7 @@ xfs_bmbt_free_block(
 	struct xfs_trans	*tp = cur->bc_tp;
 	xfs_fsblock_t		fsbno = XFS_DADDR_TO_FSB(mp, XFS_BUF_ADDR(bp));
 
-	xfs_bmap_add_free(fsbno, 1, cur->bc_private.b.flist, mp);
+	xfs_bmap_add_free(mp, cur->bc_private.b.flist, fsbno, 1, 0);
 	ip->i_d.di_nblocks--;
 
 	xfs_trans_log_inode(tp, ip, XFS_ILOG_CORE);
Index: xfs/fs/xfs/xfs_fsops.c
===================================================================
--- xfs.orig/fs/xfs/xfs_fsops.c	2011-03-21 14:48:04.000000000 +0100
+++ xfs/fs/xfs/xfs_fsops.c	2011-03-21 14:49:21.961974307 +0100
@@ -344,7 +344,7 @@ xfs_growfs_data_private(
 		 * Free the new space.
 		 */
 		error = xfs_free_extent(tp, XFS_AGB_TO_FSB(mp, agno,
-			be32_to_cpu(agf->agf_length) - new), new);
+			be32_to_cpu(agf->agf_length) - new), new, false);
 		if (error) {
 			goto error0;
 		}
Index: xfs/fs/xfs/xfs_ialloc.c
===================================================================
--- xfs.orig/fs/xfs/xfs_ialloc.c	2011-03-21 14:48:04.000000000 +0100
+++ xfs/fs/xfs/xfs_ialloc.c	2011-03-21 14:49:21.965975280 +0100
@@ -1154,9 +1154,10 @@ xfs_difree(
 			goto error0;
 		}
 
-		xfs_bmap_add_free(XFS_AGB_TO_FSB(mp,
-				agno, XFS_INO_TO_AGBNO(mp,rec.ir_startino)),
-				XFS_IALLOC_BLOCKS(mp), flist, mp);
+		xfs_bmap_add_free(mp, flist,
+				  XFS_AGB_TO_FSB(mp, agno,
+					XFS_INO_TO_AGBNO(mp,rec.ir_startino)),
+				  XFS_IALLOC_BLOCKS(mp), 0);
 	} else {
 		*delete = 0;
 
Index: xfs/fs/xfs/xfs_ialloc_btree.c
===================================================================
--- xfs.orig/fs/xfs/xfs_ialloc_btree.c	2011-03-21 14:48:04.000000000 +0100
+++ xfs/fs/xfs/xfs_ialloc_btree.c	2011-03-21 14:49:21.973975583 +0100
@@ -117,7 +117,7 @@ xfs_inobt_free_block(
 	int			error;
 
 	fsbno = XFS_DADDR_TO_FSB(cur->bc_mp, XFS_BUF_ADDR(bp));
-	error = xfs_free_extent(cur->bc_tp, fsbno, 1);
+	error = xfs_free_extent(cur->bc_tp, fsbno, 1, false);
 	if (error)
 		return error;
 
Index: xfs/fs/xfs/xfs_log_recover.c
===================================================================
--- xfs.orig/fs/xfs/xfs_log_recover.c	2011-03-21 14:48:04.000000000 +0100
+++ xfs/fs/xfs/xfs_log_recover.c	2011-03-21 14:49:21.973975583 +0100
@@ -2907,8 +2907,9 @@ xlog_recover_process_efi(
 	efdp = xfs_trans_get_efd(tp, efip, efip->efi_format.efi_nextents);
 
 	for (i = 0; i < efip->efi_format.efi_nextents; i++) {
-		extp = &(efip->efi_format.efi_extents[i]);
-		error = xfs_free_extent(tp, extp->ext_start, extp->ext_len);
+		extp = &efip->efi_format.efi_extents[i];
+		error = xfs_free_extent(tp, extp->ext_start, extp->ext_len,
+					false);
 		if (error)
 			goto abort_error;
 		xfs_trans_log_efd_extent(tp, efdp, extp->ext_start,

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 31+ messages in thread

* [PATCH 5/6] xfs: add online discard support
  2011-03-22 19:55 [PATCH 0/6] [PATCH 0/6] more efficient busy extent handling and discard support Christoph Hellwig
                   ` (3 preceding siblings ...)
  2011-03-22 19:55 ` [PATCH 4/6] xfs: allow reusing busy extents where safe Christoph Hellwig
@ 2011-03-22 19:55 ` Christoph Hellwig
  2011-03-23  0:30   ` Dave Chinner
  2011-03-25 21:04   ` Alex Elder
  2011-03-22 19:55 ` [PATCH 6/6] xfs: make discard operations asynchronous Christoph Hellwig
  5 siblings, 2 replies; 31+ messages in thread
From: Christoph Hellwig @ 2011-03-22 19:55 UTC (permalink / raw)
  To: xfs

[-- Attachment #1: xfs-add-online-discard-support --]
[-- Type: text/plain, Size: 6812 bytes --]

Now that we have reliably tracking of deleted extents in a transaction
we can easily implement "online" discard support which calls
blkdev_issue_discard once a transaction commits.

The actual discard is a two stage operation as we first have to mark
the busy extent as not available for reuse before we can start the
actual discard.  Note that we don't bother supporting discard for
the non-delaylog mode.  While that would be possible with this patch
performance is awfull, and the optimization in the next patch won't
work as easily.

Signed-off-by: Christoph Hellwig <hch@lst.de>

Index: xfs/fs/xfs/linux-2.6/xfs_super.c
===================================================================
--- xfs.orig/fs/xfs/linux-2.6/xfs_super.c	2011-03-21 14:51:40.806473979 +0100
+++ xfs/fs/xfs/linux-2.6/xfs_super.c	2011-03-21 14:51:41.437974814 +0100
@@ -112,6 +112,8 @@ mempool_t *xfs_ioend_pool;
 #define MNTOPT_QUOTANOENF  "qnoenforce"	/* same as uqnoenforce */
 #define MNTOPT_DELAYLOG   "delaylog"	/* Delayed loging enabled */
 #define MNTOPT_NODELAYLOG "nodelaylog"	/* Delayed loging disabled */
+#define MNTOPT_DISCARD	"discard"	/* Discard unused blocks */
+#define MNTOPT_NODISCARD "nodiscard"	/* Do not discard unused blocks */
 
 /*
  * Table driven mount option parser.
@@ -356,6 +358,10 @@ xfs_parseargs(
 			mp->m_flags |= XFS_MOUNT_DELAYLOG;
 		} else if (!strcmp(this_char, MNTOPT_NODELAYLOG)) {
 			mp->m_flags &= ~XFS_MOUNT_DELAYLOG;
+		} else if (!strcmp(this_char, MNTOPT_DISCARD)) {
+			mp->m_flags |= XFS_MOUNT_DISCARD;
+		} else if (!strcmp(this_char, MNTOPT_NODISCARD)) {
+			mp->m_flags &= ~XFS_MOUNT_DISCARD;
 		} else if (!strcmp(this_char, "ihashsize")) {
 			xfs_warn(mp,
 	"ihashsize no longer used, option is deprecated.");
@@ -489,6 +495,7 @@ xfs_showargs(
 		{ XFS_MOUNT_FILESTREAMS,	"," MNTOPT_FILESTREAM },
 		{ XFS_MOUNT_GRPID,		"," MNTOPT_GRPID },
 		{ XFS_MOUNT_DELAYLOG,		"," MNTOPT_DELAYLOG },
+		{ XFS_MOUNT_DISCARD,		"," MNTOPT_DISCARD },
 		{ 0, NULL }
 	};
 	static struct proc_xfs_info xfs_info_unset[] = {
Index: xfs/fs/xfs/xfs_mount.h
===================================================================
--- xfs.orig/fs/xfs/xfs_mount.h	2011-03-21 14:47:03.630474095 +0100
+++ xfs/fs/xfs/xfs_mount.h	2011-03-21 14:51:41.437974814 +0100
@@ -227,6 +227,7 @@ typedef struct xfs_mount {
 #define XFS_MOUNT_FS_SHUTDOWN	(1ULL << 4)	/* atomic stop of all filesystem
 						   operations, typically for
 						   disk errors in metadata */
+#define XFS_MOUNT_DISCARD	(1ULL << 5)	/* discard unused blocks */
 #define XFS_MOUNT_RETERR	(1ULL << 6)     /* return alignment errors to
 						   user */
 #define XFS_MOUNT_NOALIGN	(1ULL << 7)	/* turn off stripe alignment
Index: xfs/fs/xfs/xfs_log_cil.c
===================================================================
--- xfs.orig/fs/xfs/xfs_log_cil.c	2011-03-21 14:47:03.638475274 +0100
+++ xfs/fs/xfs/xfs_log_cil.c	2011-03-21 14:51:41.441975564 +0100
@@ -29,6 +29,7 @@
 #include "xfs_mount.h"
 #include "xfs_error.h"
 #include "xfs_alloc.h"
+#include "xfs_discard.h"
 
 /*
  * Perform initial CIL structure initialisation. If the CIL is not
@@ -361,13 +362,17 @@ xlog_cil_committed(
 	int	abort)
 {
 	struct xfs_cil_ctx	*ctx = args;
+	struct xfs_mount	*mp = ctx->cil->xc_log->l_mp;
 	struct xfs_busy_extent	*busyp, *n;
 
 	xfs_trans_committed_bulk(ctx->cil->xc_log->l_ailp, ctx->lv_chain,
 					ctx->start_lsn, abort);
 
-	list_for_each_entry_safe(busyp, n, &ctx->busy_extents, list)
-		xfs_alloc_busy_clear(ctx->cil->xc_log->l_mp, busyp);
+	list_for_each_entry_safe(busyp, n, &ctx->busy_extents, list) {
+		if (!abort)
+			xfs_discard_extent(mp, busyp);
+		xfs_alloc_busy_clear(mp, busyp);
+	}
 
 	spin_lock(&ctx->cil->xc_cil_lock);
 	list_del(&ctx->committing);
Index: xfs/fs/xfs/linux-2.6/xfs_discard.c
===================================================================
--- xfs.orig/fs/xfs/linux-2.6/xfs_discard.c	2011-03-21 14:47:03.614474345 +0100
+++ xfs/fs/xfs/linux-2.6/xfs_discard.c	2011-03-21 14:51:41.449976366 +0100
@@ -191,3 +191,38 @@ xfs_ioc_trim(
 		return -XFS_ERROR(EFAULT);
 	return 0;
 }
+
+int
+xfs_discard_extent(
+	struct xfs_mount	*mp,
+	struct xfs_busy_extent	*busyp)
+{
+	struct xfs_perag	*pag;
+	int			error = 0;
+	xfs_daddr_t		bno;
+	int64_t			len;
+	bool			done  = false;
+
+	if ((mp->m_flags & XFS_MOUNT_DISCARD) == 0)
+		return 0;
+
+	bno = XFS_AGB_TO_DADDR(mp, busyp->agno, busyp->bno);
+	len = XFS_FSB_TO_BB(mp, busyp->length);
+
+	pag = xfs_perag_get(mp, busyp->agno);
+	spin_lock(&pag->pagb_lock);
+	if (!busyp->length)
+		done = true;
+	busyp->flags = XFS_ALLOC_BUSY_DISCARDED;
+	spin_unlock(&pag->pagb_lock);
+	xfs_perag_put(pag);
+
+	if (done)
+		return 0;
+
+	error = -blkdev_issue_discard(mp->m_ddev_targp->bt_bdev, bno, len,
+				      GFP_NOFS, 0);
+	if (error && error != EOPNOTSUPP)
+		xfs_info(mp, "discard failed, error %d", error);
+	return error;
+}
Index: xfs/fs/xfs/linux-2.6/xfs_discard.h
===================================================================
--- xfs.orig/fs/xfs/linux-2.6/xfs_discard.h	2011-03-21 14:47:03.622475346 +0100
+++ xfs/fs/xfs/linux-2.6/xfs_discard.h	2011-03-21 14:51:41.453973708 +0100
@@ -2,7 +2,11 @@
 #define XFS_DISCARD_H 1
 
 struct fstrim_range;
+struct xfs_busy_extent;
 
 extern int	xfs_ioc_trim(struct xfs_mount *, struct fstrim_range __user *);
 
+extern int	xfs_discard_extent(struct xfs_mount *,
+				   struct xfs_busy_extent *);
+
 #endif /* XFS_DISCARD_H */
Index: xfs/fs/xfs/xfs_ag.h
===================================================================
--- xfs.orig/fs/xfs/xfs_ag.h	2011-03-21 14:49:21.941981228 +0100
+++ xfs/fs/xfs/xfs_ag.h	2011-03-21 14:51:41.457976522 +0100
@@ -189,6 +189,7 @@ struct xfs_busy_extent {
 	xfs_extlen_t	length;
 	unsigned int	flags;
 #define XFS_ALLOC_BUSY_USERDATA		0x01	/* freed data extents */
+#define XFS_ALLOC_BUSY_DISCARDED	0x02	/* can't be reused */
 };
 
 /*
Index: xfs/fs/xfs/xfs_alloc.c
===================================================================
--- xfs.orig/fs/xfs/xfs_alloc.c	2011-03-21 14:51:31.746155282 +0100
+++ xfs/fs/xfs/xfs_alloc.c	2011-03-21 14:51:41.457976522 +0100
@@ -2586,6 +2586,9 @@ xfs_alloc_busy_try_reuse(
 	xfs_agblock_t	bbno = busyp->bno;
 	xfs_agblock_t	bend = bbno + busyp->length;
 
+	if (busyp->flags & XFS_ALLOC_BUSY_DISCARDED)
+		return -1;
+
 	if (bbno < fbno && bend > fend) {
 		/*
 		 * Case 1:
@@ -2778,8 +2781,9 @@ restart:
 			continue;
 		}
 
-		if (!args->userdata ||
-		    (busyp->flags & XFS_ALLOC_BUSY_USERDATA)) {
+		if (!(busyp->flags & XFS_ALLOC_BUSY_DISCARDED) &&
+		    (!args->userdata ||
+		     (busyp->flags & XFS_ALLOC_BUSY_USERDATA))) {
 			int overlap;
 
 			overlap = xfs_alloc_busy_try_reuse(args->pag, busyp,

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 31+ messages in thread

* [PATCH 6/6] xfs: make discard operations asynchronous
  2011-03-22 19:55 [PATCH 0/6] [PATCH 0/6] more efficient busy extent handling and discard support Christoph Hellwig
                   ` (4 preceding siblings ...)
  2011-03-22 19:55 ` [PATCH 5/6] xfs: add online discard support Christoph Hellwig
@ 2011-03-22 19:55 ` Christoph Hellwig
  2011-03-23  0:43   ` Dave Chinner
  2011-03-25 21:04   ` Alex Elder
  5 siblings, 2 replies; 31+ messages in thread
From: Christoph Hellwig @ 2011-03-22 19:55 UTC (permalink / raw)
  To: xfs

[-- Attachment #1: xfs-async-discard-v2 --]
[-- Type: text/plain, Size: 10195 bytes --]

Instead of waiting for each discard request keep the CIL context alive
until all of them are done, at which point we can tear it down completly
and remove the busy extents from the rbtree.

At this point I'm doing the I/O completion from IRQ context for simplicity,
but I'll benchmark it against a version that uses a workqueue.

Signed-off-by: Christoph Hellwig <hch@lst.de>

Index: xfs/fs/xfs/linux-2.6/xfs_discard.c
===================================================================
--- xfs.orig/fs/xfs/linux-2.6/xfs_discard.c	2011-03-22 15:58:10.301855813 +0100
+++ xfs/fs/xfs/linux-2.6/xfs_discard.c	2011-03-22 18:39:09.000000000 +0100
@@ -30,6 +30,7 @@
 #include "xfs_inode.h"
 #include "xfs_alloc.h"
 #include "xfs_error.h"
+#include "xfs_log_priv.h"
 #include "xfs_discard.h"
 #include "xfs_trace.h"
 
@@ -192,37 +193,119 @@ xfs_ioc_trim(
 	return 0;
 }
 
+void
+xfs_cil_discard_done(
+	struct xfs_cil_ctx	*ctx)
+{
+	if (atomic_dec_and_test(&ctx->discards)) {
+		struct xfs_busy_extent	*busyp, *n;
+
+		list_for_each_entry_safe(busyp, n, &ctx->busy_extents, list)
+			xfs_alloc_busy_clear(ctx->cil->xc_log->l_mp, busyp);
+		kmem_free(ctx);
+	}
+}
+
+STATIC void
+xfs_discard_end_io(
+	struct bio		*bio,
+	int			err)
+{
+	struct xfs_cil_ctx	*ctx = bio->bi_private;
+
+	if (err && err != -EOPNOTSUPP) {
+		xfs_info(ctx->cil->xc_log->l_mp,
+			 "I/O error during discard\n");
+	}
+
+	bio_put(bio);
+	xfs_cil_discard_done(ctx);
+}
+
+static int
+xfs_issue_discard(
+	struct block_device	*bdev,
+	sector_t		sector,
+	sector_t		nr_sects,
+	gfp_t			gfp_mask,
+	struct xfs_cil_ctx	*ctx)
+{
+	struct request_queue	*q = bdev_get_queue(bdev);
+	unsigned int		max_discard_sectors;
+	struct bio		*bio;
+	int			ret = 0;
+
+	if (!q)
+		return -ENXIO;
+
+	if (!blk_queue_discard(q))
+		return -EOPNOTSUPP;
+
+	/*
+	 * Ensure that max_discard_sectors is of the proper
+	 * granularity
+	 */
+	max_discard_sectors = min(q->limits.max_discard_sectors, UINT_MAX >> 9);
+	if (q->limits.discard_granularity) {
+		unsigned int disc_sects = q->limits.discard_granularity >> 9;
+
+		max_discard_sectors &= ~(disc_sects - 1);
+	}
+
+
+	while (nr_sects && !ret) {
+		bio = bio_alloc(gfp_mask, 1);
+		if (!bio) {
+			ret = -ENOMEM;
+			break;
+		}
+
+		bio->bi_sector = sector;
+		bio->bi_end_io = xfs_discard_end_io;
+		bio->bi_bdev = bdev;
+		bio->bi_private = ctx;
+
+		if (nr_sects > max_discard_sectors) {
+			bio->bi_size = max_discard_sectors << 9;
+			nr_sects -= max_discard_sectors;
+			sector += max_discard_sectors;
+		} else {
+			bio->bi_size = nr_sects << 9;
+			nr_sects = 0;
+		}
+
+		atomic_inc(&ctx->discards);
+		submit_bio(REQ_WRITE | REQ_DISCARD, bio);
+	}
+
+	return ret;
+}
+
 int
 xfs_discard_extent(
 	struct xfs_mount	*mp,
-	struct xfs_busy_extent	*busyp)
+	struct xfs_busy_extent	*busyp,
+	struct xfs_cil_ctx	*ctx)
 {
 	struct xfs_perag	*pag;
-	int			error = 0;
 	xfs_daddr_t		bno;
 	int64_t			len;
 	bool			done  = false;
 
-	if ((mp->m_flags & XFS_MOUNT_DISCARD) == 0)
-		return 0;
-
 	bno = XFS_AGB_TO_DADDR(mp, busyp->agno, busyp->bno);
 	len = XFS_FSB_TO_BB(mp, busyp->length);
 
 	pag = xfs_perag_get(mp, busyp->agno);
-	spin_lock(&pag->pagb_lock);
+	spin_lock_irq(&pag->pagb_lock);
 	if (!busyp->length)
 		done = true;
 	busyp->flags = XFS_ALLOC_BUSY_DISCARDED;
-	spin_unlock(&pag->pagb_lock);
+	spin_unlock_irq(&pag->pagb_lock);
 	xfs_perag_put(pag);
 
 	if (done)
 		return 0;
 
-	error = -blkdev_issue_discard(mp->m_ddev_targp->bt_bdev, bno, len,
-				      GFP_NOFS, 0);
-	if (error && error != EOPNOTSUPP)
-		xfs_info(mp, "discard failed, error %d", error);
-	return error;
+	return -xfs_issue_discard(mp->m_ddev_targp->bt_bdev,
+				  bno, len, GFP_NOFS, ctx);
 }
Index: xfs/fs/xfs/linux-2.6/xfs_discard.h
===================================================================
--- xfs.orig/fs/xfs/linux-2.6/xfs_discard.h	2011-03-22 15:58:10.313857879 +0100
+++ xfs/fs/xfs/linux-2.6/xfs_discard.h	2011-03-22 18:39:09.000000000 +0100
@@ -3,10 +3,13 @@
 
 struct fstrim_range;
 struct xfs_busy_extent;
+struct xfs_cil_ctx;
 
 extern int	xfs_ioc_trim(struct xfs_mount *, struct fstrim_range __user *);
 
 extern int	xfs_discard_extent(struct xfs_mount *,
-				   struct xfs_busy_extent *);
+				   struct xfs_busy_extent *,
+				   struct xfs_cil_ctx *);
+extern void	xfs_cil_discard_done(struct xfs_cil_ctx	*ctx);
 
 #endif /* XFS_DISCARD_H */
Index: xfs/fs/xfs/xfs_log_cil.c
===================================================================
--- xfs.orig/fs/xfs/xfs_log_cil.c	2011-03-22 15:58:10.329855977 +0100
+++ xfs/fs/xfs/xfs_log_cil.c	2011-03-22 18:39:09.000000000 +0100
@@ -68,6 +68,7 @@ xlog_cil_init(
 	INIT_LIST_HEAD(&ctx->busy_extents);
 	ctx->sequence = 1;
 	ctx->cil = cil;
+	atomic_set(&ctx->discards, 1);
 	cil->xc_ctx = ctx;
 	cil->xc_current_sequence = ctx->sequence;
 
@@ -364,14 +365,18 @@ xlog_cil_committed(
 	struct xfs_cil_ctx	*ctx = args;
 	struct xfs_mount	*mp = ctx->cil->xc_log->l_mp;
 	struct xfs_busy_extent	*busyp, *n;
+	bool			keep_alive = false;
 
 	xfs_trans_committed_bulk(ctx->cil->xc_log->l_ailp, ctx->lv_chain,
 					ctx->start_lsn, abort);
 
-	list_for_each_entry_safe(busyp, n, &ctx->busy_extents, list) {
-		if (!abort)
-			xfs_discard_extent(mp, busyp);
-		xfs_alloc_busy_clear(mp, busyp);
+	if (!(mp->m_flags & XFS_MOUNT_DISCARD) || abort) {
+		list_for_each_entry_safe(busyp, n, &ctx->busy_extents, list)
+			xfs_alloc_busy_clear(mp, busyp);
+	} else if (!list_empty(&ctx->busy_extents)) {
+		list_for_each_entry(busyp, &ctx->busy_extents, list)
+			xfs_discard_extent(mp, busyp, ctx);
+		keep_alive = true;
 	}
 
 	spin_lock(&ctx->cil->xc_cil_lock);
@@ -379,7 +384,10 @@ xlog_cil_committed(
 	spin_unlock(&ctx->cil->xc_cil_lock);
 
 	xlog_cil_free_logvec(ctx->lv_chain);
-	kmem_free(ctx);
+	if (keep_alive)
+		xfs_cil_discard_done(ctx);
+	else
+		kmem_free(ctx);
 }
 
 /*
@@ -490,6 +498,7 @@ xlog_cil_push(
 	INIT_LIST_HEAD(&new_ctx->busy_extents);
 	new_ctx->sequence = ctx->sequence + 1;
 	new_ctx->cil = cil;
+	atomic_set(&ctx->discards, 1);
 	cil->xc_ctx = new_ctx;
 
 	/*
Index: xfs/fs/xfs/xfs_alloc.c
===================================================================
--- xfs.orig/fs/xfs/xfs_alloc.c	2011-03-22 18:39:05.173855849 +0100
+++ xfs/fs/xfs/xfs_alloc.c	2011-03-22 18:39:09.000000000 +0100
@@ -2498,7 +2498,7 @@ xfs_alloc_busy_insert(
 	trace_xfs_alloc_busy(tp, agno, bno, len, 0);
 
 	pag = xfs_perag_get(tp->t_mountp, new->agno);
-	spin_lock(&pag->pagb_lock);
+	spin_lock_irq(&pag->pagb_lock);
 	rbp = &pag->pagb_tree.rb_node;
 	while (*rbp) {
 		parent = *rbp;
@@ -2521,7 +2521,7 @@ xfs_alloc_busy_insert(
 	rb_insert_color(&new->rb_node, &pag->pagb_tree);
 
 	list_add(&new->list, &tp->t_busy);
-	spin_unlock(&pag->pagb_lock);
+	spin_unlock_irq(&pag->pagb_lock);
 	xfs_perag_put(pag);
 }
 
@@ -2547,7 +2547,7 @@ xfs_alloc_busy_search(
 	int			match = 0;
 
 	pag = xfs_perag_get(mp, agno);
-	spin_lock(&pag->pagb_lock);
+	spin_lock_irq(&pag->pagb_lock);
 
 	rbp = pag->pagb_tree.rb_node;
 
@@ -2570,7 +2570,7 @@ xfs_alloc_busy_search(
 			break;
 		}
 	}
-	spin_unlock(&pag->pagb_lock);
+	spin_unlock_irq(&pag->pagb_lock);
 	trace_xfs_alloc_busysearch(mp, agno, bno, len, !!match);
 	xfs_perag_put(pag);
 	return match;
@@ -2706,7 +2706,7 @@ xfs_alloc_busy_reuse(
 
 	pag = xfs_perag_get(tp->t_mountp, agno);
 restart:
-	spin_lock(&pag->pagb_lock);
+	spin_lock_irq(&pag->pagb_lock);
 	rbp = pag->pagb_tree.rb_node;
 	while (rbp) {
 		struct xfs_busy_extent *busyp =
@@ -2727,7 +2727,7 @@ restart:
 		overlap = xfs_alloc_busy_try_reuse(pag, busyp,
 						   fbno, fbno + flen);
 		if (overlap == -1 || (overlap && userdata)) {
-			spin_unlock(&pag->pagb_lock);
+			spin_unlock_irq(&pag->pagb_lock);
 			xfs_log_force(tp->t_mountp, XFS_LOG_SYNC);
 			goto restart;
 		}
@@ -2743,7 +2743,7 @@ restart:
 		else
 			rbp = rbp->rb_right;
 	}
-	spin_unlock(&pag->pagb_lock);
+	spin_unlock_irq(&pag->pagb_lock);
 	xfs_perag_put(pag);
 }
 
@@ -2764,7 +2764,7 @@ xfs_alloc_busy_trim(
 	ASSERT(flen > 0);
 
 restart:
-	spin_lock(&args->pag->pagb_lock);
+	spin_lock_irq(&args->pag->pagb_lock);
 	rbp = args->pag->pagb_tree.rb_node;
 	while (rbp && flen >= args->minlen) {
 		struct xfs_busy_extent *busyp =
@@ -2789,7 +2789,7 @@ restart:
 			overlap = xfs_alloc_busy_try_reuse(args->pag, busyp,
 							   fbno, fbno + flen);
 			if (unlikely(overlap == -1)) {
-				spin_unlock(&args->pag->pagb_lock);
+				spin_unlock_irq(&args->pag->pagb_lock);
 				xfs_log_force(args->mp, XFS_LOG_SYNC);
 				goto restart;
 			}
@@ -2935,7 +2935,7 @@ restart:
 		flen = fend - fbno;
 	}
 out:
-	spin_unlock(&args->pag->pagb_lock);
+	spin_unlock_irq(&args->pag->pagb_lock);
 	*rbno = fbno;
 	*rlen = flen;
 	return;
@@ -2944,7 +2944,7 @@ fail:
 	 * Return a zero extent length as failure indications.  All callers
 	 * re-check if the trimmed extent satisfies the minlen requirement.
 	 */
-	spin_unlock(&args->pag->pagb_lock);
+	spin_unlock_irq(&args->pag->pagb_lock);
 	*rbno = fbno;
 	*rlen = 0;
 }
@@ -2955,6 +2955,7 @@ xfs_alloc_busy_clear(
 	struct xfs_busy_extent	*busyp)
 {
 	struct xfs_perag	*pag;
+	unsigned long		flags;
 
 	trace_xfs_alloc_unbusy(mp, busyp->agno, busyp->bno,
 						busyp->length);
@@ -2962,10 +2963,10 @@ xfs_alloc_busy_clear(
 	list_del_init(&busyp->list);
 
 	pag = xfs_perag_get(mp, busyp->agno);
-	spin_lock(&pag->pagb_lock);
+	spin_lock_irqsave(&pag->pagb_lock, flags);
 	if (busyp->length)
 		rb_erase(&busyp->rb_node, &pag->pagb_tree);
-	spin_unlock(&pag->pagb_lock);
+	spin_unlock_irqrestore(&pag->pagb_lock, flags);
 	xfs_perag_put(pag);
 
 	kmem_free(busyp);
Index: xfs/fs/xfs/xfs_log_priv.h
===================================================================
--- xfs.orig/fs/xfs/xfs_log_priv.h	2011-03-22 18:39:05.229883275 +0100
+++ xfs/fs/xfs/xfs_log_priv.h	2011-03-22 18:39:09.000000000 +0100
@@ -389,6 +389,7 @@ struct xfs_cil_ctx {
 	struct xfs_log_vec	*lv_chain;	/* logvecs being pushed */
 	xfs_log_callback_t	log_cb;		/* completion callback hook. */
 	struct list_head	committing;	/* ctx committing list */
+	atomic_t		discards;	/* no. of pending discards */
 };
 
 /*

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 1/6] xfs: optimize AGFL refills
  2011-03-22 19:55 ` [PATCH 1/6] xfs: optimize AGFL refills Christoph Hellwig
@ 2011-03-22 22:30   ` Alex Elder
  2011-03-23 12:16     ` Christoph Hellwig
  2011-03-22 23:30   ` Dave Chinner
  1 sibling, 1 reply; 31+ messages in thread
From: Alex Elder @ 2011-03-22 22:30 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: xfs

On Tue, 2011-03-22 at 15:55 -0400, Christoph Hellwig wrote:
> Avoid forcing out busy extent when moving blocks from/to the AGFL.  We

Without thinking a bit about it, it wasn't clear why doing
this was OK to do.  It would be nice to record a one-sentence
justification of why we don't need to force busy extents out
in this case.

> archive this my moving the busy search out of xfs_alloc_get_freelist into

 achieve this

> the callers that need it, and by moving the busy list insert from
> xfs_free_ag_extent extent which is used both by AGFL refills and real
                     ^^^^^ drop this word

> allocation to xfs_free_extent, which is only used by the latter.
> 
> Signed-off-by: Christoph Hellwig <hch@lst.de>

Otherwise the change itself looks good.

Reviewed-by: Alex Elder <aelder@sgi.com>


. . .

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 2/6] xfs: do not immediately reuse busy extent ranges
  2011-03-22 19:55 ` [PATCH 2/6] xfs: do not immediately reuse busy extent ranges Christoph Hellwig
@ 2011-03-22 22:30   ` Alex Elder
  2011-03-23 12:17     ` Christoph Hellwig
  2011-03-25 21:03   ` Alex Elder
  1 sibling, 1 reply; 31+ messages in thread
From: Alex Elder @ 2011-03-22 22:30 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: xfs

On Tue, 2011-03-22 at 15:55 -0400, Christoph Hellwig wrote:
> plain text document attachment (xfs-skip-busy-extents)
> Every time we reallocate a busy extent, we cause a synchronous log force
> to occur to ensure the freeing transaction is on disk before we continue
> and use the newly allocated extent.  This is extremely sub-optimal as we
> have to mark every transaction with blocks that get reused as synchronous.
> 
> Instead of searching the busy extent list after deciding on the extent to
> allocate, check each candidate extent during the allocation decisions as
> to whether they are in the busy list.  If they are in the busy list, we
> trim the busy range out of the extent we have found and determine if that
> trimmed range is still OK for allocation. In many cases, this check can
> be incorporated into the allocation extent alignment code which already
> does trimming of the found extent before determining if it is a valid
> candidate for allocation.
> 
> Based on two earlier patches from Dave Chinner.

Again, this looks nearly identical to your last version.
On that one, I suggested rewording a comment, and you
said "ok."  You did not do so in this version, which I
guess is fine.  I just want to know whether you intended
to.  If so I'll give you a chance to post an update; if
not I can take this one as-is.

(I think I'll have the rest of the series reviewed tomorrow.)

Reviewed-by: Alex Elder <aelder@sgi.com>

> Signed-off-by: Christoph Hellwig <hch@lst.de>

. . .

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 1/6] xfs: optimize AGFL refills
  2011-03-22 19:55 ` [PATCH 1/6] xfs: optimize AGFL refills Christoph Hellwig
  2011-03-22 22:30   ` Alex Elder
@ 2011-03-22 23:30   ` Dave Chinner
  2011-03-23 12:16     ` Christoph Hellwig
  1 sibling, 1 reply; 31+ messages in thread
From: Dave Chinner @ 2011-03-22 23:30 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: xfs

On Tue, Mar 22, 2011 at 03:55:51PM -0400, Christoph Hellwig wrote:
> Avoid forcing out busy extent when moving blocks from/to the AGFL.  We
> archive this my moving the busy search out of xfs_alloc_get_freelist into
> the callers that need it, and by moving the busy list insert from
> xfs_free_ag_extent extent which is used both by AGFL refills and real
> allocation to xfs_free_extent, which is only used by the latter.
> 
> Signed-off-by: Christoph Hellwig <hch@lst.de>
> 
> Index: xfs/fs/xfs/xfs_alloc.c
> ===================================================================
> --- xfs.orig/fs/xfs/xfs_alloc.c	2011-03-19 16:49:23.774797370 +0100
> +++ xfs/fs/xfs/xfs_alloc.c	2011-03-19 16:49:38.882797272 +0100
> @@ -1326,6 +1326,8 @@ xfs_alloc_ag_vextent_small(
>  		if (error)
>  			goto error0;
>  		if (fbno != NULLAGBLOCK) {
> +			if (xfs_alloc_busy_search(args->mp, args->agno, fbno, 1))
> +				xfs_trans_set_sync(args->tp);
>  			if (args->userdata) {
>  				xfs_buf_t	*bp;
>  
> @@ -1617,18 +1619,6 @@ xfs_free_ag_extent(
>  
>  	trace_xfs_free_extent(mp, agno, bno, len, isfl, haveleft, haveright);
>  
> -	/*
> -	 * Since blocks move to the free list without the coordination
> -	 * used in xfs_bmap_finish, we can't allow block to be available
> -	 * for reallocation and non-transaction writing (user data)
> -	 * until we know that the transaction that moved it to the free
> -	 * list is permanently on disk.  We track the blocks by declaring
> -	 * these blocks as "busy"; the busy list is maintained on a per-ag
> -	 * basis and each transaction records which entries should be removed
> -	 * when the iclog commits to disk.  If a busy block is allocated,
> -	 * the iclog is pushed up to the LSN that freed the block.
> -	 */
> -	xfs_alloc_busy_insert(tp, agno, bno, len);
>  	return 0;
>  
>   error0:
> @@ -1923,21 +1913,6 @@ xfs_alloc_get_freelist(
>  	xfs_alloc_log_agf(tp, agbp, logflags);
>  	*bnop = bno;
>  
> -	/*
> -	 * As blocks are freed, they are added to the per-ag busy list and
> -	 * remain there until the freeing transaction is committed to disk.
> -	 * Now that we have allocated blocks, this list must be searched to see
> -	 * if a block is being reused.  If one is, then the freeing transaction
> -	 * must be pushed to disk before this transaction.
> -	 *
> -	 * We do this by setting the current transaction to a sync transaction
> -	 * which guarantees that the freeing transaction is on disk before this
> -	 * transaction. This is done instead of a synchronous log force here so
> -	 * that we don't sit and wait with the AGF locked in the transaction
> -	 * during the log force.
> -	 */
> -	if (xfs_alloc_busy_search(mp, be32_to_cpu(agf->agf_seqno), bno, 1))
> -		xfs_trans_set_sync(tp);
>  	return 0;
>  }
>  
> @@ -2407,6 +2382,8 @@ xfs_free_extent(
>  		be32_to_cpu(XFS_BUF_TO_AGF(args.agbp)->agf_length));
>  #endif
>  	error = xfs_free_ag_extent(tp, args.agbp, args.agno, args.agbno, len, 0);
> +	if (error)
> +		xfs_alloc_busy_insert(tp, args.agno, args.agbno, len);

Shouldn't that be "if (!error)"?  i.e. if we freed the extent
successfully we need to insert it in the busy list. This current
code has the effect of never inserting freed data extents in the
busy list....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 3/6] xfs: exact busy extent tracking
  2011-03-22 19:55 ` [PATCH 3/6] xfs: exact busy extent tracking Christoph Hellwig
@ 2011-03-22 23:47   ` Dave Chinner
  2011-03-23 12:24     ` Christoph Hellwig
  2011-03-25 21:04   ` Alex Elder
  1 sibling, 1 reply; 31+ messages in thread
From: Dave Chinner @ 2011-03-22 23:47 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: xfs

On Tue, Mar 22, 2011 at 03:55:53PM -0400, Christoph Hellwig wrote:
> Update the extent tree in case we have to reuse a busy extent, so that it
> always is kept uptodate.  This is done by replacing the busy list searches
> with a new xfs_alloc_busy_reuse helper, which updates the busy extent tree
> in case of a reuse.   Also replace setting transactions to sync with forcing
> the log out in case we found a busy extent to reuse.  This makes the code a
> lot more simple, and is required for discard support later on.  While it
> will cause performance regressios with just this patch applied, the impact
> is more than mitigated by the next patch in the series.
> 
> Signed-off-by: Christoph Hellwig <hch@lst.de>
> 
> Index: xfs/fs/xfs/xfs_alloc.c
> ===================================================================
> --- xfs.orig/fs/xfs/xfs_alloc.c	2011-03-20 19:41:55.835479390 +0100
> +++ xfs/fs/xfs/xfs_alloc.c	2011-03-21 14:49:14.157973188 +0100
.....
> -	new->tid = xfs_log_get_trans_ident(tp);
> -
>  	INIT_LIST_HEAD(&new->list);
>  
>  	/* trace before insert to be able to see failed inserts */
>  	trace_xfs_alloc_busy(tp, agno, bno, len, 0);
>  
>  	pag = xfs_perag_get(tp->t_mountp, new->agno);
> -restart:
>  	spin_lock(&pag->pagb_lock);
>  	rbp = &pag->pagb_tree.rb_node;
> -	parent = NULL;
> -	busyp = NULL;
> -	match = 0;
> -	while (*rbp && match >= 0) {
> +	while (*rbp) {
>  		parent = *rbp;
>  		busyp = rb_entry(parent, struct xfs_busy_extent, rb_node);
>  
>  		if (new->bno < busyp->bno) {
>  			/* may overlap, but exact start block is lower */
>  			rbp = &(*rbp)->rb_left;
> -			if (new->bno + new->length > busyp->bno)
> -				match = busyp->tid == new->tid ? 1 : -1;
> +			BUG_ON(new->bno + new->length > busyp->bno);

BUG_ON() inside a spinlock will effectively lock up the machine as
other CPUs try to take the spinlock that was held by the thread that
went bad. It causes the machine to die a horrible, undebuggable death
rather than just kill the thread that caused the oops and leaving
everything else still functional.

If it were an ASSERT(), that's probably OK because it is only debug
systems that woul dhave this problem, but the use of BUG(_ON) means
it will cause production systems to die as well.

>  		} else if (new->bno > busyp->bno) {
>  			/* may overlap, but exact start block is higher */
>  			rbp = &(*rbp)->rb_right;
> -			if (bno < busyp->bno + busyp->length)
> -				match = busyp->tid == new->tid ? 1 : -1;
> +			BUG_ON(bno < busyp->bno + busyp->length);
>  		} else {
> -			match = busyp->tid == new->tid ? 1 : -1;
> -			break;
> +			BUG();
>  		}
>  	}
> -	if (match < 0) {
> -		/* overlap marked busy in different transaction */
> -		spin_unlock(&pag->pagb_lock);
> -		xfs_log_force(tp->t_mountp, XFS_LOG_SYNC);
> -		goto restart;
> -	}
> -	if (match > 0) {
> -		/*
> -		 * overlap marked busy in same transaction. Update if exact
> -		 * start block match, otherwise combine the busy extents into
> -		 * a single range.
> -		 */
> -		if (busyp->bno == new->bno) {
> -			busyp->length = max(busyp->length, new->length);
> -			spin_unlock(&pag->pagb_lock);
> -			ASSERT(tp->t_flags & XFS_TRANS_SYNC);
> -			xfs_perag_put(pag);
> -			kmem_free(new);
> -			return;
> -		}
> -		rb_erase(&busyp->rb_node, &pag->pagb_tree);
> -		new->length = max(busyp->bno + busyp->length,
> -					new->bno + new->length) -
> -				min(busyp->bno, new->bno);
> -		new->bno = min(busyp->bno, new->bno);
> -	} else
> -		busyp = NULL;
>  
>  	rb_link_node(&new->rb_node, parent, rbp);
>  	rb_insert_color(&new->rb_node, &pag->pagb_tree);
> @@ -2650,7 +2518,6 @@ restart:
>  	list_add(&new->list, &tp->t_busy);
>  	spin_unlock(&pag->pagb_lock);
>  	xfs_perag_put(pag);
> -	kmem_free(busyp);
>  }
>  
>  /*
> @@ -2704,6 +2571,173 @@ xfs_alloc_busy_search(
>  	return match;
>  }
>  
> +STATIC int
> +xfs_alloc_busy_try_reuse(
> +	struct xfs_perag	*pag,
> +	struct xfs_busy_extent	*busyp,
> +	xfs_agblock_t		fbno,
> +	xfs_agblock_t		fend)
> +{
> +	xfs_agblock_t	bbno = busyp->bno;
> +	xfs_agblock_t	bend = bbno + busyp->length;
> +
> +	if (bbno < fbno && bend > fend) {
> +		/*
> +		 * Case 1:
> +		 *    bbno           bend
> +		 *    +BBBBBBBBBBBBBBBBB+
> +		 *        +---------+
> +		 *        fbno   fend
> +		 */
> +
> +		/*
> +		 * We would have to split the busy extent to be able
> +		 * to track it correct, which we cannot do because we
> +		 * would have to modify the list of busy extents
> +		 * attached to the transaction/CIL context, which
> +		 * is mutable.

		      immutable?

> +		 *
> +		 * Force out the log to clear the busy extents
> +		 * and retry the search.
> +		 */

BTW, these comments appear to wrap at 68 columns - any particular
reason?

> +		return -1;
> +	} else if (bbno >= fbno && bend <= fend) {
> +		/*
> +		 * Case 2:
> +		 *    bbno           bend
> +		 *    +BBBBBBBBBBBBBBBBB+
> +		 *    +-----------------+
> +		 *    fbno           fend
> +		 *
> +		 * Case 3:
> +		 *    bbno           bend
> +		 *    +BBBBBBBBBBBBBBBBB+
> +		 *    +--------------------------+
> +		 *    fbno                    fend
> +		 *
> +		 * Case 4:
> +		 *             bbno           bend
> +		 *             +BBBBBBBBBBBBBBBBB+
> +		 *    +--------------------------+
> +		 *    fbno                    fend
> +		 *
> +		 * Case 5:
> +		 *             bbno           bend
> +		 *             +BBBBBBBBBBBBBBBBB+
> +		 *    +-----------------------------------+
> +		 *    fbno                             fend
> +		 *
> +		 */
> +
> +		/*
> +		 * The busy extent is fully covered by the extent
> +		 * we are allocating, and can simply be removed from
> +		 * the rbtree.  However we cannot remove it from the
> +		 * immutable list tracking busy extents in the
> +		 * transaction or CIL context, so set the length
> +		 * to zero to mark it invalid.
> +		 */
> +		rb_erase(&busyp->rb_node, &pag->pagb_tree);
> +		busyp->length = 0;
> +	} else if (bbno == fbno) {
> +		/*
> +		 * Case 6:
> +		 *              bbno           bend
> +		 *             +BBBBBBBBBBBBBBBBB+
> +		 *             +---------+
> +		 *             fbno   fend
> +		 *
> +		 * Case 7:
> +		 *             bbno           bend
> +		 *             +BBBBBBBBBBBBBBBBB+
> +		 *    +------------------+
> +		 *    fbno            fend
> +		 *
> +		 */
> +
> +		busyp->bno = fend;
> +	} else if (bend == fend) {
> +		/*
> +		 * Case 8:
> +		 *    bbno           bend
> +		 *    +BBBBBBBBBBBBBBBBB+
> +		 *        +-------------+
> +		 *        fbno       fend
> +		 *
> +		 * Case 9:
> +		 *    bbno           bend
> +		 *    +BBBBBBBBBBBBBBBBB+
> +		 *        +----------------------+
> +		 *        fbno                fend
> +		 */
> +
> +		busyp->length = fbno - busyp->bno;
> +	} else {
> +		BUG();
> +	}
> +
> +	return 1;
> +}
> +
> +
> +/*
> + * For a given extent [fbno, flen], make sure we can reuse it safely.
> + */
> +void
> +xfs_alloc_busy_reuse(
> +	struct xfs_trans	*tp,
> +	xfs_agnumber_t		agno,
> +	xfs_agblock_t		fbno,
> +	xfs_extlen_t		flen)
> +{
> +	struct xfs_perag	*pag;
> +	struct rb_node		*rbp;
> +
> +	ASSERT(flen > 0);
> +
> +	pag = xfs_perag_get(tp->t_mountp, agno);
> +restart:
> +	spin_lock(&pag->pagb_lock);
> +	rbp = pag->pagb_tree.rb_node;
> +	while (rbp) {
> +		struct xfs_busy_extent *busyp =
> +			rb_entry(rbp, struct xfs_busy_extent, rb_node);
> +		xfs_agblock_t	fend = fbno + flen;
> +		xfs_agblock_t	bbno = busyp->bno;
> +		xfs_agblock_t	bend = bbno + busyp->length;
> +		int		overlap;
> +
> +		if (fend <= bbno) {
> +			rbp = rbp->rb_left;
> +			continue;
> +		} else if (fbno >= bend) {
> +			rbp = rbp->rb_right;
> +			continue;
> +		}
> +
> +		overlap = xfs_alloc_busy_try_reuse(pag, busyp,
> +						   fbno, fbno + flen);
> +		if (overlap) {
> +			spin_unlock(&pag->pagb_lock);
> +			xfs_log_force(tp->t_mountp, XFS_LOG_SYNC);
> +			goto restart;
> +		}

xfs_alloc_busy_try_reuse() only ever returns -1 or 1, which means
this will always trigger a log force on overlap....

> +
> +		/*
> +		 * No more busy extents to search.
> +		 */
> +		if (bbno <= fbno && bend >= fend)
> +			break;
> +
> +		if (fbno < bbno)
> +			rbp = rbp->rb_left;
> +		else
> +			rbp = rbp->rb_right;

and this code is never executed.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 4/6] xfs: allow reusing busy extents where safe
  2011-03-22 19:55 ` [PATCH 4/6] xfs: allow reusing busy extents where safe Christoph Hellwig
@ 2011-03-23  0:20   ` Dave Chinner
  2011-03-23 12:26     ` Christoph Hellwig
  2011-03-25 21:04   ` Alex Elder
  1 sibling, 1 reply; 31+ messages in thread
From: Dave Chinner @ 2011-03-23  0:20 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: xfs

On Tue, Mar 22, 2011 at 03:55:54PM -0400, Christoph Hellwig wrote:
> Allow reusing any busy extent for metadata allocations, and reusing busy
> userdata extents for userdata allocations.  Most of the complexity is
> propagating the userdata information from the XFS_BMAPI_METADATA flag
> to xfs_bunmapi into the low-level extent freeing routines.  After that
> we can just track what type of busy extent we have and treat it accordingly.
> 
> Signed-off-by: Christoph Hellwig <hch@lst.de>
.....

> @@ -2717,7 +2723,7 @@ restart:
>  
>  		overlap = xfs_alloc_busy_try_reuse(pag, busyp,
>  						   fbno, fbno + flen);
> -		if (overlap) {
> +		if (overlap == -1 || (overlap && userdata)) {
>  			spin_unlock(&pag->pagb_lock);
>  			xfs_log_force(tp->t_mountp, XFS_LOG_SYNC);
>  			goto restart;

Ok, so the only time we'll do a log force now is on an complete
overlap or a partial userdata overlap?

> @@ -2754,6 +2760,7 @@ xfs_alloc_busy_trim(
>  
>  	ASSERT(flen > 0);
>  
> +restart:
>  	spin_lock(&args->pag->pagb_lock);
>  	rbp = args->pag->pagb_tree.rb_node;
>  	while (rbp && flen >= args->minlen) {
> @@ -2771,6 +2778,31 @@ xfs_alloc_busy_trim(
>  			continue;
>  		}
>  
> +		if (!args->userdata ||
> +		    (busyp->flags & XFS_ALLOC_BUSY_USERDATA)) {
> +			int overlap;
> +
> +			overlap = xfs_alloc_busy_try_reuse(args->pag, busyp,
> +							   fbno, fbno + flen);
> +			if (unlikely(overlap == -1)) {
> +				spin_unlock(&args->pag->pagb_lock);
> +				xfs_log_force(args->mp, XFS_LOG_SYNC);
> +				goto restart;
> +			}

Hmmmm - I'm not so sure we can reuse overlapped data extents for
data allocations without a log force at all as there is no guarantee
that the data will not be overwritten before the original free
transaction is on disk.

That is, recovery may not replay the original data extent free
transaction or the new allocation transaction, but there is nothing
stopping us from having written the new data into the extent before
the crash occurred, especially as delayed allocation places the
allocation very close the data IO issue. e.g.:

	thread X		thread Y
	free data extent ABC
				allocate data extent BCD
				partial overlap, no log force
				issue data IO
				.....

		 <crash>

That leads to corruption of the data in the original file because
neither transaction is written to disk, but new data is....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 5/6] xfs: add online discard support
  2011-03-22 19:55 ` [PATCH 5/6] xfs: add online discard support Christoph Hellwig
@ 2011-03-23  0:30   ` Dave Chinner
  2011-03-23 12:31     ` Christoph Hellwig
  2011-03-25 21:04   ` Alex Elder
  1 sibling, 1 reply; 31+ messages in thread
From: Dave Chinner @ 2011-03-23  0:30 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: xfs

On Tue, Mar 22, 2011 at 03:55:55PM -0400, Christoph Hellwig wrote:
> Now that we have reliably tracking of deleted extents in a transaction
> we can easily implement "online" discard support which calls
> blkdev_issue_discard once a transaction commits.
> 
> The actual discard is a two stage operation as we first have to mark
> the busy extent as not available for reuse before we can start the
> actual discard.  Note that we don't bother supporting discard for
> the non-delaylog mode.  While that would be possible with this patch
> performance is awfull, and the optimization in the next patch won't
> work as easily.
> 
> Signed-off-by: Christoph Hellwig <hch@lst.de>
.....
> @@ -361,13 +362,17 @@ xlog_cil_committed(
>  	int	abort)
>  {
>  	struct xfs_cil_ctx	*ctx = args;
> +	struct xfs_mount	*mp = ctx->cil->xc_log->l_mp;
>  	struct xfs_busy_extent	*busyp, *n;
>  
>  	xfs_trans_committed_bulk(ctx->cil->xc_log->l_ailp, ctx->lv_chain,
>  					ctx->start_lsn, abort);
>  
> -	list_for_each_entry_safe(busyp, n, &ctx->busy_extents, list)
> -		xfs_alloc_busy_clear(ctx->cil->xc_log->l_mp, busyp);
> +	list_for_each_entry_safe(busyp, n, &ctx->busy_extents, list) {
> +		if (!abort)
> +			xfs_discard_extent(mp, busyp);
> +		xfs_alloc_busy_clear(mp, busyp);
> +	}
>  
>  	spin_lock(&ctx->cil->xc_cil_lock);
>  	list_del(&ctx->committing);
> Index: xfs/fs/xfs/linux-2.6/xfs_discard.c
> ===================================================================
> --- xfs.orig/fs/xfs/linux-2.6/xfs_discard.c	2011-03-21 14:47:03.614474345 +0100
> +++ xfs/fs/xfs/linux-2.6/xfs_discard.c	2011-03-21 14:51:41.449976366 +0100
> @@ -191,3 +191,38 @@ xfs_ioc_trim(
>  		return -XFS_ERROR(EFAULT);
>  	return 0;
>  }
> +
> +int
> +xfs_discard_extent(
> +	struct xfs_mount	*mp,
> +	struct xfs_busy_extent	*busyp)
> +{
> +	struct xfs_perag	*pag;
> +	int			error = 0;
> +	xfs_daddr_t		bno;
> +	int64_t			len;
> +	bool			done  = false;
> +
> +	if ((mp->m_flags & XFS_MOUNT_DISCARD) == 0)
> +		return 0;

I'd move this check to the callers, otherwise we are going to be
doing lots of function calls in a relatively performance sensitive
loop just to run a single check when discard is not enabled...

> +
> +	bno = XFS_AGB_TO_DADDR(mp, busyp->agno, busyp->bno);
> +	len = XFS_FSB_TO_BB(mp, busyp->length);
> +
> +	pag = xfs_perag_get(mp, busyp->agno);
> +	spin_lock(&pag->pagb_lock);
> +	if (!busyp->length)
> +		done = true;
> +	busyp->flags = XFS_ALLOC_BUSY_DISCARDED;
> +	spin_unlock(&pag->pagb_lock);
> +	xfs_perag_put(pag);
> +
> +	if (done)
> +		return 0;
> +
> +	error = -blkdev_issue_discard(mp->m_ddev_targp->bt_bdev, bno, len,
> +				      GFP_NOFS, 0);
> +	if (error && error != EOPNOTSUPP)
> +		xfs_info(mp, "discard failed, error %d", error);

This would be more informative if it also printed the bno and len of
the discard that failed. A couple of tracepoints here (e.g.
discard_extent_issued, discard_extent_failed) could also be useful
for tracking discard operations.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 6/6] xfs: make discard operations asynchronous
  2011-03-22 19:55 ` [PATCH 6/6] xfs: make discard operations asynchronous Christoph Hellwig
@ 2011-03-23  0:43   ` Dave Chinner
  2011-03-28 12:44     ` Christoph Hellwig
  2011-03-25 21:04   ` Alex Elder
  1 sibling, 1 reply; 31+ messages in thread
From: Dave Chinner @ 2011-03-23  0:43 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: xfs

On Tue, Mar 22, 2011 at 03:55:56PM -0400, Christoph Hellwig wrote:
> Instead of waiting for each discard request keep the CIL context alive
> until all of them are done, at which point we can tear it down completly
> and remove the busy extents from the rbtree.
> 
> At this point I'm doing the I/O completion from IRQ context for simplicity,
> but I'll benchmark it against a version that uses a workqueue.

A workqueue is probably a good idea, because then the processing has
some level of concurrency built into it. It also means we don't need
to convert the locking to irq-safe variants and all the overhead
that this introduces.

> Signed-off-by: Christoph Hellwig <hch@lst.de>
> 
> Index: xfs/fs/xfs/linux-2.6/xfs_discard.c
> ===================================================================
> --- xfs.orig/fs/xfs/linux-2.6/xfs_discard.c	2011-03-22 15:58:10.301855813 +0100
> +++ xfs/fs/xfs/linux-2.6/xfs_discard.c	2011-03-22 18:39:09.000000000 +0100
> @@ -30,6 +30,7 @@
>  #include "xfs_inode.h"
>  #include "xfs_alloc.h"
>  #include "xfs_error.h"
> +#include "xfs_log_priv.h"
>  #include "xfs_discard.h"
>  #include "xfs_trace.h"
>  
> @@ -192,37 +193,119 @@ xfs_ioc_trim(
>  	return 0;
>  }
>  
> +void
> +xfs_cil_discard_done(
> +	struct xfs_cil_ctx	*ctx)
> +{
> +	if (atomic_dec_and_test(&ctx->discards)) {
> +		struct xfs_busy_extent	*busyp, *n;
> +
> +		list_for_each_entry_safe(busyp, n, &ctx->busy_extents, list)
> +			xfs_alloc_busy_clear(ctx->cil->xc_log->l_mp, busyp);
> +		kmem_free(ctx);
> +	}
> +}
> +
> +STATIC void
> +xfs_discard_end_io(
> +	struct bio		*bio,
> +	int			err)
> +{
> +	struct xfs_cil_ctx	*ctx = bio->bi_private;
> +
> +	if (err && err != -EOPNOTSUPP) {
> +		xfs_info(ctx->cil->xc_log->l_mp,
> +			 "I/O error during discard\n");
> +	}

Same comment about the bno/len in the error message as the previous
patch.

> +
> +	bio_put(bio);
> +	xfs_cil_discard_done(ctx);
> +}
> +
> +static int
> +xfs_issue_discard(
> +	struct block_device	*bdev,
> +	sector_t		sector,
> +	sector_t		nr_sects,
> +	gfp_t			gfp_mask,
> +	struct xfs_cil_ctx	*ctx)
> +{
> +	struct request_queue	*q = bdev_get_queue(bdev);
> +	unsigned int		max_discard_sectors;
> +	struct bio		*bio;
> +	int			ret = 0;
> +
> +	if (!q)
> +		return -ENXIO;
> +
> +	if (!blk_queue_discard(q))
> +		return -EOPNOTSUPP;
> +
> +	/*
> +	 * Ensure that max_discard_sectors is of the proper
> +	 * granularity
> +	 */
> +	max_discard_sectors = min(q->limits.max_discard_sectors, UINT_MAX >> 9);
> +	if (q->limits.discard_granularity) {
> +		unsigned int disc_sects = q->limits.discard_granularity >> 9;
> +
> +		max_discard_sectors &= ~(disc_sects - 1);
> +	}

This is asking for a helper function....

> +
> +
> +	while (nr_sects && !ret) {

no need to check ret here.

> +		bio = bio_alloc(gfp_mask, 1);
> +		if (!bio) {
> +			ret = -ENOMEM;
> +			break;
> +		}
> +
> +		bio->bi_sector = sector;
> +		bio->bi_end_io = xfs_discard_end_io;
> +		bio->bi_bdev = bdev;
> +		bio->bi_private = ctx;
> +
> +		if (nr_sects > max_discard_sectors) {
> +			bio->bi_size = max_discard_sectors << 9;
> +			nr_sects -= max_discard_sectors;
> +			sector += max_discard_sectors;
> +		} else {
> +			bio->bi_size = nr_sects << 9;
> +			nr_sects = 0;
> +		}
> +
> +		atomic_inc(&ctx->discards);
> +		submit_bio(REQ_WRITE | REQ_DISCARD, bio);
> +	}
> +
> +	return ret;
> +}
> +
>  int
>  xfs_discard_extent(
>  	struct xfs_mount	*mp,
> -	struct xfs_busy_extent	*busyp)
> +	struct xfs_busy_extent	*busyp,
> +	struct xfs_cil_ctx	*ctx)
>  {
>  	struct xfs_perag	*pag;
> -	int			error = 0;
>  	xfs_daddr_t		bno;
>  	int64_t			len;
>  	bool			done  = false;
>  
> -	if ((mp->m_flags & XFS_MOUNT_DISCARD) == 0)
> -		return 0;
> -
>  	bno = XFS_AGB_TO_DADDR(mp, busyp->agno, busyp->bno);
>  	len = XFS_FSB_TO_BB(mp, busyp->length);
>  
>  	pag = xfs_perag_get(mp, busyp->agno);
> -	spin_lock(&pag->pagb_lock);
> +	spin_lock_irq(&pag->pagb_lock);
>  	if (!busyp->length)
>  		done = true;
>  	busyp->flags = XFS_ALLOC_BUSY_DISCARDED;
> -	spin_unlock(&pag->pagb_lock);
> +	spin_unlock_irq(&pag->pagb_lock);

Disabling/enabling interrupts on these locks could hurt quite a bit.
They are travelled quite frequently, and irq operations add quite a
bit of overhead....

>  	xfs_perag_put(pag);
>  
>  	if (done)
>  		return 0;
>  
> -	error = -blkdev_issue_discard(mp->m_ddev_targp->bt_bdev, bno, len,
> -				      GFP_NOFS, 0);
> -	if (error && error != EOPNOTSUPP)
> -		xfs_info(mp, "discard failed, error %d", error);
> -	return error;
> +	return -xfs_issue_discard(mp->m_ddev_targp->bt_bdev,
> +				  bno, len, GFP_NOFS, ctx);
>  }
> Index: xfs/fs/xfs/linux-2.6/xfs_discard.h
> ===================================================================
> --- xfs.orig/fs/xfs/linux-2.6/xfs_discard.h	2011-03-22 15:58:10.313857879 +0100
> +++ xfs/fs/xfs/linux-2.6/xfs_discard.h	2011-03-22 18:39:09.000000000 +0100
> @@ -3,10 +3,13 @@
>  
>  struct fstrim_range;
>  struct xfs_busy_extent;
> +struct xfs_cil_ctx;
>  
>  extern int	xfs_ioc_trim(struct xfs_mount *, struct fstrim_range __user *);
>  
>  extern int	xfs_discard_extent(struct xfs_mount *,
> -				   struct xfs_busy_extent *);
> +				   struct xfs_busy_extent *,
> +				   struct xfs_cil_ctx *);
> +extern void	xfs_cil_discard_done(struct xfs_cil_ctx	*ctx);
>  
>  #endif /* XFS_DISCARD_H */
> Index: xfs/fs/xfs/xfs_log_cil.c
> ===================================================================
> --- xfs.orig/fs/xfs/xfs_log_cil.c	2011-03-22 15:58:10.329855977 +0100
> +++ xfs/fs/xfs/xfs_log_cil.c	2011-03-22 18:39:09.000000000 +0100
> @@ -68,6 +68,7 @@ xlog_cil_init(
>  	INIT_LIST_HEAD(&ctx->busy_extents);
>  	ctx->sequence = 1;
>  	ctx->cil = cil;
> +	atomic_set(&ctx->discards, 1);
>  	cil->xc_ctx = ctx;
>  	cil->xc_current_sequence = ctx->sequence;
>  
> @@ -364,14 +365,18 @@ xlog_cil_committed(
>  	struct xfs_cil_ctx	*ctx = args;
>  	struct xfs_mount	*mp = ctx->cil->xc_log->l_mp;
>  	struct xfs_busy_extent	*busyp, *n;
> +	bool			keep_alive = false;
>  
>  	xfs_trans_committed_bulk(ctx->cil->xc_log->l_ailp, ctx->lv_chain,
>  					ctx->start_lsn, abort);
>  
> -	list_for_each_entry_safe(busyp, n, &ctx->busy_extents, list) {
> -		if (!abort)
> -			xfs_discard_extent(mp, busyp);
> -		xfs_alloc_busy_clear(mp, busyp);
> +	if (!(mp->m_flags & XFS_MOUNT_DISCARD) || abort) {
> +		list_for_each_entry_safe(busyp, n, &ctx->busy_extents, list)
> +			xfs_alloc_busy_clear(mp, busyp);
> +	} else if (!list_empty(&ctx->busy_extents)) {
> +		list_for_each_entry(busyp, &ctx->busy_extents, list)
> +			xfs_discard_extent(mp, busyp, ctx);
> +		keep_alive = true;
>  	}

Oh, I see you've moved the XFS_MOUNT_DISCARD into the loop here...

>  
>  	spin_lock(&ctx->cil->xc_cil_lock);
> @@ -379,7 +384,10 @@ xlog_cil_committed(
>  	spin_unlock(&ctx->cil->xc_cil_lock);
>  
>  	xlog_cil_free_logvec(ctx->lv_chain);
> -	kmem_free(ctx);
> +	if (keep_alive)
> +		xfs_cil_discard_done(ctx);
> +	else
> +		kmem_free(ctx);

You could probably just call xfs_cil_discard_done(ctx) here as the
busy extent list will be empty in the !keep_alive case and so it
will simply do the kmem_free(ctx) call there.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 1/6] xfs: optimize AGFL refills
  2011-03-22 22:30   ` Alex Elder
@ 2011-03-23 12:16     ` Christoph Hellwig
  2011-03-25 21:03       ` Alex Elder
  0 siblings, 1 reply; 31+ messages in thread
From: Christoph Hellwig @ 2011-03-23 12:16 UTC (permalink / raw)
  To: Alex Elder; +Cc: Christoph Hellwig, xfs

On Tue, Mar 22, 2011 at 05:30:51PM -0500, Alex Elder wrote:
> On Tue, 2011-03-22 at 15:55 -0400, Christoph Hellwig wrote:
> > Avoid forcing out busy extent when moving blocks from/to the AGFL.  We
> 
> Without thinking a bit about it, it wasn't clear why doing
> this was OK to do.  It would be nice to record a one-sentence
> justification of why we don't need to force busy extents out
> in this case.

I've added a sentence, but it seems rather pointless.

>  achieve this

fixed.

> > the callers that need it, and by moving the busy list insert from
> > xfs_free_ag_extent extent which is used both by AGFL refills and real
>                      ^^^^^ drop this word

fixed.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 1/6] xfs: optimize AGFL refills
  2011-03-22 23:30   ` Dave Chinner
@ 2011-03-23 12:16     ` Christoph Hellwig
  0 siblings, 0 replies; 31+ messages in thread
From: Christoph Hellwig @ 2011-03-23 12:16 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Christoph Hellwig, xfs

On Wed, Mar 23, 2011 at 10:30:54AM +1100, Dave Chinner wrote:
> > +	if (error)
> > +		xfs_alloc_busy_insert(tp, args.agno, args.agbno, len);
> 
> Shouldn't that be "if (!error)"?  i.e. if we freed the extent
> successfully we need to insert it in the busy list. This current
> code has the effect of never inserting freed data extents in the
> busy list....

Yes, it should.  This got messed up at some point during reshuffling
the series.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 2/6] xfs: do not immediately reuse busy extent ranges
  2011-03-22 22:30   ` Alex Elder
@ 2011-03-23 12:17     ` Christoph Hellwig
  0 siblings, 0 replies; 31+ messages in thread
From: Christoph Hellwig @ 2011-03-23 12:17 UTC (permalink / raw)
  To: Alex Elder; +Cc: Christoph Hellwig, xfs

On Tue, Mar 22, 2011 at 05:30:59PM -0500, Alex Elder wrote:
> Again, this looks nearly identical to your last version.
> On that one, I suggested rewording a comment, and you
> said "ok."  You did not do so in this version, which I
> guess is fine.  I just want to know whether you intended
> to.  If so I'll give you a chance to post an update; if
> not I can take this one as-is.

Sorry, I fixe half of what you pointed out and postponed the
comment rewrite.  I've taken your version of the comments
for the next spin.  Still not sure what to do about the magic 4,
though.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 3/6] xfs: exact busy extent tracking
  2011-03-22 23:47   ` Dave Chinner
@ 2011-03-23 12:24     ` Christoph Hellwig
  0 siblings, 0 replies; 31+ messages in thread
From: Christoph Hellwig @ 2011-03-23 12:24 UTC (permalink / raw)
  To: Dave Chinner; +Cc: xfs

On Wed, Mar 23, 2011 at 10:47:28AM +1100, Dave Chinner wrote:
> BUG_ON() inside a spinlock will effectively lock up the machine as
> other CPUs try to take the spinlock that was held by the thread that
> went bad. It causes the machine to die a horrible, undebuggable death
> rather than just kill the thread that caused the oops and leaving
> everything else still functional.
> 
> If it were an ASSERT(), that's probably OK because it is only debug
> systems that woul dhave this problem, but the use of BUG(_ON) means
> it will cause production systems to die as well.

I've changed it to asserts.

> > +		 * We would have to split the busy extent to be able
> > +		 * to track it correct, which we cannot do because we
> > +		 * would have to modify the list of busy extents
> > +		 * attached to the transaction/CIL context, which
> > +		 * is mutable.
> 
> 		      immutable?

Yes.

> > +		 *
> > +		 * Force out the log to clear the busy extents
> > +		 * and retry the search.
> > +		 */
> 
> BTW, these comments appear to wrap at 68 columns - any particular
> reason?

They used to sit one more tab to the right and I didn't fix them up
when splitting the function.  I've rewrapped them now.

> > +		overlap = xfs_alloc_busy_try_reuse(pag, busyp,
> > +						   fbno, fbno + flen);
> > +		if (overlap) {
> > +			spin_unlock(&pag->pagb_lock);
> > +			xfs_log_force(tp->t_mountp, XFS_LOG_SYNC);
> > +			goto restart;
> > +		}
> 
> xfs_alloc_busy_try_reuse() only ever returns -1 or 1, which means
> this will always trigger a log force on overlap....

At least for now, yes.  In the next patch we will treat -1 vs 1
differently.  Maybe it should be changed to 0 vs 1, but even that
isn't quite intuitive.  I'll think about it a bit more.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 4/6] xfs: allow reusing busy extents where safe
  2011-03-23  0:20   ` Dave Chinner
@ 2011-03-23 12:26     ` Christoph Hellwig
  0 siblings, 0 replies; 31+ messages in thread
From: Christoph Hellwig @ 2011-03-23 12:26 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Christoph Hellwig, xfs

On Wed, Mar 23, 2011 at 11:20:24AM +1100, Dave Chinner wrote:
> >  		overlap = xfs_alloc_busy_try_reuse(pag, busyp,
> >  						   fbno, fbno + flen);
> > -		if (overlap) {
> > +		if (overlap == -1 || (overlap && userdata)) {
> >  			spin_unlock(&pag->pagb_lock);
> >  			xfs_log_force(tp->t_mountp, XFS_LOG_SYNC);
> >  			goto restart;
> 
> Ok, so the only time we'll do a log force now is on an complete
> overlap or a partial userdata overlap?

We do it if we would have to split the busy extent, or reallocate
metadata to userdata.

> > +		if (!args->userdata ||
> > +		    (busyp->flags & XFS_ALLOC_BUSY_USERDATA)) {
> > +			int overlap;
> > +
> > +			overlap = xfs_alloc_busy_try_reuse(args->pag, busyp,
> > +							   fbno, fbno + flen);
> > +			if (unlikely(overlap == -1)) {
> > +				spin_unlock(&args->pag->pagb_lock);
> > +				xfs_log_force(args->mp, XFS_LOG_SYNC);
> > +				goto restart;
> > +			}
> 
> Hmmmm - I'm not so sure we can reuse overlapped data extents for
> data allocations without a log force at all as there is no guarantee
> that the data will not be overwritten before the original free
> transaction is on disk.
> 
> That is, recovery may not replay the original data extent free
> transaction or the new allocation transaction, but there is nothing
> stopping us from having written the new data into the extent before
> the crash occurred, especially as delayed allocation places the
> allocation very close the data IO issue. e.g.:
> 
> 	thread X		thread Y
> 	free data extent ABC
> 				allocate data extent BCD
> 				partial overlap, no log force
> 				issue data IO
> 				.....
> 
> 		 <crash>
> 
> That leads to corruption of the data in the original file because
> neither transaction is written to disk, but new data is....

You're right.  It's fine from the transaction point of view, but
it could leave wrong data in a file.  I'll disallow this case, too.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 5/6] xfs: add online discard support
  2011-03-23  0:30   ` Dave Chinner
@ 2011-03-23 12:31     ` Christoph Hellwig
  0 siblings, 0 replies; 31+ messages in thread
From: Christoph Hellwig @ 2011-03-23 12:31 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Christoph Hellwig, xfs

> > +	if ((mp->m_flags & XFS_MOUNT_DISCARD) == 0)
> > +		return 0;
> 
> I'd move this check to the callers, otherwise we are going to be
> doing lots of function calls in a relatively performance sensitive
> loop just to run a single check when discard is not enabled...

Ok.  As you noticed it's done in the next patch, and they will probably
be merged into one before the final submission.  But for now I'll
move it to make the patch more obvious.

> > +	error = -blkdev_issue_discard(mp->m_ddev_targp->bt_bdev, bno, len,
> > +				      GFP_NOFS, 0);
> > +	if (error && error != EOPNOTSUPP)
> > +		xfs_info(mp, "discard failed, error %d", error);
> 
> This would be more informative if it also printed the bno and len of
> the discard that failed. A couple of tracepoints here (e.g.
> discard_extent_issued, discard_extent_failed) could also be useful
> for tracking discard operations.

Yeah, I was planning to add a lot more tracepoint later on, both
for the busy extent handling and discards.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 1/6] xfs: optimize AGFL refills
  2011-03-23 12:16     ` Christoph Hellwig
@ 2011-03-25 21:03       ` Alex Elder
  2011-03-28 12:07         ` Christoph Hellwig
  0 siblings, 1 reply; 31+ messages in thread
From: Alex Elder @ 2011-03-25 21:03 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: xfs

On Wed, 2011-03-23 at 08:16 -0400, Christoph Hellwig wrote: 
> On Tue, Mar 22, 2011 at 05:30:51PM -0500, Alex Elder wrote:
> > On Tue, 2011-03-22 at 15:55 -0400, Christoph Hellwig wrote:
> > > Avoid forcing out busy extent when moving blocks from/to the AGFL.  We
> > 
> > Without thinking a bit about it, it wasn't clear why doing
> > this was OK to do.  It would be nice to record a one-sentence
> > justification of why we don't need to force busy extents out
> > in this case.
> 
> I've added a sentence, but it seems rather pointless.

The reason it was not obvious is because the case you're
treating is specifically when moving extents (blocks,
really) between the free list and the free btrees, in
xfs_alloc_fix_freelist().  You still need to force it
out when allocating and freeing "actually used" blocks,
which could also be considered "moving blocks from/to
the AGFL."

					-Alex


> >  achieve this
> 
> fixed.
> 
> > > the callers that need it, and by moving the busy list insert from
> > > xfs_free_ag_extent extent which is used both by AGFL refills and real
> >                      ^^^^^ drop this word
> 
> fixed.
> 




_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 2/6] xfs: do not immediately reuse busy extent ranges
  2011-03-22 19:55 ` [PATCH 2/6] xfs: do not immediately reuse busy extent ranges Christoph Hellwig
  2011-03-22 22:30   ` Alex Elder
@ 2011-03-25 21:03   ` Alex Elder
  2011-03-28 12:07     ` Christoph Hellwig
  1 sibling, 1 reply; 31+ messages in thread
From: Alex Elder @ 2011-03-25 21:03 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: xfs

On Tue, 2011-03-22 at 15:55 -0400, Christoph Hellwig wrote:
> Every time we reallocate a busy extent, we cause a synchronous log force
> to occur to ensure the freeing transaction is on disk before we continue
> and use the newly allocated extent.  This is extremely sub-optimal as we
> have to mark every transaction with blocks that get reused as synchronous.
> 
> Instead of searching the busy extent list after deciding on the extent to
> allocate, check each candidate extent during the allocation decisions as
> to whether they are in the busy list.  If they are in the busy list, we
> trim the busy range out of the extent we have found and determine if that
> trimmed range is still OK for allocation. In many cases, this check can
> be incorporated into the allocation extent alignment code which already
> does trimming of the found extent before determining if it is a valid
> candidate for allocation.
> 
> Based on two earlier patches from Dave Chinner.
> 
> Signed-off-by: Christoph Hellwig <hch@lst.de>

I already reviewed this, but I noticed a few things
that I think are worth clarifying in one comment.
There are a few typo's in that same block that you
might as well fix while you're at it.

. . .

> @@ -2634,6 +2704,181 @@ xfs_alloc_busy_search(
>  	return match;
>  }
>  
> +/*
> + * For a given extent [fbno, flen], search the busy extent list
> + * to find a subset of the extent that is not busy.
> + */
> +STATIC void
> +xfs_alloc_busy_trim(
> +	struct xfs_alloc_arg	*args,
> +	xfs_agblock_t		fbno,
> +	xfs_extlen_t		flen,
> +	xfs_agblock_t		*rbno,
> +	xfs_extlen_t		*rlen)
> +{

. . .

> +		} else {
> +			/* middle overlap */
> +
> +			/*
> +			 * Case 9:
> +			 *             bbno           bend
> +			 *             +BBBBBBBBBBBBBBBBB+
> +			 *    +-----------------------------------+
> +			 *    fbno                             fend
> +			 *
> +			 * Can be trimmed to:
> +			 *    +-------+        OR         +-------+
> +			 *    fbno fend                   fbno fend
> +			 *
> +			 * We prefer the lower bno extent because the next
> +			 * allocation for this inode will use "end" as the
> +			 * target for first block.  If the busy segment has

...will use the updated value of fend as the target...

> +			 * cleared, this will get a contiguous allocation next
> +			 * time around; if thebusy segment has not cleared,
                                             the busy

> +			 * it will get an allocation at bend, which is a forward
> +			 * allocation.
> +			 *
> +			 * If we choose segment at bend, and this remains the
> +			 * best extent for the next allocation (e.g. NEAR_BNO
> +			 * allocation) we'll next allocate at bno, which will
...we'll next allocate at (pre-update) fbno, which will...

(Actually, correct this if my statements are wrong.  The point is
to use "fbno" and "fend" where you currently just have "bno" and
"end".)

> +			 * give us backwards allocation.  We already know that
> +			 * backwards allocation direction causes significant
> +			 * fragmentation of directories and degradataion of
> +			 * directory performance.
> +			 *
> +			 * Always chose the option that produces forward
                                   choose 
> +			 * allocation patterns so that sequential reads and
> +			 * writes only ever seek in one direction.  Only choose
> +			 * the higher bno extent if the remainin unused extent
                                                          remaining 
> +			 * length is much larger than the current allocation
> +			 * request, promising us a contiguous allocation in
> +			 * the following free space.
> +			 */
> +
> +			if (bbno - fbno >= args->maxlen) {

. . . 

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 3/6] xfs: exact busy extent tracking
  2011-03-22 19:55 ` [PATCH 3/6] xfs: exact busy extent tracking Christoph Hellwig
  2011-03-22 23:47   ` Dave Chinner
@ 2011-03-25 21:04   ` Alex Elder
  2011-03-28 12:10     ` Christoph Hellwig
  1 sibling, 1 reply; 31+ messages in thread
From: Alex Elder @ 2011-03-25 21:04 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: xfs

On Tue, 2011-03-22 at 15:55 -0400, Christoph Hellwig wrote:
> Update the extent tree in case we have to reuse a busy extent, so that it
> always is kept uptodate.  This is done by replacing the busy list searches
> with a new xfs_alloc_busy_reuse helper, which updates the busy extent tree
> in case of a reuse.   Also replace setting transactions to sync with forcing
> the log out in case we found a busy extent to reuse.  This makes the code a
> lot more simple, and is required for discard support later on.  While it
> will cause performance regressios with just this patch applied, the impact
> is more than mitigated by the next patch in the series.
> 
> Signed-off-by: Christoph Hellwig <hch@lst.de>

In this one, xfs_alloc_busy_reuse() is broken, or at least it has some
dead code in it now.  (But now that I've looked at the later patches
I see that's a temporary artifact.  I've left my original comments
in place anyway, in case any insights remain despite that.)

More below.

> Index: xfs/fs/xfs/xfs_alloc.c
> ===================================================================
> --- xfs.orig/fs/xfs/xfs_alloc.c	2011-03-20 19:41:55.835479390 +0100
> +++ xfs/fs/xfs/xfs_alloc.c	2011-03-21 14:49:14.157973188 +0100

. . .

> @@ -2459,100 +2459,6 @@ error0:
>  	return error;
>  }
>  
> -
> -/*
> - * AG Busy list management
> - * The busy list contains block ranges that have been freed but whose
> - * transactions have not yet hit disk.  If any block listed in a busy
> - * list is reused, the transaction that freed it must be forced to disk
> - * before continuing to use the block.
> - *
> - * xfs_alloc_busy_insert - add to the per-ag busy list
> - * xfs_alloc_busy_clear - remove an item from the per-ag busy list
> - * xfs_alloc_busy_search - search for a busy extent
> - */
> -
> -/*
> - * Insert a new extent into the busy tree.

OK, to be honest I haven't re-read this entire block of
comment text to identify what might be of value.  But
is there really *nothing* worth saving?  Is the busy
extent tree documented adequately elsewhere?

> - * The busy extent tree is indexed by the start block of the busy extent.
> - * there can be multiple overlapping ranges in the busy extent tree but only
> - * ever one entry at a given start block. The reason for this is that
> - * multi-block extents can be freed, then smaller chunks of that extent
> - * allocated and freed again before the first transaction commit is on disk.
> - * If the exact same start block is freed a second time, we have to wait for
> - * that busy extent to pass out of the tree before the new extent is inserted.
> - * There are two main cases we have to handle here.

. . .

> @@ -2583,66 +2487,30 @@ xfs_alloc_busy_insert(
>  	new->agno = agno;
>  	new->bno = bno;
>  	new->length = len;
> -	new->tid = xfs_log_get_trans_ident(tp);
> -
>  	INIT_LIST_HEAD(&new->list);
>  
>  	/* trace before insert to be able to see failed inserts */
>  	trace_xfs_alloc_busy(tp, agno, bno, len, 0);
>  
>  	pag = xfs_perag_get(tp->t_mountp, new->agno);
> -restart:
>  	spin_lock(&pag->pagb_lock);
>  	rbp = &pag->pagb_tree.rb_node;
> -	parent = NULL;
> -	busyp = NULL;
> -	match = 0;
> -	while (*rbp && match >= 0) {
> +	while (*rbp) {
>  		parent = *rbp;
>  		busyp = rb_entry(parent, struct xfs_busy_extent, rb_node);
>  
>  		if (new->bno < busyp->bno) {
>  			/* may overlap, but exact start block is lower */

This comment isn't really right any more (BUG_ON that condition).

>  			rbp = &(*rbp)->rb_left;
> -			if (new->bno + new->length > busyp->bno)
> -				match = busyp->tid == new->tid ? 1 : -1;
> +			BUG_ON(new->bno + new->length > busyp->bno);
>  		} else if (new->bno > busyp->bno) {
>  			/* may overlap, but exact start block is higher */

This one too.

>  			rbp = &(*rbp)->rb_right;
> -			if (bno < busyp->bno + busyp->length)
> -				match = busyp->tid == new->tid ? 1 : -1;
> +			BUG_ON(bno < busyp->bno + busyp->length);
>  		} else {
> -			match = busyp->tid == new->tid ? 1 : -1;
> -			break;
> +			BUG();
>  		}

. . .

> @@ -2704,6 +2571,173 @@ xfs_alloc_busy_search(
>  	return match;
>  }
>  

/*
 * The found free extent [fbno, fend] overlaps part or all
 * of the given busy extent.  If the overlap covers the
 * beginning, the end, or all of the busy extent, the
 * overlapping portion can be made unbusy and used for
 * the allocation.  We can't split a busy extent because
 * we can't modify a transaction/CIL context busy list,
 * but we can update an entry's block number or length.
 *
 * The caller will force the log and re-check the busy
 * list after returning from this function.
 */

> +STATIC int
> +xfs_alloc_busy_try_reuse(
> +	struct xfs_perag	*pag,
> +	struct xfs_busy_extent	*busyp,
> +	xfs_agblock_t		fbno,
> +	xfs_agblock_t		fend)
> +{
> +	xfs_agblock_t	bbno = busyp->bno;
> +	xfs_agblock_t	bend = bbno + busyp->length;
> +
> +	if (bbno < fbno && bend > fend) {
> +		/*
> +		 * Case 1:
> +		 *    bbno           bend
> +		 *    +BBBBBBBBBBBBBBBBB+
> +		 *        +---------+
> +		 *        fbno   fend
> +		 */
> +
> +		/*
> +		 * We would have to split the busy extent to be able
> +		 * to track it correct, which we cannot do because we
> +		 * would have to modify the list of busy extents
> +		 * attached to the transaction/CIL context, which
> +		 * is mutable.
> +		 *
> +		 * Force out the log to clear the busy extents
> +		 * and retry the search.

The caller forces the log.  Rely on a comment in this
function's header to say that (not here).

> +		 */
> +		return -1;
> +	} else if (bbno >= fbno && bend <= fend) {
> +		/*
> +		 * Case 2:
> +		 *    bbno           bend
> +		 *    +BBBBBBBBBBBBBBBBB+
> +		 *    +-----------------+
> +		 *    fbno           fend

. . .

> +
> +	return 1;
> +}
> +
> +
> +/*
> + * For a given extent [fbno, flen], make sure we can reuse it safely.
> + */
> +void
> +xfs_alloc_busy_reuse(
> +	struct xfs_trans	*tp,
> +	xfs_agnumber_t		agno,
> +	xfs_agblock_t		fbno,
> +	xfs_extlen_t		flen)
> +{
> +	struct xfs_perag	*pag;
> +	struct rb_node		*rbp;
> +
> +	ASSERT(flen > 0);
> +
> +	pag = xfs_perag_get(tp->t_mountp, agno);
> +restart:
> +	spin_lock(&pag->pagb_lock);
> +	rbp = pag->pagb_tree.rb_node;
> +	while (rbp) {
> +		struct xfs_busy_extent *busyp =
> +			rb_entry(rbp, struct xfs_busy_extent, rb_node);
> +		xfs_agblock_t	fend = fbno + flen;
> +		xfs_agblock_t	bbno = busyp->bno;
> +		xfs_agblock_t	bend = bbno + busyp->length;
> +		int		overlap;
> +
> +		if (fend <= bbno) {
> +			rbp = rbp->rb_left;
> +			continue;
> +		} else if (fbno >= bend) {
> +			rbp = rbp->rb_right;
> +			continue;
> +		}
> +

I was going to suggest:
		/* Extent overlaps busy extent */

here, but I think that may not be the case any more if
busyp->length is zero.  Or rather, the extent may
surround the zero-length busy extent (which I suppose
could be considered overlap).

If busyp->length is zero, I think the call to
xfs_alloc_busy_try_reuse() is not needed; in fact,
if it is already 0, that function will call
rb_erase() on the entry again.

> +		overlap = xfs_alloc_busy_try_reuse(pag, busyp,
> +						   fbno, fbno + flen);

Note that xfs_alloc_busy_try_reuse() only returns 1 or -1 now...

> +		if (overlap) {

...therefore this branch is always taken, and the code
below this block to the end of the loop is not reached.

Since this is the only place it's used, xfs_alloc_busy_try_reuse()
might as well be defined as a void function.

(Ahh, but now that I've looked at the later patches
I see it gets used again later.  I'm leaving my
comments here nevertheless.)

> +			spin_unlock(&pag->pagb_lock);
> +			xfs_log_force(tp->t_mountp, XFS_LOG_SYNC);
> +			goto restart;
> +		}

+------ code starting here is never reached
|
> +		/*
> +		 * No more busy extents to search.
> +		 */
> +		if (bbno <= fbno && bend >= fend)
> +			break;
> +
> +		if (fbno < bbno)
> +			rbp = rbp->rb_left;
> +		else
> +			rbp = rbp->rb_right;
|
+------ end of code not reached

> +	}
> +	spin_unlock(&pag->pagb_lock);
> +	xfs_perag_put(pag);
> +}
> +
>  /*
>   * For a given extent [fbno, flen], search the busy extent list
>   * to find a subset of the extent that is not busy.

 * If part or all of the extent is busy, force the log, then
 * verify it is no longer busy before returning.

> @@ -2889,14 +2923,12 @@ xfs_alloc_busy_clear(
>  	trace_xfs_alloc_unbusy(mp, busyp->agno, busyp->bno,
>  						busyp->length);
>  
> -	ASSERT(xfs_alloc_busy_search(mp, busyp->agno, busyp->bno,
> -						busyp->length) == 1);
> -
>  	list_del_init(&busyp->list);
>  
>  	pag = xfs_perag_get(mp, busyp->agno);
>  	spin_lock(&pag->pagb_lock);
> -	rb_erase(&busyp->rb_node, &pag->pagb_tree);
> +	if (busyp->length)
> +		rb_erase(&busyp->rb_node, &pag->pagb_tree);
>  	spin_unlock(&pag->pagb_lock);
>  	xfs_perag_put(pag);
>  

. . .



_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 4/6] xfs: allow reusing busy extents where safe
  2011-03-22 19:55 ` [PATCH 4/6] xfs: allow reusing busy extents where safe Christoph Hellwig
  2011-03-23  0:20   ` Dave Chinner
@ 2011-03-25 21:04   ` Alex Elder
  1 sibling, 0 replies; 31+ messages in thread
From: Alex Elder @ 2011-03-25 21:04 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: xfs

On Tue, 2011-03-22 at 15:55 -0400, Christoph Hellwig wrote:
> Allow reusing any busy extent for metadata allocations, and reusing busy
> userdata extents for userdata allocations.  Most of the complexity is
> propagating the userdata information from the XFS_BMAPI_METADATA flag
> to xfs_bunmapi into the low-level extent freeing routines.  After that
> we can just track what type of busy extent we have and treat it accordingly.

Why is it OK to reuse user data extents for user data allocations?
I accept it is, I just haven't worked through in my mind why.

> Signed-off-by: Christoph Hellwig <hch@lst.de>
> 
> Index: xfs/fs/xfs/xfs_alloc.c
> ===================================================================
> --- xfs.orig/fs/xfs/xfs_alloc.c	2011-03-21 14:49:14.000000000 +0100
> +++ xfs/fs/xfs/xfs_alloc.c	2011-03-21 14:51:31.746155282 +0100
> @@ -1396,7 +1396,8 @@ xfs_alloc_ag_vextent_small(
>  		if (error)
>  			goto error0;
>  		if (fbno != NULLAGBLOCK) {
> -			xfs_alloc_busy_reuse(args->tp, args->agno, fbno, 1);
> +			xfs_alloc_busy_reuse(args->tp, args->agno, fbno, 1,
> +					     args->userdata);
>  
>  			if (args->userdata) {
>  				xfs_buf_t	*bp;
> @@ -2431,7 +2432,8 @@ int				/* error */
>  xfs_free_extent(
>  	xfs_trans_t	*tp,	/* transaction pointer */
>  	xfs_fsblock_t	bno,	/* starting block number of extent */
> -	xfs_extlen_t	len)	/* length of extent */
> +	xfs_extlen_t	len,
> +	bool		userdata)/* length of extent */

	xfs_extlen_t	len,	/* length of extent */
	bool		userdata)

>  {
>  	xfs_alloc_arg_t	args;
>  	int		error;

. . .


> @@ -2717,7 +2723,7 @@ restart:		(in xfs_alloc_busy_reuse())
>  
>  		overlap = xfs_alloc_busy_try_reuse(pag, busyp,
>  						   fbno, fbno + flen);
> -		if (overlap) {
> +		if (overlap == -1 || (overlap && userdata)) {

xfs_alloc_busy_try_reuse() (still) never returns non-zero,
so this could just be:
    if (overlap == -1 || userdata) {

I understand why we can skip forcing the log if
we're not doing a userdata allocation.  But why
don't you also check busyp->flags here when it's
a userdata allocation, to see if it represents a
busy userdata section and therefore would allow
avoiding the log force (like is done below in
xfs_alloc_busy_trim())?  You would have to grab
the flag value in busyp before the call.

>  			spin_unlock(&pag->pagb_lock);
>  			xfs_log_force(tp->t_mountp, XFS_LOG_SYNC);
>  			goto restart;
> @@ -2754,6 +2760,7 @@ xfs_alloc_busy_trim(
>  
>  	ASSERT(flen > 0);
>  
> +restart:
>  	spin_lock(&args->pag->pagb_lock);
>  	rbp = args->pag->pagb_tree.rb_node;
>  	while (rbp && flen >= args->minlen) {
> @@ -2771,6 +2778,31 @@ xfs_alloc_busy_trim(
>  			continue;
>  		}
>  
> +		if (!args->userdata ||
> +		    (busyp->flags & XFS_ALLOC_BUSY_USERDATA)) {
> +			int overlap;
> +
> +			overlap = xfs_alloc_busy_try_reuse(args->pag, busyp,
> +							   fbno, fbno + flen);
> +			if (unlikely(overlap == -1)) {
> +				spin_unlock(&args->pag->pagb_lock);
> +				xfs_log_force(args->mp, XFS_LOG_SYNC);
> +				goto restart;
> +			}
> +
> +			/*
> +			 * No more busy extents to search.
> +			 */
> +			if (bbno <= fbno && bend >= fend)
> +				goto out;
> +
> +			if (fbno < bbno)
> +				rbp = rbp->rb_left;
> +			else
> +				rbp = rbp->rb_right;
> +			continue;
> +		}
> +
>  		if (bbno <= fbno) {
>  			/* start overlap */
>  

. . .

> Index: xfs/fs/xfs/xfs_ag.h
> ===================================================================
> --- xfs.orig/fs/xfs/xfs_ag.h	2011-03-21 14:48:04.000000000 +0100
> +++ xfs/fs/xfs/xfs_ag.h	2011-03-21 14:49:21.941981228 +0100

. . .

> @@ -3750,6 +3744,7 @@ xfs_bmap_add_free(
>  	new = kmem_zone_alloc(xfs_bmap_free_item_zone, KM_SLEEP);
>  	new->xbfi_startblock = bno;
>  	new->xbfi_blockcount = (xfs_extlen_t)len;
> +	new->xbfi_flags = XFS_BFI_USERDATA;

Couldn't you arrange for the the xfs_bmbt_free_block() path
to *not* set this?  (As it stands, it will always be set.)

>  	for (prev = NULL, cur = flist->xbf_first;
>  	     cur != NULL;
>  	     prev = cur, cur = cur->xbfi_next) {

. . .

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 5/6] xfs: add online discard support
  2011-03-22 19:55 ` [PATCH 5/6] xfs: add online discard support Christoph Hellwig
  2011-03-23  0:30   ` Dave Chinner
@ 2011-03-25 21:04   ` Alex Elder
  2011-03-28 12:35     ` Christoph Hellwig
  1 sibling, 1 reply; 31+ messages in thread
From: Alex Elder @ 2011-03-25 21:04 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: xfs

On Tue, 2011-03-22 at 15:55 -0400, Christoph Hellwig wrote:
> Now that we have reliably tracking of deleted extents in a transaction
> we can easily implement "online" discard support which calls
> blkdev_issue_discard once a transaction commits.
> 
> The actual discard is a two stage operation as we first have to mark
> the busy extent as not available for reuse before we can start the
> actual discard.  Note that we don't bother supporting discard for
> the non-delaylog mode.  While that would be possible with this patch
> performance is awfull, and the optimization in the next patch won't
> work as easily.
> 
> Signed-off-by: Christoph Hellwig <hch@lst.de>

. . .

This one looks good overall.  How about adding
documentation to Documentation/filesystems/xfs.txt
to describe the new mount options?  (Dave, same
goes for you with delaylog.  I haven't looked for
any other things missing from there.)

Reviewed-by: Alex Elder <aelder@sgi.com>


_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 6/6] xfs: make discard operations asynchronous
  2011-03-22 19:55 ` [PATCH 6/6] xfs: make discard operations asynchronous Christoph Hellwig
  2011-03-23  0:43   ` Dave Chinner
@ 2011-03-25 21:04   ` Alex Elder
  1 sibling, 0 replies; 31+ messages in thread
From: Alex Elder @ 2011-03-25 21:04 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: xfs

On Tue, 2011-03-22 at 15:55 -0400, Christoph Hellwig wrote:
> Instead of waiting for each discard request keep the CIL context alive
> until all of them are done, at which point we can tear it down completly
> and remove the busy extents from the rbtree.
> 
> At this point I'm doing the I/O completion from IRQ context for simplicity,
> but I'll benchmark it against a version that uses a workqueue.
> 
> Signed-off-by: Christoph Hellwig <hch@lst.de>

This looks good.

Reviewed-by: Alex Elder <aelder@sgi.com>

. . .

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 1/6] xfs: optimize AGFL refills
  2011-03-25 21:03       ` Alex Elder
@ 2011-03-28 12:07         ` Christoph Hellwig
  0 siblings, 0 replies; 31+ messages in thread
From: Christoph Hellwig @ 2011-03-28 12:07 UTC (permalink / raw)
  To: Alex Elder; +Cc: Christoph Hellwig, xfs

> The reason it was not obvious is because the case you're
> treating is specifically when moving extents (blocks,
> really) between the free list and the free btrees, in
> xfs_alloc_fix_freelist().  You still need to force it
> out when allocating and freeing "actually used" blocks,
> which could also be considered "moving blocks from/to
> the AGFL."

I've updated the comment to make that more clear.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 2/6] xfs: do not immediately reuse busy extent ranges
  2011-03-25 21:03   ` Alex Elder
@ 2011-03-28 12:07     ` Christoph Hellwig
  0 siblings, 0 replies; 31+ messages in thread
From: Christoph Hellwig @ 2011-03-28 12:07 UTC (permalink / raw)
  To: Alex Elder; +Cc: Christoph Hellwig, xfs

On Fri, Mar 25, 2011 at 04:03:19PM -0500, Alex Elder wrote:
> > +			 * Can be trimmed to:
> > +			 *    +-------+        OR         +-------+
> > +			 *    fbno fend                   fbno fend
> > +			 *
> > +			 * We prefer the lower bno extent because the next
> > +			 * allocation for this inode will use "end" as the
> > +			 * target for first block.  If the busy segment has
> 
> ...will use the updated value of fend as the target...

I've replaced all these comments based on the text you replied to the
last round (finally..)

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 3/6] xfs: exact busy extent tracking
  2011-03-25 21:04   ` Alex Elder
@ 2011-03-28 12:10     ` Christoph Hellwig
  0 siblings, 0 replies; 31+ messages in thread
From: Christoph Hellwig @ 2011-03-28 12:10 UTC (permalink / raw)
  To: Alex Elder; +Cc: Christoph Hellwig, xfs

On Fri, Mar 25, 2011 at 04:04:20PM -0500, Alex Elder wrote:
> > - * Insert a new extent into the busy tree.
> 
> OK, to be honest I haven't re-read this entire block of
> comment text to identify what might be of value.  But
> is there really *nothing* worth saving?  Is the busy
> extent tree documented adequately elsewhere?

The old text doesn't really apply to the new code anymore.  I'll see
if there's a place I can insert some more documentaion, but it already
has quite a lot of comments.

> >  		if (new->bno < busyp->bno) {
> >  			/* may overlap, but exact start block is lower */
>
> This comment isn't really right any more (BUG_ON that condition).

> >  		} else if (new->bno > busyp->bno) {
> >  			/* may overlap, but exact start block is higher */
> 
> This one too.

Removed.

> > +			continue;
> > +		} else if (fbno >= bend) {
> > +			rbp = rbp->rb_right;
> > +			continue;
> > +		}
> > +
> 
> I was going to suggest:
> 		/* Extent overlaps busy extent */
> 
> here, but I think that may not be the case any more if
> busyp->length is zero.  Or rather, the extent may
> surround the zero-length busy extent (which I suppose
> could be considered overlap).
> 
> If busyp->length is zero, I think the call to
> xfs_alloc_busy_try_reuse() is not needed; in fact,
> if it is already 0, that function will call
> rb_erase() on the entry again.

We will never see zero-length busy extents here.  While they have
to remain on the per-transaction/cil_context list they are properly
removed from the rbtree.

> ...therefore this branch is always taken, and the code
> below this block to the end of the loop is not reached.
> 
> Since this is the only place it's used, xfs_alloc_busy_try_reuse()
> might as well be defined as a void function.
> 
> (Ahh, but now that I've looked at the later patches
> I see it gets used again later.  I'm leaving my
> comments here nevertheless.)

Yes, it's changing in the next patch.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 5/6] xfs: add online discard support
  2011-03-25 21:04   ` Alex Elder
@ 2011-03-28 12:35     ` Christoph Hellwig
  0 siblings, 0 replies; 31+ messages in thread
From: Christoph Hellwig @ 2011-03-28 12:35 UTC (permalink / raw)
  To: Alex Elder; +Cc: Christoph Hellwig, xfs

On Fri, Mar 25, 2011 at 04:04:52PM -0500, Alex Elder wrote:
> This one looks good overall.  How about adding
> documentation to Documentation/filesystems/xfs.txt
> to describe the new mount options?

I've added it.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 6/6] xfs: make discard operations asynchronous
  2011-03-23  0:43   ` Dave Chinner
@ 2011-03-28 12:44     ` Christoph Hellwig
  0 siblings, 0 replies; 31+ messages in thread
From: Christoph Hellwig @ 2011-03-28 12:44 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Christoph Hellwig, xfs

On Wed, Mar 23, 2011 at 11:43:23AM +1100, Dave Chinner wrote:
> > +	struct xfs_cil_ctx	*ctx = bio->bi_private;
> > +
> > +	if (err && err != -EOPNOTSUPP) {
> > +		xfs_info(ctx->cil->xc_log->l_mp,
> > +			 "I/O error during discard\n");
> > +	}
> 
> Same comment about the bno/len in the error message as the previous
> patch.

We don't have the len information available anymore at this point,
but I've added the startblock.

> > +	max_discard_sectors = min(q->limits.max_discard_sectors, UINT_MAX >> 9);
> > +	if (q->limits.discard_granularity) {
> > +		unsigned int disc_sects = q->limits.discard_granularity >> 9;
> > +
> > +		max_discard_sectors &= ~(disc_sects - 1);
> > +	}
> 
> This is asking for a helper function....

Yes, but that helper needs to be shared with the block layer discard
code, so I'll send it through Jens' tree later.

> > +	if (keep_alive)
> > +		xfs_cil_discard_done(ctx);
> > +	else
> > +		kmem_free(ctx);
> 
> You could probably just call xfs_cil_discard_done(ctx) here as the
> busy extent list will be empty in the !keep_alive case and so it
> will simply do the kmem_free(ctx) call there.

This means we clear the busy list a little later in the non-discard
case, but it cleans up the code nicely.  I've added it to the series
as a separate patch.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 31+ messages in thread

end of thread, other threads:[~2011-03-28 12:40 UTC | newest]

Thread overview: 31+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-03-22 19:55 [PATCH 0/6] [PATCH 0/6] more efficient busy extent handling and discard support Christoph Hellwig
2011-03-22 19:55 ` [PATCH 1/6] xfs: optimize AGFL refills Christoph Hellwig
2011-03-22 22:30   ` Alex Elder
2011-03-23 12:16     ` Christoph Hellwig
2011-03-25 21:03       ` Alex Elder
2011-03-28 12:07         ` Christoph Hellwig
2011-03-22 23:30   ` Dave Chinner
2011-03-23 12:16     ` Christoph Hellwig
2011-03-22 19:55 ` [PATCH 2/6] xfs: do not immediately reuse busy extent ranges Christoph Hellwig
2011-03-22 22:30   ` Alex Elder
2011-03-23 12:17     ` Christoph Hellwig
2011-03-25 21:03   ` Alex Elder
2011-03-28 12:07     ` Christoph Hellwig
2011-03-22 19:55 ` [PATCH 3/6] xfs: exact busy extent tracking Christoph Hellwig
2011-03-22 23:47   ` Dave Chinner
2011-03-23 12:24     ` Christoph Hellwig
2011-03-25 21:04   ` Alex Elder
2011-03-28 12:10     ` Christoph Hellwig
2011-03-22 19:55 ` [PATCH 4/6] xfs: allow reusing busy extents where safe Christoph Hellwig
2011-03-23  0:20   ` Dave Chinner
2011-03-23 12:26     ` Christoph Hellwig
2011-03-25 21:04   ` Alex Elder
2011-03-22 19:55 ` [PATCH 5/6] xfs: add online discard support Christoph Hellwig
2011-03-23  0:30   ` Dave Chinner
2011-03-23 12:31     ` Christoph Hellwig
2011-03-25 21:04   ` Alex Elder
2011-03-28 12:35     ` Christoph Hellwig
2011-03-22 19:55 ` [PATCH 6/6] xfs: make discard operations asynchronous Christoph Hellwig
2011-03-23  0:43   ` Dave Chinner
2011-03-28 12:44     ` Christoph Hellwig
2011-03-25 21:04   ` Alex Elder

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.