All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/6] do not reuse busy extents
@ 2011-01-21  9:22 Christoph Hellwig
  2011-01-21  9:22 ` [PATCH 1/6] xfs: clean up the xfs_alloc_compute_aligned calling convention Christoph Hellwig
                   ` (4 more replies)
  0 siblings, 5 replies; 19+ messages in thread
From: Christoph Hellwig @ 2011-01-21  9:22 UTC (permalink / raw)
  To: xfs

This series makes sure we never reallocated a busy extent.  This is absolutely
needed for a working online discard implementation, and should also speed
up large concurrent workloads that would trip over a busy extents a lot.

The two most significant patches are from a series Dave posted mored than
two years ago, although the main patch to avoid busy ranges in the normal
allocator need a lot of bug fixes to survived QA.  In addition to that we
now also completely avoid resues for AGFL based allocations.  While this
might not be important from a performance perspectice it is critical for
the discard implementation I'm currently working on.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH 1/6] xfs: clean up the xfs_alloc_compute_aligned calling convention
  2011-01-21  9:22 [PATCH 0/6] do not reuse busy extents Christoph Hellwig
@ 2011-01-21  9:22 ` Christoph Hellwig
  2011-01-25  4:23   ` Dave Chinner
  2011-01-27 23:21   ` Alex Elder
  2011-01-21  9:22 ` [PATCH 3/6] xfs: do not immediately reuse busy extent ranges Christoph Hellwig
                   ` (3 subsequent siblings)
  4 siblings, 2 replies; 19+ messages in thread
From: Christoph Hellwig @ 2011-01-21  9:22 UTC (permalink / raw)
  To: xfs

[-- Attachment #1: xfs-cleanup-xfs_alloc_compute_aligned --]
[-- Type: text/plain, Size: 4121 bytes --]

Pass a xfs_alloc_arg structure to xfs_alloc_compute_aligned and derive
the alignment and minlen paramters from it.  This cleans up the existing
callers, and we'll need even more information from the xfs_alloc_arg
in subsequent patches.  Based on a patch from Dave Chinner.

Signed-off-by: Christoph Hellwig <hch@lst.de>

Index: xfs/fs/xfs/xfs_alloc.c
===================================================================
--- xfs.orig/fs/xfs/xfs_alloc.c	2011-01-03 13:06:52.386254734 +0100
+++ xfs/fs/xfs/xfs_alloc.c	2011-01-03 13:07:19.545002883 +0100
@@ -147,10 +147,9 @@ xfs_alloc_get_rec(
  */
 STATIC void
 xfs_alloc_compute_aligned(
+	xfs_alloc_arg_t	*args,		/* allocation argument structure */
 	xfs_agblock_t	foundbno,	/* starting block in found extent */
 	xfs_extlen_t	foundlen,	/* length in found extent */
-	xfs_extlen_t	alignment,	/* alignment for allocation */
-	xfs_extlen_t	minlen,		/* minimum length for allocation */
 	xfs_agblock_t	*resbno,	/* result block number */
 	xfs_extlen_t	*reslen)	/* result length */
 {
@@ -158,8 +157,8 @@ xfs_alloc_compute_aligned(
 	xfs_extlen_t	diff;
 	xfs_extlen_t	len;
 
-	if (alignment > 1 && foundlen >= minlen) {
-		bno = roundup(foundbno, alignment);
+	if (args->alignment > 1 && foundlen >= args->minlen) {
+		bno = roundup(foundbno, args->alignment);
 		diff = bno - foundbno;
 		len = diff >= foundlen ? 0 : foundlen - diff;
 	} else {
@@ -693,8 +692,7 @@ xfs_alloc_find_best_extent(
 		if (error)
 			goto error0;
 		XFS_WANT_CORRUPTED_GOTO(i == 1, error0);
-		xfs_alloc_compute_aligned(*sbno, *slen, args->alignment,
-					  args->minlen, &bno, slena);
+		xfs_alloc_compute_aligned(args, *sbno, *slen, &bno, slena);
 
 		/*
 		 * The good extent is closer than this one.
@@ -866,8 +864,8 @@ xfs_alloc_ag_vextent_near(
 			if ((error = xfs_alloc_get_rec(cnt_cur, &ltbno, &ltlen, &i)))
 				goto error0;
 			XFS_WANT_CORRUPTED_GOTO(i == 1, error0);
-			xfs_alloc_compute_aligned(ltbno, ltlen, args->alignment,
-					args->minlen, &ltbnoa, &ltlena);
+			xfs_alloc_compute_aligned(args, ltbno, ltlen,
+						  &ltbnoa, &ltlena);
 			if (ltlena < args->minlen)
 				continue;
 			args->len = XFS_EXTLEN_MIN(ltlena, args->maxlen);
@@ -987,8 +985,8 @@ xfs_alloc_ag_vextent_near(
 			if ((error = xfs_alloc_get_rec(bno_cur_lt, &ltbno, &ltlen, &i)))
 				goto error0;
 			XFS_WANT_CORRUPTED_GOTO(i == 1, error0);
-			xfs_alloc_compute_aligned(ltbno, ltlen, args->alignment,
-					args->minlen, &ltbnoa, &ltlena);
+			xfs_alloc_compute_aligned(args, ltbno, ltlen,
+						  &ltbnoa, &ltlena);
 			if (ltlena >= args->minlen)
 				break;
 			if ((error = xfs_btree_decrement(bno_cur_lt, 0, &i)))
@@ -1003,8 +1001,8 @@ xfs_alloc_ag_vextent_near(
 			if ((error = xfs_alloc_get_rec(bno_cur_gt, &gtbno, &gtlen, &i)))
 				goto error0;
 			XFS_WANT_CORRUPTED_GOTO(i == 1, error0);
-			xfs_alloc_compute_aligned(gtbno, gtlen, args->alignment,
-					args->minlen, &gtbnoa, &gtlena);
+			xfs_alloc_compute_aligned(args, gtbno, gtlen,
+						  &gtbnoa, &gtlena);
 			if (gtlena >= args->minlen)
 				break;
 			if ((error = xfs_btree_increment(bno_cur_gt, 0, &i)))
@@ -1183,8 +1181,7 @@ xfs_alloc_ag_vextent_size(
 	 * once aligned; if not, we search left for something better.
 	 * This can't happen in the second case above.
 	 */
-	xfs_alloc_compute_aligned(fbno, flen, args->alignment, args->minlen,
-		&rbno, &rlen);
+	xfs_alloc_compute_aligned(args, fbno, flen, &rbno, &rlen);
 	rlen = XFS_EXTLEN_MIN(args->maxlen, rlen);
 	XFS_WANT_CORRUPTED_GOTO(rlen == 0 ||
 			(rlen <= flen && rbno + rlen <= fbno + flen), error0);
@@ -1209,8 +1206,8 @@ xfs_alloc_ag_vextent_size(
 			XFS_WANT_CORRUPTED_GOTO(i == 1, error0);
 			if (flen < bestrlen)
 				break;
-			xfs_alloc_compute_aligned(fbno, flen, args->alignment,
-				args->minlen, &rbno, &rlen);
+			xfs_alloc_compute_aligned(args, fbno, flen,
+						  &rbno, &rlen);
 			rlen = XFS_EXTLEN_MIN(args->maxlen, rlen);
 			XFS_WANT_CORRUPTED_GOTO(rlen == 0 ||
 				(rlen <= flen && rbno + rlen <= fbno + flen),

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH 3/6] xfs: do not immediately reuse busy extent ranges
  2011-01-21  9:22 [PATCH 0/6] do not reuse busy extents Christoph Hellwig
  2011-01-21  9:22 ` [PATCH 1/6] xfs: clean up the xfs_alloc_compute_aligned calling convention Christoph Hellwig
@ 2011-01-21  9:22 ` Christoph Hellwig
  2011-01-27 23:20   ` Alex Elder
  2011-01-28  1:58   ` Dave Chinner
  2011-01-21  9:22 ` [PATCH 4/6] xfs: optimize xfs_alloc_fix_freelist Christoph Hellwig
                   ` (2 subsequent siblings)
  4 siblings, 2 replies; 19+ messages in thread
From: Christoph Hellwig @ 2011-01-21  9:22 UTC (permalink / raw)
  To: xfs

[-- Attachment #1: xfs-skip-busy-extents --]
[-- Type: text/plain, Size: 21164 bytes --]

Every time we reallocate a busy extent, we cause a synchronous log force
to occur to ensure the freeing transaction is on disk before we continue
and use the newly allocated extent.  This is extremely sub-optimal as we
have to mark every transaction with blocks that get reused as synchronous.

Instead of searching the busy extent list after deciding on the extent to
allocate, check each candidate extent during the allocation decisions as
to whether they are in the busy list.  If they are in the busy list, we
trim the busy range out of the extent we have found and determine if that
trimmed range is still OK for allocation. In many cases, this check can
be incorporated into the allocation extent alignment code which already
does trimming of the found extent before determining if it is a valid
candidate for allocation.

[hch: merged two earlier patches from Dave and fixed various bugs]

Signed-off-by: Dave Chinner <david@fromorbit.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>

Index: xfs/fs/xfs/xfs_alloc.c
===================================================================
--- xfs.orig/fs/xfs/xfs_alloc.c	2011-01-17 22:05:27.146004341 +0100
+++ xfs/fs/xfs/xfs_alloc.c	2011-01-18 13:04:30.239023407 +0100
@@ -41,15 +41,13 @@
 #define	XFSA_FIXUP_BNO_OK	1
 #define	XFSA_FIXUP_CNT_OK	2
 
-/*
- * Prototypes for per-ag allocation routines
- */
-
 STATIC int xfs_alloc_ag_vextent_exact(xfs_alloc_arg_t *);
 STATIC int xfs_alloc_ag_vextent_near(xfs_alloc_arg_t *);
 STATIC int xfs_alloc_ag_vextent_size(xfs_alloc_arg_t *);
 STATIC int xfs_alloc_ag_vextent_small(xfs_alloc_arg_t *,
-	xfs_btree_cur_t *, xfs_agblock_t *, xfs_extlen_t *, int *);
+		xfs_btree_cur_t *, xfs_agblock_t *, xfs_extlen_t *, int *);
+STATIC int xfs_alloc_busy_search(struct xfs_mount *, xfs_agnumber_t,
+		xfs_agblock_t, xfs_extlen_t);
 
 /*
  * Internal functions.
@@ -154,19 +152,23 @@ xfs_alloc_compute_aligned(
 	xfs_extlen_t	*reslen)	/* result length */
 {
 	xfs_agblock_t	bno;
-	xfs_extlen_t	diff;
 	xfs_extlen_t	len;
 
-	if (args->alignment > 1 && foundlen >= args->minlen) {
-		bno = roundup(foundbno, args->alignment);
-		diff = bno - foundbno;
-		len = diff >= foundlen ? 0 : foundlen - diff;
+	/* Trim busy sections out of found extent */
+	xfs_alloc_busy_search_trim(args->mp, args->pag, foundbno, foundlen,
+				   &bno, &len);
+
+	if (args->alignment > 1 && len >= args->minlen) {
+		xfs_agblock_t	aligned_bno = roundup(bno, args->alignment);
+		xfs_extlen_t	diff = aligned_bno - bno;
+
+		*resbno = aligned_bno;
+		*reslen = diff >= len ? 0 : len - diff;
+
 	} else {
-		bno = foundbno;
-		len = foundlen;
+		*resbno = bno;
+		*reslen = len;
 	}
-	*resbno = bno;
-	*reslen = len;
 }
 
 /*
@@ -535,19 +537,12 @@ xfs_alloc_ag_vextent(
 	ASSERT(args->agbno % args->alignment == 0);
 
 	if (!args->wasfromfl) {
-		xfs_alloc_update_counters(args->tp, args->pag, args->agbp,
-					  -((long)(args->len)));
+		error = xfs_alloc_update_counters(args->tp, args->pag,
+						  args->agbp,
+						  -((long)(args->len)));
 
-		/*
-		 * Search the busylist for these blocks and mark the
-		 * transaction as synchronous if blocks are found. This
-		 * avoids the need to block due to a synchronous log
-		 * force to ensure correct ordering as the synchronous
-		 * transaction will guarantee that for us.
-		 */
-		if (xfs_alloc_busy_search(args->mp, args->agno,
-					args->agbno, args->len))
-			xfs_trans_set_sync(args->tp);
+		ASSERT(!xfs_alloc_busy_search(args->mp, args->agno,
+					      args->agbno, args->len));
 	}
 
 	if (!args->isfl) {
@@ -559,7 +554,7 @@ xfs_alloc_ag_vextent(
 
 	XFS_STATS_INC(xs_allocx);
 	XFS_STATS_ADD(xs_allocb, args->len);
-	return 0;
+	return error;
 }
 
 /*
@@ -574,14 +569,14 @@ xfs_alloc_ag_vextent_exact(
 {
 	xfs_btree_cur_t	*bno_cur;/* by block-number btree cursor */
 	xfs_btree_cur_t	*cnt_cur;/* by count btree cursor */
-	xfs_agblock_t	end;	/* end of allocated extent */
 	int		error;
 	xfs_agblock_t	fbno;	/* start block of found extent */
-	xfs_agblock_t	fend;	/* end block of found extent */
 	xfs_extlen_t	flen;	/* length of found extent */
+	xfs_agblock_t	tbno;	/* start block of trimmed extent */
+	xfs_extlen_t	tlen;	/* length of trimmed extent */
+	xfs_agblock_t	tend;	/* end block of trimmed extent */
+	xfs_agblock_t	end;	/* end of allocated extent */
 	int		i;	/* success/failure of operation */
-	xfs_agblock_t	maxend;	/* end of maximal extent */
-	xfs_agblock_t	minend;	/* end of minimal extent */
 	xfs_extlen_t	rlen;	/* length of returned extent */
 
 	ASSERT(args->alignment == 1);
@@ -611,14 +606,21 @@ xfs_alloc_ag_vextent_exact(
 		goto error0;
 	XFS_WANT_CORRUPTED_GOTO(i == 1, error0);
 	ASSERT(fbno <= args->agbno);
-	minend = args->agbno + args->minlen;
-	maxend = args->agbno + args->maxlen;
-	fend = fbno + flen;
 
 	/*
-	 * Give up if the freespace isn't long enough for the minimum request.
+	 * Check for overlapping busy extents.
+	 */
+	xfs_alloc_busy_search_trim(args->mp, args->pag, fbno, flen,
+				   &tbno, &tlen);
+
+	/*
+	 * Give up if the start of the extent is busy, or the freespace isn't
+	 * long enough for the minimum request.
 	 */
-	if (fend < minend)
+	if (tbno > args->agbno)
+		goto not_found;
+	tend = tbno + tlen;
+	if (tend < args->agbno + args->minlen)
 		goto not_found;
 
 	/*
@@ -627,14 +629,14 @@ xfs_alloc_ag_vextent_exact(
 	 *
 	 * Fix the length according to mod and prod if given.
 	 */
-	end = XFS_AGBLOCK_MIN(fend, maxend);
+	end = XFS_AGBLOCK_MIN(tend, args->agbno + args->maxlen);
 	args->len = end - args->agbno;
 	xfs_alloc_fix_len(args);
 	if (!xfs_alloc_fix_minleft(args))
 		goto not_found;
 
 	rlen = args->len;
-	ASSERT(args->agbno + rlen <= fend);
+	ASSERT(args->agbno + rlen <= tend);
 	end = args->agbno + rlen;
 
 	/*
@@ -683,11 +685,11 @@ xfs_alloc_find_best_extent(
 	struct xfs_btree_cur	**scur,	/* searching cursor */
 	xfs_agblock_t		gdiff,	/* difference for search comparison */
 	xfs_agblock_t		*sbno,	/* extent found by search */
-	xfs_extlen_t		*slen,
-	xfs_extlen_t		*slena,	/* aligned length */
+	xfs_extlen_t		*slen,	/* extent length */
+	xfs_agblock_t		*sbnoa,	/* aligned extent found by search */
+	xfs_extlen_t		*slena,	/* aligned extent length */
 	int			dir)	/* 0 = search right, 1 = search left */
 {
-	xfs_agblock_t		bno;
 	xfs_agblock_t		new;
 	xfs_agblock_t		sdiff;
 	int			error;
@@ -705,16 +707,16 @@ xfs_alloc_find_best_extent(
 		if (error)
 			goto error0;
 		XFS_WANT_CORRUPTED_GOTO(i == 1, error0);
-		xfs_alloc_compute_aligned(args, *sbno, *slen, &bno, slena);
+		xfs_alloc_compute_aligned(args, *sbno, *slen, sbnoa, slena);
 
 		/*
 		 * The good extent is closer than this one.
 		 */
 		if (!dir) {
-			if (bno >= args->agbno + gdiff)
+			if (*sbnoa >= args->agbno + gdiff)
 				goto out_use_good;
 		} else {
-			if (bno <= args->agbno - gdiff)
+			if (*sbnoa <= args->agbno - gdiff)
 				goto out_use_good;
 		}
 
@@ -726,8 +728,8 @@ xfs_alloc_find_best_extent(
 			xfs_alloc_fix_len(args);
 
 			sdiff = xfs_alloc_compute_diff(args->agbno, args->len,
-						       args->alignment, *sbno,
-						       *slen, &new);
+						       args->alignment, *sbnoa,
+						       *slena, &new);
 
 			/*
 			 * Choose closer size and invalidate other cursor.
@@ -777,7 +779,7 @@ xfs_alloc_ag_vextent_near(
 	xfs_agblock_t	gtbnoa;		/* aligned ... */
 	xfs_extlen_t	gtdiff;		/* difference to right side entry */
 	xfs_extlen_t	gtlen;		/* length of right side entry */
-	xfs_extlen_t	gtlena = 0;	/* aligned ... */
+	xfs_extlen_t	gtlena;		/* aligned ... */
 	xfs_agblock_t	gtnew;		/* useful start bno of right side */
 	int		error;		/* error code */
 	int		i;		/* result code, temporary */
@@ -786,9 +788,10 @@ xfs_alloc_ag_vextent_near(
 	xfs_agblock_t	ltbnoa;		/* aligned ... */
 	xfs_extlen_t	ltdiff;		/* difference to left side entry */
 	xfs_extlen_t	ltlen;		/* length of left side entry */
-	xfs_extlen_t	ltlena = 0;	/* aligned ... */
+	xfs_extlen_t	ltlena;		/* aligned ... */
 	xfs_agblock_t	ltnew;		/* useful start bno of left side */
 	xfs_extlen_t	rlen;		/* length of returned extent */
+	int		forced = 0;
 #if defined(DEBUG) && defined(__KERNEL__)
 	/*
 	 * Randomly don't execute the first algorithm.
@@ -797,13 +800,20 @@ xfs_alloc_ag_vextent_near(
 
 	dofirst = random32() & 1;
 #endif
+
+restart:
+	bno_cur_lt = NULL;
+	bno_cur_gt = NULL;
+	ltlen = 0;
+	gtlena = 0;
+	ltlena = 0;
+
 	/*
 	 * Get a cursor for the by-size btree.
 	 */
 	cnt_cur = xfs_allocbt_init_cursor(args->mp, args->tp, args->agbp,
 		args->agno, XFS_BTNUM_CNT);
-	ltlen = 0;
-	bno_cur_lt = bno_cur_gt = NULL;
+
 	/*
 	 * See if there are any free extents as big as maxlen.
 	 */
@@ -819,11 +829,13 @@ xfs_alloc_ag_vextent_near(
 			goto error0;
 		if (i == 0 || ltlen == 0) {
 			xfs_btree_del_cursor(cnt_cur, XFS_BTREE_NOERROR);
+			trace_xfs_alloc_near_noentry(args);
 			return 0;
 		}
 		ASSERT(i == 1);
 	}
 	args->wasfromfl = 0;
+
 	/*
 	 * First algorithm.
 	 * If the requested extent is large wrt the freespaces available
@@ -887,7 +899,7 @@ xfs_alloc_ag_vextent_near(
 			if (args->len < blen)
 				continue;
 			ltdiff = xfs_alloc_compute_diff(args->agbno, args->len,
-				args->alignment, ltbno, ltlen, &ltnew);
+				args->alignment, ltbnoa, ltlena, &ltnew);
 			if (ltnew != NULLAGBLOCK &&
 			    (args->len > blen || ltdiff < bdiff)) {
 				bdiff = ltdiff;
@@ -1039,11 +1051,12 @@ xfs_alloc_ag_vextent_near(
 			args->len = XFS_EXTLEN_MIN(ltlena, args->maxlen);
 			xfs_alloc_fix_len(args);
 			ltdiff = xfs_alloc_compute_diff(args->agbno, args->len,
-				args->alignment, ltbno, ltlen, &ltnew);
+				args->alignment, ltbnoa, ltlena, &ltnew);
 
 			error = xfs_alloc_find_best_extent(args,
 						&bno_cur_lt, &bno_cur_gt,
-						ltdiff, &gtbno, &gtlen, &gtlena,
+						ltdiff, &gtbno, &gtlen,
+						&gtbnoa, &gtlena,
 						0 /* search right */);
 		} else {
 			ASSERT(gtlena >= args->minlen);
@@ -1054,11 +1067,12 @@ xfs_alloc_ag_vextent_near(
 			args->len = XFS_EXTLEN_MIN(gtlena, args->maxlen);
 			xfs_alloc_fix_len(args);
 			gtdiff = xfs_alloc_compute_diff(args->agbno, args->len,
-				args->alignment, gtbno, gtlen, &gtnew);
+				args->alignment, gtbnoa, gtlena, &gtnew);
 
 			error = xfs_alloc_find_best_extent(args,
 						&bno_cur_gt, &bno_cur_lt,
-						gtdiff, &ltbno, &ltlen, &ltlena,
+						gtdiff, &ltbno, &ltlen,
+						&ltbnoa, &ltlena,
 						1 /* search left */);
 		}
 
@@ -1070,6 +1084,12 @@ xfs_alloc_ag_vextent_near(
 	 * If we couldn't get anything, give up.
 	 */
 	if (bno_cur_lt == NULL && bno_cur_gt == NULL) {
+		if (!forced++) {
+			trace_xfs_alloc_near_busy(args);
+			xfs_log_force(args->mp, XFS_LOG_SYNC);
+			goto restart;
+		}
+
 		trace_xfs_alloc_size_neither(args);
 		args->agbno = NULLAGBLOCK;
 		return 0;
@@ -1104,12 +1124,13 @@ xfs_alloc_ag_vextent_near(
 		return 0;
 	}
 	rlen = args->len;
-	(void)xfs_alloc_compute_diff(args->agbno, rlen, args->alignment, ltbno,
-		ltlen, &ltnew);
+	(void)xfs_alloc_compute_diff(args->agbno, rlen, args->alignment,
+				     ltbnoa, ltlena, &ltnew);
 	ASSERT(ltnew >= ltbno);
-	ASSERT(ltnew + rlen <= ltbno + ltlen);
+	ASSERT(ltnew + rlen <= ltbnoa + ltlena);
 	ASSERT(ltnew + rlen <= be32_to_cpu(XFS_BUF_TO_AGF(args->agbp)->agf_length));
 	args->agbno = ltnew;
+
 	if ((error = xfs_alloc_fixup_trees(cnt_cur, bno_cur_lt, ltbno, ltlen,
 			ltnew, rlen, XFSA_FIXUP_BNO_OK)))
 		goto error0;
@@ -1152,26 +1173,35 @@ xfs_alloc_ag_vextent_size(
 	int		i;		/* temp status variable */
 	xfs_agblock_t	rbno;		/* returned block number */
 	xfs_extlen_t	rlen;		/* length of returned extent */
+	int		forced = 0;
 
+restart:
 	/*
 	 * Allocate and initialize a cursor for the by-size btree.
 	 */
 	cnt_cur = xfs_allocbt_init_cursor(args->mp, args->tp, args->agbp,
 		args->agno, XFS_BTNUM_CNT);
 	bno_cur = NULL;
+
 	/*
 	 * Look for an entry >= maxlen+alignment-1 blocks.
 	 */
 	if ((error = xfs_alloc_lookup_ge(cnt_cur, 0,
 			args->maxlen + args->alignment - 1, &i)))
 		goto error0;
+
 	/*
-	 * If none, then pick up the last entry in the tree unless the
-	 * tree is empty.
-	 */
-	if (!i) {
-		if ((error = xfs_alloc_ag_vextent_small(args, cnt_cur, &fbno,
-				&flen, &i)))
+	 * If none or we have busy extents that we cannot allocate from, then
+	 * we have to settle for a smaller extent. In the case that there are
+	 * no large extents, this will return the last entry in the tree unless
+	 * the tree is empty. In the case that there are only busy large
+	 * extents, this will return the largest small extent unless there
+	 * are no smaller extents available.
+	 */
+	if (!i || forced > 1) {
+		error = xfs_alloc_ag_vextent_small(args, cnt_cur,
+						   &fbno, &flen, &i);
+		if (error)
 			goto error0;
 		if (i == 0 || flen == 0) {
 			xfs_btree_del_cursor(cnt_cur, XFS_BTREE_NOERROR);
@@ -1179,22 +1209,56 @@ xfs_alloc_ag_vextent_size(
 			return 0;
 		}
 		ASSERT(i == 1);
-	}
-	/*
-	 * There's a freespace as big as maxlen+alignment-1, get it.
-	 */
-	else {
-		if ((error = xfs_alloc_get_rec(cnt_cur, &fbno, &flen, &i)))
-			goto error0;
-		XFS_WANT_CORRUPTED_GOTO(i == 1, error0);
-	}
+		xfs_alloc_compute_aligned(args, fbno, flen, &rbno, &rlen);
+	} else {
+		/*
+		 * Search for a non-busy extent that is large enough.
+		 * If we are at low space, don't check, or if we fall of
+		 * the end of the btree, turn off the busy check and
+		 * restart.
+		 */
+		for (;;) {
+			error = xfs_alloc_get_rec(cnt_cur, &fbno, &flen, &i);
+			if (error)
+				goto error0;
+			XFS_WANT_CORRUPTED_GOTO(i == 1, error0);
+
+			xfs_alloc_compute_aligned(args, fbno, flen,
+						  &rbno, &rlen);
+
+			if (rlen >= args->maxlen)
+				break;
+
+			error = xfs_btree_increment(cnt_cur, 0, &i);
+			if (error)
+				goto error0;
+			if (i == 0) {
+				/*
+				 * Our only valid extents must have been busy.
+				 * Make it unbusy by forcing the log out and
+				 * retrying. If we've been here before, forcing
+				 * the log isn't making the extents available,
+				 * which means they have probably been freed in
+				 * this transaction.  In that case, we have to
+				 * give up on them and we'll attempt a minlen
+				 * allocation the next time around.
+				 */
+				xfs_btree_del_cursor(cnt_cur,
+						     XFS_BTREE_NOERROR);
+				trace_xfs_alloc_size_busy(args);
+				if (!forced++)
+					xfs_log_force(args->mp, XFS_LOG_SYNC);
+				goto restart;
+			}
+		}
+ 	}
+
 	/*
 	 * In the first case above, we got the last entry in the
 	 * by-size btree.  Now we check to see if the space hits maxlen
 	 * once aligned; if not, we search left for something better.
 	 * This can't happen in the second case above.
 	 */
-	xfs_alloc_compute_aligned(args, fbno, flen, &rbno, &rlen);
 	rlen = XFS_EXTLEN_MIN(args->maxlen, rlen);
 	XFS_WANT_CORRUPTED_GOTO(rlen == 0 ||
 			(rlen <= flen && rbno + rlen <= fbno + flen), error0);
@@ -1248,13 +1312,19 @@ xfs_alloc_ag_vextent_size(
 	 * Fix up the length.
 	 */
 	args->len = rlen;
-	xfs_alloc_fix_len(args);
-	if (rlen < args->minlen || !xfs_alloc_fix_minleft(args)) {
-		xfs_btree_del_cursor(cnt_cur, XFS_BTREE_NOERROR);
-		trace_xfs_alloc_size_nominleft(args);
-		args->agbno = NULLAGBLOCK;
-		return 0;
+	if (rlen < args->minlen) {
+		if (!forced++) {
+			xfs_btree_del_cursor(cnt_cur, XFS_BTREE_NOERROR);
+			trace_xfs_alloc_size_busy(args);
+			xfs_log_force(args->mp, XFS_LOG_SYNC);
+			goto restart;
+		}
+		goto out_nominleft;
 	}
+	xfs_alloc_fix_len(args);
+
+	if (!xfs_alloc_fix_minleft(args))
+		goto out_nominleft;
 	rlen = args->len;
 	XFS_WANT_CORRUPTED_GOTO(rlen <= flen, error0);
 	/*
@@ -1284,6 +1354,12 @@ error0:
 	if (bno_cur)
 		xfs_btree_del_cursor(bno_cur, XFS_BTREE_ERROR);
 	return error;
+
+out_nominleft:
+	xfs_btree_del_cursor(cnt_cur, XFS_BTREE_NOERROR);
+	trace_xfs_alloc_size_nominleft(args);
+	args->agbno = NULLAGBLOCK;
+	return 0;
 }
 
 /*
@@ -2612,7 +2688,7 @@ restart:
  * will require a synchronous transaction, but it can still be
  * used to distinguish between a partial or exact match.
  */
-int
+STATIC int
 xfs_alloc_busy_search(
 	struct xfs_mount	*mp,
 	xfs_agnumber_t		agno,
@@ -2654,6 +2730,71 @@ xfs_alloc_busy_search(
 	return match;
 }
 
+/*
+ * For a given extent [fbno, flen], search the busy extent list
+ * to find a subset of the extent that is not busy.
+ */
+void
+xfs_alloc_busy_search_trim(
+	struct xfs_mount	*mp,
+	struct xfs_perag	*pag,
+	xfs_agblock_t		fbno,
+	xfs_extlen_t		flen,
+	xfs_agblock_t		*rbno,
+	xfs_extlen_t		*rlen)
+{
+	struct rb_node		*rbp;
+	xfs_agblock_t           bno = fbno;
+	xfs_extlen_t            len = flen;
+
+	spin_lock(&pag->pagb_lock);
+	rbp = pag->pagb_tree.rb_node;
+	while (rbp) {
+		struct xfs_busy_extent *busyp =
+			rb_entry(rbp, struct xfs_busy_extent, rb_node);
+		xfs_agblock_t	end = bno + len;
+		xfs_agblock_t	bend = busyp->bno + busyp->length;
+
+		if (bno + len <= busyp->bno) {
+			rbp = rbp->rb_left;
+			continue;
+		} else if (bno >= busyp->bno + busyp->length) {
+			rbp = rbp->rb_right;
+			continue;
+		}
+
+		if (busyp->bno < bno) {
+			/* start overlap */
+			ASSERT(bend >= bno);
+			ASSERT(bend <= end);
+			len -= bno - bend;
+			bno = bend;
+		} else if (bend > end) {
+			/* end overlap */
+			ASSERT(busyp->bno >= bno);
+			ASSERT(busyp->bno < end);
+			len -= bend - end;
+		} else {
+			/* middle overlap - return larger segment */
+			ASSERT(busyp->bno >= bno);
+			ASSERT(bend <= end);
+			len = busyp->bno - bno;
+			if (len >= end - bend) {
+				/* use first segment */
+				len = len;
+			} else {
+				/* use last segment */
+				bno = bend;
+				len = end - bend;
+			}
+		}
+	}
+	spin_unlock(&pag->pagb_lock);
+
+	*rbno = bno;
+	*rlen = len;
+}
+
 void
 xfs_alloc_busy_clear(
 	struct xfs_mount	*mp,
Index: xfs/fs/xfs/linux-2.6/xfs_trace.h
===================================================================
--- xfs.orig/fs/xfs/linux-2.6/xfs_trace.h	2011-01-17 22:05:13.686003991 +0100
+++ xfs/fs/xfs/linux-2.6/xfs_trace.h	2011-01-18 12:52:32.773004341 +0100
@@ -1433,11 +1433,14 @@ DEFINE_ALLOC_EVENT(xfs_alloc_near_first)
 DEFINE_ALLOC_EVENT(xfs_alloc_near_greater);
 DEFINE_ALLOC_EVENT(xfs_alloc_near_lesser);
 DEFINE_ALLOC_EVENT(xfs_alloc_near_error);
+DEFINE_ALLOC_EVENT(xfs_alloc_near_noentry);
+DEFINE_ALLOC_EVENT(xfs_alloc_near_busy);
 DEFINE_ALLOC_EVENT(xfs_alloc_size_neither);
 DEFINE_ALLOC_EVENT(xfs_alloc_size_noentry);
 DEFINE_ALLOC_EVENT(xfs_alloc_size_nominleft);
 DEFINE_ALLOC_EVENT(xfs_alloc_size_done);
 DEFINE_ALLOC_EVENT(xfs_alloc_size_error);
+DEFINE_ALLOC_EVENT(xfs_alloc_size_busy);
 DEFINE_ALLOC_EVENT(xfs_alloc_small_freelist);
 DEFINE_ALLOC_EVENT(xfs_alloc_small_notenough);
 DEFINE_ALLOC_EVENT(xfs_alloc_small_done);
@@ -1790,7 +1793,6 @@ DEFINE_EVENT(xfs_discard_class, name, \
 DEFINE_DISCARD_EVENT(xfs_discard_extent);
 DEFINE_DISCARD_EVENT(xfs_discard_toosmall);
 DEFINE_DISCARD_EVENT(xfs_discard_exclude);
-DEFINE_DISCARD_EVENT(xfs_discard_busy);
 
 #endif /* _TRACE_XFS_H */
 
Index: xfs/fs/xfs/linux-2.6/xfs_discard.c
===================================================================
--- xfs.orig/fs/xfs/linux-2.6/xfs_discard.c	2011-01-17 22:06:13.004005040 +0100
+++ xfs/fs/xfs/linux-2.6/xfs_discard.c	2011-01-17 22:14:09.133005668 +0100
@@ -77,8 +77,8 @@ xfs_trim_extents(
 	 * enough to be worth discarding.
 	 */
 	while (i) {
-		xfs_agblock_t fbno;
-		xfs_extlen_t flen;
+		xfs_agblock_t	fbno, tbno;
+		xfs_extlen_t	flen, tlen;
 
 		error = xfs_alloc_get_rec(cur, &fbno, &flen, &i);
 		if (error)
@@ -90,7 +90,7 @@ xfs_trim_extents(
 		 * Too small?  Give up.
 		 */
 		if (flen < minlen) {
-			trace_xfs_discard_toosmall(mp, agno, fbno, flen);
+			trace_xfs_discard_toosmall(mp, agno, tbno, flen);
 			goto out_del_cursor;
 		}
 
@@ -109,19 +109,16 @@ xfs_trim_extents(
 		 * If any blocks in the range are still busy, skip the
 		 * discard and try again the next time.
 		 */
-		if (xfs_alloc_busy_search(mp, agno, fbno, flen)) {
-			trace_xfs_discard_busy(mp, agno, fbno, flen);
-			goto next_extent;
-		}
+		xfs_alloc_busy_search_trim(mp, pag, fbno, flen, &tbno, &tlen);
 
-		trace_xfs_discard_extent(mp, agno, fbno, flen);
+		trace_xfs_discard_extent(mp, agno, tbno, tlen);
 		error = -blkdev_issue_discard(bdev,
-				XFS_AGB_TO_DADDR(mp, agno, fbno),
-				XFS_FSB_TO_BB(mp, flen),
+				XFS_AGB_TO_DADDR(mp, agno, tbno),
+				XFS_FSB_TO_BB(mp, tlen),
 				GFP_NOFS, 0);
 		if (error)
 			goto out_del_cursor;
-		*blocks_trimmed += flen;
+		*blocks_trimmed += tlen;
 
 next_extent:
 		error = xfs_btree_decrement(cur, 0, &i);
Index: xfs/fs/xfs/xfs_alloc.h
===================================================================
--- xfs.orig/fs/xfs/xfs_alloc.h	2011-01-17 22:06:12.985003713 +0100
+++ xfs/fs/xfs/xfs_alloc.h	2011-01-18 13:03:59.474004132 +0100
@@ -126,9 +126,10 @@ xfs_alloc_busy_insert(struct xfs_trans *
 void
 xfs_alloc_busy_clear(struct xfs_mount *mp, struct xfs_busy_extent *busyp);
 
-int
-xfs_alloc_busy_search(struct xfs_mount *mp, xfs_agnumber_t agno,
-	xfs_agblock_t bno, xfs_extlen_t len);
+void
+xfs_alloc_busy_search_trim(struct xfs_mount *mp, struct xfs_perag *pag,
+	xfs_agblock_t bno, xfs_extlen_t len,
+	xfs_agblock_t *tbno, xfs_extlen_t *tlen);
 #endif	/* __KERNEL__ */
 
 /*

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH 4/6] xfs: optimize xfs_alloc_fix_freelist
  2011-01-21  9:22 [PATCH 0/6] do not reuse busy extents Christoph Hellwig
  2011-01-21  9:22 ` [PATCH 1/6] xfs: clean up the xfs_alloc_compute_aligned calling convention Christoph Hellwig
  2011-01-21  9:22 ` [PATCH 3/6] xfs: do not immediately reuse busy extent ranges Christoph Hellwig
@ 2011-01-21  9:22 ` Christoph Hellwig
  2011-01-28  5:36   ` Dave Chinner
  2011-01-28 22:17   ` Alex Elder
  2011-01-21  9:22 ` [PATCH 5/6] xfs: do not classify freed allocation btree blocks as busy Christoph Hellwig
  2011-01-21  9:22 ` [PATCH 6/6] xfs: remove handling of duplicates the busy extent tree Christoph Hellwig
  4 siblings, 2 replies; 19+ messages in thread
From: Christoph Hellwig @ 2011-01-21  9:22 UTC (permalink / raw)
  To: xfs

[-- Attachment #1: xfs-avoid-sync-transaction-for-freelist-refill --]
[-- Type: text/plain, Size: 6539 bytes --]

If we shorten the freelist in xfs_alloc_fix_freelist there is no need
to wait for busy blocks as they will be marked busy again when we call
xfs_free_ag_extent.  Avoid this by not marking blocks coming from the
freelist as busy in xfs_free_ag_extent, and not marking transactions
with busy extents as synchronous in xfs_alloc_get_freelist.  Unlike
xfs_free_ag_extent which already has the isfl argument,
xfs_alloc_get_freelist needs to be told about the usage of the blocks
it returns.  For this we extend the btreeblk flag to a type argument
which specifies in detail what the block is going to be used for.

Signed-off-by: Christoph Hellwig <hch@lst.de>

Index: xfs/fs/xfs/xfs_alloc.c
===================================================================
--- xfs.orig/fs/xfs/xfs_alloc.c	2011-01-17 22:18:20.421254585 +0100
+++ xfs/fs/xfs/xfs_alloc.c	2011-01-17 22:30:54.006256540 +0100
@@ -1395,7 +1395,8 @@ xfs_alloc_ag_vextent_small(
 	else if (args->minlen == 1 && args->alignment == 1 && !args->isfl &&
 		 (be32_to_cpu(XFS_BUF_TO_AGF(args->agbp)->agf_flcount)
 		  > args->minleft)) {
-		error = xfs_alloc_get_freelist(args->tp, args->agbp, &fbno, 0);
+		error = xfs_alloc_get_freelist(args->tp, args->agbp, &fbno,
+					       XFS_FREELIST_ALLOC);
 		if (error)
 			goto error0;
 		if (fbno != NULLAGBLOCK) {
@@ -1683,25 +1684,20 @@ xfs_free_ag_extent(
 	if (error)
 		goto error0;
 
-	if (!isfl)
+	if (!isfl) {
 		xfs_trans_mod_sb(tp, XFS_TRANS_SB_FDBLOCKS, (long)len);
+
+		/*
+		 * Mark the extent busy unless it comes from the freelist,
+		 * in which case it has already been marked busy.
+		 */
+		xfs_alloc_busy_insert(tp, agno, bno, len);
+	}
+
 	XFS_STATS_INC(xs_freex);
 	XFS_STATS_ADD(xs_freeb, len);
 
 	trace_xfs_free_extent(mp, agno, bno, len, isfl, haveleft, haveright);
-
-	/*
-	 * Since blocks move to the free list without the coordination
-	 * used in xfs_bmap_finish, we can't allow block to be available
-	 * for reallocation and non-transaction writing (user data)
-	 * until we know that the transaction that moved it to the free
-	 * list is permanently on disk.  We track the blocks by declaring
-	 * these blocks as "busy"; the busy list is maintained on a per-ag
-	 * basis and each transaction records which entries should be removed
-	 * when the iclog commits to disk.  If a busy block is allocated,
-	 * the iclog is pushed up to the LSN that freed the block.
-	 */
-	xfs_alloc_busy_insert(tp, agno, bno, len);
 	return 0;
 
  error0:
@@ -1873,11 +1869,14 @@ xfs_alloc_fix_freelist(
 	while (be32_to_cpu(agf->agf_flcount) > need) {
 		xfs_buf_t	*bp;
 
-		error = xfs_alloc_get_freelist(tp, agbp, &bno, 0);
+		error = xfs_alloc_get_freelist(tp, agbp, &bno,
+					       XFS_FREELIST_BALANCE);
 		if (error)
 			return error;
-		if ((error = xfs_free_ag_extent(tp, agbp, args->agno, bno, 1, 1)))
+		error = xfs_free_ag_extent(tp, agbp, args->agno, bno, 1, 1);
+		if (error)
 			return error;
+
 		bp = xfs_btree_get_bufs(mp, tp, args->agno, bno, 0);
 		xfs_trans_binval(tp, bp);
 	}
@@ -1944,18 +1943,18 @@ xfs_alloc_get_freelist(
 	xfs_trans_t	*tp,	/* transaction pointer */
 	xfs_buf_t	*agbp,	/* buffer containing the agf structure */
 	xfs_agblock_t	*bnop,	/* block address retrieved from freelist */
-	int		btreeblk) /* destination is a AGF btree */
+	int		type)	/* where does the allocation go to? */
 {
-	xfs_agf_t	*agf;	/* a.g. freespace structure */
+	xfs_mount_t	*mp = tp->t_mountp;
+	xfs_agf_t	*agf = XFS_BUF_TO_AGF(agbp);
+	xfs_agnumber_t	agno = be32_to_cpu(agf->agf_seqno);
 	xfs_agfl_t	*agfl;	/* a.g. freelist structure */
 	xfs_buf_t	*agflbp;/* buffer for a.g. freelist structure */
 	xfs_agblock_t	bno;	/* block number returned */
 	int		error;
 	int		logflags;
-	xfs_mount_t	*mp;	/* mount structure */
 	xfs_perag_t	*pag;	/* per allocation group data */
 
-	agf = XFS_BUF_TO_AGF(agbp);
 	/*
 	 * Freelist is empty, give up.
 	 */
@@ -1963,14 +1962,15 @@ xfs_alloc_get_freelist(
 		*bnop = NULLAGBLOCK;
 		return 0;
 	}
+
 	/*
 	 * Read the array of free blocks.
 	 */
-	mp = tp->t_mountp;
-	if ((error = xfs_alloc_read_agfl(mp, tp,
-			be32_to_cpu(agf->agf_seqno), &agflbp)))
+	error = xfs_alloc_read_agfl(mp, tp, agno, &agflbp);
+	if (error)
 		return error;
 	agfl = XFS_BUF_TO_AGFL(agflbp);
+
 	/*
 	 * Get the block number and update the data structures.
 	 */
@@ -1980,14 +1980,14 @@ xfs_alloc_get_freelist(
 	if (be32_to_cpu(agf->agf_flfirst) == XFS_AGFL_SIZE(mp))
 		agf->agf_flfirst = 0;
 
-	pag = xfs_perag_get(mp, be32_to_cpu(agf->agf_seqno));
+	pag = xfs_perag_get(mp, agno);
 	be32_add_cpu(&agf->agf_flcount, -1);
 	xfs_trans_agflist_delta(tp, -1);
 	pag->pagf_flcount--;
 	xfs_perag_put(pag);
 
 	logflags = XFS_AGF_FLFIRST | XFS_AGF_FLCOUNT;
-	if (btreeblk) {
+	if (type == XFS_FREELIST_BTREE) {
 		be32_add_cpu(&agf->agf_btreeblks, 1);
 		pag->pagf_btreeblks++;
 		logflags |= XFS_AGF_BTREEBLKS;
@@ -2009,8 +2009,11 @@ xfs_alloc_get_freelist(
 	 * that we don't sit and wait with the AGF locked in the transaction
 	 * during the log force.
 	 */
-	if (xfs_alloc_busy_search(mp, be32_to_cpu(agf->agf_seqno), bno, 1))
-		xfs_trans_set_sync(tp);
+	if (type != XFS_FREELIST_BALANCE) {
+		if (xfs_alloc_busy_search(mp, agno, bno, 1))
+			xfs_trans_set_sync(tp);
+	}
+
 	return 0;
 }
 
Index: xfs/fs/xfs/xfs_alloc.h
===================================================================
--- xfs.orig/fs/xfs/xfs_alloc.h	2011-01-17 22:23:05.168004061 +0100
+++ xfs/fs/xfs/xfs_alloc.h	2011-01-17 22:30:06.456005528 +0100
@@ -140,6 +140,13 @@ xfs_alloc_compute_maxlevels(
 	struct xfs_mount	*mp);	/* file system mount structure */
 
 /*
+ * Destination of blocks allocated by xfs_alloc_get_freelist.
+ */
+#define XFS_FREELIST_ALLOC	0
+#define XFS_FREELIST_BTREE	1
+#define XFS_FREELIST_BALANCE	2
+
+/*
  * Get a block from the freelist.
  * Returns with the buffer for the block gotten.
  */
Index: xfs/fs/xfs/xfs_alloc_btree.c
===================================================================
--- xfs.orig/fs/xfs/xfs_alloc_btree.c	2011-01-17 22:23:05.155004271 +0100
+++ xfs/fs/xfs/xfs_alloc_btree.c	2011-01-17 22:24:35.867255145 +0100
@@ -83,7 +83,7 @@ xfs_allocbt_alloc_block(
 
 	/* Allocate the new block from the freelist. If we can't, give up.  */
 	error = xfs_alloc_get_freelist(cur->bc_tp, cur->bc_private.a.agbp,
-				       &bno, 1);
+				       &bno, XFS_FREELIST_BTREE);
 	if (error) {
 		XFS_BTREE_TRACE_CURSOR(cur, XBT_ERROR);
 		return error;

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH 5/6] xfs: do not classify freed allocation btree blocks as busy
  2011-01-21  9:22 [PATCH 0/6] do not reuse busy extents Christoph Hellwig
                   ` (2 preceding siblings ...)
  2011-01-21  9:22 ` [PATCH 4/6] xfs: optimize xfs_alloc_fix_freelist Christoph Hellwig
@ 2011-01-21  9:22 ` Christoph Hellwig
  2011-01-28  6:33   ` Dave Chinner
                     ` (2 more replies)
  2011-01-21  9:22 ` [PATCH 6/6] xfs: remove handling of duplicates the busy extent tree Christoph Hellwig
  4 siblings, 3 replies; 19+ messages in thread
From: Christoph Hellwig @ 2011-01-21  9:22 UTC (permalink / raw)
  To: xfs

[-- Attachment #1: xfs-optimize-busy-btree-blocks --]
[-- Type: text/plain, Size: 10937 bytes --]

During an allocation or freeing of an extent, we occasionally have an
interesting situation happen during a btree update.  We may merge a block
as part of a record delete, then allocate it again immediately afterwards
when inserting a modified extent.  xfs_alloc_fixup_trees() does this sort
of manipulation to the btrees, and so can merge then immediately split
the tree resulting the in the same busy block being used from the
freelist.

Previously, this would trigger a synchronous log force of the entire log
(as the current transaction had a lsn of 0 until committed), but continue
to allow the extent to be used because if it is not on disk now then it
must have been freed and marked busy in this transaction.

In the case of allocbt blocks, we log the fact that they get moved to the
freelist and we also log them when the move off the free list back into
the free space trees.  When we move the blocks off the freelist back to
the trees, we mark the extent busy at that point so that it doesn't get
reused until the transaction is committed.

This means that as long as the block is on the AGFL and is only used
for allocbt blocks, we don't have to force the log to ensure prior
manipulating transactions are on disk as the process of log recovery
will replay the transactions in the correct order.

However, we still need to protect against the fact that as we approach
ENOSPC in an AG we can allocate data and metadata blocks direct from the
AGFL. In this case, we need to treat btree blocks just like any other
busy extent. Hence we still need to track btree blocks in the busy extent
tree, but we need to distinguish them from normal extents so we can apply
the necessary exceptions for btree block allocation.

Signed-off-by: Dave Chinner <david@fromorbit.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>

Index: xfs/fs/xfs/xfs_ag.h
===================================================================
--- xfs.orig/fs/xfs/xfs_ag.h	2011-01-17 22:03:15.000000000 +0100
+++ xfs/fs/xfs/xfs_ag.h	2011-01-17 22:32:06.541004201 +0100
@@ -188,6 +188,11 @@ struct xfs_busy_extent {
 	xfs_agblock_t	bno;
 	xfs_extlen_t	length;
 	xlog_tid_t	tid;		/* transaction that created this */
+	int		flags;
+};
+
+enum {
+	XFS_BUSY_FREELIST	= 0x0001, /* busy extent on freelist from abt */
 };
 
 /*
Index: xfs/fs/xfs/xfs_alloc.c
===================================================================
--- xfs.orig/fs/xfs/xfs_alloc.c	2011-01-17 22:30:54.000000000 +0100
+++ xfs/fs/xfs/xfs_alloc.c	2011-01-17 22:39:29.021005529 +0100
@@ -46,8 +46,12 @@ STATIC int xfs_alloc_ag_vextent_near(xfs
 STATIC int xfs_alloc_ag_vextent_size(xfs_alloc_arg_t *);
 STATIC int xfs_alloc_ag_vextent_small(xfs_alloc_arg_t *,
 		xfs_btree_cur_t *, xfs_agblock_t *, xfs_extlen_t *, int *);
+STATIC void xfs_alloc_busy_insert(struct xfs_trans *, xfs_agnumber_t,
+		xfs_agblock_t, xfs_extlen_t, int);
 STATIC int xfs_alloc_busy_search(struct xfs_mount *, xfs_agnumber_t,
 		xfs_agblock_t, xfs_extlen_t);
+STATIC void xfs_alloc_busy_force(struct xfs_mount *, xfs_agnumber_t,
+		xfs_agblock_t, xfs_extlen_t, int);
 
 /*
  * Internal functions.
@@ -1691,7 +1695,7 @@ xfs_free_ag_extent(
 		 * Mark the extent busy unless it comes from the freelist,
 		 * in which case it has already been marked busy.
 		 */
-		xfs_alloc_busy_insert(tp, agno, bno, len);
+		xfs_alloc_busy_insert(tp, agno, bno, len, 0);
 	}
 
 	XFS_STATS_INC(xs_freex);
@@ -1996,22 +2000,23 @@ xfs_alloc_get_freelist(
 	xfs_alloc_log_agf(tp, agbp, logflags);
 	*bnop = bno;
 
-	/*
-	 * As blocks are freed, they are added to the per-ag busy list and
-	 * remain there until the freeing transaction is committed to disk.
-	 * Now that we have allocated blocks, this list must be searched to see
-	 * if a block is being reused.  If one is, then the freeing transaction
-	 * must be pushed to disk before this transaction.
-	 *
-	 * We do this by setting the current transaction to a sync transaction
-	 * which guarantees that the freeing transaction is on disk before this
-	 * transaction. This is done instead of a synchronous log force here so
-	 * that we don't sit and wait with the AGF locked in the transaction
-	 * during the log force.
-	 */
 	if (type != XFS_FREELIST_BALANCE) {
-		if (xfs_alloc_busy_search(mp, agno, bno, 1))
-			xfs_trans_set_sync(tp);
+		/*
+		 * If this block is marked busy we may have to force out the
+		 * log to prevent reuse until the freeing transaction has been
+		 * commited.
+		 *
+		 * If we're just allocating the block to rebalance the the
+		 * freelist size we can skip this exercise as the block
+		 * just keeps its busy marking while migrating to the
+		 * allocation btree.
+		 *
+		 * If the block was freed from a btree and gets reallocated
+		 * to it we may skip the log force, but details are handled
+		 * by xfs_alloc_busy_force.
+		 */
+		xfs_alloc_busy_force(mp, agno, bno, 1,
+				     type == XFS_FREELIST_BTREE);
 	}
 
 	return 0;
@@ -2081,26 +2086,27 @@ xfs_alloc_put_freelist(
 	xfs_agblock_t		bno,	/* block being freed */
 	int			btreeblk) /* block came from a AGF btree */
 {
-	xfs_agf_t		*agf;	/* a.g. freespace structure */
+	xfs_mount_t		*mp = tp->t_mountp;
+	xfs_agf_t		*agf = XFS_BUF_TO_AGF(agbp);
+	xfs_agnumber_t		agno = be32_to_cpu(agf->agf_seqno);
 	xfs_agfl_t		*agfl;	/* a.g. free block array */
 	__be32			*blockp;/* pointer to array entry */
 	int			error;
 	int			logflags;
-	xfs_mount_t		*mp;	/* mount structure */
 	xfs_perag_t		*pag;	/* per allocation group data */
 
-	agf = XFS_BUF_TO_AGF(agbp);
-	mp = tp->t_mountp;
+	if (!agflbp) {
+		error = xfs_alloc_read_agfl(mp, tp, agno, &agflbp);
+		if (error)
+			return error;
+	}
 
-	if (!agflbp && (error = xfs_alloc_read_agfl(mp, tp,
-			be32_to_cpu(agf->agf_seqno), &agflbp)))
-		return error;
 	agfl = XFS_BUF_TO_AGFL(agflbp);
 	be32_add_cpu(&agf->agf_fllast, 1);
 	if (be32_to_cpu(agf->agf_fllast) == XFS_AGFL_SIZE(mp))
 		agf->agf_fllast = 0;
 
-	pag = xfs_perag_get(mp, be32_to_cpu(agf->agf_seqno));
+	pag = xfs_perag_get(mp, agno);
 	be32_add_cpu(&agf->agf_flcount, 1);
 	xfs_trans_agflist_delta(tp, 1);
 	pag->pagf_flcount++;
@@ -2123,6 +2129,14 @@ xfs_alloc_put_freelist(
 		(int)((xfs_caddr_t)blockp - (xfs_caddr_t)agfl),
 		(int)((xfs_caddr_t)blockp - (xfs_caddr_t)agfl +
 			sizeof(xfs_agblock_t) - 1));
+
+	/*
+	 * If it's a btree block, mark it busy as it has just been freed.
+	 * Otherwise we are just replenishing the free list with extents
+	 * that are already free so we don't need to track them.
+	 */
+	if (btreeblk)
+		xfs_alloc_busy_insert(tp, agno, bno, 1, XFS_BUSY_FREELIST);
 	return 0;
 }
 
@@ -2582,12 +2596,13 @@ error0:
  * logically into the one checkpoint. If the checkpoint sequences are
  * different, however, we still need to wait on a log force.
  */
-void
+STATIC void
 xfs_alloc_busy_insert(
 	struct xfs_trans	*tp,
 	xfs_agnumber_t		agno,
 	xfs_agblock_t		bno,
-	xfs_extlen_t		len)
+	xfs_extlen_t		len,
+	int			flags)
 {
 	struct xfs_busy_extent	*new;
 	struct xfs_busy_extent	*busyp;
@@ -2612,6 +2627,7 @@ xfs_alloc_busy_insert(
 	new->agno = agno;
 	new->bno = bno;
 	new->length = len;
+	new->flags = flags;
 	new->tid = xfs_log_get_trans_ident(tp);
 
 	INIT_LIST_HEAD(&new->list);
@@ -2733,6 +2749,62 @@ xfs_alloc_busy_search(
 	return match;
 }
 
+STATIC void
+xfs_alloc_busy_force(
+	struct xfs_mount	*mp,
+	xfs_agnumber_t		agno,
+	xfs_agblock_t		bno,
+	xfs_extlen_t		len,
+	int			btreeblk)
+{
+	struct xfs_perag	*pag;
+	struct rb_node		*rbp;
+	struct xfs_busy_extent	*busyp;
+	int			match = 0;
+
+	pag = xfs_perag_get(mp, agno);
+	spin_lock(&pag->pagb_lock);
+
+	rbp = pag->pagb_tree.rb_node;
+
+	/* find closest start bno overlap */
+	while (rbp) {
+		busyp = rb_entry(rbp, struct xfs_busy_extent, rb_node);
+		if (bno < busyp->bno) {
+			/* may overlap, but exact start block is lower */
+			if (bno + len > busyp->bno) {
+				match = 1;
+				break;
+			}
+			rbp = rbp->rb_left;
+		} else if (bno > busyp->bno) {
+			/* may overlap, but exact start block is higher */
+			if (bno < busyp->bno + busyp->length) {
+				match = 1;
+				break;
+			}
+			rbp = rbp->rb_right;
+		} else {
+			/* bno matches busyp, length determines exact match */
+			match = 1;
+			break;
+		}
+	}
+
+	if (match && btreeblk && (busyp->flags & XFS_BUSY_FREELIST)) {
+		list_del_init(&busyp->list);
+		rb_erase(&busyp->rb_node, &pag->pagb_tree);
+		kmem_free(busyp);
+		match = 0;
+	}
+	spin_unlock(&pag->pagb_lock);
+	trace_xfs_alloc_busysearch(mp, agno, bno, len, !!match);
+	xfs_perag_put(pag);
+
+	if (match)
+		xfs_log_force(mp, XFS_LOG_SYNC);
+}
+
 /*
  * For a given extent [fbno, flen], search the busy extent list
  * to find a subset of the extent that is not busy.
Index: xfs/fs/xfs/xfs_alloc_btree.c
===================================================================
--- xfs.orig/fs/xfs/xfs_alloc_btree.c	2011-01-17 22:24:35.000000000 +0100
+++ xfs/fs/xfs/xfs_alloc_btree.c	2011-01-17 22:32:06.550048202 +0100
@@ -109,7 +109,6 @@ xfs_allocbt_free_block(
 	struct xfs_buf		*bp)
 {
 	struct xfs_buf		*agbp = cur->bc_private.a.agbp;
-	struct xfs_agf		*agf = XFS_BUF_TO_AGF(agbp);
 	xfs_agblock_t		bno;
 	int			error;
 
@@ -118,18 +117,6 @@ xfs_allocbt_free_block(
 	if (error)
 		return error;
 
-	/*
-	 * Since blocks move to the free list without the coordination used in
-	 * xfs_bmap_finish, we can't allow block to be available for
-	 * reallocation and non-transaction writing (user data) until we know
-	 * that the transaction that moved it to the free list is permanently
-	 * on disk. We track the blocks by declaring these blocks as "busy";
-	 * the busy list is maintained on a per-ag basis and each transaction
-	 * records which entries should be removed when the iclog commits to
-	 * disk. If a busy block is allocated, the iclog is pushed up to the
-	 * LSN that freed the block.
-	 */
-	xfs_alloc_busy_insert(cur->bc_tp, be32_to_cpu(agf->agf_seqno), bno, 1);
 	xfs_trans_agbtree_delta(cur->bc_tp, -1);
 	return 0;
 }
Index: xfs/fs/xfs/xfs_alloc.h
===================================================================
--- xfs.orig/fs/xfs/xfs_alloc.h	2011-01-17 22:30:06.456005528 +0100
+++ xfs/fs/xfs/xfs_alloc.h	2011-01-17 22:40:14.542034513 +0100
@@ -120,10 +120,6 @@ xfs_alloc_longest_free_extent(struct xfs
 
 #ifdef __KERNEL__
 void
-xfs_alloc_busy_insert(struct xfs_trans *tp, xfs_agnumber_t agno,
-	xfs_agblock_t bno, xfs_extlen_t len);
-
-void
 xfs_alloc_busy_clear(struct xfs_mount *mp, struct xfs_busy_extent *busyp);
 
 void
@@ -140,7 +136,8 @@ xfs_alloc_compute_maxlevels(
 	struct xfs_mount	*mp);	/* file system mount structure */
 
 /*
- * Destination of blocks allocated by xfs_alloc_get_freelist.
+ * Destination of blocks allocated by xfs_alloc_get_freelist  /
+ * source of blocks freed by xfs_alloc_put_freelist.
  */
 #define XFS_FREELIST_ALLOC	0
 #define XFS_FREELIST_BTREE	1

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH 6/6] xfs: remove handling of duplicates the busy extent tree
  2011-01-21  9:22 [PATCH 0/6] do not reuse busy extents Christoph Hellwig
                   ` (3 preceding siblings ...)
  2011-01-21  9:22 ` [PATCH 5/6] xfs: do not classify freed allocation btree blocks as busy Christoph Hellwig
@ 2011-01-21  9:22 ` Christoph Hellwig
  2011-02-01 23:02   ` Alex Elder
  4 siblings, 1 reply; 19+ messages in thread
From: Christoph Hellwig @ 2011-01-21  9:22 UTC (permalink / raw)
  To: xfs

[-- Attachment #1: xfs-remove-duplicate-extent-handling --]
[-- Type: text/plain, Size: 9754 bytes --]

With the recent changes we never re-used busy extents.  Remove all handling
of them and replace them with asserts.  This also effectively reverts
commit 955833cf2ad0aa39b336e853cad212d867199984

	"xfs: make the log ticket ID available outside the log infrastructure"

Signed-off-by: Christoph Hellwig <hch@lst.de>

Index: xfs/fs/xfs/xfs_ag.h
===================================================================
--- xfs.orig/fs/xfs/xfs_ag.h	2011-01-18 13:06:47.588254235 +0100
+++ xfs/fs/xfs/xfs_ag.h	2011-01-18 13:07:05.608012022 +0100
@@ -187,7 +187,6 @@ struct xfs_busy_extent {
 	xfs_agnumber_t	agno;
 	xfs_agblock_t	bno;
 	xfs_extlen_t	length;
-	xlog_tid_t	tid;		/* transaction that created this */
 	int		flags;
 };
 
Index: xfs/fs/xfs/xfs_alloc.c
===================================================================
--- xfs.orig/fs/xfs/xfs_alloc.c	2011-01-18 13:06:47.595254654 +0100
+++ xfs/fs/xfs/xfs_alloc.c	2011-01-18 13:08:34.000000000 +0100
@@ -2518,83 +2518,7 @@ error0:
 /*
  * Insert a new extent into the busy tree.
  *
- * The busy extent tree is indexed by the start block of the busy extent.
- * there can be multiple overlapping ranges in the busy extent tree but only
- * ever one entry at a given start block. The reason for this is that
- * multi-block extents can be freed, then smaller chunks of that extent
- * allocated and freed again before the first transaction commit is on disk.
- * If the exact same start block is freed a second time, we have to wait for
- * that busy extent to pass out of the tree before the new extent is inserted.
- * There are two main cases we have to handle here.
- *
- * The first case is a transaction that triggers a "free - allocate - free"
- * cycle. This can occur during btree manipulations as a btree block is freed
- * to the freelist, then allocated from the free list, then freed again. In
- * this case, the second extxpnet free is what triggers the duplicate and as
- * such the transaction IDs should match. Because the extent was allocated in
- * this transaction, the transaction must be marked as synchronous. This is
- * true for all cases where the free/alloc/free occurs in the one transaction,
- * hence the addition of the ASSERT(tp->t_flags & XFS_TRANS_SYNC) to this case.
- * This serves to catch violations of the second case quite effectively.
- *
- * The second case is where the free/alloc/free occur in different
- * transactions. In this case, the thread freeing the extent the second time
- * can't mark the extent busy immediately because it is already tracked in a
- * transaction that may be committing.  When the log commit for the existing
- * busy extent completes, the busy extent will be removed from the tree. If we
- * allow the second busy insert to continue using that busy extent structure,
- * it can be freed before this transaction is safely in the log.  Hence our
- * only option in this case is to force the log to remove the existing busy
- * extent from the list before we insert the new one with the current
- * transaction ID.
- *
- * The problem we are trying to avoid in the free-alloc-free in separate
- * transactions is most easily described with a timeline:
- *
- *      Thread 1	Thread 2	Thread 3	xfslogd
- *	xact alloc
- *	free X
- *	mark busy
- *	commit xact
- *	free xact
- *			xact alloc
- *			alloc X
- *			busy search
- *			mark xact sync
- *			commit xact
- *			free xact
- *			force log
- *			checkpoint starts
- *			....
- *					xact alloc
- *					free X
- *					mark busy
- *					finds match
- *					*** KABOOM! ***
- *					....
- *							log IO completes
- *							unbusy X
- *			checkpoint completes
- *
- * By issuing a log force in thread 3 @ "KABOOM", the thread will block until
- * the checkpoint completes, and the busy extent it matched will have been
- * removed from the tree when it is woken. Hence it can then continue safely.
- *
- * However, to ensure this matching process is robust, we need to use the
- * transaction ID for identifying transaction, as delayed logging results in
- * the busy extent and transaction lifecycles being different. i.e. the busy
- * extent is active for a lot longer than the transaction.  Hence the
- * transaction structure can be freed and reallocated, then mark the same
- * extent busy again in the new transaction. In this case the new transaction
- * will have a different tid but can have the same address, and hence we need
- * to check against the tid.
- *
- * Future: for delayed logging, we could avoid the log force if the extent was
- * first freed in the current checkpoint sequence. This, however, requires the
- * ability to pin the current checkpoint in memory until this transaction
- * commits to ensure that both the original free and the current one combine
- * logically into the one checkpoint. If the checkpoint sequences are
- * different, however, we still need to wait on a log force.
+ * The busy extent tree is indexed by the start block of the busy extent
  */
 STATIC void
 xfs_alloc_busy_insert(
@@ -2609,8 +2533,6 @@ xfs_alloc_busy_insert(
 	struct xfs_perag	*pag;
 	struct rb_node		**rbp;
 	struct rb_node		*parent;
-	int			match;
-
 
 	new = kmem_zalloc(sizeof(struct xfs_busy_extent), KM_MAYFAIL);
 	if (!new) {
@@ -2628,74 +2550,35 @@ xfs_alloc_busy_insert(
 	new->bno = bno;
 	new->length = len;
 	new->flags = flags;
-	new->tid = xfs_log_get_trans_ident(tp);
-
 	INIT_LIST_HEAD(&new->list);
 
 	/* trace before insert to be able to see failed inserts */
 	trace_xfs_alloc_busy(tp, agno, bno, len, 0);
 
 	pag = xfs_perag_get(tp->t_mountp, new->agno);
-restart:
 	spin_lock(&pag->pagb_lock);
 	rbp = &pag->pagb_tree.rb_node;
 	parent = NULL;
-	busyp = NULL;
-	match = 0;
-	while (*rbp && match >= 0) {
+	while (*rbp) {
 		parent = *rbp;
 		busyp = rb_entry(parent, struct xfs_busy_extent, rb_node);
 
 		if (new->bno < busyp->bno) {
-			/* may overlap, but exact start block is lower */
+			ASSERT(new->bno + new->length <= busyp->bno);
 			rbp = &(*rbp)->rb_left;
-			if (new->bno + new->length > busyp->bno)
-				match = busyp->tid == new->tid ? 1 : -1;
-		} else if (new->bno > busyp->bno) {
-			/* may overlap, but exact start block is higher */
-			rbp = &(*rbp)->rb_right;
-			if (bno < busyp->bno + busyp->length)
-				match = busyp->tid == new->tid ? 1 : -1;
 		} else {
-			match = busyp->tid == new->tid ? 1 : -1;
-			break;
+			ASSERT(new->bno > busyp->bno);
+			ASSERT(bno >= busyp->bno + busyp->length);
+			rbp = &(*rbp)->rb_right;
 		}
 	}
-	if (match < 0) {
-		/* overlap marked busy in different transaction */
-		spin_unlock(&pag->pagb_lock);
-		xfs_log_force(tp->t_mountp, XFS_LOG_SYNC);
-		goto restart;
-	}
-	if (match > 0) {
-		/*
-		 * overlap marked busy in same transaction. Update if exact
-		 * start block match, otherwise combine the busy extents into
-		 * a single range.
-		 */
-		if (busyp->bno == new->bno) {
-			busyp->length = max(busyp->length, new->length);
-			spin_unlock(&pag->pagb_lock);
-			ASSERT(tp->t_flags & XFS_TRANS_SYNC);
-			xfs_perag_put(pag);
-			kmem_free(new);
-			return;
-		}
-		rb_erase(&busyp->rb_node, &pag->pagb_tree);
-		new->length = max(busyp->bno + busyp->length,
-					new->bno + new->length) -
-				min(busyp->bno, new->bno);
-		new->bno = min(busyp->bno, new->bno);
-	} else
-		busyp = NULL;
 
 	rb_link_node(&new->rb_node, parent, rbp);
 	rb_insert_color(&new->rb_node, &pag->pagb_tree);
-
 	list_add(&new->list, &tp->t_busy);
+
 	spin_unlock(&pag->pagb_lock);
 	xfs_perag_put(pag);
-	kmem_free(busyp);
 }
 
 /*
Index: xfs/fs/xfs/xfs_log.c
===================================================================
--- xfs.orig/fs/xfs/xfs_log.c	2011-01-18 13:06:47.612254724 +0100
+++ xfs/fs/xfs/xfs_log.c	2011-01-18 13:07:05.929013420 +0100
@@ -3256,13 +3256,6 @@ xfs_log_ticket_get(
 	return ticket;
 }
 
-xlog_tid_t
-xfs_log_get_trans_ident(
-	struct xfs_trans	*tp)
-{
-	return tp->t_ticket->t_tid;
-}
-
 /*
  * Allocate and initialise a new log ticket.
  */
Index: xfs/fs/xfs/xfs_log.h
===================================================================
--- xfs.orig/fs/xfs/xfs_log.h	2011-01-18 13:06:47.618254514 +0100
+++ xfs/fs/xfs/xfs_log.h	2011-01-18 13:07:05.936017889 +0100
@@ -189,8 +189,6 @@ void	  xlog_iodone(struct xfs_buf *);
 struct xlog_ticket *xfs_log_ticket_get(struct xlog_ticket *ticket);
 void	  xfs_log_ticket_put(struct xlog_ticket *ticket);
 
-xlog_tid_t xfs_log_get_trans_ident(struct xfs_trans *tp);
-
 int	xfs_log_commit_cil(struct xfs_mount *mp, struct xfs_trans *tp,
 				struct xfs_log_vec *log_vector,
 				xfs_lsn_t *commit_lsn, int flags);
Index: xfs/fs/xfs/xfs_log_priv.h
===================================================================
--- xfs.orig/fs/xfs/xfs_log_priv.h	2011-01-18 13:06:47.625253956 +0100
+++ xfs/fs/xfs/xfs_log_priv.h	2011-01-18 13:07:05.951006924 +0100
@@ -148,6 +148,8 @@ static inline uint xlog_get_client_id(__
 #define	XLOG_RECOVERY_NEEDED	0x4	/* log was recovered */
 #define XLOG_IO_ERROR		0x8	/* log hit an I/O error, and being
 					   shutdown */
+typedef __uint32_t xlog_tid_t;
+
 
 #ifdef __KERNEL__
 /*
Index: xfs/fs/xfs/xfs_types.h
===================================================================
--- xfs.orig/fs/xfs/xfs_types.h	2011-01-18 13:06:47.645254026 +0100
+++ xfs/fs/xfs/xfs_types.h	2011-01-18 13:07:05.958005458 +0100
@@ -73,8 +73,6 @@ typedef	__int32_t	xfs_tid_t;	/* transact
 typedef	__uint32_t	xfs_dablk_t;	/* dir/attr block number (in file) */
 typedef	__uint32_t	xfs_dahash_t;	/* dir/attr hash value */
 
-typedef __uint32_t	xlog_tid_t;	/* transaction ID type */
-
 /*
  * These types are 64 bits on disk but are either 32 or 64 bits in memory.
  * Disk based types:

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 1/6] xfs: clean up the xfs_alloc_compute_aligned calling convention
  2011-01-21  9:22 ` [PATCH 1/6] xfs: clean up the xfs_alloc_compute_aligned calling convention Christoph Hellwig
@ 2011-01-25  4:23   ` Dave Chinner
  2011-01-27 23:21   ` Alex Elder
  1 sibling, 0 replies; 19+ messages in thread
From: Dave Chinner @ 2011-01-25  4:23 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: xfs

On Fri, Jan 21, 2011 at 04:22:28AM -0500, Christoph Hellwig wrote:
> Pass a xfs_alloc_arg structure to xfs_alloc_compute_aligned and derive
> the alignment and minlen paramters from it.  This cleans up the existing
> callers, and we'll need even more information from the xfs_alloc_arg
> in subsequent patches.  Based on a patch from Dave Chinner.
> 
> Signed-off-by: Christoph Hellwig <hch@lst.de>

Good cleanup.

Reviewed-by: Dave Chinner <dchinner@redhat.com>
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 3/6] xfs: do not immediately reuse busy extent ranges
  2011-01-21  9:22 ` [PATCH 3/6] xfs: do not immediately reuse busy extent ranges Christoph Hellwig
@ 2011-01-27 23:20   ` Alex Elder
  2011-01-28  1:58   ` Dave Chinner
  1 sibling, 0 replies; 19+ messages in thread
From: Alex Elder @ 2011-01-27 23:20 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: xfs

On Fri, 2011-01-21 at 04:22 -0500, Christoph Hellwig wrote:
> Every time we reallocate a busy extent, we cause a synchronous log force
> to occur to ensure the freeing transaction is on disk before we continue
> and use the newly allocated extent.  This is extremely sub-optimal as we
> have to mark every transaction with blocks that get reused as synchronous.
> 
> Instead of searching the busy extent list after deciding on the extent to
> allocate, check each candidate extent during the allocation decisions as
> to whether they are in the busy list.  If they are in the busy list, we
> trim the busy range out of the extent we have found and determine if that
> trimmed range is still OK for allocation. In many cases, this check can
> be incorporated into the allocation extent alignment code which already
> does trimming of the found extent before determining if it is a valid
> candidate for allocation.
> 
> [hch: merged two earlier patches from Dave and fixed various bugs]
> 
> Signed-off-by: Dave Chinner <david@fromorbit.com>
> Signed-off-by: Christoph Hellwig <hch@lst.de>

You know, I must really not be looking at this right, because
the way I am interpreting your xfs_alloc_busy_search_trim(),
it's just plain wrong.  Perhaps it arrives at an OK result
anyway, but please take a look to see if I'm just confused.

I have a few other comments, not as important.

Generally the rest of it looks good.

I'll pick up with the rest of the series tomorrow.

					-Alex


> Index: xfs/fs/xfs/xfs_alloc.c
> ===================================================================
> --- xfs.orig/fs/xfs/xfs_alloc.c	2011-01-17 22:05:27.146004341 +0100
> +++ xfs/fs/xfs/xfs_alloc.c	2011-01-18 13:04:30.239023407 +0100

. . .

> @@ -2654,6 +2730,71 @@ xfs_alloc_busy_search(
>  	return match;
>  }
>  
> +/*
> + * For a given extent [fbno, flen], search the busy extent list
> + * to find a subset of the extent that is not busy.
> + */
> +void
> +xfs_alloc_busy_search_trim(
> +	struct xfs_mount	*mp,
> +	struct xfs_perag	*pag,
> +	xfs_agblock_t		fbno,
> +	xfs_extlen_t		flen,
> +	xfs_agblock_t		*rbno,
> +	xfs_extlen_t		*rlen)
> +{
> +	struct rb_node		*rbp;
> +	xfs_agblock_t           bno = fbno;
> +	xfs_extlen_t            len = flen;
> +

I don't know if it's important, but you could ASSERT(flen > 0) here.

> +	spin_lock(&pag->pagb_lock);
> +	rbp = pag->pagb_tree.rb_node;
> +	while (rbp) {

	while (rbp && len) {

> +		struct xfs_busy_extent *busyp =
> +			rb_entry(rbp, struct xfs_busy_extent, rb_node);
> +		xfs_agblock_t	end = bno + len;
> +		xfs_agblock_t	bend = busyp->bno + busyp->length;
> +
> +		if (bno + len <= busyp->bno) {
> +			rbp = rbp->rb_left;
> +			continue;
> +		} else if (bno >= busyp->bno + busyp->length) {
> +			rbp = rbp->rb_right;
> +			continue;
> +		}
> +
> +		if (busyp->bno < bno) {
> +			/* start overlap */
> +			ASSERT(bend >= bno);
			ASSERT(bend > bno);

> +			ASSERT(bend <= end);
> +			len -= bno - bend;
       NO:		len -= bend - bno;

> +			bno = bend;
> +		} else if (bend > end) {
> +			/* end overlap */
> +			ASSERT(busyp->bno >= bno);
> +			ASSERT(busyp->bno < end);
> +			len -= bend - end;
       NO:		len -= end - busyp->bn;

> +		} else {
> +			/* middle overlap - return larger segment */
> +			ASSERT(busyp->bno >= bno);
> +			ASSERT(bend <= end);
> +			len = busyp->bno - bno;
> +			if (len >= end - bend) {
> +				/* use first segment */
> +				len = len;
> +			} else {
> +				/* use last segment */
> +				bno = bend;
> +				len = end - bend;
> +			}

 			/* Use the first segment... */
			len = busp->bno - bno;
			if (len < end - bend) {
				/* unless the second is larger */
				bno = bend;
				len = end - bend;
			}


> +		}
> +	}
> +	spin_unlock(&pag->pagb_lock);
> +
> +	*rbno = bno;
> +	*rlen = len;
> +}
> +
>  void
>  xfs_alloc_busy_clear(
>  	struct xfs_mount	*mp,

. . .
 
> Index: xfs/fs/xfs/linux-2.6/xfs_discard.c
> ===================================================================
> --- xfs.orig/fs/xfs/linux-2.6/xfs_discard.c	2011-01-17 22:06:13.004005040 +0100
> +++ xfs/fs/xfs/linux-2.6/xfs_discard.c	2011-01-17 22:14:09.133005668 +0100
> @@ -77,8 +77,8 @@ xfs_trim_extents(
>  	 * enough to be worth discarding.
>  	 */
>  	while (i) {
> -		xfs_agblock_t fbno;
> -		xfs_extlen_t flen;
> +		xfs_agblock_t	fbno, tbno;
> +		xfs_extlen_t	flen, tlen;

Does "f" represent "found" and "t" represent "trimmed" here?
(Just curious--it's fine.)

>  
>  		error = xfs_alloc_get_rec(cur, &fbno, &flen, &i);
>  		if (error)
> @@ -90,7 +90,7 @@ xfs_trim_extents(
>  		 * Too small?  Give up.
>  		 */
>  		if (flen < minlen) {
> -			trace_xfs_discard_toosmall(mp, agno, fbno, flen);
> +			trace_xfs_discard_toosmall(mp, agno, tbno, flen);
"tbno" appears to be possibly used before set here.  At this point
don't you actually want the found block number anyway?

>  			goto out_del_cursor;
>  		}
>  

. . .

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 1/6] xfs: clean up the xfs_alloc_compute_aligned calling convention
  2011-01-21  9:22 ` [PATCH 1/6] xfs: clean up the xfs_alloc_compute_aligned calling convention Christoph Hellwig
  2011-01-25  4:23   ` Dave Chinner
@ 2011-01-27 23:21   ` Alex Elder
  1 sibling, 0 replies; 19+ messages in thread
From: Alex Elder @ 2011-01-27 23:21 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: xfs

On Fri, 2011-01-21 at 04:22 -0500, Christoph Hellwig wrote:
> plain text document attachment (xfs-cleanup-xfs_alloc_compute_aligned)
> Pass a xfs_alloc_arg structure to xfs_alloc_compute_aligned and derive
> the alignment and minlen paramters from it.  This cleans up the existing
> callers, and we'll need even more information from the xfs_alloc_arg
> in subsequent patches.  Based on a patch from Dave Chinner.
> 
> Signed-off-by: Christoph Hellwig <hch@lst.de>

Looks good.

Reviewed-by: Alex Elder <aelder@sgi.com>




_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 3/6] xfs: do not immediately reuse busy extent ranges
  2011-01-21  9:22 ` [PATCH 3/6] xfs: do not immediately reuse busy extent ranges Christoph Hellwig
  2011-01-27 23:20   ` Alex Elder
@ 2011-01-28  1:58   ` Dave Chinner
  2011-01-28 16:19     ` Alex Elder
  1 sibling, 1 reply; 19+ messages in thread
From: Dave Chinner @ 2011-01-28  1:58 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: xfs

On Fri, Jan 21, 2011 at 04:22:30AM -0500, Christoph Hellwig wrote:
> Every time we reallocate a busy extent, we cause a synchronous log force
> to occur to ensure the freeing transaction is on disk before we continue
> and use the newly allocated extent.  This is extremely sub-optimal as we
> have to mark every transaction with blocks that get reused as synchronous.
> 
> Instead of searching the busy extent list after deciding on the extent to
> allocate, check each candidate extent during the allocation decisions as
> to whether they are in the busy list.  If they are in the busy list, we
> trim the busy range out of the extent we have found and determine if that
> trimmed range is still OK for allocation. In many cases, this check can
> be incorporated into the allocation extent alignment code which already
> does trimming of the found extent before determining if it is a valid
> candidate for allocation.
> 
> [hch: merged two earlier patches from Dave and fixed various bugs]
> 
> Signed-off-by: Dave Chinner <david@fromorbit.com>
> Signed-off-by: Christoph Hellwig <hch@lst.de>
....
>  	ASSERT(args->agbno % args->alignment == 0);
>  
>  	if (!args->wasfromfl) {
> -		xfs_alloc_update_counters(args->tp, args->pag, args->agbp,
> -					  -((long)(args->len)));
> +		error = xfs_alloc_update_counters(args->tp, args->pag,
> +						  args->agbp,
> +						  -((long)(args->len)));
>  
> -		/*
> -		 * Search the busylist for these blocks and mark the
> -		 * transaction as synchronous if blocks are found. This
> -		 * avoids the need to block due to a synchronous log
> -		 * force to ensure correct ordering as the synchronous
> -		 * transaction will guarantee that for us.
> -		 */
> -		if (xfs_alloc_busy_search(args->mp, args->agno,
> -					args->agbno, args->len))
> -			xfs_trans_set_sync(args->tp);
> +		ASSERT(!xfs_alloc_busy_search(args->mp, args->agno,
> +					      args->agbno, args->len));
>  	}
>  
>  	if (!args->isfl) {
> @@ -559,7 +554,7 @@ xfs_alloc_ag_vextent(
>  
>  	XFS_STATS_INC(xs_allocx);
>  	XFS_STATS_ADD(xs_allocb, args->len);
> -	return 0;
> +	return error;
>  }

These error handling changes could go in the previous patch.

> @@ -2612,7 +2688,7 @@ restart:
>   * will require a synchronous transaction, but it can still be
>   * used to distinguish between a partial or exact match.
>   */
> -int
> +STATIC int
>  xfs_alloc_busy_search(
>  	struct xfs_mount	*mp,
>  	xfs_agnumber_t		agno,
> @@ -2654,6 +2730,71 @@ xfs_alloc_busy_search(
>  	return match;
>  }
>  
> +/*
> + * For a given extent [fbno, flen], search the busy extent list
> + * to find a subset of the extent that is not busy.
> + */

How does this function return an indication that the extent selected
is entirrely unsuitable? e.g. the extent lies wholly within
the busy range and gets trimmed to zero length? perhaps it woul dbe
good to return an error in this case?

> +void
> +xfs_alloc_busy_search_trim(
> +	struct xfs_mount	*mp,
> +	struct xfs_perag	*pag,
> +	xfs_agblock_t		fbno,
> +	xfs_extlen_t		flen,
> +	xfs_agblock_t		*rbno,
> +	xfs_extlen_t		*rlen)
> +{
> +	struct rb_node		*rbp;
> +	xfs_agblock_t           bno = fbno;
> +	xfs_extlen_t            len = flen;
> +
> +	spin_lock(&pag->pagb_lock);
> +	rbp = pag->pagb_tree.rb_node;
> +	while (rbp) {
> +		struct xfs_busy_extent *busyp =
> +			rb_entry(rbp, struct xfs_busy_extent, rb_node);
> +		xfs_agblock_t	end = bno + len;
> +		xfs_agblock_t	bend = busyp->bno + busyp->length;
> +
> +		if (bno + len <= busyp->bno) {
> +			rbp = rbp->rb_left;
> +			continue;
> +		} else if (bno >= busyp->bno + busyp->length) {
> +			rbp = rbp->rb_right;
> +			continue;
> +		}

		if (end <= bbno)
			left;
		else if (bno > bend)
			right;

		/* overlap */

> +
> +		if (busyp->bno < bno) {
> +			/* start overlap */
> +			ASSERT(bend >= bno);
> +			ASSERT(bend <= end);
> +			len -= bno - bend;
> +			bno = bend;

		if (bbno < bno) {

		bbno           bend
		+-----------------+
Case 1:
		   +---------+
		   bno     end

		   No unbusy region in extent, return failure

Case 2:
		   +------------------------+
		   bno                    end

		   Needs to be trimmed to:
		                    +-------+
		                    bno   end
		   bno = bend;
		   len = end - bno;

> +		} else if (bend > end) {
> +			/* end overlap */
> +			ASSERT(busyp->bno >= bno);
> +			ASSERT(busyp->bno < end);
> +			len -= bend - end;

		} else if (bend > end) {

bno must be <= bbno in this case, so:

		bbno           bend
		+-----------------+

Case 3: bno == bbno:
		+-------------+
		bno         end

		No unbusy region in extent, return failure.

Case 4:
        +------------------+
        bno              end

   Needs to be trimmed to:
        +-------+
        bno   end

		end = bbno;
		len = end - bno;



> +		} else {
> +			/* middle overlap - return larger segment */
> +			ASSERT(busyp->bno >= bno);
> +			ASSERT(bend <= end);
> +			len = busyp->bno - bno;
> +			if (len >= end - bend) {
> +				/* use first segment */
> +				len = len;
> +			} else {
> +				/* use last segment */
> +				bno = bend;
> +				len = end - bend;
> +			}

(bno <= bbno && bend <= end) in this case, so:

		bbno           bend
		+-----------------+

Case 5: Exact match: (bno == bbno && bend == end)
		+-----------------+
		bno             end

		No unbusy region in extent, return failure.

Case 6: (bno == bbno && bend < end)

		+-------------------------+
		bno                     end

		Needs to be trimmed to:
		                  +-------+
		                  bno   end

		Same as Case 2.
			- case 2 can be extend to bbno <= bno.

Case 7: (bno < bbno && bend == end)

        +-------------------------+
        bno                     end

   Needs to be trimmed to:
        +-------+
        bno   end

		Same as case 4.
			- Case 4 can be extented to bend >= end

Case 8: (bno < bbno && bend < end)

        +---------------------------------+
        bno                             end

can be trimmed to:
        +-------+        OR       +-------+
        bno   end                 bno   end

	- Should always want to choose the lower bno extent:
		- next allocation in file will use "end" as
		  the target for first block
		- if busy segment has cleared, will get contiguous
		  allocation
		- if busy segment has not cleared, get allocation at
		  bend, which is forwards allocation.

		- if we choose segment at bend, and this remains the
		  best extent for the next allocation (e.g.
		  NEAR_BNO allocation) we'll next allocate at bno,
		  which will give us backwards allocation.

		- always chose the option that produces forwards
		  allocation patterns so that sequential reads and
		  writes only ever seek in one direction.

		- we already know that backwards allocation
		  direction (due to NEAR_BNO second algorithm)
		  causes significant fragmentation of directories
		  and degradataion of directory performance, so I
		  think we should avoid introducing a new allocation
		  algorithm that can lead to backwards allocation
		  around frequently reused extents.

	- only choose right hand edge if the remainin unused extent
	  length is much larger than the current allocation request.

	- otherwise return failure if we can't use lower bno due to
	  minlen restrictions so a new extent is chosen by the
	  higher level algorithm.


So, it looks to me like the "overlap found" algorithm shoul dbe
something like:

		if (bbno <= bno) {
			if (end <= bend) {
				/* case 1, 3, 5 */
				return failure;
			}
			/* case 2, 6 */
			bno = bend;
			len = end - bno;
		} else if (bend >= end) {
			ASSERT(bbno > bno);
			/* case 4, 7 */
			end = bbno;
			len = end - bno;
		} else {
			ASSERT(bbno > bno);
			ASSERT(bend < end);
			/* case 8 */
			if (bbno - bno >= args->minlen) {
				/* left candidate OK */
				left = 1;
			}
			if (end - bend >= args->maxlen * 4) {
				/* right candidate OK */
				right = 1;
			}
			if (left && right) {
				/* take right if left is not a
				 * maximal allocation */
				if (bbno - bno < args->maxlen)
					left = 0;
			}
			if (left) {
				end = bbno;
				len = end - bno;
			} else if (right) {
				bno = bend;
				len = end - bno;
			} else {
				return failure;
			}
		}

> @@ -109,19 +109,16 @@ xfs_trim_extents(
>  		 * If any blocks in the range are still busy, skip the
>  		 * discard and try again the next time.
>  		 */
> -		if (xfs_alloc_busy_search(mp, agno, fbno, flen)) {
> -			trace_xfs_discard_busy(mp, agno, fbno, flen);
> -			goto next_extent;
> -		}
> +		xfs_alloc_busy_search_trim(mp, pag, fbno, flen, &tbno, &tlen);
>  
> -		trace_xfs_discard_extent(mp, agno, fbno, flen);
> +		trace_xfs_discard_extent(mp, agno, tbno, tlen);
>  		error = -blkdev_issue_discard(bdev,
> -				XFS_AGB_TO_DADDR(mp, agno, fbno),
> -				XFS_FSB_TO_BB(mp, flen),
> +				XFS_AGB_TO_DADDR(mp, agno, tbno),
> +				XFS_FSB_TO_BB(mp, tlen),
>  				GFP_NOFS, 0);
>  		if (error)
>  			goto out_del_cursor;
> -		*blocks_trimmed += flen;
> +		*blocks_trimmed += tlen;

Hmmm - that means if we get a case 8 overlap, we'll only trim one
side of the extent. That's probably not a big deal. However, perhaps
this should check the size of the trimmed extent before issuing the
discard against it in case we've reduced it to something smaller
thanthe minimum requested trim size....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 4/6] xfs: optimize xfs_alloc_fix_freelist
  2011-01-21  9:22 ` [PATCH 4/6] xfs: optimize xfs_alloc_fix_freelist Christoph Hellwig
@ 2011-01-28  5:36   ` Dave Chinner
  2011-01-28  5:51     ` Dave Chinner
  2011-01-28 22:17   ` Alex Elder
  1 sibling, 1 reply; 19+ messages in thread
From: Dave Chinner @ 2011-01-28  5:36 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: xfs

On Fri, Jan 21, 2011 at 04:22:31AM -0500, Christoph Hellwig wrote:
> If we shorten the freelist in xfs_alloc_fix_freelist there is no need
> to wait for busy blocks as they will be marked busy again when we call
> xfs_free_ag_extent.  Avoid this by not marking blocks coming from the
> freelist as busy in xfs_free_ag_extent, and not marking transactions
> with busy extents as synchronous in xfs_alloc_get_freelist.  Unlike
> xfs_free_ag_extent which already has the isfl argument,
> xfs_alloc_get_freelist needs to be told about the usage of the blocks
> it returns.  For this we extend the btreeblk flag to a type argument
> which specifies in detail what the block is going to be used for.
> 
> Signed-off-by: Christoph Hellwig <hch@lst.de>

(This is mostly notes I wrote as I analysed the change)

Ok, so prior to this change we added extents to the busy list:

	- unconditionally in xfs_free_ag_extent(). i.e. freeing an
	  extent back into the freespace btree
	- unconditionally in xfs_allocbt_free_block() when freeing a
	  btree block back to the freelist

And we search for busy extents in:

	- unconditionally in xfs_alloc_get_freelist()
	- unconditionally in xfs_alloc_ag_vextent()
	- unconditionally in xfs_trim_extents()

So, for blocks on the freelist, they may or may not be in the busy
list depending on whether they were placed there via
xfs_alloc_fix_freelist() or xfs_allocbt_free_block().

In the case they were put there via xfs_alloc_fix_freelist(), the
extent will not be in the busy list so this change should be a
no-op.

In the case they were put there via xfs_allocbt_free_block(), they
will already be in the busy list when we call
xfs_alloc_get_freelist() and hence the transaction will always be
marked synchronous. The change here is to avoid marking transactions
which "allocate" blocks via xfs_alloc_get_freelist() synchronous if
possible.

So, if we get a block via xfs_alloc_ag_vextent_small(), it is tagged
XFS_FREELIST_ALLOC. If we get a block via xfs_alloc_fix_freelist, it
is tagged XFS_FREELIST_BALANCE, and if we are allocating a block for
the alloc btree, it is tagged XFS_FREELIST_BTREE. If the tag is
XFS_FREELIST_BALANCE, we do not do a busy list search as we are not
actually allocating the block for use, otherwise we do the search.

Then, in xfs_free_ag_extent() itself, when the block is coming from
the freelist, we do not add it to the busy extent list as it is
already in the busy list if it is necessary for it to be there. i.e.
if a block goes freespace -> freelist -> freespace, then there is no
need for it to be marked busy.

----

Ok, I've convinced myself that the changes are sane, though I think
it could do with a bit more explaination in the changelog and a
description of what the XFS_FREELIST_* tags do where they are
declared.

Thinking through this a bit more - we don't need to do a busy search
for metadata allocations - it's only necessary for metadata -> data
extent transistions. Metadata modifications are all logged, so there
is no need to force out the busy extent transaction if the next
allocation is for metadata. If the new transaction hits the disk,
then it will only be replayed if the previous transaction to free
the extent is on the disk and has already been replayed.

Hence I think we only need to do a busy search in these cases for
XFS_FREELIST_ALLOC when args->userdata == 0. The XFS_FREELIST_BTREE
case is definitely a metadata allocation, so it seems to me that
there is further scope for optimisation here. i.e. that allocbt ->
freelist -> allocbt doesn't require a sync transaction to be issued.

The code as it stands looks good; the question is should we take
this next step as well? Have I missed anything that makes this a bad
thing to do, Christoph?

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 4/6] xfs: optimize xfs_alloc_fix_freelist
  2011-01-28  5:36   ` Dave Chinner
@ 2011-01-28  5:51     ` Dave Chinner
  0 siblings, 0 replies; 19+ messages in thread
From: Dave Chinner @ 2011-01-28  5:51 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: xfs

On Fri, Jan 28, 2011 at 04:36:53PM +1100, Dave Chinner wrote:
> Thinking through this a bit more - we don't need to do a busy search
> for metadata allocations - it's only necessary for metadata -> data
> extent transistions. Metadata modifications are all logged, so there
> is no need to force out the busy extent transaction if the next
> allocation is for metadata. If the new transaction hits the disk,
> then it will only be replayed if the previous transaction to free
> the extent is on the disk and has already been replayed.
> 
> Hence I think we only need to do a busy search in these cases for
> XFS_FREELIST_ALLOC when args->userdata == 0. The XFS_FREELIST_BTREE
> case is definitely a metadata allocation, so it seems to me that
> there is further scope for optimisation here. i.e. that allocbt ->
> freelist -> allocbt doesn't require a sync transaction to be issued.
> 
> The code as it stands looks good; the question is should we take
> this next step as well? Have I missed anything that makes this a bad
> thing to do, Christoph?

Oh, I see the next patch tries to address this. That'll learn me to
read the entire patch series through before commenting on any of
it. I'll take this discussion to that patch.... ;)

Cheers,,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 5/6] xfs: do not classify freed allocation btree blocks as busy
  2011-01-21  9:22 ` [PATCH 5/6] xfs: do not classify freed allocation btree blocks as busy Christoph Hellwig
@ 2011-01-28  6:33   ` Dave Chinner
  2011-01-28 22:17   ` Alex Elder
  2011-02-01 23:02   ` Alex Elder
  2 siblings, 0 replies; 19+ messages in thread
From: Dave Chinner @ 2011-01-28  6:33 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: xfs

On Fri, Jan 21, 2011 at 04:22:32AM -0500, Christoph Hellwig wrote:
> During an allocation or freeing of an extent, we occasionally have an
> interesting situation happen during a btree update.  We may merge a block
> as part of a record delete, then allocate it again immediately afterwards
> when inserting a modified extent.  xfs_alloc_fixup_trees() does this sort
> of manipulation to the btrees, and so can merge then immediately split
> the tree resulting the in the same busy block being used from the
> freelist.
> 
> Previously, this would trigger a synchronous log force of the entire log
> (as the current transaction had a lsn of 0 until committed), but continue
> to allow the extent to be used because if it is not on disk now then it
> must have been freed and marked busy in this transaction.
> 
> In the case of allocbt blocks, we log the fact that they get moved to the
> freelist and we also log them when the move off the free list back into
> the free space trees.  When we move the blocks off the freelist back to
> the trees, we mark the extent busy at that point so that it doesn't get
> reused until the transaction is committed.

I'm not sure this is correct - it's when they are put on the
freelist that they are marked busy (xfs_allocbt_free_block), and
with the previous patch they now don't get marked busy when freed
back to the trees.

> This means that as long as the block is on the AGFL and is only used
> for allocbt blocks, we don't have to force the log to ensure prior
> manipulating transactions are on disk as the process of log recovery
> will replay the transactions in the correct order.

I think that as long as the busy extent is reallocated as metadata
of any kind, it does not require us to force the log/mark the
transaction synchronous. That is, the metadata modifications will
not be written to the extent until the allocation transaction has
hit the disk regardless of whether we force the log here or not.
Hence log recovery will always do the right thing here.

It's the metadata -> data extent re-allocation that we need the
protecion for, as we cannot serialise data extent writes against log
recovery requirements. That is, we have no idea if the data extent
contents have been written or not when we run log recovery, so we
cannot allow data to be written until we are certain that that the
data extent allocation overrides all the recorded metadata changes
in the log.

For data -> metadata reallocation, this is not a problem as we know
the metadata writes will not be done (and therefore overwrite the
data on disk) until the transactions are on disk and the metadata
buffers unpinned. The same goes for metadata->metadata reallocation.

Hence I don't think we specifically need to track allocbt blocks in
the busy list - we track all freed blocks like we currently do, but
we only need to search the busy block list and mark transactions
synchronous for data extent allocations.

This removes the complexity of removing busy allocbt blocks from the
busy tree when they get reused - if they remain allocated they will
be removed from the tree as transactions/checkpoints are committed
to disk just like they currently are. That also avoids the problem
of a transaction commit trying to remove a busy extent from the
tree that isn't in there anymore or potential races based around
transactions adding and removing busy blocks that log IO completion
will be trying to remove.

Does this make sense? I understand this code a lot better than I did
when I originally wrote these patches, and to tell the truth I can't
remember exactly why I took this approach. Looking over the patches
makes me think that I was trying to do the minimum possible semantic
change to solve the specific problem I was seeing, rather than
understanding the entire workings of the busy extent list and how to
optimise from there...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 3/6] xfs: do not immediately reuse busy extent ranges
  2011-01-28  1:58   ` Dave Chinner
@ 2011-01-28 16:19     ` Alex Elder
  2011-01-29  0:25       ` Dave Chinner
  0 siblings, 1 reply; 19+ messages in thread
From: Alex Elder @ 2011-01-28 16:19 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Christoph Hellwig, xfs

On Fri, 2011-01-28 at 12:58 +1100, Dave Chinner wrote:
> On Fri, Jan 21, 2011 at 04:22:30AM -0500, Christoph Hellwig wrote:
> > Every time we reallocate a busy extent, we cause a synchronous log force
> > to occur to ensure the freeing transaction is on disk before we continue
. . .
> > +
> > +	spin_lock(&pag->pagb_lock);
> > +	rbp = pag->pagb_tree.rb_node;
> > +	while (rbp) {

I will amend the loop termination condition I suggested
before to be this:

	while (rbp && len >= args->minlen) {

> > +		struct xfs_busy_extent *busyp =
> > +			rb_entry(rbp, struct xfs_busy_extent, rb_node);
> > +		xfs_agblock_t	end = bno + len;
> > +		xfs_agblock_t	bend = busyp->bno + busyp->length;
> > +
> > +		if (bno + len <= busyp->bno) {
> > +			rbp = rbp->rb_left;
> > +			continue;
> > +		} else if (bno >= busyp->bno + busyp->length) {
> > +			rbp = rbp->rb_right;
> > +			continue;
> > +		}
> 
> 		if (end <= bbno)
> 			left;
> 		else if (bno > bend)
> 			right;

I think the original code is right in this case.
The value of "bend" is the offset *following* the
end of the range.  So if "bno" equals that, we
want to move Right.  (Same reason <= is correct
for the first condition here.)

> 		/* overlap */
> 
> > +
> > +		if (busyp->bno < bno) {
> > +			/* start overlap */
> > +			ASSERT(bend >= bno);
> > +			ASSERT(bend <= end);
> > +			len -= bno - bend;
> > +			bno = bend;
> 
> 		if (bbno < bno) {
> 
> 		bbno           bend
> 		+-----------------+
> Case 1:
> 		   +---------+
> 		   bno     end
> 
> 		   No unbusy region in extent, return failure

Yes, that's right, I missed that.  My suggestion goes
negative in this case.

> Case 2:
> 		   +------------------------+
> 		   bno                    end
> 
> 		   Needs to be trimmed to:
> 		                    +-------+
> 		                    bno   end
> 		   bno = bend;
> 		   len = end - bno;

I like defining len in terms of the updated bno as
you have suggested here.

> > +		} else if (bend > end) {
> > +			/* end overlap */
> > +			ASSERT(busyp->bno >= bno);
> > +			ASSERT(busyp->bno < end);
> > +			len -= bend - end;
> 
. . .


> So, it looks to me like the "overlap found" algorithm shoul dbe
> something like:

For this algorithm, updating the value of len can be done
once, at the bottom (or top) of the loop, based simply on
the (updated) value of end and bno:

	len = end - bno;

You could rearrange things a bit so this gets done at
the top--instead of computing the value of end based
on bno and len.


> 		if (bbno <= bno) {
> 			if (end <= bend) {
> 				/* case 1, 3, 5 */
> 				return failure;
> 			}
> 			/* case 2, 6 */
> 			bno = bend;
> 			len = end - bno;
> 		} else if (bend >= end) {
> 			ASSERT(bbno > bno);
> 			/* case 4, 7 */
> 			end = bbno;
> 			len = end - bno;
> 		} else {
> 			ASSERT(bbno > bno);
> 			ASSERT(bend < end);
> 			/* case 8 */
> 			if (bbno - bno >= args->minlen) {
> 				/* left candidate OK */
> 				left = 1;
> 			}
> 			if (end - bend >= args->maxlen * 4) {

The "4" here I understand, but it's arbitrary (based
on an educated guess) so it needs to at least be explained
here with a comment.  Making it symbolic might make it
something one could search for at some future date.

> 				/* right candidate OK */
> 				right = 1;
> 			}
> 			if (left && right) {
> 				/* take right if left is not a
> 				 * maximal allocation */
> 				if (bbno - bno < args->maxlen)
> 					left = 0;
> 			}
> 			if (left) {
> 				end = bbno;
> 				len = end - bno;
> 			} else if (right) {
> 				bno = bend;
> 				len = end - bno;
> 			} else {
> 				return failure;
> 			}
> 		}
> 
> > @@ -109,19 +109,16 @@ xfs_trim_extents(
> >  		 * If any blocks in the range are still busy, skip the
> >  		 * discard and try again the next time.
> >  		 */
> > -		if (xfs_alloc_busy_search(mp, agno, fbno, flen)) {
> > -			trace_xfs_discard_busy(mp, agno, fbno, flen);
> > -			goto next_extent;
> > -		}
> > +		xfs_alloc_busy_search_trim(mp, pag, fbno, flen, &tbno, &tlen);
> >  
> > -		trace_xfs_discard_extent(mp, agno, fbno, flen);
> > +		trace_xfs_discard_extent(mp, agno, tbno, tlen);
> >  		error = -blkdev_issue_discard(bdev,
> > -				XFS_AGB_TO_DADDR(mp, agno, fbno),
> > -				XFS_FSB_TO_BB(mp, flen),
> > +				XFS_AGB_TO_DADDR(mp, agno, tbno),
> > +				XFS_FSB_TO_BB(mp, tlen),
> >  				GFP_NOFS, 0);
> >  		if (error)
> >  			goto out_del_cursor;
> > -		*blocks_trimmed += flen;
> > +		*blocks_trimmed += tlen;
> 
> Hmmm - that means if we get a case 8 overlap, we'll only trim one
> side of the extent. That's probably not a big deal. However, perhaps
> this should check the size of the trimmed extent before issuing the
> discard against it in case we've reduced it to something smaller
> thanthe minimum requested trim size....

I think all of the places that (ultimately) call this function
need to be looked at to make sure they handle the "error" case
properly--either checking for a returned error or verifying the
returned length is at least the minimum.

					-Alex


_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 4/6] xfs: optimize xfs_alloc_fix_freelist
  2011-01-21  9:22 ` [PATCH 4/6] xfs: optimize xfs_alloc_fix_freelist Christoph Hellwig
  2011-01-28  5:36   ` Dave Chinner
@ 2011-01-28 22:17   ` Alex Elder
  1 sibling, 0 replies; 19+ messages in thread
From: Alex Elder @ 2011-01-28 22:17 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: xfs

On Fri, 2011-01-21 at 04:22 -0500, Christoph Hellwig wrote:
> plain text document attachment
> (xfs-avoid-sync-transaction-for-freelist-refill)
> If we shorten the freelist in xfs_alloc_fix_freelist there is no need
> to wait for busy blocks as they will be marked busy again when we call
> xfs_free_ag_extent.  Avoid this by not marking blocks coming from the
> freelist as busy in xfs_free_ag_extent, and not marking transactions
> with busy extents as synchronous in xfs_alloc_get_freelist.  Unlike
> xfs_free_ag_extent which already has the isfl argument,
> xfs_alloc_get_freelist needs to be told about the usage of the blocks
> it returns.  For this we extend the btreeblk flag to a type argument
> which specifies in detail what the block is going to be used for.
> 
> Signed-off-by: Christoph Hellwig <hch@lst.de>

Now that I've found my way through the paths that this
code touches (Dave's analysis helped a lot), I understand
what you're doing and I think it looks good.

I agree that the XFS_FREELIST_* symbols need a little
explanation.

And on a broader subject, it might be nice if the sort
of lifecycle of this kind of thing was documented
somewhere (maybe it is).  It would be really cool if
certain changes could point to a Wiki page with a
diagram of what's going on or something.  Nah...

Reviewed-by: Alex Elder <aelder@sgi.com>


_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 5/6] xfs: do not classify freed allocation btree blocks as busy
  2011-01-21  9:22 ` [PATCH 5/6] xfs: do not classify freed allocation btree blocks as busy Christoph Hellwig
  2011-01-28  6:33   ` Dave Chinner
@ 2011-01-28 22:17   ` Alex Elder
  2011-02-01 23:02   ` Alex Elder
  2 siblings, 0 replies; 19+ messages in thread
From: Alex Elder @ 2011-01-28 22:17 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: xfs

On Fri, 2011-01-21 at 04:22 -0500, Christoph Hellwig wrote:
> During an allocation or freeing of an extent, we occasionally have an
> interesting situation happen during a btree update.  We may merge a block
> as part of a record delete, then allocate it again immediately afterwards
> when inserting a modified extent.  xfs_alloc_fixup_trees() does this sort
> of manipulation to the btrees, and so can merge then immediately split
> the tree resulting the in the same busy block being used from the
> freelist.

Methinks someone must have observed this stuff by tracing...

> Previously, this would trigger a synchronous log force of the entire log
> (as the current transaction had a lsn of 0 until committed), but continue
> to allow the extent to be used because if it is not on disk now then it
> must have been freed and marked busy in this transaction.

This could explain a mysterious slowdown I was looking at
a while ago.

> In the case of allocbt blocks, we log the fact that they get moved to the
> freelist and we also log them when the move off the free list back into
> the free space trees.  When we move the blocks off the freelist back to
> the trees, we mark the extent busy at that point so that it doesn't get
> reused until the transaction is committed.
> 
> This means that as long as the block is on the AGFL and is only used
> for allocbt blocks, we don't have to force the log to ensure prior
> manipulating transactions are on disk as the process of log recovery
> will replay the transactions in the correct order.
> 
> However, we still need to protect against the fact that as we approach
> ENOSPC in an AG we can allocate data and metadata blocks direct from the
> AGFL. In this case, we need to treat btree blocks just like any other
> busy extent. Hence we still need to track btree blocks in the busy extent
> tree, but we need to distinguish them from normal extents so we can apply
> the necessary exceptions for btree block allocation.
> 
> Signed-off-by: Dave Chinner <david@fromorbit.com>
> Signed-off-by: Christoph Hellwig <hch@lst.de>

Sorry, this ended up being a pretty pointless note...
I see what you're doing here but have not completed
my detailed review.  I also have read Dave's comments
and I think he's right that what the block is being
allocated *for* (metadata or data) determines whether
the log force is required.

Anyway, I need to go now unfortunately.  I'll try to
finish reviewing this and the last one by Monday.

					-Alex


_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 3/6] xfs: do not immediately reuse busy extent ranges
  2011-01-28 16:19     ` Alex Elder
@ 2011-01-29  0:25       ` Dave Chinner
  0 siblings, 0 replies; 19+ messages in thread
From: Dave Chinner @ 2011-01-29  0:25 UTC (permalink / raw)
  To: Alex Elder; +Cc: Christoph Hellwig, xfs

On Fri, Jan 28, 2011 at 10:19:51AM -0600, Alex Elder wrote:
> On Fri, 2011-01-28 at 12:58 +1100, Dave Chinner wrote:
> > On Fri, Jan 21, 2011 at 04:22:30AM -0500, Christoph Hellwig wrote:
> > > Every time we reallocate a busy extent, we cause a synchronous log force
> > > to occur to ensure the freeing transaction is on disk before we continue
> . . .
> > > +
> > > +	spin_lock(&pag->pagb_lock);
> > > +	rbp = pag->pagb_tree.rb_node;
> > > +	while (rbp) {
> 
> I will amend the loop termination condition I suggested
> before to be this:
> 
> 	while (rbp && len >= args->minlen) {

Makes sense.

> > > +		struct xfs_busy_extent *busyp =
> > > +			rb_entry(rbp, struct xfs_busy_extent, rb_node);
> > > +		xfs_agblock_t	end = bno + len;
> > > +		xfs_agblock_t	bend = busyp->bno + busyp->length;
> > > +
> > > +		if (bno + len <= busyp->bno) {
> > > +			rbp = rbp->rb_left;
> > > +			continue;
> > > +		} else if (bno >= busyp->bno + busyp->length) {
> > > +			rbp = rbp->rb_right;
> > > +			continue;
> > > +		}
> > 
> > 		if (end <= bbno)
> > 			left;
> > 		else if (bno > bend)
> > 			right;
> 
> I think the original code is right in this case.
> The value of "bend" is the offset *following* the
> end of the range.  So if "bno" equals that, we
> want to move Right.  (Same reason <= is correct
> for the first condition here.)

Oops, yes, you are right. Good catch - I failed to copy that code
into psuedo code correctly.

> 
> > 		/* overlap */
> > 
> > > +
> > > +		if (busyp->bno < bno) {
> > > +			/* start overlap */
> > > +			ASSERT(bend >= bno);
> > > +			ASSERT(bend <= end);
> > > +			len -= bno - bend;
> > > +			bno = bend;
> > 
> > 		if (bbno < bno) {
> > 
> > 		bbno           bend
> > 		+-----------------+
> > Case 1:
> > 		   +---------+
> > 		   bno     end
> > 
> > 		   No unbusy region in extent, return failure
> 
> Yes, that's right, I missed that.  My suggestion goes
> negative in this case.
> 
> > Case 2:
> > 		   +------------------------+
> > 		   bno                    end
> > 
> > 		   Needs to be trimmed to:
> > 		                    +-------+
> > 		                    bno   end
> > 		   bno = bend;
> > 		   len = end - bno;
> 
> I like defining len in terms of the updated bno as
> you have suggested here.
> 
> > > +		} else if (bend > end) {
> > > +			/* end overlap */
> > > +			ASSERT(busyp->bno >= bno);
> > > +			ASSERT(busyp->bno < end);
> > > +			len -= bend - end;
> > 
> . . .
> 
> 
> > So, it looks to me like the "overlap found" algorithm shoul dbe
> > something like:
> 
> For this algorithm, updating the value of len can be done
> once, at the bottom (or top) of the loop, based simply on
> the (updated) value of end and bno:
> 
> 	len = end - bno;
> 
> You could rearrange things a bit so this gets done at
> the top--instead of computing the value of end based
> on bno and len.

Quite possibly - I just wanted to enumerate what I though the code
should do rather than present a optimal, completed function ;)

> > 		if (bbno <= bno) {
> > 			if (end <= bend) {
> > 				/* case 1, 3, 5 */
> > 				return failure;
> > 			}
> > 			/* case 2, 6 */
> > 			bno = bend;
> > 			len = end - bno;
> > 		} else if (bend >= end) {
> > 			ASSERT(bbno > bno);
> > 			/* case 4, 7 */
> > 			end = bbno;
> > 			len = end - bno;
> > 		} else {
> > 			ASSERT(bbno > bno);
> > 			ASSERT(bend < end);
> > 			/* case 8 */
> > 			if (bbno - bno >= args->minlen) {
> > 				/* left candidate OK */
> > 				left = 1;
> > 			}
> > 			if (end - bend >= args->maxlen * 4) {
> 
> The "4" here I understand, but it's arbitrary (based
> on an educated guess) so it needs to at least be explained
> here with a comment.  Making it symbolic might make it
> something one could search for at some future date.

Yup. That value of "4" was simply a SWAG - I haven't really thought
about the best way to determine if the "remaining free space is much
larger than the allocation request" reliably, but I needed
something there to demonstrate what I was thinking....

....

> > > -		if (xfs_alloc_busy_search(mp, agno, fbno, flen)) {
> > > -			trace_xfs_discard_busy(mp, agno, fbno, flen);
> > > -			goto next_extent;
> > > -		}
> > > +		xfs_alloc_busy_search_trim(mp, pag, fbno, flen, &tbno, &tlen);
> > >  
> > > -		trace_xfs_discard_extent(mp, agno, fbno, flen);
> > > +		trace_xfs_discard_extent(mp, agno, tbno, tlen);
> > >  		error = -blkdev_issue_discard(bdev,
> > > -				XFS_AGB_TO_DADDR(mp, agno, fbno),
> > > -				XFS_FSB_TO_BB(mp, flen),
> > > +				XFS_AGB_TO_DADDR(mp, agno, tbno),
> > > +				XFS_FSB_TO_BB(mp, tlen),
> > >  				GFP_NOFS, 0);
> > >  		if (error)
> > >  			goto out_del_cursor;
> > > -		*blocks_trimmed += flen;
> > > +		*blocks_trimmed += tlen;
> > 
> > Hmmm - that means if we get a case 8 overlap, we'll only trim one
> > side of the extent. That's probably not a big deal. However, perhaps
> > this should check the size of the trimmed extent before issuing the
> > discard against it in case we've reduced it to something smaller
> > thanthe minimum requested trim size....
> 
> I think all of the places that (ultimately) call this function
> need to be looked at to make sure they handle the "error" case
> properly--either checking for a returned error or verifying the
> returned length is at least the minimum.

*nods vigorously*

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 5/6] xfs: do not classify freed allocation btree blocks as busy
  2011-01-21  9:22 ` [PATCH 5/6] xfs: do not classify freed allocation btree blocks as busy Christoph Hellwig
  2011-01-28  6:33   ` Dave Chinner
  2011-01-28 22:17   ` Alex Elder
@ 2011-02-01 23:02   ` Alex Elder
  2 siblings, 0 replies; 19+ messages in thread
From: Alex Elder @ 2011-02-01 23:02 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: xfs

On Fri, 2011-01-21 at 04:22 -0500, Christoph Hellwig wrote:
> During an allocation or freeing of an extent, we occasionally have an
> interesting situation happen during a btree update.  We may merge a block
> as part of a record delete, then allocate it again immediately afterwards
> when inserting a modified extent.  xfs_alloc_fixup_trees() does this sort
> of manipulation to the btrees, and so can merge then immediately split
> the tree resulting the in the same busy block being used from the
> freelist.
> 
> Previously, this would trigger a synchronous log force of the entire log
> (as the current transaction had a lsn of 0 until committed), but continue
> to allow the extent to be used because if it is not on disk now then it
> must have been freed and marked busy in this transaction.
> 
> In the case of allocbt blocks, we log the fact that they get moved to the
> freelist and we also log them when the move off the free list back into
> the free space trees.  When we move the blocks off the freelist back to
> the trees, we mark the extent busy at that point so that it doesn't get
> reused until the transaction is committed.
> 
> This means that as long as the block is on the AGFL and is only used
> for allocbt blocks, we don't have to force the log to ensure prior
> manipulating transactions are on disk as the process of log recovery
> will replay the transactions in the correct order.
> 
> However, we still need to protect against the fact that as we approach
> ENOSPC in an AG we can allocate data and metadata blocks direct from the
> AGFL. In this case, we need to treat btree blocks just like any other
> busy extent. Hence we still need to track btree blocks in the busy extent
> tree, but we need to distinguish them from normal extents so we can apply
> the necessary exceptions for btree block allocation.

OK, it's taken a while but I'm finally ready to comment on
these last two patches.  I've been thinking about how these
freed blocks ought to be handled, but that's really a separate
discussion so I'll stick to the patch at hand for now.

It has a few problems.  I also think it may be making an
already a bit confusing situation possibly less obvious.
What I mean is, there are some rules we're depending on
here and those rules aren't very explicitly documented
very well in the code.  (Like, "if a btree node block
is freed it goes into the freelist, and in that case it
needs to be marked busy; normally the log will need to
be forced if that block gets reallocated later, but not
if it is getting reallocated later for use in the
allocation btree.")  It's not the nicest thing to
have to try to explain, but that means it's even
more important to do so.

I think some of the commit explanations help, but I
think the code itself ought to be more clear about
what's going on.

> Signed-off-by: Dave Chinner <david@fromorbit.com>
> Signed-off-by: Christoph Hellwig <hch@lst.de>
> 
> Index: xfs/fs/xfs/xfs_ag.h
> ===================================================================
> --- xfs.orig/fs/xfs/xfs_ag.h	2011-01-17 22:03:15.000000000 +0100
> +++ xfs/fs/xfs/xfs_ag.h	2011-01-17 22:32:06.541004201 +0100
> @@ -188,6 +188,11 @@ struct xfs_busy_extent {
>  	xfs_agblock_t	bno;
>  	xfs_extlen_t	length;
>  	xlog_tid_t	tid;		/* transaction that created this */
> +	int		flags;
> +};
> +
> +enum {
> +	XFS_BUSY_FREELIST	= 0x0001, /* busy extent on freelist from abt */

I don't really like the name here.  It emphasizes the freelist
rather than the fact that the extent was freed from the
allocation btree, which is I think the more important bit.

I think it's named this because it represents who called
xfs_alloc_busy_insert() (xfs_alloc_put_freelist(), as
opposed to xfs_free_ag_extent() which would pass 0 for
its "flags" value).  There's probably not a good concise
name to represent that call site, but I think I'd prefer
to have one rather than just passing 0.

In any case, the real distinction is whether the extent
being marked busy is a block previously used for the
allocation btree or not.

>  };
>  
>  /*
> Index: xfs/fs/xfs/xfs_alloc.c
> ===================================================================
> --- xfs.orig/fs/xfs/xfs_alloc.c	2011-01-17 22:30:54.000000000 +0100
> +++ xfs/fs/xfs/xfs_alloc.c	2011-01-17 22:39:29.021005529 +0100

Note that there are some comments in this file that begin
with:
 * AG Busy list management
They ought to be updated to give a one-line description of
the new functions.

. . .

> @@ -1996,22 +2000,23 @@ xfs_alloc_get_freelist(
>  	xfs_alloc_log_agf(tp, agbp, logflags);
>  	*bnop = bno;
>  
> -	/*
> -	 * As blocks are freed, they are added to the per-ag busy list and
> -	 * remain there until the freeing transaction is committed to disk.
> -	 * Now that we have allocated blocks, this list must be searched to see
> -	 * if a block is being reused.  If one is, then the freeing transaction
> -	 * must be pushed to disk before this transaction.
> -	 *
> -	 * We do this by setting the current transaction to a sync transaction
> -	 * which guarantees that the freeing transaction is on disk before this
> -	 * transaction. This is done instead of a synchronous log force here so
> -	 * that we don't sit and wait with the AGF locked in the transaction
> -	 * during the log force.
> -	 */
>  	if (type != XFS_FREELIST_BALANCE) {
> -		if (xfs_alloc_busy_search(mp, agno, bno, 1))
> -			xfs_trans_set_sync(tp);
> +		/*
> +		 * If this block is marked busy we may have to force out the
> +		 * log to prevent reuse until the freeing transaction has been
> +		 * commited.
> +		 *
> +		 * If we're just allocating the block to rebalance the the
> +		 * freelist size we can skip this exercise as the block
> +		 * just keeps its busy marking while migrating to the
> +		 * allocation btree.
> +		 *
> +		 * If the block was freed from a btree and gets reallocated
> +		 * to it we may skip the log force, but details are handled
> +		 * by xfs_alloc_busy_force.
> +		 */
> +		xfs_alloc_busy_force(mp, agno, bno, 1,
> +				     type == XFS_FREELIST_BTREE);

With the "if (type...)" above and the conditional expression
being passed as the last argument here you're basically
enumerating all possible values of "type".  I think it might
look cleaner to just pass type as-is, and let xfs_alloc_busy_force()
hide this part of the logic also.

>  	}
>  
>  	return 0;

. . .

> @@ -2123,6 +2129,14 @@ xfs_alloc_put_freelist(
>  		(int)((xfs_caddr_t)blockp - (xfs_caddr_t)agfl),
>  		(int)((xfs_caddr_t)blockp - (xfs_caddr_t)agfl +
>  			sizeof(xfs_agblock_t) - 1));
> +
> +	/*
> +	 * If it's a btree block, mark it busy as it has just been freed.
> +	 * Otherwise we are just replenishing the free list with extents
> +	 * that are already free so we don't need to track them.
> +	 */
> +	if (btreeblk)
> +		xfs_alloc_busy_insert(tp, agno, bno, 1, XFS_BUSY_FREELIST);

On second thought...  You could just pass the value of btreeblk
as the last argument here and let xfs_alloc_busy_insert() decide
whether inserting it is needed or not.

>  	return 0;
>  }
>  
. . .
> @@ -2733,6 +2749,62 @@ xfs_alloc_busy_search(
>  	return match;
>  }
>  
> +STATIC void
> +xfs_alloc_busy_force(

It's a shame this and xfs_alloc_busy_search() (as well
as the *_trim() variant) can't rely on a common helper
function, since they rely on basically the same logic
for searching the tree.

> +	struct xfs_mount	*mp,
> +	xfs_agnumber_t		agno,
> +	xfs_agblock_t		bno,
> +	xfs_extlen_t		len,
> +	int			btreeblk)
> +{
> +	struct xfs_perag	*pag;
> +	struct rb_node		*rbp;
> +	struct xfs_busy_extent	*busyp;
> +	int			match = 0;
> +
> +	pag = xfs_perag_get(mp, agno);
> +	spin_lock(&pag->pagb_lock);
> +
> +	rbp = pag->pagb_tree.rb_node;
> +
> +	/* find closest start bno overlap */
> +	while (rbp) {
> +		busyp = rb_entry(rbp, struct xfs_busy_extent, rb_node);
> +		if (bno < busyp->bno) {
> +			/* may overlap, but exact start block is lower */
> +			if (bno + len > busyp->bno) {
> +				match = 1;
> +				break;
> +			}
> +			rbp = rbp->rb_left;
> +		} else if (bno > busyp->bno) {
> +			/* may overlap, but exact start block is higher */
> +			if (bno < busyp->bno + busyp->length) {
> +				match = 1;
> +				break;
> +			}
> +			rbp = rbp->rb_right;
> +		} else {
> +			/* bno matches busyp, length determines exact match */
> +			match = 1;

I believe this is not sufficient.  You need to check for
an exact match.  More below.

> +			break;
> +		}
> +	}
> +
> +	if (match && btreeblk && (busyp->flags & XFS_BUSY_FREELIST)) {
> +		list_del_init(&busyp->list);

If xfs_alloc_busy_insert() finds that an extent being marked
busy abuts or overlaps an existing busy extent entry, it
combines the two.  Because of this, I think it's possible that
a busy entry found to overlap the range provided here (i.e.,
[bno, bno+len)) may contain more than just the overlapping
region.  Specifically this case will be freeing a single block,
and what I'm saying is that this single block could have been
merged with another by a different call to xfs_alloc_busy_insert().

So simply deleting the entire entry is wrong here.  You have
to see if we're in the situation I describe, and handle it
accordingly.  That gets messy, unfortunately.

This also highlights another issue with the addition of
the "flags" value to the xfs_busy_extent structure.  It
is no longer valid to simply combine two busy entries.
You only can combine them if their flag values match.
Otherwise, when combining them you could either gain
or lose flag values associated with the involved extents.

> +		rb_erase(&busyp->rb_node, &pag->pagb_tree);
> +		kmem_free(busyp);
> +		match = 0;
> +	}
> +	spin_unlock(&pag->pagb_lock);
> +	trace_xfs_alloc_busysearch(mp, agno, bno, len, !!match);
> +	xfs_perag_put(pag);
> +
> +	if (match)
> +		xfs_log_force(mp, XFS_LOG_SYNC);

I'm not sure how much difference it makes, but why can't
we just call xfs_log_force_lsn(mp, lsn, flags), where
"lsn" is the sequence number for the operation that
freed the extent that's being re-allocated.  (This
applies to the existing code as well, not just this
change.)  This suggestion also gets tricky due to
merging busy extents (similar to what I said about
"flags" above, but in this case it's the lsn).

> +}
> +
>  /*
>   * For a given extent [fbno, flen], search the busy extent list
>   * to find a subset of the extent that is no
. . .

					-Alex

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 6/6] xfs: remove handling of duplicates the busy extent tree
  2011-01-21  9:22 ` [PATCH 6/6] xfs: remove handling of duplicates the busy extent tree Christoph Hellwig
@ 2011-02-01 23:02   ` Alex Elder
  0 siblings, 0 replies; 19+ messages in thread
From: Alex Elder @ 2011-02-01 23:02 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: xfs

On Fri, 2011-01-21 at 04:22 -0500, Christoph Hellwig wrote:
> plain text document attachment (xfs-remove-duplicate-extent-handling)
> With the recent changes we never re-used busy extents.  Remove all handling
> of them and replace them with asserts.  This also effectively reverts
> commit 955833cf2ad0aa39b336e853cad212d867199984
> 
> 	"xfs: make the log ticket ID available outside the log infrastructure"

I have a hunch that this, for you, was an itch that
needed scratching.


Anyway, I was wondering about the following when
xfs_alloc_busy_search_trim() was added.  Maybe you
could just offer a quick explanation.  If we always
skip busy extents when allocating blocks, will we
still satisfy allocation requests when we're almost
out of space?  Will a log flush cause busy extents
to become non-busy when we find nothing available,
thereby always satifying requests we would have
satisfied (perhaps with busy extents) previously?

Second, is there *anything* that could be said about
the resulting allocation patterns?  Will they be
affected in any adverse (or beneficial) way by
skipping busy extents?  (Will we get more fragmented
earlier, for a contrived example?)  I have no
feeling for it but thought the question ought
to be asked...

Outside of the above, I think this patch looks fine.
The xfs_alloc_busy_search() call becomes a debug-only
function (and might as well be surrounded by DEBUG
ifdef's, accordingly), ensuring an allocated extent
is never found in the busy list.

I'm not signing off on this for now, though, because
I want to hear back on the comments from the previous
patch, and I think you're going to be re-posting the
series anyway.

					-Alex

> Signed-off-by: Christoph Hellwig <hch@lst.de>
> 
> Index: xfs/fs/xfs/xfs_ag.h
> ==========================================

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2011-02-01 23:00 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-01-21  9:22 [PATCH 0/6] do not reuse busy extents Christoph Hellwig
2011-01-21  9:22 ` [PATCH 1/6] xfs: clean up the xfs_alloc_compute_aligned calling convention Christoph Hellwig
2011-01-25  4:23   ` Dave Chinner
2011-01-27 23:21   ` Alex Elder
2011-01-21  9:22 ` [PATCH 3/6] xfs: do not immediately reuse busy extent ranges Christoph Hellwig
2011-01-27 23:20   ` Alex Elder
2011-01-28  1:58   ` Dave Chinner
2011-01-28 16:19     ` Alex Elder
2011-01-29  0:25       ` Dave Chinner
2011-01-21  9:22 ` [PATCH 4/6] xfs: optimize xfs_alloc_fix_freelist Christoph Hellwig
2011-01-28  5:36   ` Dave Chinner
2011-01-28  5:51     ` Dave Chinner
2011-01-28 22:17   ` Alex Elder
2011-01-21  9:22 ` [PATCH 5/6] xfs: do not classify freed allocation btree blocks as busy Christoph Hellwig
2011-01-28  6:33   ` Dave Chinner
2011-01-28 22:17   ` Alex Elder
2011-02-01 23:02   ` Alex Elder
2011-01-21  9:22 ` [PATCH 6/6] xfs: remove handling of duplicates the busy extent tree Christoph Hellwig
2011-02-01 23:02   ` Alex Elder

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.