From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from relay.sgi.com (relay3.corp.sgi.com [198.149.34.15]) by oss.sgi.com (8.14.3/8.14.3/SuSE Linux 0.8) with ESMTP id p11Mxw45249441 for ; Tue, 1 Feb 2011 17:00:00 -0600 Subject: Re: [PATCH 5/6] xfs: do not classify freed allocation btree blocks as busy From: Alex Elder In-Reply-To: <20110121092551.402449845@bombadil.infradead.org> References: <20110121092227.115815324@bombadil.infradead.org> <20110121092551.402449845@bombadil.infradead.org> Date: Tue, 01 Feb 2011 17:02:22 -0600 Message-ID: <1296601342.2350.129.camel@doink> Mime-Version: 1.0 Reply-To: aelder@sgi.com List-Id: XFS Filesystem from SGI List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Sender: xfs-bounces@oss.sgi.com Errors-To: xfs-bounces@oss.sgi.com To: Christoph Hellwig Cc: xfs@oss.sgi.com On Fri, 2011-01-21 at 04:22 -0500, Christoph Hellwig wrote: > During an allocation or freeing of an extent, we occasionally have an > interesting situation happen during a btree update. We may merge a block > as part of a record delete, then allocate it again immediately afterwards > when inserting a modified extent. xfs_alloc_fixup_trees() does this sort > of manipulation to the btrees, and so can merge then immediately split > the tree resulting the in the same busy block being used from the > freelist. > > Previously, this would trigger a synchronous log force of the entire log > (as the current transaction had a lsn of 0 until committed), but continue > to allow the extent to be used because if it is not on disk now then it > must have been freed and marked busy in this transaction. > > In the case of allocbt blocks, we log the fact that they get moved to the > freelist and we also log them when the move off the free list back into > the free space trees. When we move the blocks off the freelist back to > the trees, we mark the extent busy at that point so that it doesn't get > reused until the transaction is committed. > > This means that as long as the block is on the AGFL and is only used > for allocbt blocks, we don't have to force the log to ensure prior > manipulating transactions are on disk as the process of log recovery > will replay the transactions in the correct order. > > However, we still need to protect against the fact that as we approach > ENOSPC in an AG we can allocate data and metadata blocks direct from the > AGFL. In this case, we need to treat btree blocks just like any other > busy extent. Hence we still need to track btree blocks in the busy extent > tree, but we need to distinguish them from normal extents so we can apply > the necessary exceptions for btree block allocation. OK, it's taken a while but I'm finally ready to comment on these last two patches. I've been thinking about how these freed blocks ought to be handled, but that's really a separate discussion so I'll stick to the patch at hand for now. It has a few problems. I also think it may be making an already a bit confusing situation possibly less obvious. What I mean is, there are some rules we're depending on here and those rules aren't very explicitly documented very well in the code. (Like, "if a btree node block is freed it goes into the freelist, and in that case it needs to be marked busy; normally the log will need to be forced if that block gets reallocated later, but not if it is getting reallocated later for use in the allocation btree.") It's not the nicest thing to have to try to explain, but that means it's even more important to do so. I think some of the commit explanations help, but I think the code itself ought to be more clear about what's going on. > Signed-off-by: Dave Chinner > Signed-off-by: Christoph Hellwig > > Index: xfs/fs/xfs/xfs_ag.h > =================================================================== > --- xfs.orig/fs/xfs/xfs_ag.h 2011-01-17 22:03:15.000000000 +0100 > +++ xfs/fs/xfs/xfs_ag.h 2011-01-17 22:32:06.541004201 +0100 > @@ -188,6 +188,11 @@ struct xfs_busy_extent { > xfs_agblock_t bno; > xfs_extlen_t length; > xlog_tid_t tid; /* transaction that created this */ > + int flags; > +}; > + > +enum { > + XFS_BUSY_FREELIST = 0x0001, /* busy extent on freelist from abt */ I don't really like the name here. It emphasizes the freelist rather than the fact that the extent was freed from the allocation btree, which is I think the more important bit. I think it's named this because it represents who called xfs_alloc_busy_insert() (xfs_alloc_put_freelist(), as opposed to xfs_free_ag_extent() which would pass 0 for its "flags" value). There's probably not a good concise name to represent that call site, but I think I'd prefer to have one rather than just passing 0. In any case, the real distinction is whether the extent being marked busy is a block previously used for the allocation btree or not. > }; > > /* > Index: xfs/fs/xfs/xfs_alloc.c > =================================================================== > --- xfs.orig/fs/xfs/xfs_alloc.c 2011-01-17 22:30:54.000000000 +0100 > +++ xfs/fs/xfs/xfs_alloc.c 2011-01-17 22:39:29.021005529 +0100 Note that there are some comments in this file that begin with: * AG Busy list management They ought to be updated to give a one-line description of the new functions. . . . > @@ -1996,22 +2000,23 @@ xfs_alloc_get_freelist( > xfs_alloc_log_agf(tp, agbp, logflags); > *bnop = bno; > > - /* > - * As blocks are freed, they are added to the per-ag busy list and > - * remain there until the freeing transaction is committed to disk. > - * Now that we have allocated blocks, this list must be searched to see > - * if a block is being reused. If one is, then the freeing transaction > - * must be pushed to disk before this transaction. > - * > - * We do this by setting the current transaction to a sync transaction > - * which guarantees that the freeing transaction is on disk before this > - * transaction. This is done instead of a synchronous log force here so > - * that we don't sit and wait with the AGF locked in the transaction > - * during the log force. > - */ > if (type != XFS_FREELIST_BALANCE) { > - if (xfs_alloc_busy_search(mp, agno, bno, 1)) > - xfs_trans_set_sync(tp); > + /* > + * If this block is marked busy we may have to force out the > + * log to prevent reuse until the freeing transaction has been > + * commited. > + * > + * If we're just allocating the block to rebalance the the > + * freelist size we can skip this exercise as the block > + * just keeps its busy marking while migrating to the > + * allocation btree. > + * > + * If the block was freed from a btree and gets reallocated > + * to it we may skip the log force, but details are handled > + * by xfs_alloc_busy_force. > + */ > + xfs_alloc_busy_force(mp, agno, bno, 1, > + type == XFS_FREELIST_BTREE); With the "if (type...)" above and the conditional expression being passed as the last argument here you're basically enumerating all possible values of "type". I think it might look cleaner to just pass type as-is, and let xfs_alloc_busy_force() hide this part of the logic also. > } > > return 0; . . . > @@ -2123,6 +2129,14 @@ xfs_alloc_put_freelist( > (int)((xfs_caddr_t)blockp - (xfs_caddr_t)agfl), > (int)((xfs_caddr_t)blockp - (xfs_caddr_t)agfl + > sizeof(xfs_agblock_t) - 1)); > + > + /* > + * If it's a btree block, mark it busy as it has just been freed. > + * Otherwise we are just replenishing the free list with extents > + * that are already free so we don't need to track them. > + */ > + if (btreeblk) > + xfs_alloc_busy_insert(tp, agno, bno, 1, XFS_BUSY_FREELIST); On second thought... You could just pass the value of btreeblk as the last argument here and let xfs_alloc_busy_insert() decide whether inserting it is needed or not. > return 0; > } > . . . > @@ -2733,6 +2749,62 @@ xfs_alloc_busy_search( > return match; > } > > +STATIC void > +xfs_alloc_busy_force( It's a shame this and xfs_alloc_busy_search() (as well as the *_trim() variant) can't rely on a common helper function, since they rely on basically the same logic for searching the tree. > + struct xfs_mount *mp, > + xfs_agnumber_t agno, > + xfs_agblock_t bno, > + xfs_extlen_t len, > + int btreeblk) > +{ > + struct xfs_perag *pag; > + struct rb_node *rbp; > + struct xfs_busy_extent *busyp; > + int match = 0; > + > + pag = xfs_perag_get(mp, agno); > + spin_lock(&pag->pagb_lock); > + > + rbp = pag->pagb_tree.rb_node; > + > + /* find closest start bno overlap */ > + while (rbp) { > + busyp = rb_entry(rbp, struct xfs_busy_extent, rb_node); > + if (bno < busyp->bno) { > + /* may overlap, but exact start block is lower */ > + if (bno + len > busyp->bno) { > + match = 1; > + break; > + } > + rbp = rbp->rb_left; > + } else if (bno > busyp->bno) { > + /* may overlap, but exact start block is higher */ > + if (bno < busyp->bno + busyp->length) { > + match = 1; > + break; > + } > + rbp = rbp->rb_right; > + } else { > + /* bno matches busyp, length determines exact match */ > + match = 1; I believe this is not sufficient. You need to check for an exact match. More below. > + break; > + } > + } > + > + if (match && btreeblk && (busyp->flags & XFS_BUSY_FREELIST)) { > + list_del_init(&busyp->list); If xfs_alloc_busy_insert() finds that an extent being marked busy abuts or overlaps an existing busy extent entry, it combines the two. Because of this, I think it's possible that a busy entry found to overlap the range provided here (i.e., [bno, bno+len)) may contain more than just the overlapping region. Specifically this case will be freeing a single block, and what I'm saying is that this single block could have been merged with another by a different call to xfs_alloc_busy_insert(). So simply deleting the entire entry is wrong here. You have to see if we're in the situation I describe, and handle it accordingly. That gets messy, unfortunately. This also highlights another issue with the addition of the "flags" value to the xfs_busy_extent structure. It is no longer valid to simply combine two busy entries. You only can combine them if their flag values match. Otherwise, when combining them you could either gain or lose flag values associated with the involved extents. > + rb_erase(&busyp->rb_node, &pag->pagb_tree); > + kmem_free(busyp); > + match = 0; > + } > + spin_unlock(&pag->pagb_lock); > + trace_xfs_alloc_busysearch(mp, agno, bno, len, !!match); > + xfs_perag_put(pag); > + > + if (match) > + xfs_log_force(mp, XFS_LOG_SYNC); I'm not sure how much difference it makes, but why can't we just call xfs_log_force_lsn(mp, lsn, flags), where "lsn" is the sequence number for the operation that freed the extent that's being re-allocated. (This applies to the existing code as well, not just this change.) This suggestion also gets tricky due to merging busy extents (similar to what I said about "flags" above, but in this case it's the lsn). > +} > + > /* > * For a given extent [fbno, flen], search the busy extent list > * to find a subset of the extent that is no . . . -Alex _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs