All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH] XFS: Don't flush stale inodes
@ 2010-01-02  2:39 Dave Chinner
  2010-01-02 12:00 ` Christoph Hellwig
                   ` (2 more replies)
  0 siblings, 3 replies; 15+ messages in thread
From: Dave Chinner @ 2010-01-02  2:39 UTC (permalink / raw)
  To: xfs

Because inodes remain in cache much longer than inode buffers do
under memory pressure, we can get the situation where we have stale,
dirty inodes being reclaimed but the backing storage has been freed.
Hence we should never, ever flush XFS_ISTALE inodes to disk as
there is no guarantee that the backing buffer is in cache and
still marked stale when the flush occurs.

Signed-off-by: Dave Chinner <david@fromorbit.com>
---
 fs/xfs/xfs_inode.c |   10 +++++++---
 1 files changed, 7 insertions(+), 3 deletions(-)

diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
index c2618db..e5c9953 100644
--- a/fs/xfs/xfs_inode.c
+++ b/fs/xfs/xfs_inode.c
@@ -2846,10 +2846,14 @@ xfs_iflush(
 	mp = ip->i_mount;
 
 	/*
-	 * If the inode isn't dirty, then just release the inode
-	 * flush lock and do nothing.
+	 * If the inode isn't dirty, then just release the inode flush lock and
+	 * do nothing. Treat stale inodes the same; we cannot rely on the
+	 * backing buffer remaining stale in cache for the remaining life of
+	 * the stale inode and so xfs_itobp() below may give us a buffer that
+	 * no longer contains inodes below. Doing this stale check here also
+	 * avoids forcing the log on pinned, stale inodes.
 	 */
-	if (xfs_inode_clean(ip)) {
+	if (xfs_inode_clean(ip) || xfs_iflags_test(ip, XFS_ISTALE)) {
 		xfs_ifunlock(ip);
 		return 0;
 	}
-- 
1.6.5

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [PATCH] XFS: Don't flush stale inodes
  2010-01-02  2:39 [PATCH] XFS: Don't flush stale inodes Dave Chinner
@ 2010-01-02 12:00 ` Christoph Hellwig
  2010-01-02 12:14   ` Dave Chinner
  2010-01-02 12:24   ` Dave Chinner
  2010-01-08 20:59 ` Alex Elder
  2010-01-08 23:14 ` [PATCH v2] xfs: " Alex Elder
  2 siblings, 2 replies; 15+ messages in thread
From: Christoph Hellwig @ 2010-01-02 12:00 UTC (permalink / raw)
  To: Dave Chinner; +Cc: xfs

On Sat, Jan 02, 2010 at 01:39:40PM +1100, Dave Chinner wrote:
> Because inodes remain in cache much longer than inode buffers do
> under memory pressure, we can get the situation where we have stale,
> dirty inodes being reclaimed but the backing storage has been freed.
> Hence we should never, ever flush XFS_ISTALE inodes to disk as
> there is no guarantee that the backing buffer is in cache and
> still marked stale when the flush occurs.

We should not flush stale inodes.  But how do we even end up calling
xfs_iflush with a stale inode?

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] XFS: Don't flush stale inodes
  2010-01-02 12:00 ` Christoph Hellwig
@ 2010-01-02 12:14   ` Dave Chinner
  2010-01-02 12:24   ` Dave Chinner
  1 sibling, 0 replies; 15+ messages in thread
From: Dave Chinner @ 2010-01-02 12:14 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: xfs

On Sat, Jan 02, 2010 at 07:00:53AM -0500, Christoph Hellwig wrote:
> On Sat, Jan 02, 2010 at 01:39:40PM +1100, Dave Chinner wrote:
> > Because inodes remain in cache much longer than inode buffers do
> > under memory pressure, we can get the situation where we have stale,
> > dirty inodes being reclaimed but the backing storage has been freed.
> > Hence we should never, ever flush XFS_ISTALE inodes to disk as
> > there is no guarantee that the backing buffer is in cache and
> > still marked stale when the flush occurs.
> 
> We should not flush stale inodes.  But how do we even end up calling
> xfs_iflush with a stale inode?

xfs_reclaim_inode() -> xfs_iflush() according to the stack traces
that found bad inode magic numbers in xfs_itobp().

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] XFS: Don't flush stale inodes
  2010-01-02 12:00 ` Christoph Hellwig
  2010-01-02 12:14   ` Dave Chinner
@ 2010-01-02 12:24   ` Dave Chinner
  2010-01-02 13:17     ` Christoph Hellwig
  1 sibling, 1 reply; 15+ messages in thread
From: Dave Chinner @ 2010-01-02 12:24 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: xfs

On Sat, Jan 02, 2010 at 07:00:53AM -0500, Christoph Hellwig wrote:
> On Sat, Jan 02, 2010 at 01:39:40PM +1100, Dave Chinner wrote:
> > Because inodes remain in cache much longer than inode buffers do
> > under memory pressure, we can get the situation where we have stale,
> > dirty inodes being reclaimed but the backing storage has been freed.
> > Hence we should never, ever flush XFS_ISTALE inodes to disk as
> > there is no guarantee that the backing buffer is in cache and
> > still marked stale when the flush occurs.
> 
> We should not flush stale inodes.  But how do we even end up calling
> xfs_iflush with a stale inode?

Actually, here's most of the failure trace (unlimited scrollback buffers are
great):

[ 5703.683858] Device sdb2 - bad inode magic/vsn daddr 16129976 #0 (magic=0)
[ 5703.690689] ------------[ cut here ]------------
[ 5703.691665] kernel BUG at fs/xfs/support/debug.c:62!
[ 5703.691665] invalid opcode: 0000 [#1] SMP
[ 5703.691665] last sysfs file: /sys/devices/virtual/net/lo/operstate
[ 5703.691665] CPU 1
[ 5703.691665] Modules linked in:
[ 5703.691665] Pid: 4017, comm: xfssyncd Not tainted 2.6.32-dgc #73 IBM eServer 326m -[796955M]-
[ 5703.691665] RIP: 0010:[<ffffffff813925a1>]  [<ffffffff813925a1>] cmn_err+0x101/0x110
[ 5703.691665] RSP: 0018:ffff8800a8cfdaa0  EFLAGS: 00010246
[ 5703.691665] RAX: 0000000002deff6d RBX: ffffffff819102d0 RCX: 0000000000000006
[ 5703.691665] RDX: ffffffff81fb5130 RSI: ffff8800ae2d9ba0 RDI: ffff8800ae2d9440
[ 5703.691665] RBP: ffff8800a8cfdb90 R08: 0000000000000000 R09: 0000000000000001
[ 5703.691665] R10: 0000000000000001 R11: 0000000000000001 R12: 0000000000000000
[ 5703.691665] R13: 0000000000000282 R14: 0000000000000000 R15: ffff8800ae34ca88
[ 5703.691665] FS:  00007f64efe476f0(0000) GS:ffff880007600000(0000) knlGS:0000000000000000
[ 5703.691665] CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
[ 5703.691665] CR2: 00007f64efe4b000 CR3: 00000000ad04f000 CR4: 00000000000006e0
[ 5703.691665] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 5703.691665] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[ 5703.691665] Process xfssyncd (pid: 4017, threadinfo ffff8800a8cfc000, task ffff8800ae2d9440)
[ 5703.691665] Stack:
[ 5703.691665]  0000003000000030 ffff8800a8cfdba0 ffff8800a8cfdac0 ffff8800ad585b24
[ 5703.691665] <0> 0000000000000002 ffffffff8135e77d ffff8800a8cfdbe0 0000000000f61fb8
[ 5703.691665] <0> 0000000000000000 0000000000000000 ffff8800a8cfdb10 ffffffff81388330
[ 5703.691665] Call Trace:
[ 5703.691665]  [<ffffffff8135e77d>] ? xfs_itobp+0x6d/0x100
[ 5703.691665]  [<ffffffff81388330>] ? _xfs_buf_read+0x90/0xa0
[ 5703.691665]  [<ffffffff81388a8c>] ? xfs_buf_read+0xdc/0x110
[ 5703.691665]  [<ffffffff8137a9cf>] ? xfs_trans_read_buf+0x43f/0x680
[ 5703.691665]  [<ffffffff8117b723>] ? disk_name+0x63/0xc0
[ 5703.691665]  [<ffffffff8135e62a>] xfs_imap_to_bp+0x15a/0x240
[ 5703.691665]  [<ffffffff8135e77d>] ? xfs_itobp+0x6d/0x100
[ 5703.691665]  [<ffffffff8135e77d>] xfs_itobp+0x6d/0x100
[ 5703.691665]  [<ffffffff81360397>] xfs_iflush+0x207/0x380
[ 5703.691665]  [<ffffffff81390d7f>] xfs_reclaim_inode+0x15f/0x1b0
[ 5703.691665]  [<ffffffff81390e38>] xfs_reclaim_inode_now+0x68/0x90
[ 5703.691665]  [<ffffffff81390dd0>] ? xfs_reclaim_inode_now+0x0/0x90
[ 5703.691665]  [<ffffffff81391744>] xfs_inode_ag_walk+0x64/0xc0
[ 5703.691665]  [<ffffffff81373552>] ? xfs_perag_get+0xe2/0x110
[ 5703.691665]  [<ffffffff81391817>] xfs_inode_ag_iterator+0x77/0xc0
[ 5703.691665]  [<ffffffff81390dd0>] ? xfs_reclaim_inode_now+0x0/0x90

I was hitting this regularly with workloads creating then removing
hundreds of thousands of small files, and the patch I sent stopped
them from occurring...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] XFS: Don't flush stale inodes
  2010-01-02 12:24   ` Dave Chinner
@ 2010-01-02 13:17     ` Christoph Hellwig
  2010-01-02 13:39       ` Dave Chinner
  0 siblings, 1 reply; 15+ messages in thread
From: Christoph Hellwig @ 2010-01-02 13:17 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Christoph Hellwig, xfs

This looks like a bigger problem to me.  We only mark inodes as stale
from xfs_ifree_cluster, which via xfs_ifree and xfs_inactive gets
called from xfs_fs_clear_inode.  Given that the inode has now been
deleted we should not mark it as reclaimable in xfs_fs_destroy_inode
but go on to reap it given that there is nothing to reclaim.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] XFS: Don't flush stale inodes
  2010-01-02 13:17     ` Christoph Hellwig
@ 2010-01-02 13:39       ` Dave Chinner
  0 siblings, 0 replies; 15+ messages in thread
From: Dave Chinner @ 2010-01-02 13:39 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: xfs

On Sat, Jan 02, 2010 at 08:17:01AM -0500, Christoph Hellwig wrote:
> This looks like a bigger problem to me.  We only mark inodes as stale
> from xfs_ifree_cluster, which via xfs_ifree and xfs_inactive gets
> called from xfs_fs_clear_inode.  Given that the inode has now been
> deleted we should not mark it as reclaimable in xfs_fs_destroy_inode
> but go on to reap it given that there is nothing to reclaim.

We can't reap it immediately as the inodes are usually dirty and
pinned in memory at this point by transactions that have not yet
been written to disk and completed. We need the inodes in core for
log IO completion processing....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 15+ messages in thread

* RE: [PATCH] XFS: Don't flush stale inodes
  2010-01-02  2:39 [PATCH] XFS: Don't flush stale inodes Dave Chinner
  2010-01-02 12:00 ` Christoph Hellwig
@ 2010-01-08 20:59 ` Alex Elder
  2010-01-08 23:14 ` [PATCH v2] xfs: " Alex Elder
  2 siblings, 0 replies; 15+ messages in thread
From: Alex Elder @ 2010-01-08 20:59 UTC (permalink / raw)
  To: Dave Chinner; +Cc: xfs

Dave Chinner wrote:
> Because inodes remain in cache much longer than inode buffers do
> under memory pressure, we can get the situation where we have stale,
> dirty inodes being reclaimed but the backing storage has been freed.
> Hence we should never, ever flush XFS_ISTALE inodes to disk as
> there is no guarantee that the backing buffer is in cache and
> still marked stale when the flush occurs.

Looks good.  Seems like this could be the source of
some pretty gnarly problems.

> Signed-off-by: Dave Chinner <david@fromorbit.com>

Reviewed-by: Alex Elder <aelder@sgi.com>

> ---
>  fs/xfs/xfs_inode.c |   10 +++++++---
>  1 files changed, 7 insertions(+), 3 deletions(-)
> 
> diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
> index c2618db..e5c9953 100644
> --- a/fs/xfs/xfs_inode.c
> +++ b/fs/xfs/xfs_inode.c
> @@ -2846,10 +2846,14 @@ xfs_iflush(
>  	mp = ip->i_mount;
> 
>  	/*
> -	 * If the inode isn't dirty, then just release the inode
> -	 * flush lock and do nothing.
> +	 * If the inode isn't dirty, then just release the inode flush lock and
> +	 * do nothing. Treat stale inodes the same; we cannot rely on the
> +	 * backing buffer remaining stale in cache for the remaining life of
> +	 * the stale inode and so xfs_itobp() below may give us a buffer that
> +	 * no longer contains inodes below. Doing this stale check here also
> +	 * avoids forcing the log on pinned, stale inodes.
>  	 */
> -	if (xfs_inode_clean(ip)) {
> +	if (xfs_inode_clean(ip) || xfs_iflags_test(ip, XFS_ISTALE)) {
>  		xfs_ifunlock(ip);
>  		return 0;
>  	}

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH v2] xfs: Don't flush stale inodes
  2010-01-02  2:39 [PATCH] XFS: Don't flush stale inodes Dave Chinner
  2010-01-02 12:00 ` Christoph Hellwig
  2010-01-08 20:59 ` Alex Elder
@ 2010-01-08 23:14 ` Alex Elder
  2010-01-09  0:09   ` Dave Chinner
  2 siblings, 1 reply; 15+ messages in thread
From: Alex Elder @ 2010-01-08 23:14 UTC (permalink / raw)
  To: Dave Chinner; +Cc: xfs

Dave Chinner wrote:
> Because inodes remain in cache much longer than inode buffers do
> under memory pressure, we can get the situation where we have stale,
> dirty inodes being reclaimed but the backing storage has been freed.
> Hence we should never, ever flush XFS_ISTALE inodes to disk as
> there is no guarantee that the backing buffer is in cache and
> still marked stale when the flush occurs.
>
> Signed-off-by: Dave Chinner <david@fromorbit.com>

Dave, I'm putting this patch in before your perag radix tree
patch series, so I had to port it.  I am submitting here on
your behalf, but would like you to review it if possible.
It builds fine, and I've run it through a quick test.
I will retain all of your original commentary, etc.  I did
*not* implement Christoph's recommendation that you change
the "for" loop in xfs_alloc_search_busy() to a more typical
form.

For reference, the original patch is here:
    http://patchwork.xfs.org/patch/382/

Signed-off-by: Alex Elder <aelder@sgi.com>


---
 fs/xfs/linux-2.6/xfs_trace.h |    9 +++++---
 fs/xfs/xfs_alloc.c           |   44 ++++++++++++++++++++-----------------------
 2 files changed, 27 insertions(+), 26 deletions(-)

Index: b/fs/xfs/linux-2.6/xfs_trace.h
===================================================================
--- a/fs/xfs/linux-2.6/xfs_trace.h
+++ b/fs/xfs/linux-2.6/xfs_trace.h
@@ -1079,14 +1079,15 @@ TRACE_EVENT(xfs_alloc_unbusy,
 
 TRACE_EVENT(xfs_alloc_busysearch,
 	TP_PROTO(struct xfs_mount *mp, xfs_agnumber_t agno, xfs_agblock_t agbno,
-		 xfs_extlen_t len, int found),
-	TP_ARGS(mp, agno, agbno, len, found),
+		 xfs_extlen_t len, int found, xfs_lsn_t lsn),
+	TP_ARGS(mp, agno, agbno, len, found, lsn),
 	TP_STRUCT__entry(
 		__field(dev_t, dev)
 		__field(xfs_agnumber_t, agno)
 		__field(xfs_agblock_t, agbno)
 		__field(xfs_extlen_t, len)
 		__field(int, found)
+		__field(xfs_lsn_t, lsn)
 	),
 	TP_fast_assign(
 		__entry->dev = mp->m_super->s_dev;
@@ -1094,12 +1095,14 @@ TRACE_EVENT(xfs_alloc_busysearch,
 		__entry->agbno = agbno;
 		__entry->len = len;
 		__entry->found = found;
+		__entry->lsn = lsn;
 	),
-	TP_printk("dev %d:%d agno %u agbno %u len %u %s",
+	TP_printk("dev %d:%d agno %u agbno %u len %u force lsn 0x%llx %s",
 		  MAJOR(__entry->dev), MINOR(__entry->dev),
 		  __entry->agno,
 		  __entry->agbno,
 		  __entry->len,
+		  __entry->lsn,
 		  __print_symbolic(__entry->found, XFS_BUSY_STATES))
 );
 
Index: b/fs/xfs/xfs_alloc.c
===================================================================
--- a/fs/xfs/xfs_alloc.c
+++ b/fs/xfs/xfs_alloc.c
@@ -2563,7 +2563,7 @@ xfs_alloc_search_busy(xfs_trans_t *tp,
 	xfs_mount_t		*mp;
 	xfs_perag_busy_t	*bsy;
 	xfs_agblock_t		uend, bend;
-	xfs_lsn_t		lsn;
+	xfs_lsn_t		lsn = 0;
 	int			cnt;
 
 	mp = tp->t_mountp;
@@ -2573,33 +2573,31 @@ xfs_alloc_search_busy(xfs_trans_t *tp,
 
 	uend = bno + len - 1;
 
-	/* search pagb_list for this slot, skipping open slots */
-	for (bsy = mp->m_perag[agno].pagb_list; cnt; bsy++) {
-
-		/*
-		 * (start1,length1) within (start2, length2)
-		 */
-		if (bsy->busy_tp != NULL) {
-			bend = bsy->busy_start + bsy->busy_length - 1;
-			if ((bno > bend) || (uend < bsy->busy_start)) {
-				cnt--;
-			} else {
-				break;
-			}
-		}
+	/*
+	 * search pagb_list for this slot, skipping open slots. We have to
+	 * search the entire array as there may be multiple overlaps and
+	 * we have to get the most recent LSN for the log force to push out
+	 * all the transactions that span the range.
+	 */
+	for (bsy = mp->m_perag[agno].pagb_list; cnt; bsy++, cnt--) {
+		if (!bsy->busy_tp)
+			continue;
+
+		bend = bsy->busy_start + bsy->busy_length - 1;
+		if ((bno > bend) || (uend < bsy->busy_start))
+			continue;
+
+		/* (start1,length1) within (start2, length2) */
+		if (XFS_LSN_CMP(bsy->busy_tp->t_commit_lsn, lsn) > 0)
+			lsn = bsy->busy_tp->t_commit_lsn;
 	}
-
-	trace_xfs_alloc_busysearch(mp, agno, bno, len, !!cnt);
+	spin_unlock(&mp->m_perag[agno].pagb_lock);
+	trace_xfs_alloc_busysearch(tp->t_mountp, agno, bno, len, !!cnt, lsn);
 
 	/*
 	 * If a block was found, force the log through the LSN of the
 	 * transaction that freed the block
 	 */
-	if (cnt) {
-		lsn = bsy->busy_tp->t_commit_lsn;
-		spin_unlock(&mp->m_perag[agno].pagb_lock);
+	if (lsn)
 		xfs_log_force(mp, lsn, XFS_LOG_FORCE|XFS_LOG_SYNC);
-	} else {
-		spin_unlock(&mp->m_perag[agno].pagb_lock);
-	}
 }

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v2] xfs: Don't flush stale inodes
  2010-01-08 23:14 ` [PATCH v2] xfs: " Alex Elder
@ 2010-01-09  0:09   ` Dave Chinner
  2010-01-09 16:22     ` Alex Elder
  2010-01-09 18:10     ` [PATCH, updated] xfs: Ensure we force all busy extents in range to disk Alex Elder
  0 siblings, 2 replies; 15+ messages in thread
From: Dave Chinner @ 2010-01-09  0:09 UTC (permalink / raw)
  To: Alex Elder; +Cc: xfs

On Fri, Jan 08, 2010 at 05:14:25PM -0600, Alex Elder wrote:
> Dave Chinner wrote:
> > Because inodes remain in cache much longer than inode buffers do
> > under memory pressure, we can get the situation where we have stale,
> > dirty inodes being reclaimed but the backing storage has been freed.
> > Hence we should never, ever flush XFS_ISTALE inodes to disk as
> > there is no guarantee that the backing buffer is in cache and
> > still marked stale when the flush occurs.
> >
> > Signed-off-by: Dave Chinner <david@fromorbit.com>
> 
> Dave, I'm putting this patch in before your perag radix tree
> patch series, so I had to port it.  I am submitting here on
> your behalf, but would like you to review it if possible.
> It builds fine, and I've run it through a quick test.
> I will retain all of your original commentary, etc.  I did
> *not* implement Christoph's recommendation that you change
> the "for" loop in xfs_alloc_search_busy() to a more typical
> form.
> 
> For reference, the original patch is here:
>     http://patchwork.xfs.org/patch/382/
> 
> Signed-off-by: Alex Elder <aelder@sgi.com>

Ah, which patch are you talking about? The title says stale inodes,
the patch is searching busy extents.

Assuming you meant the busy extent search fix, you've ported the
wrong version of the busy extent search fix. I posted an updated
version after Christoph reviewed that one (in my response to
Christoph's review).

Because it was a new patch, patchwork doesn't add the email to the
original thread - it starts a new one. It's a real PITA, IMO, but
that is the way patchwork does stuff. The patch that should be
checked in is this one:

http://patchwork.xfs.org/patch/387/

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 15+ messages in thread

* RE:  Re: [PATCH v2] xfs: Don't flush stale inodes
  2010-01-09  0:09   ` Dave Chinner
@ 2010-01-09 16:22     ` Alex Elder
  2010-01-09 22:39       ` Dave Chinner
  2010-01-09 18:10     ` [PATCH, updated] xfs: Ensure we force all busy extents in range to disk Alex Elder
  1 sibling, 1 reply; 15+ messages in thread
From: Alex Elder @ 2010-01-09 16:22 UTC (permalink / raw)
  To: Dave Chinner; +Cc: xfs

Dave Chinner wrote:
> On Fri, Jan 08, 2010 at 05:14:25PM -0600, Alex Elder wrote:
>> Dave Chinner wrote:
>>> Because inodes remain in cache much longer than inode buffers do
>>> under memory pressure, we can get the situation where we have stale,
>>> dirty inodes being reclaimed but the backing storage has been freed.
>>> Hence we should never, ever flush XFS_ISTALE inodes to disk as
>>> there is no guarantee that the backing buffer is in cache and
>>> still marked stale when the flush occurs.
>>> 
>>> Signed-off-by: Dave Chinner <david@fromorbit.com>
>> 
>> Dave, I'm putting this patch in before your perag radix tree
>> patch series, so I had to port it.  I am submitting here on
>> your behalf, but would like you to review it if possible.
>> It builds fine, and I've run it through a quick test.
>> I will retain all of your original commentary, etc.  I did
>> *not* implement Christoph's recommendation that you change
>> the "for" loop in xfs_alloc_search_busy() to a more typical form.
>> 
>> For reference, the original patch is here:
>>     http://patchwork.xfs.org/patch/382/
>> 
>> Signed-off-by: Alex Elder <aelder@sgi.com>
> 
> Ah, which patch are you talking about? The title says stale inodes,
> the patch is searching busy extents.

Oops.  I blame my Left-handed mouse clicking skills, which are
improving but are still unreliable.  I'm working mostly without
my Right hand this week because of an injury...

> Assuming you meant the busy extent search fix, you've ported the
> wrong version of the busy extent search fix. I posted an updated
> version after Christoph reviewed that one (in my response to
> Christoph's review).

Yes, I see that now.  I had marked your older one "Reviewed" and
didn't look further.  I will re-do the port and will re-submit.
BTW I also plan to port the perag radix tree patch(es) over this
for you after I get this one published.

> Because it was a new patch, patchwork doesn't add the email to the
> original thread - it starts a new one. It's a real PITA, IMO, but
> that is the way patchwork does stuff. The patch that should be
> checked in is this one:
> 
> http://patchwork.xfs.org/patch/387/

Yes, I agree.  I find it helpful when updates are flagged with
something like "[PATCH v2]" because it somehow catches my eye
better *and* makes it explicit that the message contains a patch
that obsoletes a previous one.

					-Alex

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH, updated]  xfs: Ensure we force all busy extents in range to disk
  2010-01-09  0:09   ` Dave Chinner
  2010-01-09 16:22     ` Alex Elder
@ 2010-01-09 18:10     ` Alex Elder
  2010-01-09 19:35       ` Christoph Hellwig
  1 sibling, 1 reply; 15+ messages in thread
From: Alex Elder @ 2010-01-09 18:10 UTC (permalink / raw)
  To: Dave Chinner; +Cc: xfs

Port of this patch to the current XFS master top-of-tree,
which does not yet have the perag radix-tree changes.

    http://patchwork.xfs.org/patch/387/

					-Alex

When we search for and find a busy extent during allocation we
force the log out to ensure the extent free transaction is on
disk before the allocation transaction. The current implementation
has a subtle bug in it--it does not handle multiple overlapping
ranges.

That is, if we free lots of little extents into a single
contiguous extent, then allocate the contiguous extent, the busy
search code stops searching at the first extent it finds that
overlaps the allocated range. It then uses the commit LSN of the
transaction to force the log out to.

Unfortunately, the other busy ranges might have more recent
commit LSNs than the first busy extent that is found, and this
results in xfs_alloc_search_busy() returning before all the
extent free transactions are on disk for the range being
allocated. This can lead to potential metadata corruption or
stale data exposure after a crash because log replay won't replay
all the extent free transactions that cover the allocation range.

Signed-off-by: Dave Chinner <david@fromorbit.com>
Signed-off-by: Alex Elder <aelder@sgi.com>

---
 fs/xfs/linux-2.6/xfs_trace.h |    9 +++++---
 fs/xfs/xfs_alloc.c           |   46 ++++++++++++++++++++-----------------------
 2 files changed, 28 insertions(+), 27 deletions(-)

Index: b/fs/xfs/linux-2.6/xfs_trace.h
===================================================================
--- a/fs/xfs/linux-2.6/xfs_trace.h
+++ b/fs/xfs/linux-2.6/xfs_trace.h
@@ -1079,14 +1079,15 @@ TRACE_EVENT(xfs_alloc_unbusy,
 
 TRACE_EVENT(xfs_alloc_busysearch,
 	TP_PROTO(struct xfs_mount *mp, xfs_agnumber_t agno, xfs_agblock_t agbno,
-		 xfs_extlen_t len, int found),
-	TP_ARGS(mp, agno, agbno, len, found),
+		 xfs_extlen_t len, int found, xfs_lsn_t lsn),
+	TP_ARGS(mp, agno, agbno, len, found, lsn),
 	TP_STRUCT__entry(
 		__field(dev_t, dev)
 		__field(xfs_agnumber_t, agno)
 		__field(xfs_agblock_t, agbno)
 		__field(xfs_extlen_t, len)
 		__field(int, found)
+		__field(xfs_lsn_t, lsn)
 	),
 	TP_fast_assign(
 		__entry->dev = mp->m_super->s_dev;
@@ -1094,12 +1095,14 @@ TRACE_EVENT(xfs_alloc_busysearch,
 		__entry->agbno = agbno;
 		__entry->len = len;
 		__entry->found = found;
+		__entry->lsn = lsn;
 	),
-	TP_printk("dev %d:%d agno %u agbno %u len %u %s",
+	TP_printk("dev %d:%d agno %u agbno %u len %u force lsn 0x%llx %s",
 		  MAJOR(__entry->dev), MINOR(__entry->dev),
 		  __entry->agno,
 		  __entry->agbno,
 		  __entry->len,
+		  __entry->lsn,
 		  __print_symbolic(__entry->found, XFS_BUSY_STATES))
 );
 
Index: b/fs/xfs/xfs_alloc.c
===================================================================
--- a/fs/xfs/xfs_alloc.c
+++ b/fs/xfs/xfs_alloc.c
@@ -2563,43 +2563,41 @@ xfs_alloc_search_busy(xfs_trans_t *tp,
 	xfs_mount_t		*mp;
 	xfs_perag_busy_t	*bsy;
 	xfs_agblock_t		uend, bend;
-	xfs_lsn_t		lsn;
+	xfs_lsn_t		lsn = 0;
 	int			cnt;
 
 	mp = tp->t_mountp;
 
 	spin_lock(&mp->m_perag[agno].pagb_lock);
-	cnt = mp->m_perag[agno].pagb_count;
 
 	uend = bno + len - 1;
 
-	/* search pagb_list for this slot, skipping open slots */
-	for (bsy = mp->m_perag[agno].pagb_list; cnt; bsy++) {
-
-		/*
-		 * (start1,length1) within (start2, length2)
-		 */
-		if (bsy->busy_tp != NULL) {
-			bend = bsy->busy_start + bsy->busy_length - 1;
-			if ((bno > bend) || (uend < bsy->busy_start)) {
-				cnt--;
-			} else {
-				break;
-			}
-		}
+	/*
+	 * search pagb_list for this slot, skipping open slots. We have to
+	 * search the entire array as there may be multiple overlaps and
+	 * we have to get the most recent LSN for the log force to push out
+	 * all the transactions that span the range.
+	 */
+	for (cnt = 0; cnt < mp->m_perag[agno].pagb_count; cnt++) {
+		bsy = &mp->m_perag[agno].pagb_list[cnt];
+		if (!bsy->busy_tp)
+			continue;
+
+		bend = bsy->busy_start + bsy->busy_length - 1;
+		if (bno > bend || uend < bsy->busy_start)
+			continue;
+
+		/* (start1,length1) within (start2, length2) */
+		if (XFS_LSN_CMP(bsy->busy_tp->t_commit_lsn, lsn) > 0)
+			lsn = bsy->busy_tp->t_commit_lsn;
 	}
-
-	trace_xfs_alloc_busysearch(mp, agno, bno, len, !!cnt);
+	spin_unlock(&mp->m_perag[agno].pagb_lock);
+	trace_xfs_alloc_busysearch(tp->t_mountp, agno, bno, len, !!cnt, lsn);
 
 	/*
 	 * If a block was found, force the log through the LSN of the
 	 * transaction that freed the block
 	 */
-	if (cnt) {
-		lsn = bsy->busy_tp->t_commit_lsn;
-		spin_unlock(&mp->m_perag[agno].pagb_lock);
+	if (lsn)
 		xfs_log_force(mp, lsn, XFS_LOG_FORCE|XFS_LOG_SYNC);
-	} else {
-		spin_unlock(&mp->m_perag[agno].pagb_lock);
-	}
 }

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH, updated]  xfs: Ensure we force all busy extents in range to disk
  2010-01-09 18:10     ` [PATCH, updated] xfs: Ensure we force all busy extents in range to disk Alex Elder
@ 2010-01-09 19:35       ` Christoph Hellwig
  2010-01-10 18:28         ` [PATCH, updated] xfs: Ensure we force all busy extents inrange " Alex Elder
  0 siblings, 1 reply; 15+ messages in thread
From: Christoph Hellwig @ 2010-01-09 19:35 UTC (permalink / raw)
  To: Alex Elder; +Cc: xfs


Looks good,


Reviewed-by: Christoph Hellwig <hch@lst.de>

>  TRACE_EVENT(xfs_alloc_busysearch,
>  	TP_PROTO(struct xfs_mount *mp, xfs_agnumber_t agno, xfs_agblock_t agbno,
> -		 xfs_extlen_t len, int found),
> -	TP_ARGS(mp, agno, agbno, len, found),
> +		 xfs_extlen_t len, int found, xfs_lsn_t lsn),
> +	TP_ARGS(mp, agno, agbno, len, found, lsn),

The found argument to this tracepoint can be dropped - with the new loop
style it doesn't work anymore, and a non-zero lsn provides the same
information.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Re: [PATCH v2] xfs: Don't flush stale inodes
  2010-01-09 16:22     ` Alex Elder
@ 2010-01-09 22:39       ` Dave Chinner
  2010-01-10 16:43         ` Alex Elder
  0 siblings, 1 reply; 15+ messages in thread
From: Dave Chinner @ 2010-01-09 22:39 UTC (permalink / raw)
  To: Alex Elder; +Cc: xfs

On Sat, Jan 09, 2010 at 10:22:14AM -0600, Alex Elder wrote:
> Yes, I see that now.  I had marked your older one "Reviewed" and
> didn't look further.  I will re-do the port and will re-submit.
> BTW I also plan to port the perag radix tree patch(es) over this
> for you after I get this one published.

Don't make extra work for yourself that slows down the rate at which
you can merge stuff. When the tree is updated with these patches,
I'll fix any conflicts in my patches and retest them, then send them
to the list again.

As a maintainer, manually munging a bunch patches around to fit,
then asking if they are correct, then finally merging them doesn't
scale - push that work back out to the patch authors where it does
scale and who have probably already done the work.... ;)

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 15+ messages in thread

* RE: Re: [PATCH v2] xfs: Don't flush stale inodes
  2010-01-09 22:39       ` Dave Chinner
@ 2010-01-10 16:43         ` Alex Elder
  0 siblings, 0 replies; 15+ messages in thread
From: Alex Elder @ 2010-01-10 16:43 UTC (permalink / raw)
  To: Dave Chinner; +Cc: xfs

Dave Chinner wrote:
> On Sat, Jan 09, 2010 at 10:22:14AM -0600, Alex Elder wrote:
>> Yes, I see that now.  I had marked your older one "Reviewed" and
>> didn't look further.  I will re-do the port and will re-submit.
>> BTW I also plan to port the perag radix tree patch(es) over this
>> for you after I get this one published.
> 
> Don't make extra work for yourself that slows down the rate at which
> you can merge stuff. When the tree is updated with these patches,
> I'll fix any conflicts in my patches and retest them, then send them
> to the list again.

You are right.  And I do very much want to keep improving
the turnaround for merges.

> As a maintainer, manually munging a bunch patches around to fit,
> then asking if they are correct, then finally merging them doesn't
> scale - push that work back out to the patch authors where it does
> scale and who have probably already done the work.... ;)

Right again, and I will gladly do this from here forward.
Remind me if I stray off course again...

Thanks a lot Dave.

					-Alex

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 15+ messages in thread

* RE: [PATCH, updated]  xfs: Ensure we force all busy extents inrange to disk
  2010-01-09 19:35       ` Christoph Hellwig
@ 2010-01-10 18:28         ` Alex Elder
  0 siblings, 0 replies; 15+ messages in thread
From: Alex Elder @ 2010-01-10 18:28 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: xfs

Christoph Hellwig wrote:
> Looks good,

Thanks, I will publish this today.

> Reviewed-by: Christoph Hellwig <hch@lst.de>
> 
>>  TRACE_EVENT(xfs_alloc_busysearch,
>>  	TP_PROTO(struct xfs_mount *mp, xfs_agnumber_t agno, xfs_agblock_t agbno,
>> -		 xfs_extlen_t len, int found),
>> -	TP_ARGS(mp, agno, agbno, len, found),
>> +		 xfs_extlen_t len, int found, xfs_lsn_t lsn),
>> +	TP_ARGS(mp, agno, agbno, len, found, lsn),
> 
> The found argument to this tracepoint can be dropped - with the new loop
> style it doesn't work anymore, and a non-zero lsn provides the same
> information.

I agree with that, and have now removed that argument
from the tracepoint, with the net effect that the lsn
argument simply replaces it.

					-Alex

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2010-01-10 18:32 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2010-01-02  2:39 [PATCH] XFS: Don't flush stale inodes Dave Chinner
2010-01-02 12:00 ` Christoph Hellwig
2010-01-02 12:14   ` Dave Chinner
2010-01-02 12:24   ` Dave Chinner
2010-01-02 13:17     ` Christoph Hellwig
2010-01-02 13:39       ` Dave Chinner
2010-01-08 20:59 ` Alex Elder
2010-01-08 23:14 ` [PATCH v2] xfs: " Alex Elder
2010-01-09  0:09   ` Dave Chinner
2010-01-09 16:22     ` Alex Elder
2010-01-09 22:39       ` Dave Chinner
2010-01-10 16:43         ` Alex Elder
2010-01-09 18:10     ` [PATCH, updated] xfs: Ensure we force all busy extents in range to disk Alex Elder
2010-01-09 19:35       ` Christoph Hellwig
2010-01-10 18:28         ` [PATCH, updated] xfs: Ensure we force all busy extents inrange " Alex Elder

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.