All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/5] xfs: reclaim bug fixes
@ 2010-07-15  0:38 Dave Chinner
  2010-07-15  0:38 ` [PATCH 1/5] xfs: track AGs with reclaimable inodes in per-ag radix tree Dave Chinner
                   ` (5 more replies)
  0 siblings, 6 replies; 21+ messages in thread
From: Dave Chinner @ 2010-07-15  0:38 UTC (permalink / raw)
  To: xfs

The following patches fix excessive CPU consumption during inode
cache shrinking when filesystems have lots of allocation groups as
well as prevent a couple of lockdep reports that were found during
testing. Also included is a fix for a reclaim recursion deadlock
when allocating memory during inode initialisation.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCH 1/5] xfs: track AGs with reclaimable inodes in per-ag radix tree
  2010-07-15  0:38 [PATCH 0/5] xfs: reclaim bug fixes Dave Chinner
@ 2010-07-15  0:38 ` Dave Chinner
  2010-07-15 18:01   ` Alex Elder
  2010-07-15  0:38 ` [PATCH 2/5] xfs: simplify and remove xfs_ireclaim Dave Chinner
                   ` (4 subsequent siblings)
  5 siblings, 1 reply; 21+ messages in thread
From: Dave Chinner @ 2010-07-15  0:38 UTC (permalink / raw)
  To: xfs

From: Dave Chinner <dchinner@redhat.com>

https://bugzilla.kernel.org/show_bug.cgi?id=16348

When the filesystem grows to a large number of allocation groups,
the summing of recalimable inodes gets expensive. In many cases,
most AGs won't have any reclaimable inodes and so we are wasting CPU
time aggregating over these AGs. This is particularly important for
the inode shrinker that gets called frequently under memory
pressure.

To avoid the overhead, track AGs with reclaimable inodes in the
per-ag radix tree so that we can find all the AGs with reclaimable
inodes via a simple gang tag lookup. This involves setting the tag
when the first reclaimable inode is tracked in the AG, and removing
the tag when the last reclaimable inode is removed from the tree.
Then the summation process becomes a loop walking the radix tree
summing AGs with the reclaim tag set.

This significantly reduces the overhead of scanning - a 6400 AG
filesystea now only uses about 25% of a cpu in kswapd while slab reclaim
progresses instead of being permanently stuck at 100% CPU and making little
progress. Clean filesystems filesystems will see no overhead and the
overhead only increases linearly with the number of dirty AGs.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/linux-2.6/xfs_sync.c  |   71 +++++++++++++++++++++++++++++++++++++----
 fs/xfs/linux-2.6/xfs_trace.h |    3 ++
 2 files changed, 67 insertions(+), 7 deletions(-)

diff --git a/fs/xfs/linux-2.6/xfs_sync.c b/fs/xfs/linux-2.6/xfs_sync.c
index 56fed91..51da595 100644
--- a/fs/xfs/linux-2.6/xfs_sync.c
+++ b/fs/xfs/linux-2.6/xfs_sync.c
@@ -133,6 +133,41 @@ restart:
 	return last_error;
 }
 
+/*
+ * Select the next per-ag structure to iterate during the walk. The reclaim
+ * walk is optimised only to walk AGs with reclaimable inodes in them.
+ */
+static struct xfs_perag *
+xfs_inode_ag_iter_next_pag(
+	struct xfs_mount	*mp,
+	xfs_agnumber_t		*first,
+	int			tag)
+{
+	struct xfs_perag	*pag = NULL;
+
+	if (tag == XFS_ICI_RECLAIM_TAG) {
+		int found;
+		int ref;
+
+		spin_lock(&mp->m_perag_lock);
+		found = radix_tree_gang_lookup_tag(&mp->m_perag_tree,
+				(void **)&pag, *first, 1, tag);
+		if (found <= 0) {
+			spin_unlock(&mp->m_perag_lock);
+			return NULL;
+		}
+		*first = pag->pag_agno + 1;
+		/* open coded pag reference increment */
+		ref = atomic_inc_return(&pag->pag_ref);
+		spin_unlock(&mp->m_perag_lock);
+		trace_xfs_perag_get_reclaim(mp, pag->pag_agno, ref, _RET_IP_);
+	} else {
+		pag = xfs_perag_get(mp, *first);
+		(*first)++;
+	}
+	return pag;
+}
+
 int
 xfs_inode_ag_iterator(
 	struct xfs_mount	*mp,
@@ -143,16 +178,15 @@ xfs_inode_ag_iterator(
 	int			exclusive,
 	int			*nr_to_scan)
 {
+	struct xfs_perag	*pag;
 	int			error = 0;
 	int			last_error = 0;
 	xfs_agnumber_t		ag;
 	int			nr;
 
 	nr = nr_to_scan ? *nr_to_scan : INT_MAX;
-	for (ag = 0; ag < mp->m_sb.sb_agcount; ag++) {
-		struct xfs_perag	*pag;
-
-		pag = xfs_perag_get(mp, ag);
+	ag = 0;
+	while ((pag = xfs_inode_ag_iter_next_pag(mp, &ag, tag))) {
 		error = xfs_inode_ag_walk(mp, pag, execute, flags, tag,
 						exclusive, &nr);
 		xfs_perag_put(pag);
@@ -628,6 +662,17 @@ __xfs_inode_set_reclaim_tag(
 	radix_tree_tag_set(&pag->pag_ici_root,
 			   XFS_INO_TO_AGINO(ip->i_mount, ip->i_ino),
 			   XFS_ICI_RECLAIM_TAG);
+
+	if (!pag->pag_ici_reclaimable) {
+		/* propagate the reclaim tag up into the perag radix tree */
+		spin_lock(&ip->i_mount->m_perag_lock);
+		radix_tree_tag_set(&ip->i_mount->m_perag_tree,
+				XFS_INO_TO_AGNO(ip->i_mount, ip->i_ino),
+				XFS_ICI_RECLAIM_TAG);
+		spin_unlock(&ip->i_mount->m_perag_lock);
+		trace_xfs_perag_set_reclaim(ip->i_mount, pag->pag_agno,
+							-1, _RET_IP_);
+	}
 	pag->pag_ici_reclaimable++;
 }
 
@@ -662,6 +707,16 @@ __xfs_inode_clear_reclaim_tag(
 	radix_tree_tag_clear(&pag->pag_ici_root,
 			XFS_INO_TO_AGINO(mp, ip->i_ino), XFS_ICI_RECLAIM_TAG);
 	pag->pag_ici_reclaimable--;
+	if (!pag->pag_ici_reclaimable) {
+		/* clear the reclaim tag from the perag radix tree */
+		spin_lock(&ip->i_mount->m_perag_lock);
+		radix_tree_tag_clear(&ip->i_mount->m_perag_tree,
+				XFS_INO_TO_AGNO(ip->i_mount, ip->i_ino),
+				XFS_ICI_RECLAIM_TAG);
+		spin_unlock(&ip->i_mount->m_perag_lock);
+		trace_xfs_perag_clear_reclaim(ip->i_mount, pag->pag_agno,
+							-1, _RET_IP_);
+	}
 }
 
 /*
@@ -832,7 +887,7 @@ xfs_reclaim_inode_shrink(
 	struct xfs_mount *mp;
 	struct xfs_perag *pag;
 	xfs_agnumber_t	ag;
-	int		reclaimable = 0;
+	int		reclaimable;
 
 	if (nr_to_scan) {
 		if (!(gfp_mask & __GFP_FS))
@@ -849,9 +904,11 @@ xfs_reclaim_inode_shrink(
 	}
 
 	down_read(&xfs_mount_list_lock);
+	reclaimable = 0;
 	list_for_each_entry(mp, &xfs_mount_list, m_mplist) {
-		for (ag = 0; ag < mp->m_sb.sb_agcount; ag++) {
-			pag = xfs_perag_get(mp, ag);
+		ag = 0;
+		while ((pag = xfs_inode_ag_iter_next_pag(mp, &ag,
+						XFS_ICI_RECLAIM_TAG))) {
 			reclaimable += pag->pag_ici_reclaimable;
 			xfs_perag_put(pag);
 		}
diff --git a/fs/xfs/linux-2.6/xfs_trace.h b/fs/xfs/linux-2.6/xfs_trace.h
index d506753..24e5580 100644
--- a/fs/xfs/linux-2.6/xfs_trace.h
+++ b/fs/xfs/linux-2.6/xfs_trace.h
@@ -124,7 +124,10 @@ DEFINE_EVENT(xfs_perag_class, name,	\
 		 unsigned long caller_ip),					\
 	TP_ARGS(mp, agno, refcount, caller_ip))
 DEFINE_PERAG_REF_EVENT(xfs_perag_get);
+DEFINE_PERAG_REF_EVENT(xfs_perag_get_reclaim);
 DEFINE_PERAG_REF_EVENT(xfs_perag_put);
+DEFINE_PERAG_REF_EVENT(xfs_perag_set_reclaim);
+DEFINE_PERAG_REF_EVENT(xfs_perag_clear_reclaim);
 
 TRACE_EVENT(xfs_attr_list_node_descend,
 	TP_PROTO(struct xfs_attr_list_context *ctx,
-- 
1.7.1

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH 2/5] xfs: simplify and remove xfs_ireclaim
  2010-07-15  0:38 [PATCH 0/5] xfs: reclaim bug fixes Dave Chinner
  2010-07-15  0:38 ` [PATCH 1/5] xfs: track AGs with reclaimable inodes in per-ag radix tree Dave Chinner
@ 2010-07-15  0:38 ` Dave Chinner
  2010-07-15 18:07   ` Alex Elder
  2010-07-16  5:16   ` Christoph Hellwig
  2010-07-15  0:38 ` [PATCH 3/5] xfs: fix xfs_trans_add_item() lockdep warnings Dave Chinner
                   ` (3 subsequent siblings)
  5 siblings, 2 replies; 21+ messages in thread
From: Dave Chinner @ 2010-07-15  0:38 UTC (permalink / raw)
  To: xfs

From: Dave Chinner <dchinner@redhat.com>

xfs_ireclaim has to get and put te pag structure because it is only
called with the inode to reclaim. The one caller of this function
already has a reference on the pag and a pointer to is, so move the
radix tree delete to the caller and remove xfs_ireclaim completely.
This avoids a xfs_perag_get/put on every inode being reclaimed.

The overhead was noticed in a bug report at:

https://bugzilla.kernel.org/show_bug.cgi?id=16348

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/linux-2.6/xfs_sync.c |   31 ++++++++++++++++++++++++-
 fs/xfs/xfs_iget.c           |   53 +------------------------------------------
 fs/xfs/xfs_inode.h          |    2 +-
 3 files changed, 32 insertions(+), 54 deletions(-)

diff --git a/fs/xfs/linux-2.6/xfs_sync.c b/fs/xfs/linux-2.6/xfs_sync.c
index 51da595..80ae18a 100644
--- a/fs/xfs/linux-2.6/xfs_sync.c
+++ b/fs/xfs/linux-2.6/xfs_sync.c
@@ -855,7 +855,36 @@ out:
 reclaim:
 	xfs_ifunlock(ip);
 	xfs_iunlock(ip, XFS_ILOCK_EXCL);
-	xfs_ireclaim(ip);
+
+	XFS_STATS_INC(xs_ig_reclaims);
+	/*
+	 * Remove the inode from the per-AG radix tree.
+	 *
+	 * Because radix_tree_delete won't complain even if the item was never
+	 * added to the tree assert that it's been there before to catch
+	 * problems with the inode life time early on.
+	 */
+	write_lock(&pag->pag_ici_lock);
+	if (!radix_tree_delete(&pag->pag_ici_root,
+				XFS_INO_TO_AGINO(ip->i_mount, ip->i_ino)))
+		ASSERT(0);
+	write_unlock(&pag->pag_ici_lock);
+
+	/*
+	 * Here we do an (almost) spurious inode lock in order to coordinate
+	 * with inode cache radix tree lookups.  This is because the lookup
+	 * can reference the inodes in the cache without taking references.
+	 *
+	 * We make that OK here by ensuring that we wait until the inode is
+	 * unlocked after the lookup before we go ahead and free it.  We get
+	 * both the ilock and the iolock because the code may need to drop the
+	 * ilock one but will still hold the iolock.
+	 */
+	xfs_ilock(ip, XFS_ILOCK_EXCL | XFS_IOLOCK_EXCL);
+	xfs_qm_dqdetach(ip);
+	xfs_iunlock(ip, XFS_ILOCK_EXCL | XFS_IOLOCK_EXCL);
+
+	xfs_inode_free(ip);
 	return error;
 
 }
diff --git a/fs/xfs/xfs_iget.c b/fs/xfs/xfs_iget.c
index 9e86f21..eba5ae6 100644
--- a/fs/xfs/xfs_iget.c
+++ b/fs/xfs/xfs_iget.c
@@ -91,7 +91,7 @@ xfs_inode_alloc(
 	return ip;
 }
 
-STATIC void
+void
 xfs_inode_free(
 	struct xfs_inode	*ip)
 {
@@ -418,57 +418,6 @@ out_error_or_again:
 }
 
 /*
- * This is called free all the memory associated with an inode.
- * It must free the inode itself and any buffers allocated for
- * if_extents/if_data and if_broot.  It must also free the lock
- * associated with the inode.
- *
- * Note: because we don't initialise everything on reallocation out
- * of the zone, we must ensure we nullify everything correctly before
- * freeing the structure.
- */
-void
-xfs_ireclaim(
-	struct xfs_inode	*ip)
-{
-	struct xfs_mount	*mp = ip->i_mount;
-	struct xfs_perag	*pag;
-	xfs_agino_t		agino = XFS_INO_TO_AGINO(mp, ip->i_ino);
-
-	XFS_STATS_INC(xs_ig_reclaims);
-
-	/*
-	 * Remove the inode from the per-AG radix tree.
-	 *
-	 * Because radix_tree_delete won't complain even if the item was never
-	 * added to the tree assert that it's been there before to catch
-	 * problems with the inode life time early on.
-	 */
-	pag = xfs_perag_get(mp, XFS_INO_TO_AGNO(mp, ip->i_ino));
-	write_lock(&pag->pag_ici_lock);
-	if (!radix_tree_delete(&pag->pag_ici_root, agino))
-		ASSERT(0);
-	write_unlock(&pag->pag_ici_lock);
-	xfs_perag_put(pag);
-
-	/*
-	 * Here we do an (almost) spurious inode lock in order to coordinate
-	 * with inode cache radix tree lookups.  This is because the lookup
-	 * can reference the inodes in the cache without taking references.
-	 *
-	 * We make that OK here by ensuring that we wait until the inode is
-	 * unlocked after the lookup before we go ahead and free it.  We get
-	 * both the ilock and the iolock because the code may need to drop the
-	 * ilock one but will still hold the iolock.
-	 */
-	xfs_ilock(ip, XFS_ILOCK_EXCL | XFS_IOLOCK_EXCL);
-	xfs_qm_dqdetach(ip);
-	xfs_iunlock(ip, XFS_ILOCK_EXCL | XFS_IOLOCK_EXCL);
-
-	xfs_inode_free(ip);
-}
-
-/*
  * This is a wrapper routine around the xfs_ilock() routine
  * used to centralize some grungy code.  It is used in places
  * that wish to lock the inode solely for reading the extents.
diff --git a/fs/xfs/xfs_inode.h b/fs/xfs/xfs_inode.h
index eb41559..0898c54 100644
--- a/fs/xfs/xfs_inode.h
+++ b/fs/xfs/xfs_inode.h
@@ -450,7 +450,7 @@ void		xfs_ilock_demote(xfs_inode_t *, uint);
 int		xfs_isilocked(xfs_inode_t *, uint);
 uint		xfs_ilock_map_shared(xfs_inode_t *);
 void		xfs_iunlock_map_shared(xfs_inode_t *, uint);
-void		xfs_ireclaim(xfs_inode_t *);
+void		xfs_inode_free(struct xfs_inode *ip);
 
 /*
  * xfs_inode.c prototypes.
-- 
1.7.1

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH 3/5] xfs: fix xfs_trans_add_item() lockdep warnings
  2010-07-15  0:38 [PATCH 0/5] xfs: reclaim bug fixes Dave Chinner
  2010-07-15  0:38 ` [PATCH 1/5] xfs: track AGs with reclaimable inodes in per-ag radix tree Dave Chinner
  2010-07-15  0:38 ` [PATCH 2/5] xfs: simplify and remove xfs_ireclaim Dave Chinner
@ 2010-07-15  0:38 ` Dave Chinner
  2010-07-15 18:09   ` Alex Elder
  2010-07-16  5:19   ` Christoph Hellwig
  2010-07-15  0:38 ` [PATCH 4/5] xfs: use GFP_NOFS for page cache allocation Dave Chinner
                   ` (2 subsequent siblings)
  5 siblings, 2 replies; 21+ messages in thread
From: Dave Chinner @ 2010-07-15  0:38 UTC (permalink / raw)
  To: xfs

From: Dave Chinner <dchinner@redhat.com>

xfs_trans_add_item() is called with ip->i_ilock held, which means it
is unsafe for memory reclaim to recurse back into the filesystem
(ilock is required in writeback). Hence the allocation needs to be
KM_NOFS to avoid recursion.

Lockdep report indicating memory allocation being called with the
ip->i_ilock held is as follows:

[ 1749.866796] =================================
[ 1749.867788] [ INFO: inconsistent lock state ]
[ 1749.868327] 2.6.35-rc3-dgc+ #25
[ 1749.868741] ---------------------------------
[ 1749.868741] inconsistent {IN-RECLAIM_FS-W} -> {RECLAIM_FS-ON-W} usage.
[ 1749.868741] dd/2835 [HC0[0]:SC0[0]:HE1:SE1] takes:
[ 1749.868741]  (&(&ip->i_lock)->mr_lock){++++?.}, at: [<ffffffff813170fb>] xfs_ilock+0x10b/0x190
[ 1749.868741] {IN-RECLAIM_FS-W} state was registered at:
[ 1749.868741]   [<ffffffff810b3a97>] __lock_acquire+0x437/0x1450
[ 1749.868741]   [<ffffffff810b4b56>] lock_acquire+0xa6/0x160
[ 1749.868741]   [<ffffffff810a20b5>] down_write_nested+0x65/0xb0
[ 1749.868741]   [<ffffffff813170fb>] xfs_ilock+0x10b/0x190
[ 1749.868741]   [<ffffffff8134e819>] xfs_reclaim_inode+0x99/0x310
[ 1749.868741]   [<ffffffff8134f56b>] xfs_inode_ag_walk+0x8b/0x150
[ 1749.868741]   [<ffffffff8134f6bb>] xfs_inode_ag_iterator+0x8b/0xf0
[ 1749.868741]   [<ffffffff8134f7a8>] xfs_reclaim_inode_shrink+0x88/0x90
[ 1749.868741]   [<ffffffff81119d07>] shrink_slab+0x137/0x1a0
[ 1749.868741]   [<ffffffff8111bbe1>] balance_pgdat+0x421/0x6a0
[ 1749.868741]   [<ffffffff8111bf7d>] kswapd+0x11d/0x320
[ 1749.868741]   [<ffffffff8109ce56>] kthread+0x96/0xa0
[ 1749.868741]   [<ffffffff81035de4>] kernel_thread_helper+0x4/0x10
[ 1749.868741] irq event stamp: 4234335
[ 1749.868741] hardirqs last  enabled at (4234335): [<ffffffff81147d25>] kmem_cache_free+0x115/0x220
[ 1749.868741] hardirqs last disabled at (4234334): [<ffffffff81147c4d>] kmem_cache_free+0x3d/0x220
[ 1749.868741] softirqs last  enabled at (4233112): [<ffffffff81084dd2>] __do_softirq+0x142/0x260
[ 1749.868741] softirqs last disabled at (4233095): [<ffffffff81035edc>] call_softirq+0x1c/0x50
[ 1749.868741]
[ 1749.868741] other info that might help us debug this:
[ 1749.868741] 2 locks held by dd/2835:
[ 1749.868741]  #0:  (&(&ip->i_iolock)->mr_lock#2){+.+.+.}, at: [<ffffffff81316edd>] xfs_ilock_nowait+0xed/0x200
[ 1749.868741]  #1:  (&(&ip->i_lock)->mr_lock){++++?.}, at: [<ffffffff813170fb>] xfs_ilock+0x10b/0x190
[ 1749.868741]
[ 1749.868741] stack backtrace:
[ 1749.868741] Pid: 2835, comm: dd Not tainted 2.6.35-rc3-dgc+ #25
[ 1749.868741] Call Trace:
[ 1749.868741]  [<ffffffff810b1faa>] print_usage_bug+0x18a/0x190
[ 1749.868741]  [<ffffffff8104264f>] ? save_stack_trace+0x2f/0x50
[ 1749.868741]  [<ffffffff810b2400>] ? check_usage_backwards+0x0/0xf0
[ 1749.868741]  [<ffffffff810b2f11>] mark_lock+0x331/0x400
[ 1749.868741]  [<ffffffff810b3047>] mark_held_locks+0x67/0x90
[ 1749.868741]  [<ffffffff810b3111>] lockdep_trace_alloc+0xa1/0xe0
[ 1749.868741]  [<ffffffff81147419>] kmem_cache_alloc+0x39/0x1e0
[ 1749.868741]  [<ffffffff8133f954>] kmem_zone_alloc+0x94/0xe0
[ 1749.868741]  [<ffffffff8133f9be>] kmem_zone_zalloc+0x1e/0x50
[ 1749.868741]  [<ffffffff81335f02>] xfs_trans_add_item+0x72/0xb0
[ 1749.868741]  [<ffffffff81339e41>] xfs_trans_ijoin+0xa1/0xd0
[ 1749.868741]  [<ffffffff81319f82>] xfs_itruncate_finish+0x312/0x5d0
[ 1749.868741]  [<ffffffff8133cb87>] xfs_free_eofblocks+0x227/0x280
[ 1749.868741]  [<ffffffff8133cd18>] xfs_release+0x138/0x190
[ 1749.868741]  [<ffffffff813464c5>] xfs_file_release+0x15/0x20
[ 1749.868741]  [<ffffffff81150ebf>] fput+0x13f/0x260
[ 1749.868741]  [<ffffffff8114d8c2>] filp_close+0x52/0x80
[ 1749.868741]  [<ffffffff8114d9a9>] sys_close+0xb9/0x120
[ 1749.868741]  [<ffffffff81034ff2>] system_call_fastpath+0x16/0x1b

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/xfs_trans.c |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/fs/xfs/xfs_trans.c b/fs/xfs/xfs_trans.c
index f2065cc..3680a47 100644
--- a/fs/xfs/xfs_trans.c
+++ b/fs/xfs/xfs_trans.c
@@ -1134,7 +1134,7 @@ xfs_trans_add_item(
 	ASSERT(lip->li_mountp = tp->t_mountp);
 	ASSERT(lip->li_ailp = tp->t_mountp->m_ail);
 
-	lidp = kmem_zone_zalloc(xfs_log_item_desc_zone, KM_SLEEP);
+	lidp = kmem_zone_zalloc(xfs_log_item_desc_zone, KM_SLEEP|KM_NOFS);
 
 	lidp->lid_item = lip;
 	lidp->lid_flags = 0;
-- 
1.7.1

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH 4/5] xfs: use GFP_NOFS for page cache allocation
  2010-07-15  0:38 [PATCH 0/5] xfs: reclaim bug fixes Dave Chinner
                   ` (2 preceding siblings ...)
  2010-07-15  0:38 ` [PATCH 3/5] xfs: fix xfs_trans_add_item() lockdep warnings Dave Chinner
@ 2010-07-15  0:38 ` Dave Chinner
  2010-07-15 18:10   ` Alex Elder
  2010-07-16  5:21   ` Christoph Hellwig
  2010-07-15  0:38 ` [PATCH 5/5] xfs: fix memory reclaim recursion deadlock on locked inode buffer Dave Chinner
  2010-07-16  5:23 ` [PATCH 0/5] xfs: reclaim bug fixes Christoph Hellwig
  5 siblings, 2 replies; 21+ messages in thread
From: Dave Chinner @ 2010-07-15  0:38 UTC (permalink / raw)
  To: xfs

From: Dave Chinner <dchinner@redhat.com>

Avoid a lockdep warning by preventing page cache allocation from
recursing back into the filesystem during memory reclaim.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/linux-2.6/xfs_aops.c |    5 +++--
 1 files changed, 3 insertions(+), 2 deletions(-)

diff --git a/fs/xfs/linux-2.6/xfs_aops.c b/fs/xfs/linux-2.6/xfs_aops.c
index ed9c3db..1075791 100644
--- a/fs/xfs/linux-2.6/xfs_aops.c
+++ b/fs/xfs/linux-2.6/xfs_aops.c
@@ -1501,8 +1501,9 @@ xfs_vm_write_begin(
 	void			**fsdata)
 {
 	*pagep = NULL;
-	return block_write_begin(file, mapping, pos, len, flags, pagep, fsdata,
-								xfs_get_blocks);
+	return block_write_begin(file, mapping, pos, len,
+				 (flags | AOP_FLAG_NOFS),
+				 pagep, fsdata, xfs_get_blocks);
 }
 
 STATIC sector_t
-- 
1.7.1

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH 5/5] xfs: fix memory reclaim recursion deadlock on locked inode buffer
  2010-07-15  0:38 [PATCH 0/5] xfs: reclaim bug fixes Dave Chinner
                   ` (3 preceding siblings ...)
  2010-07-15  0:38 ` [PATCH 4/5] xfs: use GFP_NOFS for page cache allocation Dave Chinner
@ 2010-07-15  0:38 ` Dave Chinner
  2010-07-15 18:42   ` Alex Elder
  2010-07-16  5:22   ` Christoph Hellwig
  2010-07-16  5:23 ` [PATCH 0/5] xfs: reclaim bug fixes Christoph Hellwig
  5 siblings, 2 replies; 21+ messages in thread
From: Dave Chinner @ 2010-07-15  0:38 UTC (permalink / raw)
  To: xfs

From: Dave Chinner <dchinner@redhat.com>

Calling into memory reclaim with a locked inode buffer can deadlock
if memory reclaim tries to lock the inode buffer during inode
teardown. Convert the relevant memory allocations to use KM_NOFS to
avoid this deadlock condition.

Reported-by: Peter Watkins <treestem@gmail.com>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/xfs_inode.c |   20 +++++++++++---------
 1 files changed, 11 insertions(+), 9 deletions(-)

diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
index 5715a9d..3c206b3 100644
--- a/fs/xfs/xfs_inode.c
+++ b/fs/xfs/xfs_inode.c
@@ -422,7 +422,7 @@ xfs_iformat(
 	if (!XFS_DFORK_Q(dip))
 		return 0;
 	ASSERT(ip->i_afp == NULL);
-	ip->i_afp = kmem_zone_zalloc(xfs_ifork_zone, KM_SLEEP);
+	ip->i_afp = kmem_zone_zalloc(xfs_ifork_zone, KM_SLEEP|KM_NOFS);
 	ip->i_afp->if_ext_max =
 		XFS_IFORK_ASIZE(ip) / (uint)sizeof(xfs_bmbt_rec_t);
 	switch (dip->di_aformat) {
@@ -505,7 +505,7 @@ xfs_iformat_local(
 		ifp->if_u1.if_data = ifp->if_u2.if_inline_data;
 	else {
 		real_size = roundup(size, 4);
-		ifp->if_u1.if_data = kmem_alloc(real_size, KM_SLEEP);
+		ifp->if_u1.if_data = kmem_alloc(real_size, KM_SLEEP|KM_NOFS);
 	}
 	ifp->if_bytes = size;
 	ifp->if_real_bytes = real_size;
@@ -632,7 +632,7 @@ xfs_iformat_btree(
 	}
 
 	ifp->if_broot_bytes = size;
-	ifp->if_broot = kmem_alloc(size, KM_SLEEP);
+	ifp->if_broot = kmem_alloc(size, KM_SLEEP|KM_NOFS);
 	ASSERT(ifp->if_broot != NULL);
 	/*
 	 * Copy and convert from the on-disk structure
@@ -2191,7 +2191,7 @@ xfs_iroot_realloc(
 		 */
 		if (ifp->if_broot_bytes == 0) {
 			new_size = (size_t)XFS_BMAP_BROOT_SPACE_CALC(rec_diff);
-			ifp->if_broot = kmem_alloc(new_size, KM_SLEEP);
+			ifp->if_broot = kmem_alloc(new_size, KM_SLEEP|KM_NOFS);
 			ifp->if_broot_bytes = (int)new_size;
 			return;
 		}
@@ -2207,7 +2207,7 @@ xfs_iroot_realloc(
 		new_size = (size_t)XFS_BMAP_BROOT_SPACE_CALC(new_max);
 		ifp->if_broot = kmem_realloc(ifp->if_broot, new_size,
 				(size_t)XFS_BMAP_BROOT_SPACE_CALC(cur_max), /* old size */
-				KM_SLEEP);
+				KM_SLEEP|KM_NOFS);
 		op = (char *)XFS_BMAP_BROOT_PTR_ADDR(mp, ifp->if_broot, 1,
 						     ifp->if_broot_bytes);
 		np = (char *)XFS_BMAP_BROOT_PTR_ADDR(mp, ifp->if_broot, 1,
@@ -2233,7 +2233,7 @@ xfs_iroot_realloc(
 	else
 		new_size = 0;
 	if (new_size > 0) {
-		new_broot = kmem_alloc(new_size, KM_SLEEP);
+		new_broot = kmem_alloc(new_size, KM_SLEEP|KM_NOFS);
 		/*
 		 * First copy over the btree block header.
 		 */
@@ -2337,7 +2337,8 @@ xfs_idata_realloc(
 		real_size = roundup(new_size, 4);
 		if (ifp->if_u1.if_data == NULL) {
 			ASSERT(ifp->if_real_bytes == 0);
-			ifp->if_u1.if_data = kmem_alloc(real_size, KM_SLEEP);
+			ifp->if_u1.if_data = kmem_alloc(real_size,
+							KM_SLEEP|KM_NOFS);
 		} else if (ifp->if_u1.if_data != ifp->if_u2.if_inline_data) {
 			/*
 			 * Only do the realloc if the underlying size
@@ -2348,11 +2349,12 @@ xfs_idata_realloc(
 					kmem_realloc(ifp->if_u1.if_data,
 							real_size,
 							ifp->if_real_bytes,
-							KM_SLEEP);
+							KM_SLEEP|KM_NOFS);
 			}
 		} else {
 			ASSERT(ifp->if_real_bytes == 0);
-			ifp->if_u1.if_data = kmem_alloc(real_size, KM_SLEEP);
+			ifp->if_u1.if_data = kmem_alloc(real_size,
+							KM_SLEEP|KM_NOFS);
 			memcpy(ifp->if_u1.if_data, ifp->if_u2.if_inline_data,
 				ifp->if_bytes);
 		}
-- 
1.7.1

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: [PATCH 1/5] xfs: track AGs with reclaimable inodes in per-ag radix tree
  2010-07-15  0:38 ` [PATCH 1/5] xfs: track AGs with reclaimable inodes in per-ag radix tree Dave Chinner
@ 2010-07-15 18:01   ` Alex Elder
  2010-07-16  5:17     ` Christoph Hellwig
  2010-07-19  0:24     ` Dave Chinner
  0 siblings, 2 replies; 21+ messages in thread
From: Alex Elder @ 2010-07-15 18:01 UTC (permalink / raw)
  To: Dave Chinner; +Cc: XFS Mailing List

On Thu, 2010-07-15 at 10:38 +1000, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> https://bugzilla.kernel.org/show_bug.cgi?id=16348
> 
> When the filesystem grows to a large number of allocation groups,
> the summing of recalimable inodes gets expensive. In many cases,
> most AGs won't have any reclaimable inodes and so we are wasting CPU
> time aggregating over these AGs. This is particularly important for
> the inode shrinker that gets called frequently under memory
> pressure.
> 
> To avoid the overhead, track AGs with reclaimable inodes in the
> per-ag radix tree so that we can find all the AGs with reclaimable
> inodes via a simple gang tag lookup. This involves setting the tag
> when the first reclaimable inode is tracked in the AG, and removing
> the tag when the last reclaimable inode is removed from the tree.
> Then the summation process becomes a loop walking the radix tree
> summing AGs with the reclaim tag set.
> 
> This significantly reduces the overhead of scanning - a 6400 AG
> filesystea now only uses about 25% of a cpu in kswapd while slab reclaim
> progresses instead of being permanently stuck at 100% CPU and making little
> progress. Clean filesystems filesystems will see no overhead and the
> overhead only increases linearly with the number of dirty AGs.

Using the same tag represent a perag with reclaimable
inodes as the tag representing an inode is reclaimable
is nicely consistent...

I have a few comments below for your consideration
but they are minor (and don't even require a response).

Overall this looks good.

Reviewed-by: Alex Elder <aelder@sgi.com>

> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---
>  fs/xfs/linux-2.6/xfs_sync.c  |   71 +++++++++++++++++++++++++++++++++++++----
>  fs/xfs/linux-2.6/xfs_trace.h |    3 ++
>  2 files changed, 67 insertions(+), 7 deletions(-)
> 
> diff --git a/fs/xfs/linux-2.6/xfs_sync.c b/fs/xfs/linux-2.6/xfs_sync.c
> index 56fed91..51da595 100644
> --- a/fs/xfs/linux-2.6/xfs_sync.c
> +++ b/fs/xfs/linux-2.6/xfs_sync.c
> @@ -133,6 +133,41 @@ restart:
>  	return last_error;
>  }
>  
> +/*
> + * Select the next per-ag structure to iterate during the walk. The reclaim
> + * walk is optimised only to walk AGs with reclaimable inodes in them.
> + */
> +static struct xfs_perag *
> +xfs_inode_ag_iter_next_pag(
> +	struct xfs_mount	*mp,
> +	xfs_agnumber_t		*first,
> +	int			tag)
> +{
> +	struct xfs_perag	*pag = NULL;
> +
> +	if (tag == XFS_ICI_RECLAIM_TAG) {
> +		int found;
> +		int ref;
> +
> +		spin_lock(&mp->m_perag_lock);
> +		found = radix_tree_gang_lookup_tag(&mp->m_perag_tree,
> +				(void **)&pag, *first, 1, tag);
> +		if (found <= 0) {
> +			spin_unlock(&mp->m_perag_lock);
> +			return NULL;
> +		}
> +		*first = pag->pag_agno + 1;

Maybe move this below, just before the return.  I.e.:
	if (pag)
		*first = pag->pag_agno + 1;

To me it's slightly clearer that we're just setting
up to search next time with the perag following the
one returned.

> +		/* open coded pag reference increment */

By open-coding here you miss the assertions in xfs_perag_get().
(Though a common inline encapsulating them would have to be
in a header because the two functions reside in different files.)

> +		ref = atomic_inc_return(&pag->pag_ref);
> +		spin_unlock(&mp->m_perag_lock);
> +		trace_xfs_perag_get_reclaim(mp, pag->pag_agno, ref, _RET_IP_);
> +	} else {
> +		pag = xfs_perag_get(mp, *first);
> +		(*first)++;
> +	}
> +	return pag;
> +}
> +
>  int
>  xfs_inode_ag_iterator(
>  	struct xfs_mount	*mp,

. . .

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 2/5] xfs: simplify and remove xfs_ireclaim
  2010-07-15  0:38 ` [PATCH 2/5] xfs: simplify and remove xfs_ireclaim Dave Chinner
@ 2010-07-15 18:07   ` Alex Elder
  2010-07-16  5:16   ` Christoph Hellwig
  1 sibling, 0 replies; 21+ messages in thread
From: Alex Elder @ 2010-07-15 18:07 UTC (permalink / raw)
  To: Dave Chinner; +Cc: xfs

On Thu, 2010-07-15 at 10:38 +1000, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> xfs_ireclaim has to get and put te pag structure because it is only
> called with the inode to reclaim. The one caller of this function
> already has a reference on the pag and a pointer to is, so move the
> radix tree delete to the caller and remove xfs_ireclaim completely.
> This avoids a xfs_perag_get/put on every inode being reclaimed.
> 
> The overhead was noticed in a bug report at:
> 
> https://bugzilla.kernel.org/show_bug.cgi?id=16348

Looks good.

Reviewed-by: Alex Elder <aelder@sgi.com>

> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---
>  fs/xfs/linux-2.6/xfs_sync.c |   31 ++++++++++++++++++++++++-
>  fs/xfs/xfs_iget.c           |   53 +------------------------------------------
>  fs/xfs/xfs_inode.h          |    2 +-
>  3 files changed, 32 insertions(+), 54 deletions(-)


_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 3/5] xfs: fix xfs_trans_add_item() lockdep warnings
  2010-07-15  0:38 ` [PATCH 3/5] xfs: fix xfs_trans_add_item() lockdep warnings Dave Chinner
@ 2010-07-15 18:09   ` Alex Elder
  2010-07-16  5:19   ` Christoph Hellwig
  1 sibling, 0 replies; 21+ messages in thread
From: Alex Elder @ 2010-07-15 18:09 UTC (permalink / raw)
  To: Dave Chinner; +Cc: XFS Mailing List

On Thu, 2010-07-15 at 10:38 +1000, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> xfs_trans_add_item() is called with ip->i_ilock held, which means it
> is unsafe for memory reclaim to recurse back into the filesystem
> (ilock is required in writeback). Hence the allocation needs to be
> KM_NOFS to avoid recursion.

Looks good.

Reviewed-by: Alex Elder <aelder@sgi.com>

> Lockdep report indicating memory allocation being called with the
> ip->i_ilock held is as follows:
. . .

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 4/5] xfs: use GFP_NOFS for page cache allocation
  2010-07-15  0:38 ` [PATCH 4/5] xfs: use GFP_NOFS for page cache allocation Dave Chinner
@ 2010-07-15 18:10   ` Alex Elder
  2010-07-16  5:21   ` Christoph Hellwig
  1 sibling, 0 replies; 21+ messages in thread
From: Alex Elder @ 2010-07-15 18:10 UTC (permalink / raw)
  To: Dave Chinner; +Cc: xfs

On Thu, 2010-07-15 at 10:38 +1000, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> Avoid a lockdep warning by preventing page cache allocation from
> recursing back into the filesystem during memory reclaim.

Looks good.

Reviewed-by: Alex Elder <aelder@sgi.com>


> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---
>  fs/xfs/linux-2.6/xfs_aops.c |    5 +++--
>  1 files changed, 3 insertions(+), 2 deletions(-)
> 
> diff --git a/fs/xfs/linux-2.6/xfs_aops.c b/fs/xfs/linux-2.6/xfs_aops.c
> index ed9c3db..1075791 100644
> --- a/fs/xfs/linux-2.6/xfs_aops.c
> +++ b/fs/xfs/linux-2.6/xfs_aops.c
> @@ -1501,8 +1501,9 @@ xfs_vm_write_begin(
>  	void			**fsdata)
>  {
>  	*pagep = NULL;
> -	return block_write_begin(file, mapping, pos, len, flags, pagep, fsdata,
> -								xfs_get_blocks);
> +	return block_write_begin(file, mapping, pos, len,
> +				 (flags | AOP_FLAG_NOFS),

                            Why the parentheses?

> +				 pagep, fsdata, xfs_get_blocks);
>  }
>  
>  STATIC sector_t



_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 5/5] xfs: fix memory reclaim recursion deadlock on locked inode buffer
  2010-07-15  0:38 ` [PATCH 5/5] xfs: fix memory reclaim recursion deadlock on locked inode buffer Dave Chinner
@ 2010-07-15 18:42   ` Alex Elder
  2010-07-16  5:22   ` Christoph Hellwig
  1 sibling, 0 replies; 21+ messages in thread
From: Alex Elder @ 2010-07-15 18:42 UTC (permalink / raw)
  To: Dave Chinner; +Cc: XFS Mailing List

On Thu, 2010-07-15 at 10:38 +1000, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> Calling into memory reclaim with a locked inode buffer can deadlock
> if memory reclaim tries to lock the inode buffer during inode
> teardown. Convert the relevant memory allocations to use KM_NOFS to
> avoid this deadlock condition.

Looks good.  I wasn't able to find the paths that
led to xfs_iroot_realloc() and xfs_idata_realloc()
but I'm sure they're there...

Reviewed-by: Alex Elder <aelder@sgi.com>

> Reported-by: Peter Watkins <treestem@gmail.com>
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---
>  fs/xfs/xfs_inode.c |   20 +++++++++++---------
>  1 files changed, 11 insertions(+), 9 deletions(-)
> 
> diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
> index 5715a9d..3c206b3 100644
> --- a/fs/xfs/xfs_inode.c
> +++ b/fs/xfs/xfs_inode.c
> @@ -422,7 +422,7 @@ xfs_iformat(
>  	if (!XFS_DFORK_Q(dip))
>  		return 0;
>  	ASSERT(ip->i_afp == NULL);
> -	ip->i_afp = kmem_zone_zalloc(xfs_ifork_zone, KM_SLEEP);
> +	ip->i_afp = kmem_zone_zalloc(xfs_ifork_zone, KM_SLEEP|KM_NOFS);
>  	ip->i_afp->if_ext_max =
>  		XFS_IFORK_ASIZE(ip) / (uint)sizeof(xfs_bmbt_rec_t);
>  	switch (dip->di_aformat) {
> @@ -505,7 +505,7 @@ xfs_iformat_local(
>  		ifp->if_u1.if_data = ifp->if_u2.if_inline_data;
>  	else {
>  		real_size = roundup(size, 4);
> -		ifp->if_u1.if_data = kmem_alloc(real_size, KM_SLEEP);
> +		ifp->if_u1.if_data = kmem_alloc(real_size, KM_SLEEP|KM_NOFS);
>  	}
>  	ifp->if_bytes = size;
>  	ifp->if_real_bytes = real_size;
> @@ -632,7 +632,7 @@ xfs_iformat_btree(
>  	}
>  
>  	ifp->if_broot_bytes = size;
> -	ifp->if_broot = kmem_alloc(size, KM_SLEEP);
> +	ifp->if_broot = kmem_alloc(size, KM_SLEEP|KM_NOFS);
>  	ASSERT(ifp->if_broot != NULL);
>  	/*
>  	 * Copy and convert from the on-disk structure
> @@ -2191,7 +2191,7 @@ xfs_iroot_realloc(
>  		 */
>  		if (ifp->if_broot_bytes == 0) {
>  			new_size = (size_t)XFS_BMAP_BROOT_SPACE_CALC(rec_diff);
> -			ifp->if_broot = kmem_alloc(new_size, KM_SLEEP);
> +			ifp->if_broot = kmem_alloc(new_size, KM_SLEEP|KM_NOFS);
>  			ifp->if_broot_bytes = (int)new_size;
>  			return;
>  		}
> @@ -2207,7 +2207,7 @@ xfs_iroot_realloc(
>  		new_size = (size_t)XFS_BMAP_BROOT_SPACE_CALC(new_max);
>  		ifp->if_broot = kmem_realloc(ifp->if_broot, new_size,
>  				(size_t)XFS_BMAP_BROOT_SPACE_CALC(cur_max), /* old size */
> -				KM_SLEEP);
> +				KM_SLEEP|KM_NOFS);
>  		op = (char *)XFS_BMAP_BROOT_PTR_ADDR(mp, ifp->if_broot, 1,
>  						     ifp->if_broot_bytes);
>  		np = (char *)XFS_BMAP_BROOT_PTR_ADDR(mp, ifp->if_broot, 1,
> @@ -2233,7 +2233,7 @@ xfs_iroot_realloc(
>  	else
>  		new_size = 0;
>  	if (new_size > 0) {
> -		new_broot = kmem_alloc(new_size, KM_SLEEP);
> +		new_broot = kmem_alloc(new_size, KM_SLEEP|KM_NOFS);
>  		/*
>  		 * First copy over the btree block header.
>  		 */
> @@ -2337,7 +2337,8 @@ xfs_idata_realloc(
>  		real_size = roundup(new_size, 4);
>  		if (ifp->if_u1.if_data == NULL) {
>  			ASSERT(ifp->if_real_bytes == 0);
> -			ifp->if_u1.if_data = kmem_alloc(real_size, KM_SLEEP);
> +			ifp->if_u1.if_data = kmem_alloc(real_size,
> +							KM_SLEEP|KM_NOFS);
>  		} else if (ifp->if_u1.if_data != ifp->if_u2.if_inline_data) {
>  			/*
>  			 * Only do the realloc if the underlying size
> @@ -2348,11 +2349,12 @@ xfs_idata_realloc(
>  					kmem_realloc(ifp->if_u1.if_data,
>  							real_size,
>  							ifp->if_real_bytes,
> -							KM_SLEEP);
> +							KM_SLEEP|KM_NOFS);
>  			}
>  		} else {
>  			ASSERT(ifp->if_real_bytes == 0);
> -			ifp->if_u1.if_data = kmem_alloc(real_size, KM_SLEEP);
> +			ifp->if_u1.if_data = kmem_alloc(real_size,
> +							KM_SLEEP|KM_NOFS);
>  			memcpy(ifp->if_u1.if_data, ifp->if_u2.if_inline_data,
>  				ifp->if_bytes);
>  		}



_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 2/5] xfs: simplify and remove xfs_ireclaim
  2010-07-15  0:38 ` [PATCH 2/5] xfs: simplify and remove xfs_ireclaim Dave Chinner
  2010-07-15 18:07   ` Alex Elder
@ 2010-07-16  5:16   ` Christoph Hellwig
  1 sibling, 0 replies; 21+ messages in thread
From: Christoph Hellwig @ 2010-07-16  5:16 UTC (permalink / raw)
  To: Dave Chinner; +Cc: xfs

On Thu, Jul 15, 2010 at 10:38:17AM +1000, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> xfs_ireclaim has to get and put te pag structure because it is only
> called with the inode to reclaim. The one caller of this function
> already has a reference on the pag and a pointer to is, so move the
> radix tree delete to the caller and remove xfs_ireclaim completely.
> This avoids a xfs_perag_get/put on every inode being reclaimed.
> 
> The overhead was noticed in a bug report at:
> 
> https://bugzilla.kernel.org/show_bug.cgi?id=16348

Looks good, I had looked into this earlier but never finished it.

There's also some opportunity to optimize roundtrips on the ilock in
this path.

Reviewed-by: Christoph Hellwig <hch@lst.de>

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 1/5] xfs: track AGs with reclaimable inodes in per-ag radix tree
  2010-07-15 18:01   ` Alex Elder
@ 2010-07-16  5:17     ` Christoph Hellwig
  2010-07-19  0:17       ` Dave Chinner
  2010-07-19  0:24     ` Dave Chinner
  1 sibling, 1 reply; 21+ messages in thread
From: Christoph Hellwig @ 2010-07-16  5:17 UTC (permalink / raw)
  To: Alex Elder; +Cc: XFS Mailing List

On Thu, Jul 15, 2010 at 01:01:40PM -0500, Alex Elder wrote:
> Using the same tag represent a perag with reclaimable
> inodes as the tag representing an inode is reclaimable
> is nicely consistent...

Oh, didn't notice that when reviewing it.  Not sure I like it
too much.  If we ever grow more tags on the perag tree this
might get rather confusing.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 3/5] xfs: fix xfs_trans_add_item() lockdep warnings
  2010-07-15  0:38 ` [PATCH 3/5] xfs: fix xfs_trans_add_item() lockdep warnings Dave Chinner
  2010-07-15 18:09   ` Alex Elder
@ 2010-07-16  5:19   ` Christoph Hellwig
  2010-07-19  0:24     ` Dave Chinner
  1 sibling, 1 reply; 21+ messages in thread
From: Christoph Hellwig @ 2010-07-16  5:19 UTC (permalink / raw)
  To: Dave Chinner; +Cc: xfs

> -	lidp = kmem_zone_zalloc(xfs_log_item_desc_zone, KM_SLEEP);
> +	lidp = kmem_zone_zalloc(xfs_log_item_desc_zone, KM_SLEEP|KM_NOFS);

KM_NOFS by itself actually is sufficiant.  If you want to keep the
KM_SLEEP for documentation purposes at least add some spaces to make
it more redable.

	lidp = kmem_zone_zalloc(xfs_log_item_desc_zone, KM_SLEEP | KM_NOFS);

Once getting outside of nitpicking territory this looks good to me:


Reviewed-by: Christoph Hellwig <hch@lst.de>

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 4/5] xfs: use GFP_NOFS for page cache allocation
  2010-07-15  0:38 ` [PATCH 4/5] xfs: use GFP_NOFS for page cache allocation Dave Chinner
  2010-07-15 18:10   ` Alex Elder
@ 2010-07-16  5:21   ` Christoph Hellwig
  1 sibling, 0 replies; 21+ messages in thread
From: Christoph Hellwig @ 2010-07-16  5:21 UTC (permalink / raw)
  To: Dave Chinner; +Cc: xfs

On Thu, Jul 15, 2010 at 10:38:19AM +1000, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> Avoid a lockdep warning by preventing page cache allocation from
> recursing back into the filesystem during memory reclaim.

Yeah, I already have this in my queue too.  Noticed that we don't
need it because of i_mutex as suggested in the stack overflow thread,
but for the XFS ilock.

> @@ -1501,8 +1501,9 @@ xfs_vm_write_begin(
>  	void			**fsdata)
>  {
>  	*pagep = NULL;
> -	return block_write_begin(file, mapping, pos, len, flags, pagep, fsdata,
> -								xfs_get_blocks);
> +	return block_write_begin(file, mapping, pos, len,
> +				 (flags | AOP_FLAG_NOFS),
> +				 pagep, fsdata, xfs_get_blocks);


No need for the bracing:

	return block_write_begin(file, mapping, pos, len, flags | AOP_FLAG_NOFS,
				 pagep, fsdata, xfs_get_blocks);


And with my truncate rework in the vfs tree we'll get rejection galore in
linux-next, but Stephen has been pretty good at handling these..

Reviewed-by: Christoph Hellwig <hch@lst.de>

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 5/5] xfs: fix memory reclaim recursion deadlock on locked inode buffer
  2010-07-15  0:38 ` [PATCH 5/5] xfs: fix memory reclaim recursion deadlock on locked inode buffer Dave Chinner
  2010-07-15 18:42   ` Alex Elder
@ 2010-07-16  5:22   ` Christoph Hellwig
  1 sibling, 0 replies; 21+ messages in thread
From: Christoph Hellwig @ 2010-07-16  5:22 UTC (permalink / raw)
  To: Dave Chinner; +Cc: xfs

On Thu, Jul 15, 2010 at 10:38:20AM +1000, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> Calling into memory reclaim with a locked inode buffer can deadlock
> if memory reclaim tries to lock the inode buffer during inode
> teardown. Convert the relevant memory allocations to use KM_NOFS to
> avoid this deadlock condition.

Looks good, although the same style nitpicks as for patch 3 apply.

Reviewed-by: Christoph Hellwig <hch@lst.de>

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 0/5] xfs: reclaim bug fixes
  2010-07-15  0:38 [PATCH 0/5] xfs: reclaim bug fixes Dave Chinner
                   ` (4 preceding siblings ...)
  2010-07-15  0:38 ` [PATCH 5/5] xfs: fix memory reclaim recursion deadlock on locked inode buffer Dave Chinner
@ 2010-07-16  5:23 ` Christoph Hellwig
  2010-07-19  0:30   ` Dave Chinner
  5 siblings, 1 reply; 21+ messages in thread
From: Christoph Hellwig @ 2010-07-16  5:23 UTC (permalink / raw)
  To: Dave Chinner; +Cc: xfs

On Thu, Jul 15, 2010 at 10:38:15AM +1000, Dave Chinner wrote:
> The following patches fix excessive CPU consumption during inode
> cache shrinking when filesystems have lots of allocation groups as
> well as prevent a couple of lockdep reports that were found during
> testing. Also included is a fix for a reclaim recursion deadlock
> when allocating memory during inode initialisation.

Wa the overlap of patch 1 with the for-2.6.35 shrinker series
intentional?  In addition to patch 1 patches 3 to 5 are also for-2.6.35
material in my opinion.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 1/5] xfs: track AGs with reclaimable inodes in per-ag radix tree
  2010-07-16  5:17     ` Christoph Hellwig
@ 2010-07-19  0:17       ` Dave Chinner
  0 siblings, 0 replies; 21+ messages in thread
From: Dave Chinner @ 2010-07-19  0:17 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: XFS Mailing List, Alex Elder

On Fri, Jul 16, 2010 at 01:17:00AM -0400, Christoph Hellwig wrote:
> On Thu, Jul 15, 2010 at 01:01:40PM -0500, Alex Elder wrote:
> > Using the same tag represent a perag with reclaimable
> > inodes as the tag representing an inode is reclaimable
> > is nicely consistent...
> 
> Oh, didn't notice that when reviewing it.  Not sure I like it
> too much.  If we ever grow more tags on the perag tree this
> might get rather confusing.

Sure. But it's an in-memory flag so we can change it whenever we
need to. If we grow another tag on the perag tree, we can sort it
out then....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 1/5] xfs: track AGs with reclaimable inodes in per-ag radix tree
  2010-07-15 18:01   ` Alex Elder
  2010-07-16  5:17     ` Christoph Hellwig
@ 2010-07-19  0:24     ` Dave Chinner
  1 sibling, 0 replies; 21+ messages in thread
From: Dave Chinner @ 2010-07-19  0:24 UTC (permalink / raw)
  To: Alex Elder; +Cc: XFS Mailing List

On Thu, Jul 15, 2010 at 01:01:40PM -0500, Alex Elder wrote:
> On Thu, 2010-07-15 at 10:38 +1000, Dave Chinner wrote:
> > From: Dave Chinner <dchinner@redhat.com>
> > 
> > https://bugzilla.kernel.org/show_bug.cgi?id=16348
> > 
> > When the filesystem grows to a large number of allocation groups,
> > the summing of recalimable inodes gets expensive. In many cases,
> > most AGs won't have any reclaimable inodes and so we are wasting CPU
> > time aggregating over these AGs. This is particularly important for
> > the inode shrinker that gets called frequently under memory
> > pressure.
> > 
> > To avoid the overhead, track AGs with reclaimable inodes in the
> > per-ag radix tree so that we can find all the AGs with reclaimable
> > inodes via a simple gang tag lookup. This involves setting the tag
> > when the first reclaimable inode is tracked in the AG, and removing
> > the tag when the last reclaimable inode is removed from the tree.
> > Then the summation process becomes a loop walking the radix tree
> > summing AGs with the reclaim tag set.
> > 
> > This significantly reduces the overhead of scanning - a 6400 AG
> > filesystea now only uses about 25% of a cpu in kswapd while slab reclaim
> > progresses instead of being permanently stuck at 100% CPU and making little
> > progress. Clean filesystems filesystems will see no overhead and the
> > overhead only increases linearly with the number of dirty AGs.
> 
> Using the same tag represent a perag with reclaimable
> inodes as the tag representing an inode is reclaimable
> is nicely consistent...
> 
> I have a few comments below for your consideration
> but they are minor (and don't even require a response).
> 
> Overall this looks good.
> 
> Reviewed-by: Alex Elder <aelder@sgi.com>
> 
> > Signed-off-by: Dave Chinner <dchinner@redhat.com>
> > ---
> >  fs/xfs/linux-2.6/xfs_sync.c  |   71 +++++++++++++++++++++++++++++++++++++----
> >  fs/xfs/linux-2.6/xfs_trace.h |    3 ++
> >  2 files changed, 67 insertions(+), 7 deletions(-)
> > 
> > diff --git a/fs/xfs/linux-2.6/xfs_sync.c b/fs/xfs/linux-2.6/xfs_sync.c
> > index 56fed91..51da595 100644
> > --- a/fs/xfs/linux-2.6/xfs_sync.c
> > +++ b/fs/xfs/linux-2.6/xfs_sync.c
> > @@ -133,6 +133,41 @@ restart:
> >  	return last_error;
> >  }
> >  
> > +/*
> > + * Select the next per-ag structure to iterate during the walk. The reclaim
> > + * walk is optimised only to walk AGs with reclaimable inodes in them.
> > + */
> > +static struct xfs_perag *
> > +xfs_inode_ag_iter_next_pag(
> > +	struct xfs_mount	*mp,
> > +	xfs_agnumber_t		*first,
> > +	int			tag)
> > +{
> > +	struct xfs_perag	*pag = NULL;
> > +
> > +	if (tag == XFS_ICI_RECLAIM_TAG) {
> > +		int found;
> > +		int ref;
> > +
> > +		spin_lock(&mp->m_perag_lock);
> > +		found = radix_tree_gang_lookup_tag(&mp->m_perag_tree,
> > +				(void **)&pag, *first, 1, tag);
> > +		if (found <= 0) {
> > +			spin_unlock(&mp->m_perag_lock);
> > +			return NULL;
> > +		}
> > +		*first = pag->pag_agno + 1;
> 
> Maybe move this below, just before the return.  I.e.:
> 	if (pag)
> 		*first = pag->pag_agno + 1;
> 
> To me it's slightly clearer that we're just setting
> up to search next time with the perag following the
> one returned.

If found <=0, we didn't find anything in the tree. I don't know what
the value of pag is going to be in this case, so we have to check
the return value to determine if pag is valid or not.
The other gang lookups in XFS use the same convention...
>
> > +		/* open coded pag reference increment */
> 
> By open-coding here you miss the assertions in xfs_perag_get().
> (Though a common inline encapsulating them would have to be
> in a header because the two functions reside in different files.)

Yes, it misses them, but we do sufficient calls to xfs_perag_get()
elsewhere that I don't think it makes much difference - we'll still
get the ASSERT()s tripping, and the tracing will still point out
where the problem is...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 3/5] xfs: fix xfs_trans_add_item() lockdep warnings
  2010-07-16  5:19   ` Christoph Hellwig
@ 2010-07-19  0:24     ` Dave Chinner
  0 siblings, 0 replies; 21+ messages in thread
From: Dave Chinner @ 2010-07-19  0:24 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: xfs

On Fri, Jul 16, 2010 at 01:19:34AM -0400, Christoph Hellwig wrote:
> > -	lidp = kmem_zone_zalloc(xfs_log_item_desc_zone, KM_SLEEP);
> > +	lidp = kmem_zone_zalloc(xfs_log_item_desc_zone, KM_SLEEP|KM_NOFS);
> 
> KM_NOFS by itself actually is sufficiant.  If you want to keep the
> KM_SLEEP for documentation purposes at least add some spaces to make
> it more redable.
> 
> 	lidp = kmem_zone_zalloc(xfs_log_item_desc_zone, KM_SLEEP | KM_NOFS);

Will do.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 0/5] xfs: reclaim bug fixes
  2010-07-16  5:23 ` [PATCH 0/5] xfs: reclaim bug fixes Christoph Hellwig
@ 2010-07-19  0:30   ` Dave Chinner
  0 siblings, 0 replies; 21+ messages in thread
From: Dave Chinner @ 2010-07-19  0:30 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: xfs

On Fri, Jul 16, 2010 at 01:23:29AM -0400, Christoph Hellwig wrote:
> On Thu, Jul 15, 2010 at 10:38:15AM +1000, Dave Chinner wrote:
> > The following patches fix excessive CPU consumption during inode
> > cache shrinking when filesystems have lots of allocation groups as
> > well as prevent a couple of lockdep reports that were found during
> > testing. Also included is a fix for a reclaim recursion deadlock
> > when allocating memory during inode initialisation.
> 
> Wa the overlap of patch 1 with the for-2.6.35 shrinker series
> intentional?  In addition to patch 1 patches 3 to 5 are also for-2.6.35
> material in my opinion.

I realised that we'd get a messy, messy conflict if I separated the
per-ag tree reclaim tracking from the shrinker patchset, so I
included it in that one as it was also part of fixing reported
XFS shrinker regressions to avoid such conflicts.

But yes, i think that all the lockdep fixes are probably 2.6.35
material.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2010-07-19  0:28 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2010-07-15  0:38 [PATCH 0/5] xfs: reclaim bug fixes Dave Chinner
2010-07-15  0:38 ` [PATCH 1/5] xfs: track AGs with reclaimable inodes in per-ag radix tree Dave Chinner
2010-07-15 18:01   ` Alex Elder
2010-07-16  5:17     ` Christoph Hellwig
2010-07-19  0:17       ` Dave Chinner
2010-07-19  0:24     ` Dave Chinner
2010-07-15  0:38 ` [PATCH 2/5] xfs: simplify and remove xfs_ireclaim Dave Chinner
2010-07-15 18:07   ` Alex Elder
2010-07-16  5:16   ` Christoph Hellwig
2010-07-15  0:38 ` [PATCH 3/5] xfs: fix xfs_trans_add_item() lockdep warnings Dave Chinner
2010-07-15 18:09   ` Alex Elder
2010-07-16  5:19   ` Christoph Hellwig
2010-07-19  0:24     ` Dave Chinner
2010-07-15  0:38 ` [PATCH 4/5] xfs: use GFP_NOFS for page cache allocation Dave Chinner
2010-07-15 18:10   ` Alex Elder
2010-07-16  5:21   ` Christoph Hellwig
2010-07-15  0:38 ` [PATCH 5/5] xfs: fix memory reclaim recursion deadlock on locked inode buffer Dave Chinner
2010-07-15 18:42   ` Alex Elder
2010-07-16  5:22   ` Christoph Hellwig
2010-07-16  5:23 ` [PATCH 0/5] xfs: reclaim bug fixes Christoph Hellwig
2010-07-19  0:30   ` Dave Chinner

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.