All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/18] xfs: metadata and buffer cache scalability improvements
@ 2010-09-14 10:55 Dave Chinner
  2010-09-14 10:56 ` [PATCH 01/18] xfs: single thread inode cache shrinking Dave Chinner
                   ` (19 more replies)
  0 siblings, 20 replies; 69+ messages in thread
From: Dave Chinner @ 2010-09-14 10:55 UTC (permalink / raw)
  To: xfs

This patchset has grown quite a bit - it started out as a "convert
the buffer cache to rbtrees" patch, and has gotten bigger as I
peeled the onion from one bottleneck to another.

Performance numbers here are 8-way fs_mark create to 50M files, and
8-way rm -rf to remove the files created.

			wall time	fs_mark rate
2.6.36-rc4:
	create:		13m10s		65k file/s
	unlink:		23m58s		N/A

The first set of patches are generic infrastructure changes that
address pain points the rbtree based buffer cache introduces. I've
put them first because they are simpler to review and have immediate
impact on performance. These patches address lock contention as
measured by the kernel lockstat infrastructure.

xfs: single thread inode cache shrinking.
	- prevents per-ag contention during cache shrinking

xfs: reduce the number of CIL lock round trips during commit
	- reduces lock traffic on the xc_cil_lock by two orders of
	  magnitude

xfs: remove debug assert for per-ag reference counting
xfs: lockless per-ag lookups
	- hottest lock in the system with buffer cache rbtree path
	- converted to use RCU.

xfs: convert inode cache lookups to use RCU locking
xfs: convert pag_ici_lock to a spin lock
	- addresses lookup vs reclaim contention on pag_ici_lock
	- converted to use RCU.

xfs: don't use vfs writeback for pure metadata modifications
	- inode writeback does not keep up with dirtying 100,000
	  inodes a second. Avoids the superblock dirty list where
	  possible by using the AIL as the age-order flusher.

Performance with these patches:

2.6.36-rc4 + shrinker + CIL + RCU:
	create:		11m38s		80k files/s
	unlink:		14m29s		N/A

Create rate has improved by 20%, unlink time has almost halved. On
large numbers of inodes, the unlink rate improves even more
dramatically.

The buffer cache to rbtree series current stands at:

xfs: rename xfs_buf_get_nodaddr to be more appropriate
xfs: introduced uncached buffer read primitve
xfs: store xfs_mount in the buftarg instead of in the xfs_buf
xfs: kill XBF_FS_MANAGED buffers
xfs: use unhashed buffers for size checks
xfs: remove buftarg hash for external devices
	- preparatory buffer cache API cleanup patches

xfs: convert buffer cache hash to rbtree
	- what it says ;)
	- includes changes based on Alex's review.

xfs; pack xfs_buf structure more tightly
	- memory usage reduction, means adding the LRU list head is
	  effectively memory usage neutral.

xfs: convert xfsbud shrinker to a per-buftarg shrinker.
xfs: add a lru to the XFS buffer cache
	- Add an LRU for reclaim

xfs: stop using the page cache to back the buffer cache
	- kill all the page cache code

2.6.36-rc4 + shrinker + CIL + RCU + rbtree:
	create:		 9m47s		95k files/s
	unlink:		14m16s		N/A

Create rate has improved by another 20%, unlink rate has improved
marginally (noise, really).

There are two remaining parts to the buffer cache conversions:

	1. work out how to efficiently support block size smaller
	than page size. The current code works, but uses a page per
	sub-apge buffer.  A set of slab caches would be perfect for
	this use, but I'm not sure that we are allowed to use them
	for IO anymore. Christoph?

	2. Connect up the buffer type sepcific reclaim priority
	reference counting and convert the LRU reclaim to a cursor
	based walk that simply drops reclaim reference counts and
	frees anything that has a zero reclaim reference.

Overall, I can swap the order of the two patch sets, and the
incremental performance increases for create are pretty much
identical. For unlink, te benefit comes from the shrinker
modification. For those that care, the rbtree patch set in isolation
results in a time of 4h38m to create 1 billion inodes on my 8p/4GB
RAM test VM. I haven't run this test with the RCU and writeback
modifications yet.

Moving on from this point is to start testing against Nick Piggin's
VFS scalability tree, aѕ the inode_lock and dcache_lock are now the
performance limiting factors. That will, without doubt, bring new
hotspots out in XFS so I'll be starting this cycle over again soon.

Overall diffstat at this point is:

 fs/xfs/linux-2.6/kmem.h        |    1 +
 fs/xfs/linux-2.6/xfs_buf.c     |  588 ++++++++++++++--------------------------
 fs/xfs/linux-2.6/xfs_buf.h     |   61 +++--
 fs/xfs/linux-2.6/xfs_iops.c    |   18 +-
 fs/xfs/linux-2.6/xfs_super.c   |   11 +-
 fs/xfs/linux-2.6/xfs_sync.c    |   49 +++-
 fs/xfs/linux-2.6/xfs_trace.h   |    2 +-
 fs/xfs/quota/xfs_qm_syscalls.c |    4 +-
 fs/xfs/xfs_ag.h                |    9 +-
 fs/xfs/xfs_buf_item.c          |    3 +-
 fs/xfs/xfs_fsops.c             |   11 +-
 fs/xfs/xfs_iget.c              |   46 +++-
 fs/xfs/xfs_inode.c             |   22 +-
 fs/xfs/xfs_inode_item.c        |    9 -
 fs/xfs/xfs_log.c               |    3 +-
 fs/xfs/xfs_log_cil.c           |  116 +++++----
 fs/xfs/xfs_log_recover.c       |   18 +-
 fs/xfs/xfs_mount.c             |  126 ++++-----
 fs/xfs/xfs_mount.h             |    2 +
 fs/xfs/xfs_rtalloc.c           |   29 +-
 fs/xfs/xfs_vnodeops.c          |    2 +-
 21 files changed, 502 insertions(+), 628 deletions(-)

So it is improving performance, removing code and fixing
longstanding bugs all at the same time. ;)

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 69+ messages in thread

* [PATCH 01/18] xfs: single thread inode cache shrinking.
  2010-09-14 10:55 [PATCH 0/18] xfs: metadata and buffer cache scalability improvements Dave Chinner
@ 2010-09-14 10:56 ` Dave Chinner
  2010-09-14 18:48   ` Alex Elder
  2010-09-14 10:56 ` [PATCH 02/18] xfs: reduce the number of CIL lock round trips during commit Dave Chinner
                   ` (18 subsequent siblings)
  19 siblings, 1 reply; 69+ messages in thread
From: Dave Chinner @ 2010-09-14 10:56 UTC (permalink / raw)
  To: xfs

From: Dave Chinner <dchinner@redhat.com>

Having multiple CPUs trying to do the same cache shrinking work can
be actively harmful to perforamnce when the shrinkers land in the
same AGs.  They then lockstep on perag locks, causing contention and
slowing each other down. Reclaim walking is sufficiently efficient
that we do no need parallelism to make significant progress, so stop
parallel access at the door.

Instead, keep track of the number of objects the shrinkers want
cleaned and make sure the single running shrinker does not stop
until it has hit the threshold that the other shrinker calls have
built up.

This increases the cold-cache unlink rate of a 8-way parallel unlink
workload from about 15,000 unlinks/s to 60-70,000 unlinks/s for the
same CPU usage (~700%), resulting in the runtime for a 200M inode
unlink workload dropping from 4h50m to just under 1 hour.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
---
 fs/xfs/linux-2.6/xfs_sync.c |   21 +++++++++++++++++++--
 fs/xfs/xfs_mount.h          |    2 ++
 2 files changed, 21 insertions(+), 2 deletions(-)

diff --git a/fs/xfs/linux-2.6/xfs_sync.c b/fs/xfs/linux-2.6/xfs_sync.c
index d59c4a6..bc54cd6 100644
--- a/fs/xfs/linux-2.6/xfs_sync.c
+++ b/fs/xfs/linux-2.6/xfs_sync.c
@@ -869,7 +869,9 @@ xfs_reclaim_inodes(
 }
 
 /*
- * Shrinker infrastructure.
+ * Shrinker infrastructure. We allow filesystem reclaim recursion here because
+ * we trylock everything in the reclaim path and hence will not deadlock with
+ * locks that may already be held through direct reclaim.
  */
 static int
 xfs_reclaim_inode_shrink(
@@ -883,12 +885,25 @@ xfs_reclaim_inode_shrink(
 	int		reclaimable;
 
 	mp = container_of(shrink, struct xfs_mount, m_inode_shrink);
+
 	if (nr_to_scan) {
-		if (!(gfp_mask & __GFP_FS))
+		int64_t scan_cnt;
+
+		if (!mutex_trylock(&mp->m_ino_shrink_lock)) {
+			atomic64_add(nr_to_scan, &mp->m_ino_shrink_nr);
 			return -1;
+		}
 
+		do {
+			scan_cnt = atomic64_read(&mp->m_ino_shrink_nr);
+		} while (scan_cnt !=
+			atomic64_cmpxchg(&mp->m_ino_shrink_nr, scan_cnt, 0));
+
+		nr_to_scan += scan_cnt;
 		xfs_inode_ag_iterator(mp, xfs_reclaim_inode, 0,
 					XFS_ICI_RECLAIM_TAG, 1, &nr_to_scan);
+		mutex_unlock(&mp->m_ino_shrink_lock);
+
 		/* if we don't exhaust the scan, don't bother coming back */
 		if (nr_to_scan > 0)
 			return -1;
@@ -910,6 +925,8 @@ xfs_inode_shrinker_register(
 {
 	mp->m_inode_shrink.shrink = xfs_reclaim_inode_shrink;
 	mp->m_inode_shrink.seeks = DEFAULT_SEEKS;
+	atomic64_set(&mp->m_ino_shrink_nr, 0);
+	mutex_init(&mp->m_ino_shrink_lock);
 	register_shrinker(&mp->m_inode_shrink);
 }
 
diff --git a/fs/xfs/xfs_mount.h b/fs/xfs/xfs_mount.h
index 622da21..57b5644 100644
--- a/fs/xfs/xfs_mount.h
+++ b/fs/xfs/xfs_mount.h
@@ -199,6 +199,8 @@ typedef struct xfs_mount {
 	__int64_t		m_update_flags;	/* sb flags we need to update
 						   on the next remount,rw */
 	struct shrinker		m_inode_shrink;	/* inode reclaim shrinker */
+	atomic64_t		m_ino_shrink_nr;
+	struct mutex		m_ino_shrink_lock;
 } xfs_mount_t;
 
 /*
-- 
1.7.1

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 69+ messages in thread

* [PATCH 02/18] xfs: reduce the number of CIL lock round trips during commit
  2010-09-14 10:55 [PATCH 0/18] xfs: metadata and buffer cache scalability improvements Dave Chinner
  2010-09-14 10:56 ` [PATCH 01/18] xfs: single thread inode cache shrinking Dave Chinner
@ 2010-09-14 10:56 ` Dave Chinner
  2010-09-14 14:48   ` Christoph Hellwig
  2010-09-14 17:21   ` Alex Elder
  2010-09-14 10:56 ` [PATCH 03/18] xfs: remove debug assert for per-ag reference counting Dave Chinner
                   ` (17 subsequent siblings)
  19 siblings, 2 replies; 69+ messages in thread
From: Dave Chinner @ 2010-09-14 10:56 UTC (permalink / raw)
  To: xfs

From: Dave Chinner <dchinner@redhat.com>

When commiting a transaction, we do a lock CIL state lock round trip
on every single log vector we insert into the CIL. This is resulting
in the lock being as hot as the inode and dcache locks on 8-way
create workloads. Rework the insertion loops to bring the number
of lock round trips to one per transaction for log vectors, and one
more do the busy extents.

Also change the allocation of the log vector buffer not to zero it
as we copy over the entire allocated buffer anyway.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/xfs_log_cil.c |  116 +++++++++++++++++++++++++++++--------------------
 1 files changed, 69 insertions(+), 47 deletions(-)

diff --git a/fs/xfs/xfs_log_cil.c b/fs/xfs/xfs_log_cil.c
index ed575fb..f1e6184 100644
--- a/fs/xfs/xfs_log_cil.c
+++ b/fs/xfs/xfs_log_cil.c
@@ -146,68 +146,37 @@ xlog_cil_init_post_recovery(
 }
 
 /*
- * Insert the log item into the CIL and calculate the difference in space
+ * Insert the log items into the CIL and calculate the difference in space
  * consumed by the item. Add the space to the checkpoint ticket and calculate
  * if the change requires additional log metadata. If it does, take that space
  * as well. Remove the amount of space we addded to the checkpoint ticket from
  * the current transaction ticket so that the accounting works out correctly.
- *
- * If this is the first time the item is being placed into the CIL in this
- * context, pin it so it can't be written to disk until the CIL is flushed to
- * the iclog and the iclog written to disk.
  */
 static void
 xlog_cil_insert(
 	struct log		*log,
 	struct xlog_ticket	*ticket,
-	struct xfs_log_item	*item,
-	struct xfs_log_vec	*lv)
+	struct xfs_log_vec	*log_vector,
+	int			diff_length,
+	int			diff_iovecs)
 {
 	struct xfs_cil		*cil = log->l_cilp;
-	struct xfs_log_vec	*old = lv->lv_item->li_lv;
 	struct xfs_cil_ctx	*ctx = cil->xc_ctx;
-	int			len;
-	int			diff_iovecs;
 	int			iclog_space;
+	int			len = diff_length;
+	struct xfs_log_vec	*lv;
 
-	if (old) {
-		/* existing lv on log item, space used is a delta */
-		ASSERT(!list_empty(&item->li_cil));
-		ASSERT(old->lv_buf && old->lv_buf_len && old->lv_niovecs);
-
-		len = lv->lv_buf_len - old->lv_buf_len;
-		diff_iovecs = lv->lv_niovecs - old->lv_niovecs;
-		kmem_free(old->lv_buf);
-		kmem_free(old);
-	} else {
-		/* new lv, must pin the log item */
-		ASSERT(!lv->lv_item->li_lv);
-		ASSERT(list_empty(&item->li_cil));
+	spin_lock(&cil->xc_cil_lock);
 
-		len = lv->lv_buf_len;
-		diff_iovecs = lv->lv_niovecs;
-		IOP_PIN(lv->lv_item);
+	/* move the items to the tail of the CIL */
+	for (lv = log_vector; lv; lv = lv->lv_next)
+		list_move_tail(&lv->lv_item->li_cil, &cil->xc_cil);
 
-	}
+	/* account for space used by new iovec headers  */
 	len += diff_iovecs * sizeof(xlog_op_header_t);
-
-	/* attach new log vector to log item */
-	lv->lv_item->li_lv = lv;
-
-	spin_lock(&cil->xc_cil_lock);
-	list_move_tail(&item->li_cil, &cil->xc_cil);
 	ctx->nvecs += diff_iovecs;
 
 	/*
-	 * If this is the first time the item is being committed to the CIL,
-	 * store the sequence number on the log item so we can tell
-	 * in future commits whether this is the first checkpoint the item is
-	 * being committed into.
-	 */
-	if (!item->li_seq)
-		item->li_seq = ctx->sequence;
-
-	/*
 	 * Now transfer enough transaction reservation to the context ticket
 	 * for the checkpoint. The context ticket is special - the unit
 	 * reservation has to grow as well as the current reservation as we
@@ -286,7 +255,7 @@ xlog_cil_format_items(
 			len += lv->lv_iovecp[index].i_len;
 
 		lv->lv_buf_len = len;
-		lv->lv_buf = kmem_zalloc(lv->lv_buf_len, KM_SLEEP|KM_NOFS);
+		lv->lv_buf = kmem_alloc(lv->lv_buf_len, KM_SLEEP|KM_NOFS);
 		ptr = lv->lv_buf;
 
 		for (index = 0; index < lv->lv_niovecs; index++) {
@@ -307,14 +276,67 @@ xlog_cil_insert_items(
 	struct xlog_ticket	*ticket,
 	xfs_lsn_t		*start_lsn)
 {
-	struct xfs_log_vec *lv;
+	struct xfs_log_vec	*lv;
+	int			len = 0;
+	int			diff_iovecs = 0;
+
+	ASSERT(log_vector);
 
 	if (start_lsn)
 		*start_lsn = log->l_cilp->xc_ctx->sequence;
 
-	ASSERT(log_vector);
-	for (lv = log_vector; lv; lv = lv->lv_next)
-		xlog_cil_insert(log, ticket, lv->lv_item, lv);
+	/*
+	 * Do all the accounting aggregation and switching of log vectors
+	 * around in a separate loop to the insertion of items into the CIL.
+	 * Then we can do a separate loop to update the CIL within a single
+	 * lock/unlock pair. This reduces the number of round trips on the CIL
+	 * lock from O(nr_logvectors) to O(1) and greatly reduces the overall
+	 * hold time for the transaction commit.
+	 *
+	 * If this is the first time the item is being placed into the CIL in
+	 * this context, pin it so it can't be written to disk until the CIL is
+	 * flushed to the iclog and the iclog written to disk.
+	 *
+	 * We can do this safely because the context can't checkpoint until we
+	 * are done so it doesn't matter exactly how we update the CIL.
+	 */
+	for (lv = log_vector; lv; lv = lv->lv_next) {
+		struct xfs_log_vec	*old = lv->lv_item->li_lv;
+
+		if (old) {
+			/* existing lv on log item, space used is a delta */
+			ASSERT(!list_empty(&lv->lv_item->li_cil));
+			ASSERT(old->lv_buf && old->lv_buf_len && old->lv_niovecs);
+
+			len += lv->lv_buf_len - old->lv_buf_len;
+			diff_iovecs += lv->lv_niovecs - old->lv_niovecs;
+			kmem_free(old->lv_buf);
+			kmem_free(old);
+		} else {
+			/* new lv, must pin the log item */
+			ASSERT(!lv->lv_item->li_lv);
+			ASSERT(list_empty(&lv->lv_item->li_cil));
+
+			len += lv->lv_buf_len;
+			diff_iovecs += lv->lv_niovecs;
+			IOP_PIN(lv->lv_item);
+
+		}
+
+		/* attach new log vector to log item */
+		lv->lv_item->li_lv = lv;
+
+		/*
+		 * If this is the first time the item is being committed to the
+		 * CIL, store the sequence number on the log item so we can
+		 * tell in future commits whether this is the first checkpoint
+		 * the item is being committed into.
+		 */
+		if (!lv->lv_item->li_seq)
+			lv->lv_item->li_seq = log->l_cilp->xc_ctx->sequence;
+	}
+
+	xlog_cil_insert(log, ticket, log_vector, len, diff_iovecs);
 }
 
 static void
-- 
1.7.1

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 69+ messages in thread

* [PATCH 03/18] xfs: remove debug assert for per-ag reference counting
  2010-09-14 10:55 [PATCH 0/18] xfs: metadata and buffer cache scalability improvements Dave Chinner
  2010-09-14 10:56 ` [PATCH 01/18] xfs: single thread inode cache shrinking Dave Chinner
  2010-09-14 10:56 ` [PATCH 02/18] xfs: reduce the number of CIL lock round trips during commit Dave Chinner
@ 2010-09-14 10:56 ` Dave Chinner
  2010-09-14 14:48   ` Christoph Hellwig
  2010-09-14 17:22   ` Alex Elder
  2010-09-14 10:56 ` [PATCH 04/18] xfs: lockless per-ag lookups Dave Chinner
                   ` (16 subsequent siblings)
  19 siblings, 2 replies; 69+ messages in thread
From: Dave Chinner @ 2010-09-14 10:56 UTC (permalink / raw)
  To: xfs

From: Dave Chinner <dchinner@redhat.com>

When we start taking references per cached buffer to the the perag
it is cached on, it will blow the current debug maximum reference
count assert out of the water. The assert has never caught a bug,
and we have tracing to track changes if there ever is a problem,
so just remove it.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/xfs_mount.c |    2 --
 1 files changed, 0 insertions(+), 2 deletions(-)

diff --git a/fs/xfs/xfs_mount.c b/fs/xfs/xfs_mount.c
index aeb9d72..00c7a87 100644
--- a/fs/xfs/xfs_mount.c
+++ b/fs/xfs/xfs_mount.c
@@ -210,8 +210,6 @@ xfs_perag_get(struct xfs_mount *mp, xfs_agnumber_t agno)
 	pag = radix_tree_lookup(&mp->m_perag_tree, agno);
 	if (pag) {
 		ASSERT(atomic_read(&pag->pag_ref) >= 0);
-		/* catch leaks in the positive direction during testing */
-		ASSERT(atomic_read(&pag->pag_ref) < 1000);
 		ref = atomic_inc_return(&pag->pag_ref);
 	}
 	spin_unlock(&mp->m_perag_lock);
-- 
1.7.1

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 69+ messages in thread

* [PATCH 04/18] xfs: lockless per-ag lookups
  2010-09-14 10:55 [PATCH 0/18] xfs: metadata and buffer cache scalability improvements Dave Chinner
                   ` (2 preceding siblings ...)
  2010-09-14 10:56 ` [PATCH 03/18] xfs: remove debug assert for per-ag reference counting Dave Chinner
@ 2010-09-14 10:56 ` Dave Chinner
  2010-09-14 12:35   ` Dave Chinner
                     ` (2 more replies)
  2010-09-14 10:56 ` [PATCH 05/18] xfs: convert inode cache lookups to use RCU locking Dave Chinner
                   ` (15 subsequent siblings)
  19 siblings, 3 replies; 69+ messages in thread
From: Dave Chinner @ 2010-09-14 10:56 UTC (permalink / raw)
  To: xfs

From: Dave Chinner <dchinner@redhat.com>

When we start taking a reference to the per-ag for every cached
buffer in the system, kernel lockstat profiling on an 8-way create
workload shows the mp->m_perag_lock has higher acquisition rates
than the inode lock and has significantly more contention. That is,
it becomes the highest contended lock in the system.

The perag lookup is trivial to convert to lock-less RCU lookups
because perag structures never go away. Hence the only thing we need
to protect against is tree structure changes during a grow. THis can
be done simply by replacing the locking in xfs_perag_get() with RCU
read locking. This removes the mp->m_perag_lock completely from this
path.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/xfs_ag.h    |    3 +++
 fs/xfs/xfs_mount.c |   25 +++++++++++++++++--------
 2 files changed, 20 insertions(+), 8 deletions(-)

diff --git a/fs/xfs/xfs_ag.h b/fs/xfs/xfs_ag.h
index 4917d4e..51c42c2 100644
--- a/fs/xfs/xfs_ag.h
+++ b/fs/xfs/xfs_ag.h
@@ -230,6 +230,9 @@ typedef struct xfs_perag {
 	rwlock_t	pag_ici_lock;	/* incore inode lock */
 	struct radix_tree_root pag_ici_root;	/* incore inode cache root */
 	int		pag_ici_reclaimable;	/* reclaimable inodes */
+
+	/* for rcu-safe freeing */
+	struct rcu_head	rcu_head;
 #endif
 	int		pagb_count;	/* pagb slots in use */
 } xfs_perag_t;
diff --git a/fs/xfs/xfs_mount.c b/fs/xfs/xfs_mount.c
index 00c7a87..14fc6e9 100644
--- a/fs/xfs/xfs_mount.c
+++ b/fs/xfs/xfs_mount.c
@@ -199,6 +199,8 @@ xfs_uuid_unmount(
 
 /*
  * Reference counting access wrappers to the perag structures.
+ * Because we never free per-ag structures, the only thing we
+ * have to protect against changes is the tree structure itself.
  */
 struct xfs_perag *
 xfs_perag_get(struct xfs_mount *mp, xfs_agnumber_t agno)
@@ -206,13 +208,13 @@ xfs_perag_get(struct xfs_mount *mp, xfs_agnumber_t agno)
 	struct xfs_perag	*pag;
 	int			ref = 0;
 
-	spin_lock(&mp->m_perag_lock);
+	rcu_read_lock();
 	pag = radix_tree_lookup(&mp->m_perag_tree, agno);
 	if (pag) {
 		ASSERT(atomic_read(&pag->pag_ref) >= 0);
 		ref = atomic_inc_return(&pag->pag_ref);
 	}
-	spin_unlock(&mp->m_perag_lock);
+	rcu_read_unlock();
 	trace_xfs_perag_get(mp, agno, ref, _RET_IP_);
 	return pag;
 }
@@ -227,10 +229,18 @@ xfs_perag_put(struct xfs_perag *pag)
 	trace_xfs_perag_put(pag->pag_mount, pag->pag_agno, ref, _RET_IP_);
 }
 
+STATIC void
+__xfs_free_perag(
+	struct rcu_head	*head)
+{
+	struct xfs_perag *pag = container_of(head, struct xfs_perag, rcu_head);
+
+	ASSERT(atomic_read(&pag->pag_ref) == 0);
+	kmem_free(pag);
+}
+
 /*
- * Free up the resources associated with a mount structure.  Assume that
- * the structure was initially zeroed, so we can tell which fields got
- * initialized.
+ * Free up the per-ag resources associated with the mount structure.
  */
 STATIC void
 xfs_free_perag(
@@ -242,10 +252,9 @@ xfs_free_perag(
 	for (agno = 0; agno < mp->m_sb.sb_agcount; agno++) {
 		spin_lock(&mp->m_perag_lock);
 		pag = radix_tree_delete(&mp->m_perag_tree, agno);
-		ASSERT(pag);
-		ASSERT(atomic_read(&pag->pag_ref) == 0);
 		spin_unlock(&mp->m_perag_lock);
-		kmem_free(pag);
+		ASSERT(pag);
+		call_rcu(&pag->rcu_head, __xfs_free_perag);
 	}
 }
 
-- 
1.7.1

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 69+ messages in thread

* [PATCH 05/18] xfs: convert inode cache lookups to use RCU locking
  2010-09-14 10:55 [PATCH 0/18] xfs: metadata and buffer cache scalability improvements Dave Chinner
                   ` (3 preceding siblings ...)
  2010-09-14 10:56 ` [PATCH 04/18] xfs: lockless per-ag lookups Dave Chinner
@ 2010-09-14 10:56 ` Dave Chinner
  2010-09-14 16:27   ` Christoph Hellwig
  2010-09-14 21:23   ` Alex Elder
  2010-09-14 10:56 ` [PATCH 06/18] xfs: convert pag_ici_lock to a spin lock Dave Chinner
                   ` (14 subsequent siblings)
  19 siblings, 2 replies; 69+ messages in thread
From: Dave Chinner @ 2010-09-14 10:56 UTC (permalink / raw)
  To: xfs

From: Dave Chinner <dchinner@redhat.com>

With delayed logging greatly increasing the sustained parallelism of inode
operations, the inode cache locking is showing significant read vs write
contention when inode reclaim runs at the same time as lookups. There is
also a lot more write lock acquistions than there are read locks (4:1 ratio)
so the read locking is not really buying us much in the way of parallelism.

To avoid the read vs write contention, change the cache to use RCU locking on
the read side. To avoid needing to RCU free every single inode, use the built
in slab RCU freeing mechanism. This requires us to be able to detect lookups of
freed inodes, so enѕure that ever freed inode has an inode number of zero and
the XFS_IRECLAIM flag set. We already check the XFS_IRECLAIM flag in cache hit
lookup path, but also add a check for a zero inode number as well.

We canthen convert all the read locking lockups to use RCU read side locking
and hence remove all read side locking.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/linux-2.6/kmem.h        |    1 +
 fs/xfs/linux-2.6/xfs_super.c   |    3 ++-
 fs/xfs/linux-2.6/xfs_sync.c    |   12 ++++++------
 fs/xfs/quota/xfs_qm_syscalls.c |    4 ++--
 fs/xfs/xfs_iget.c              |   36 +++++++++++++++++++++++++++++-------
 fs/xfs/xfs_inode.c             |   22 ++++++++++++++--------
 6 files changed, 54 insertions(+), 24 deletions(-)

diff --git a/fs/xfs/linux-2.6/kmem.h b/fs/xfs/linux-2.6/kmem.h
index f7c8f7a..c0fe7ef 100644
--- a/fs/xfs/linux-2.6/kmem.h
+++ b/fs/xfs/linux-2.6/kmem.h
@@ -82,6 +82,7 @@ extern void *kmem_zalloc_greedy(size_t *, size_t, size_t);
 #define KM_ZONE_HWALIGN	SLAB_HWCACHE_ALIGN
 #define KM_ZONE_RECLAIM	SLAB_RECLAIM_ACCOUNT
 #define KM_ZONE_SPREAD	SLAB_MEM_SPREAD
+#define KM_ZONE_RCU	SLAB_DESTROY_BY_RCU
 
 #define kmem_zone	kmem_cache
 #define kmem_zone_t	struct kmem_cache
diff --git a/fs/xfs/linux-2.6/xfs_super.c b/fs/xfs/linux-2.6/xfs_super.c
index a4e0797..6205eb8 100644
--- a/fs/xfs/linux-2.6/xfs_super.c
+++ b/fs/xfs/linux-2.6/xfs_super.c
@@ -1723,7 +1723,8 @@ xfs_init_zones(void)
 
 	xfs_inode_zone =
 		kmem_zone_init_flags(sizeof(xfs_inode_t), "xfs_inode",
-			KM_ZONE_HWALIGN | KM_ZONE_RECLAIM | KM_ZONE_SPREAD,
+			KM_ZONE_RCU | KM_ZONE_HWALIGN |
+			KM_ZONE_RECLAIM | KM_ZONE_SPREAD,
 			xfs_fs_inode_init_once);
 	if (!xfs_inode_zone)
 		goto out_destroy_efi_zone;
diff --git a/fs/xfs/linux-2.6/xfs_sync.c b/fs/xfs/linux-2.6/xfs_sync.c
index bc54cd6..e549d67 100644
--- a/fs/xfs/linux-2.6/xfs_sync.c
+++ b/fs/xfs/linux-2.6/xfs_sync.c
@@ -102,13 +102,13 @@ restart:
 		if (exclusive)
 			write_lock(&pag->pag_ici_lock);
 		else
-			read_lock(&pag->pag_ici_lock);
+			rcu_read_lock();
 		ip = xfs_inode_ag_lookup(mp, pag, &first_index, tag);
 		if (!ip) {
 			if (exclusive)
 				write_unlock(&pag->pag_ici_lock);
 			else
-				read_unlock(&pag->pag_ici_lock);
+				rcu_read_unlock();
 			break;
 		}
 
@@ -204,11 +204,10 @@ xfs_inode_ag_iterator(
 	return XFS_ERROR(last_error);
 }
 
-/* must be called with pag_ici_lock held and releases it */
 int
 xfs_sync_inode_valid(
 	struct xfs_inode	*ip,
-	struct xfs_perag	*pag)
+	struct xfs_perag	*pag) __releases(RCU)
 {
 	struct inode		*inode = VFS_I(ip);
 	int			error = EFSCORRUPTED;
@@ -219,7 +218,8 @@ xfs_sync_inode_valid(
 
 	/* avoid new or reclaimable inodes. Leave for reclaim code to flush */
 	error = ENOENT;
-	if (xfs_iflags_test(ip, XFS_INEW | XFS_IRECLAIMABLE | XFS_IRECLAIM))
+	if (ip->i_ino == 0 ||
+	    xfs_iflags_test(ip, XFS_INEW | XFS_IRECLAIMABLE | XFS_IRECLAIM))
 		goto out_unlock;
 
 	/* If we can't grab the inode, it must on it's way to reclaim. */
@@ -234,7 +234,7 @@ xfs_sync_inode_valid(
 	/* inode is valid */
 	error = 0;
 out_unlock:
-	read_unlock(&pag->pag_ici_lock);
+	rcu_read_unlock();
 	return error;
 }
 
diff --git a/fs/xfs/quota/xfs_qm_syscalls.c b/fs/xfs/quota/xfs_qm_syscalls.c
index 45e5849..ab9cafc 100644
--- a/fs/xfs/quota/xfs_qm_syscalls.c
+++ b/fs/xfs/quota/xfs_qm_syscalls.c
@@ -873,7 +873,7 @@ STATIC int
 xfs_dqrele_inode(
 	struct xfs_inode	*ip,
 	struct xfs_perag	*pag,
-	int			flags)
+	int			flags) __releases(RCU)
 {
 	int			error;
 
@@ -882,7 +882,7 @@ xfs_dqrele_inode(
 	    ip == ip->i_mount->m_quotainfo->qi_gquotaip) {
 		ASSERT(ip->i_udquot == NULL);
 		ASSERT(ip->i_gdquot == NULL);
-		read_unlock(&pag->pag_ici_lock);
+		rcu_read_unlock();
 		return 0;
 	}
 
diff --git a/fs/xfs/xfs_iget.c b/fs/xfs/xfs_iget.c
index b1ecc6f..f3a46b6 100644
--- a/fs/xfs/xfs_iget.c
+++ b/fs/xfs/xfs_iget.c
@@ -69,6 +69,7 @@ xfs_inode_alloc(
 	ASSERT(atomic_read(&ip->i_pincount) == 0);
 	ASSERT(!spin_is_locked(&ip->i_flags_lock));
 	ASSERT(completion_done(&ip->i_flush));
+	ASSERT(ip->i_ino == 0);
 
 	mrlock_init(&ip->i_iolock, MRLOCK_BARRIER, "xfsio", ip->i_ino);
 
@@ -134,6 +135,13 @@ xfs_inode_free(
 	ASSERT(!spin_is_locked(&ip->i_flags_lock));
 	ASSERT(completion_done(&ip->i_flush));
 
+	/*
+	 * because we use SLAB_DESTROY_BY_RCU freeing, ensure the inode
+	 * always appears to be reclaimed with an invalid inode number
+	 * when in the free state.
+	 */
+	ip->i_flags = XFS_IRECLAIM;
+	ip->i_ino = 0;
 	kmem_zone_free(xfs_inode_zone, ip);
 }
 
@@ -145,12 +153,26 @@ xfs_iget_cache_hit(
 	struct xfs_perag	*pag,
 	struct xfs_inode	*ip,
 	int			flags,
-	int			lock_flags) __releases(pag->pag_ici_lock)
+	int			lock_flags) __releases(RCU)
 {
 	struct inode		*inode = VFS_I(ip);
 	struct xfs_mount	*mp = ip->i_mount;
 	int			error;
 
+	/*
+	 * check for re-use of an inode within an RCU grace period due to the
+	 * radix tree nodes not being updated yet. We monitor for this by
+	 * setting the inode number to zero before freeing the inode structure.
+	 */
+	if (ip->i_ino == 0) {
+		trace_xfs_iget_skip(ip);
+		XFS_STATS_INC(xs_ig_frecycle);
+		rcu_read_unlock();
+		/* Expire the grace period so we don't trip over it again. */
+		synchronize_rcu();
+		return EAGAIN;
+	}
+
 	spin_lock(&ip->i_flags_lock);
 
 	/*
@@ -194,7 +216,7 @@ xfs_iget_cache_hit(
 		ip->i_flags |= XFS_IRECLAIM;
 
 		spin_unlock(&ip->i_flags_lock);
-		read_unlock(&pag->pag_ici_lock);
+		rcu_read_unlock();
 
 		error = -inode_init_always(mp->m_super, inode);
 		if (error) {
@@ -202,7 +224,7 @@ xfs_iget_cache_hit(
 			 * Re-initializing the inode failed, and we are in deep
 			 * trouble.  Try to re-add it to the reclaim list.
 			 */
-			read_lock(&pag->pag_ici_lock);
+			rcu_read_lock();
 			spin_lock(&ip->i_flags_lock);
 
 			ip->i_flags &= ~XFS_INEW;
@@ -230,7 +252,7 @@ xfs_iget_cache_hit(
 
 		/* We've got a live one. */
 		spin_unlock(&ip->i_flags_lock);
-		read_unlock(&pag->pag_ici_lock);
+		rcu_read_unlock();
 		trace_xfs_iget_hit(ip);
 	}
 
@@ -244,7 +266,7 @@ xfs_iget_cache_hit(
 
 out_error:
 	spin_unlock(&ip->i_flags_lock);
-	read_unlock(&pag->pag_ici_lock);
+	rcu_read_unlock();
 	return error;
 }
 
@@ -375,7 +397,7 @@ xfs_iget(
 
 again:
 	error = 0;
-	read_lock(&pag->pag_ici_lock);
+	rcu_read_lock();
 	ip = radix_tree_lookup(&pag->pag_ici_root, agino);
 
 	if (ip) {
@@ -383,7 +405,7 @@ again:
 		if (error)
 			goto out_error_or_again;
 	} else {
-		read_unlock(&pag->pag_ici_lock);
+		rcu_read_unlock();
 		XFS_STATS_INC(xs_ig_missed);
 
 		error = xfs_iget_cache_miss(mp, pag, tp, ino, &ip,
diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
index 34798f3..6927699 100644
--- a/fs/xfs/xfs_inode.c
+++ b/fs/xfs/xfs_inode.c
@@ -1999,13 +1999,14 @@ xfs_ifree_cluster(
 		 */
 		for (i = 0; i < ninodes; i++) {
 retry:
-			read_lock(&pag->pag_ici_lock);
+			rcu_read_lock();
 			ip = radix_tree_lookup(&pag->pag_ici_root,
 					XFS_INO_TO_AGINO(mp, (inum + i)));
 
 			/* Inode not in memory or stale, nothing to do */
-			if (!ip || xfs_iflags_test(ip, XFS_ISTALE)) {
-				read_unlock(&pag->pag_ici_lock);
+			if (!ip || !ip->i_ino ||
+			    xfs_iflags_test(ip, XFS_ISTALE)) {
+				rcu_read_unlock();
 				continue;
 			}
 
@@ -2018,11 +2019,11 @@ retry:
 			 */
 			if (ip != free_ip &&
 			    !xfs_ilock_nowait(ip, XFS_ILOCK_EXCL)) {
-				read_unlock(&pag->pag_ici_lock);
+				rcu_read_unlock();
 				delay(1);
 				goto retry;
 			}
-			read_unlock(&pag->pag_ici_lock);
+			rcu_read_unlock();
 
 			xfs_iflock(ip);
 			xfs_iflags_set(ip, XFS_ISTALE);
@@ -2628,7 +2629,7 @@ xfs_iflush_cluster(
 
 	mask = ~(((XFS_INODE_CLUSTER_SIZE(mp) >> mp->m_sb.sb_inodelog)) - 1);
 	first_index = XFS_INO_TO_AGINO(mp, ip->i_ino) & mask;
-	read_lock(&pag->pag_ici_lock);
+	rcu_read_lock();
 	/* really need a gang lookup range call here */
 	nr_found = radix_tree_gang_lookup(&pag->pag_ici_root, (void**)ilist,
 					first_index, inodes_per_cluster);
@@ -2639,6 +2640,11 @@ xfs_iflush_cluster(
 		iq = ilist[i];
 		if (iq == ip)
 			continue;
+
+		/* check we've got a valid inode */
+		if (!iq->i_ino)
+			continue;
+
 		/* if the inode lies outside this cluster, we're done. */
 		if ((XFS_INO_TO_AGINO(mp, iq->i_ino) & mask) != first_index)
 			break;
@@ -2691,7 +2697,7 @@ xfs_iflush_cluster(
 	}
 
 out_free:
-	read_unlock(&pag->pag_ici_lock);
+	rcu_read_unlock();
 	kmem_free(ilist);
 out_put:
 	xfs_perag_put(pag);
@@ -2703,7 +2709,7 @@ cluster_corrupt_out:
 	 * Corruption detected in the clustering loop.  Invalidate the
 	 * inode buffer and shut down the filesystem.
 	 */
-	read_unlock(&pag->pag_ici_lock);
+	rcu_read_unlock();
 	/*
 	 * Clean up the buffer.  If it was B_DELWRI, just release it --
 	 * brelse can handle it with no problems.  If not, shut down the
-- 
1.7.1

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 69+ messages in thread

* [PATCH 06/18] xfs: convert pag_ici_lock to a spin lock
  2010-09-14 10:55 [PATCH 0/18] xfs: metadata and buffer cache scalability improvements Dave Chinner
                   ` (4 preceding siblings ...)
  2010-09-14 10:56 ` [PATCH 05/18] xfs: convert inode cache lookups to use RCU locking Dave Chinner
@ 2010-09-14 10:56 ` Dave Chinner
  2010-09-14 21:26   ` Alex Elder
  2010-09-14 10:56 ` [PATCH 07/18] xfs: don't use vfs writeback for pure metadata modifications Dave Chinner
                   ` (13 subsequent siblings)
  19 siblings, 1 reply; 69+ messages in thread
From: Dave Chinner @ 2010-09-14 10:56 UTC (permalink / raw)
  To: xfs

From: Dave Chinner <dchinner@redhat.com>

now that we are using RCU protection for the inode cache lookups,
the lock is only needed on the modification side. Hence it is not
necessary for the lock to be a rwlock as there are no read side
holders anymore. Convert it to a spin lock to reflect it's exclusive
nature.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/linux-2.6/xfs_sync.c |   16 ++++++++--------
 fs/xfs/xfs_ag.h             |    2 +-
 fs/xfs/xfs_iget.c           |   10 +++++-----
 fs/xfs/xfs_mount.c          |    2 +-
 4 files changed, 15 insertions(+), 15 deletions(-)

diff --git a/fs/xfs/linux-2.6/xfs_sync.c b/fs/xfs/linux-2.6/xfs_sync.c
index e549d67..c093e91 100644
--- a/fs/xfs/linux-2.6/xfs_sync.c
+++ b/fs/xfs/linux-2.6/xfs_sync.c
@@ -100,13 +100,13 @@ restart:
 		xfs_inode_t	*ip;
 
 		if (exclusive)
-			write_lock(&pag->pag_ici_lock);
+			spin_lock(&pag->pag_ici_lock);
 		else
 			rcu_read_lock();
 		ip = xfs_inode_ag_lookup(mp, pag, &first_index, tag);
 		if (!ip) {
 			if (exclusive)
-				write_unlock(&pag->pag_ici_lock);
+				spin_unlock(&pag->pag_ici_lock);
 			else
 				rcu_read_unlock();
 			break;
@@ -659,12 +659,12 @@ xfs_inode_set_reclaim_tag(
 	struct xfs_perag *pag;
 
 	pag = xfs_perag_get(mp, XFS_INO_TO_AGNO(mp, ip->i_ino));
-	write_lock(&pag->pag_ici_lock);
+	spin_lock(&pag->pag_ici_lock);
 	spin_lock(&ip->i_flags_lock);
 	__xfs_inode_set_reclaim_tag(pag, ip);
 	__xfs_iflags_set(ip, XFS_IRECLAIMABLE);
 	spin_unlock(&ip->i_flags_lock);
-	write_unlock(&pag->pag_ici_lock);
+	spin_unlock(&pag->pag_ici_lock);
 	xfs_perag_put(pag);
 }
 
@@ -757,12 +757,12 @@ xfs_reclaim_inode(
 	if (__xfs_iflags_test(ip, XFS_IRECLAIM)) {
 		/* ignore as it is already under reclaim */
 		spin_unlock(&ip->i_flags_lock);
-		write_unlock(&pag->pag_ici_lock);
+		spin_unlock(&pag->pag_ici_lock);
 		return 0;
 	}
 	__xfs_iflags_set(ip, XFS_IRECLAIM);
 	spin_unlock(&ip->i_flags_lock);
-	write_unlock(&pag->pag_ici_lock);
+	spin_unlock(&pag->pag_ici_lock);
 
 	xfs_ilock(ip, XFS_ILOCK_EXCL);
 	if (!xfs_iflock_nowait(ip)) {
@@ -834,11 +834,11 @@ reclaim:
 	 * added to the tree assert that it's been there before to catch
 	 * problems with the inode life time early on.
 	 */
-	write_lock(&pag->pag_ici_lock);
+	spin_lock(&pag->pag_ici_lock);
 	if (!radix_tree_delete(&pag->pag_ici_root,
 				XFS_INO_TO_AGINO(ip->i_mount, ip->i_ino)))
 		ASSERT(0);
-	write_unlock(&pag->pag_ici_lock);
+	spin_unlock(&pag->pag_ici_lock);
 
 	/*
 	 * Here we do an (almost) spurious inode lock in order to coordinate
diff --git a/fs/xfs/xfs_ag.h b/fs/xfs/xfs_ag.h
index 51c42c2..6de9128 100644
--- a/fs/xfs/xfs_ag.h
+++ b/fs/xfs/xfs_ag.h
@@ -227,7 +227,7 @@ typedef struct xfs_perag {
 
 	atomic_t        pagf_fstrms;    /* # of filestreams active in this AG */
 
-	rwlock_t	pag_ici_lock;	/* incore inode lock */
+	spinlock_t	pag_ici_lock;	/* incore inode cache lock */
 	struct radix_tree_root pag_ici_root;	/* incore inode cache root */
 	int		pag_ici_reclaimable;	/* reclaimable inodes */
 
diff --git a/fs/xfs/xfs_iget.c b/fs/xfs/xfs_iget.c
index f3a46b6..c46ce03 100644
--- a/fs/xfs/xfs_iget.c
+++ b/fs/xfs/xfs_iget.c
@@ -234,14 +234,14 @@ xfs_iget_cache_hit(
 			goto out_error;
 		}
 
-		write_lock(&pag->pag_ici_lock);
+		spin_lock(&pag->pag_ici_lock);
 		spin_lock(&ip->i_flags_lock);
 		ip->i_flags &= ~(XFS_IRECLAIMABLE | XFS_IRECLAIM);
 		ip->i_flags |= XFS_INEW;
 		__xfs_inode_clear_reclaim_tag(mp, pag, ip);
 		inode->i_state = I_NEW;
 		spin_unlock(&ip->i_flags_lock);
-		write_unlock(&pag->pag_ici_lock);
+		spin_unlock(&pag->pag_ici_lock);
 	} else {
 		/* If the VFS inode is being torn down, pause and try again. */
 		if (!igrab(inode)) {
@@ -319,7 +319,7 @@ xfs_iget_cache_miss(
 			BUG();
 	}
 
-	write_lock(&pag->pag_ici_lock);
+	spin_lock(&pag->pag_ici_lock);
 
 	/* insert the new inode */
 	error = radix_tree_insert(&pag->pag_ici_root, agino, ip);
@@ -334,14 +334,14 @@ xfs_iget_cache_miss(
 	ip->i_udquot = ip->i_gdquot = NULL;
 	xfs_iflags_set(ip, XFS_INEW);
 
-	write_unlock(&pag->pag_ici_lock);
+	spin_unlock(&pag->pag_ici_lock);
 	radix_tree_preload_end();
 
 	*ipp = ip;
 	return 0;
 
 out_preload_end:
-	write_unlock(&pag->pag_ici_lock);
+	spin_unlock(&pag->pag_ici_lock);
 	radix_tree_preload_end();
 	if (lock_flags)
 		xfs_iunlock(ip, lock_flags);
diff --git a/fs/xfs/xfs_mount.c b/fs/xfs/xfs_mount.c
index 14fc6e9..546eb1f 100644
--- a/fs/xfs/xfs_mount.c
+++ b/fs/xfs/xfs_mount.c
@@ -450,7 +450,7 @@ xfs_initialize_perag(
 			goto out_unwind;
 		pag->pag_agno = index;
 		pag->pag_mount = mp;
-		rwlock_init(&pag->pag_ici_lock);
+		spin_lock_init(&pag->pag_ici_lock);
 		INIT_RADIX_TREE(&pag->pag_ici_root, GFP_ATOMIC);
 
 		if (radix_tree_preload(GFP_NOFS))
-- 
1.7.1

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 69+ messages in thread

* [PATCH 07/18] xfs: don't use vfs writeback for pure metadata modifications
  2010-09-14 10:55 [PATCH 0/18] xfs: metadata and buffer cache scalability improvements Dave Chinner
                   ` (5 preceding siblings ...)
  2010-09-14 10:56 ` [PATCH 06/18] xfs: convert pag_ici_lock to a spin lock Dave Chinner
@ 2010-09-14 10:56 ` Dave Chinner
  2010-09-14 14:54   ` Christoph Hellwig
                     ` (2 more replies)
  2010-09-14 10:56 ` [PATCH 08/18] xfs: rename xfs_buf_get_nodaddr to be more appropriate Dave Chinner
                   ` (12 subsequent siblings)
  19 siblings, 3 replies; 69+ messages in thread
From: Dave Chinner @ 2010-09-14 10:56 UTC (permalink / raw)
  To: xfs

From: Dave Chinner <dchinner@redhat.com>

Under heavy multi-way parallel create workloads, the VFS struggles to write
back all the inodes that have been changed in age order. The bdi flusher thread
becomes CPU bound, spending 85% of it's time in the VFS code, mostly traversing
the superblock dirty inode list to separate dirty inodes old enough to flush.

We already keep an index of all metadata changes in age order - in the AIL -
and continued log pressure will do age ordered writeback without any extra
overhead at all. If there is no pressure on the log, the xfssyncd will
periodically write back metadata in ascending disk address offset order so will
be very efficient.

Hence we can stop marking VFS inodes dirty during transaction commit or when
changing timestamps during transactions. This will keep the inodes in the
superblock dirty list to those containing data or unlogged metadata changes.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/linux-2.6/xfs_iops.c |   18 +++++-------------
 fs/xfs/xfs_inode_item.c     |    9 ---------
 2 files changed, 5 insertions(+), 22 deletions(-)

diff --git a/fs/xfs/linux-2.6/xfs_iops.c b/fs/xfs/linux-2.6/xfs_iops.c
index 1e084ff..8f21765 100644
--- a/fs/xfs/linux-2.6/xfs_iops.c
+++ b/fs/xfs/linux-2.6/xfs_iops.c
@@ -95,9 +95,11 @@ xfs_mark_inode_dirty(
 }
 
 /*
- * Change the requested timestamp in the given inode.
- * We don't lock across timestamp updates, and we don't log them but
- * we do record the fact that there is dirty information in core.
+ * Change the requested timestamp in the given inode.  We don't lock across
+ * timestamp updates, and we don't log them directly.  However, all timestamp
+ * changes occur within transactions that log the inode core, so the timestamp
+ * changes will be copied back into the XFS inode during transaction commit.
+ * Hence we do not need to dirty the inode here.
  */
 void
 xfs_ichgtime(
@@ -106,27 +108,17 @@ xfs_ichgtime(
 {
 	struct inode	*inode = VFS_I(ip);
 	timespec_t	tv;
-	int		sync_it = 0;
 
 	tv = current_fs_time(inode->i_sb);
 
 	if ((flags & XFS_ICHGTIME_MOD) &&
 	    !timespec_equal(&inode->i_mtime, &tv)) {
 		inode->i_mtime = tv;
-		sync_it = 1;
 	}
 	if ((flags & XFS_ICHGTIME_CHG) &&
 	    !timespec_equal(&inode->i_ctime, &tv)) {
 		inode->i_ctime = tv;
-		sync_it = 1;
 	}
-
-	/*
-	 * Update complete - now make sure everyone knows that the inode
-	 * is dirty.
-	 */
-	if (sync_it)
-		xfs_mark_inode_dirty_sync(ip);
 }
 
 /*
diff --git a/fs/xfs/xfs_inode_item.c b/fs/xfs/xfs_inode_item.c
index fe00777..c7ac020 100644
--- a/fs/xfs/xfs_inode_item.c
+++ b/fs/xfs/xfs_inode_item.c
@@ -223,15 +223,6 @@ xfs_inode_item_format(
 	nvecs	     = 1;
 
 	/*
-	 * Make sure the linux inode is dirty. We do this before
-	 * clearing i_update_core as the VFS will call back into
-	 * XFS here and set i_update_core, so we need to dirty the
-	 * inode first so that the ordering of i_update_core and
-	 * unlogged modifications still works as described below.
-	 */
-	xfs_mark_inode_dirty_sync(ip);
-
-	/*
 	 * Clear i_update_core if the timestamps (or any other
 	 * non-transactional modification) need flushing/logging
 	 * and we're about to log them with the rest of the core.
-- 
1.7.1

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 69+ messages in thread

* [PATCH 08/18] xfs: rename xfs_buf_get_nodaddr to be more appropriate
  2010-09-14 10:55 [PATCH 0/18] xfs: metadata and buffer cache scalability improvements Dave Chinner
                   ` (6 preceding siblings ...)
  2010-09-14 10:56 ` [PATCH 07/18] xfs: don't use vfs writeback for pure metadata modifications Dave Chinner
@ 2010-09-14 10:56 ` Dave Chinner
  2010-09-14 14:56   ` Christoph Hellwig
  2010-09-14 22:14   ` Alex Elder
  2010-09-14 10:56 ` [PATCH 09/18] xfs: introduced uncached buffer read primitve Dave Chinner
                   ` (11 subsequent siblings)
  19 siblings, 2 replies; 69+ messages in thread
From: Dave Chinner @ 2010-09-14 10:56 UTC (permalink / raw)
  To: xfs

From: Dave Chinner <dchinner@redhat.com>

xfs_buf_get_nodaddr() is really used to allocate a buffer that is
uncached. While it is not directly assigned a disk address, the fact
that they are not cached is a more important distinction. With the
upcoming uncached buffer read primitive, we should be consistent
with this disctinction.

While there, make page allocation in xfs_buf_get_nodaddr() safe
against memory reclaim re-entrancy into the filesystem by changing
the allocation to GFP_NOFS.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/linux-2.6/xfs_buf.c   |    6 +++---
 fs/xfs/linux-2.6/xfs_buf.h   |    2 +-
 fs/xfs/linux-2.6/xfs_trace.h |    2 +-
 fs/xfs/xfs_log.c             |    3 ++-
 fs/xfs/xfs_log_recover.c     |    2 +-
 fs/xfs/xfs_vnodeops.c        |    2 +-
 6 files changed, 9 insertions(+), 8 deletions(-)

diff --git a/fs/xfs/linux-2.6/xfs_buf.c b/fs/xfs/linux-2.6/xfs_buf.c
index 286e36e..03563d4 100644
--- a/fs/xfs/linux-2.6/xfs_buf.c
+++ b/fs/xfs/linux-2.6/xfs_buf.c
@@ -707,7 +707,7 @@ xfs_buf_associate_memory(
 }
 
 xfs_buf_t *
-xfs_buf_get_noaddr(
+xfs_buf_get_uncached(
 	size_t			len,
 	xfs_buftarg_t		*target)
 {
@@ -725,7 +725,7 @@ xfs_buf_get_noaddr(
 		goto fail_free_buf;
 
 	for (i = 0; i < page_count; i++) {
-		bp->b_pages[i] = alloc_page(GFP_KERNEL);
+		bp->b_pages[i] = alloc_page(GFP_NOFS);
 		if (!bp->b_pages[i])
 			goto fail_free_mem;
 	}
@@ -740,7 +740,7 @@ xfs_buf_get_noaddr(
 
 	xfs_buf_unlock(bp);
 
-	trace_xfs_buf_get_noaddr(bp, _RET_IP_);
+	trace_xfs_buf_get_uncached(bp, _RET_IP_);
 	return bp;
 
  fail_free_mem:
diff --git a/fs/xfs/linux-2.6/xfs_buf.h b/fs/xfs/linux-2.6/xfs_buf.h
index 2a05614..73a3a8e 100644
--- a/fs/xfs/linux-2.6/xfs_buf.h
+++ b/fs/xfs/linux-2.6/xfs_buf.h
@@ -213,7 +213,7 @@ extern xfs_buf_t *xfs_buf_read(xfs_buftarg_t *, xfs_off_t, size_t,
 				xfs_buf_flags_t);
 
 extern xfs_buf_t *xfs_buf_get_empty(size_t, xfs_buftarg_t *);
-extern xfs_buf_t *xfs_buf_get_noaddr(size_t, xfs_buftarg_t *);
+extern xfs_buf_t *xfs_buf_get_uncached(size_t, xfs_buftarg_t *);
 extern int xfs_buf_associate_memory(xfs_buf_t *, void *, size_t);
 extern void xfs_buf_hold(xfs_buf_t *);
 extern void xfs_buf_readahead(xfs_buftarg_t *, xfs_off_t, size_t,
diff --git a/fs/xfs/linux-2.6/xfs_trace.h b/fs/xfs/linux-2.6/xfs_trace.h
index be5dffd..2a1d4fb 100644
--- a/fs/xfs/linux-2.6/xfs_trace.h
+++ b/fs/xfs/linux-2.6/xfs_trace.h
@@ -331,7 +331,7 @@ DEFINE_BUF_EVENT(xfs_buf_iowait_done);
 DEFINE_BUF_EVENT(xfs_buf_delwri_queue);
 DEFINE_BUF_EVENT(xfs_buf_delwri_dequeue);
 DEFINE_BUF_EVENT(xfs_buf_delwri_split);
-DEFINE_BUF_EVENT(xfs_buf_get_noaddr);
+DEFINE_BUF_EVENT(xfs_buf_get_uncached);
 DEFINE_BUF_EVENT(xfs_bdstrat_shut);
 DEFINE_BUF_EVENT(xfs_buf_item_relse);
 DEFINE_BUF_EVENT(xfs_buf_item_iodone);
diff --git a/fs/xfs/xfs_log.c b/fs/xfs/xfs_log.c
index 33f718f..90dd3f9 100644
--- a/fs/xfs/xfs_log.c
+++ b/fs/xfs/xfs_log.c
@@ -1131,7 +1131,8 @@ xlog_alloc_log(xfs_mount_t	*mp,
 		iclog->ic_prev = prev_iclog;
 		prev_iclog = iclog;
 
-		bp = xfs_buf_get_noaddr(log->l_iclog_size, mp->m_logdev_targp);
+		bp = xfs_buf_get_uncached(log->l_iclog_size,
+						mp->m_logdev_targp);
 		if (!bp)
 			goto out_free_iclog;
 		if (!XFS_BUF_CPSEMA(bp))
diff --git a/fs/xfs/xfs_log_recover.c b/fs/xfs/xfs_log_recover.c
index 6f3f5fa..f1e64db 100644
--- a/fs/xfs/xfs_log_recover.c
+++ b/fs/xfs/xfs_log_recover.c
@@ -107,7 +107,7 @@ xlog_get_bp(
 		nbblks += log->l_sectBBsize;
 	nbblks = round_up(nbblks, log->l_sectBBsize);
 
-	return xfs_buf_get_noaddr(BBTOB(nbblks), log->l_mp->m_logdev_targp);
+	return xfs_buf_get_uncached(BBTOB(nbblks), log->l_mp->m_logdev_targp);
 }
 
 STATIC void
diff --git a/fs/xfs/xfs_vnodeops.c b/fs/xfs/xfs_vnodeops.c
index dc6e4fb..76aed73 100644
--- a/fs/xfs/xfs_vnodeops.c
+++ b/fs/xfs/xfs_vnodeops.c
@@ -2431,7 +2431,7 @@ xfs_zero_remaining_bytes(
 	if (endoff > ip->i_size)
 		endoff = ip->i_size;
 
-	bp = xfs_buf_get_noaddr(mp->m_sb.sb_blocksize,
+	bp = xfs_buf_get_uncached(mp->m_sb.sb_blocksize,
 				XFS_IS_REALTIME_INODE(ip) ?
 				mp->m_rtdev_targp : mp->m_ddev_targp);
 	if (!bp)
-- 
1.7.1

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 69+ messages in thread

* [PATCH 09/18] xfs: introduced uncached buffer read primitve
  2010-09-14 10:55 [PATCH 0/18] xfs: metadata and buffer cache scalability improvements Dave Chinner
                   ` (7 preceding siblings ...)
  2010-09-14 10:56 ` [PATCH 08/18] xfs: rename xfs_buf_get_nodaddr to be more appropriate Dave Chinner
@ 2010-09-14 10:56 ` Dave Chinner
  2010-09-14 14:56   ` Christoph Hellwig
  2010-09-14 22:16   ` Alex Elder
  2010-09-14 10:56 ` [PATCH 10/18] xfs: store xfs_mount in the buftarg instead of in the xfs_buf Dave Chinner
                   ` (10 subsequent siblings)
  19 siblings, 2 replies; 69+ messages in thread
From: Dave Chinner @ 2010-09-14 10:56 UTC (permalink / raw)
  To: xfs

From: Dave Chinner <dchinner@redhat.com>

To avoid the need to use cached buffers for single-shot or buffers
cached at the filesystem level, introduce a new buffer read
primitive that bypasses the cache an reads directly from disk.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/linux-2.6/xfs_buf.c |   33 +++++++++++++++++++++++++++++++++
 fs/xfs/linux-2.6/xfs_buf.h |    3 +++
 2 files changed, 36 insertions(+), 0 deletions(-)

diff --git a/fs/xfs/linux-2.6/xfs_buf.c b/fs/xfs/linux-2.6/xfs_buf.c
index 03563d4..8555974 100644
--- a/fs/xfs/linux-2.6/xfs_buf.c
+++ b/fs/xfs/linux-2.6/xfs_buf.c
@@ -638,6 +638,39 @@ xfs_buf_readahead(
 	xfs_buf_read(target, ioff, isize, flags);
 }
 
+/*
+ * Read an uncached buffer from disk. Allocates and returns a locked
+ * buffer containing the disk contents or nothing.
+ */
+struct xfs_buf *
+xfs_buf_read_uncached(
+	struct xfs_mount	*mp,
+	struct xfs_buftarg	*target,
+	xfs_daddr_t		daddr,
+	size_t			length)
+{
+	xfs_buf_t		*bp;
+	int			error;
+
+	bp = xfs_buf_get_uncached(length, target);
+	if (!bp)
+		return NULL;
+
+	/* set up the buffer for a read IO */
+	xfs_buf_lock(bp);
+	XFS_BUF_SET_ADDR(bp, daddr);
+	XFS_BUF_READ(bp);
+	XFS_BUF_BUSY(bp);
+
+	xfsbdstrat(mp, bp);
+	error = xfs_iowait(bp);
+	if (error || bp->b_error) {
+		xfs_buf_relse(bp);
+		return NULL;
+	}
+	return bp;
+}
+
 xfs_buf_t *
 xfs_buf_get_empty(
 	size_t			len,
diff --git a/fs/xfs/linux-2.6/xfs_buf.h b/fs/xfs/linux-2.6/xfs_buf.h
index 73a3a8e..4d5937d 100644
--- a/fs/xfs/linux-2.6/xfs_buf.h
+++ b/fs/xfs/linux-2.6/xfs_buf.h
@@ -218,6 +218,9 @@ extern int xfs_buf_associate_memory(xfs_buf_t *, void *, size_t);
 extern void xfs_buf_hold(xfs_buf_t *);
 extern void xfs_buf_readahead(xfs_buftarg_t *, xfs_off_t, size_t,
 				xfs_buf_flags_t);
+struct xfs_buf *xfs_buf_read_uncached(struct xfs_mount *mp,
+				struct xfs_buftarg *target,
+				xfs_daddr_t daddr, size_t length);
 
 /* Releasing Buffers */
 extern void xfs_buf_free(xfs_buf_t *);
-- 
1.7.1

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 69+ messages in thread

* [PATCH 10/18] xfs: store xfs_mount in the buftarg instead of in the xfs_buf
  2010-09-14 10:55 [PATCH 0/18] xfs: metadata and buffer cache scalability improvements Dave Chinner
                   ` (8 preceding siblings ...)
  2010-09-14 10:56 ` [PATCH 09/18] xfs: introduced uncached buffer read primitve Dave Chinner
@ 2010-09-14 10:56 ` Dave Chinner
  2010-09-14 14:57   ` Christoph Hellwig
  2010-09-14 22:21   ` Alex Elder
  2010-09-14 10:56 ` [PATCH 11/18] xfs: kill XBF_FS_MANAGED buffers Dave Chinner
                   ` (9 subsequent siblings)
  19 siblings, 2 replies; 69+ messages in thread
From: Dave Chinner @ 2010-09-14 10:56 UTC (permalink / raw)
  To: xfs

From: Dave Chinner <dchinner@redhat.com>

Each buffer contains both a buftarg pointer and a mount pointer. If
we add a mount pointer into the buftarg, we can avoid needing the
b_mount field in every buffer and grab it from the buftarg when
needed instead. This shrinks the xfs_buf by 8 bytes.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/linux-2.6/xfs_buf.c   |    9 ++++-----
 fs/xfs/linux-2.6/xfs_buf.h   |    5 +++--
 fs/xfs/linux-2.6/xfs_super.c |    8 +++++---
 fs/xfs/xfs_buf_item.c        |    3 +--
 fs/xfs/xfs_log_recover.c     |   16 +++++++---------
 5 files changed, 20 insertions(+), 21 deletions(-)

diff --git a/fs/xfs/linux-2.6/xfs_buf.c b/fs/xfs/linux-2.6/xfs_buf.c
index 8555974..726a6dc 100644
--- a/fs/xfs/linux-2.6/xfs_buf.c
+++ b/fs/xfs/linux-2.6/xfs_buf.c
@@ -892,7 +892,7 @@ xfs_buf_lock(
 	trace_xfs_buf_lock(bp, _RET_IP_);
 
 	if (atomic_read(&bp->b_pin_count) && (bp->b_flags & XBF_STALE))
-		xfs_log_force(bp->b_mount, 0);
+		xfs_log_force(bp->b_target->bt_mount, 0);
 	if (atomic_read(&bp->b_io_remaining))
 		blk_run_address_space(bp->b_target->bt_mapping);
 	down(&bp->b_sema);
@@ -1015,7 +1015,6 @@ xfs_bwrite(
 {
 	int			error;
 
-	bp->b_mount = mp;
 	bp->b_flags |= XBF_WRITE;
 	bp->b_flags &= ~(XBF_ASYNC | XBF_READ);
 
@@ -1036,8 +1035,6 @@ xfs_bdwrite(
 {
 	trace_xfs_buf_bdwrite(bp, _RET_IP_);
 
-	bp->b_mount = mp;
-
 	bp->b_flags &= ~XBF_READ;
 	bp->b_flags |= (XBF_DELWRI | XBF_ASYNC);
 
@@ -1126,7 +1123,7 @@ int
 xfs_bdstrat_cb(
 	struct xfs_buf	*bp)
 {
-	if (XFS_FORCED_SHUTDOWN(bp->b_mount)) {
+	if (XFS_FORCED_SHUTDOWN(bp->b_target->bt_mount)) {
 		trace_xfs_bdstrat_shut(bp, _RET_IP_);
 		/*
 		 * Metadata write that didn't get logged but
@@ -1642,6 +1639,7 @@ out_error:
 
 xfs_buftarg_t *
 xfs_alloc_buftarg(
+	struct xfs_mount	*mp,
 	struct block_device	*bdev,
 	int			external,
 	const char		*fsname)
@@ -1650,6 +1648,7 @@ xfs_alloc_buftarg(
 
 	btp = kmem_zalloc(sizeof(*btp), KM_SLEEP);
 
+	btp->bt_mount = mp;
 	btp->bt_dev =  bdev->bd_dev;
 	btp->bt_bdev = bdev;
 	if (xfs_setsize_buftarg_early(btp, bdev))
diff --git a/fs/xfs/linux-2.6/xfs_buf.h b/fs/xfs/linux-2.6/xfs_buf.h
index 4d5937d..086d2bc 100644
--- a/fs/xfs/linux-2.6/xfs_buf.h
+++ b/fs/xfs/linux-2.6/xfs_buf.h
@@ -132,6 +132,7 @@ typedef struct xfs_buftarg {
 	dev_t			bt_dev;
 	struct block_device	*bt_bdev;
 	struct address_space	*bt_mapping;
+	struct xfs_mount	*bt_mount;
 	unsigned int		bt_bsize;
 	unsigned int		bt_sshift;
 	size_t			bt_smask;
@@ -189,7 +190,6 @@ typedef struct xfs_buf {
 	struct completion	b_iowait;	/* queue for I/O waiters */
 	void			*b_fspriv;
 	void			*b_fspriv2;
-	struct xfs_mount	*b_mount;
 	unsigned short		b_error;	/* error code on I/O */
 	unsigned int		b_page_count;	/* size of page array */
 	unsigned int		b_offset;	/* page offset in first page */
@@ -377,7 +377,8 @@ static inline void xfs_buf_relse(xfs_buf_t *bp)
 /*
  *	Handling of buftargs.
  */
-extern xfs_buftarg_t *xfs_alloc_buftarg(struct block_device *, int, const char *);
+extern xfs_buftarg_t *xfs_alloc_buftarg(struct xfs_mount *,
+			struct block_device *, int, const char *);
 extern void xfs_free_buftarg(struct xfs_mount *, struct xfs_buftarg *);
 extern void xfs_wait_buftarg(xfs_buftarg_t *);
 extern int xfs_setsize_buftarg(xfs_buftarg_t *, unsigned int, unsigned int);
diff --git a/fs/xfs/linux-2.6/xfs_super.c b/fs/xfs/linux-2.6/xfs_super.c
index 6205eb8..02c7a7d 100644
--- a/fs/xfs/linux-2.6/xfs_super.c
+++ b/fs/xfs/linux-2.6/xfs_super.c
@@ -758,18 +758,20 @@ xfs_open_devices(
 	 * Setup xfs_mount buffer target pointers
 	 */
 	error = ENOMEM;
-	mp->m_ddev_targp = xfs_alloc_buftarg(ddev, 0, mp->m_fsname);
+	mp->m_ddev_targp = xfs_alloc_buftarg(mp, ddev, 0, mp->m_fsname);
 	if (!mp->m_ddev_targp)
 		goto out_close_rtdev;
 
 	if (rtdev) {
-		mp->m_rtdev_targp = xfs_alloc_buftarg(rtdev, 1, mp->m_fsname);
+		mp->m_rtdev_targp = xfs_alloc_buftarg(mp, rtdev, 1,
+							mp->m_fsname);
 		if (!mp->m_rtdev_targp)
 			goto out_free_ddev_targ;
 	}
 
 	if (logdev && logdev != ddev) {
-		mp->m_logdev_targp = xfs_alloc_buftarg(logdev, 1, mp->m_fsname);
+		mp->m_logdev_targp = xfs_alloc_buftarg(mp, logdev, 1,
+							mp->m_fsname);
 		if (!mp->m_logdev_targp)
 			goto out_free_rtdev_targ;
 	} else {
diff --git a/fs/xfs/xfs_buf_item.c b/fs/xfs/xfs_buf_item.c
index 1b09d7a..ee75576 100644
--- a/fs/xfs/xfs_buf_item.c
+++ b/fs/xfs/xfs_buf_item.c
@@ -692,8 +692,7 @@ xfs_buf_item_init(
 	 * the first.  If we do already have one, there is
 	 * nothing to do here so return.
 	 */
-	if (bp->b_mount != mp)
-		bp->b_mount = mp;
+	ASSERT(bp->b_target->bt_mount == mp);
 	if (XFS_BUF_FSPRIVATE(bp, void *) != NULL) {
 		lip = XFS_BUF_FSPRIVATE(bp, xfs_log_item_t *);
 		if (lip->li_type == XFS_LI_BUF) {
diff --git a/fs/xfs/xfs_log_recover.c b/fs/xfs/xfs_log_recover.c
index f1e64db..eb03e9d 100644
--- a/fs/xfs/xfs_log_recover.c
+++ b/fs/xfs/xfs_log_recover.c
@@ -321,10 +321,11 @@ xlog_recover_iodone(
 		 * this during recovery. One strike!
 		 */
 		xfs_ioerror_alert("xlog_recover_iodone",
-				  bp->b_mount, bp, XFS_BUF_ADDR(bp));
-		xfs_force_shutdown(bp->b_mount, SHUTDOWN_META_IO_ERROR);
+					bp->b_target->bt_mount, bp,
+					XFS_BUF_ADDR(bp));
+		xfs_force_shutdown(bp->b_target->bt_mount,
+					SHUTDOWN_META_IO_ERROR);
 	}
-	bp->b_mount = NULL;
 	XFS_BUF_CLR_IODONE_FUNC(bp);
 	xfs_biodone(bp);
 }
@@ -2275,8 +2276,7 @@ xlog_recover_do_buffer_trans(
 		XFS_BUF_STALE(bp);
 		error = xfs_bwrite(mp, bp);
 	} else {
-		ASSERT(bp->b_mount == NULL || bp->b_mount == mp);
-		bp->b_mount = mp;
+		ASSERT(bp->b_target->bt_mount == mp);
 		XFS_BUF_SET_IODONE_FUNC(bp, xlog_recover_iodone);
 		xfs_bdwrite(mp, bp);
 	}
@@ -2540,8 +2540,7 @@ xlog_recover_do_inode_trans(
 	}
 
 write_inode_buffer:
-	ASSERT(bp->b_mount == NULL || bp->b_mount == mp);
-	bp->b_mount = mp;
+	ASSERT(bp->b_target->bt_mount == mp);
 	XFS_BUF_SET_IODONE_FUNC(bp, xlog_recover_iodone);
 	xfs_bdwrite(mp, bp);
 error:
@@ -2678,8 +2677,7 @@ xlog_recover_do_dquot_trans(
 	memcpy(ddq, recddq, item->ri_buf[1].i_len);
 
 	ASSERT(dq_f->qlf_size == 2);
-	ASSERT(bp->b_mount == NULL || bp->b_mount == mp);
-	bp->b_mount = mp;
+	ASSERT(bp->b_target->bt_mount == mp);
 	XFS_BUF_SET_IODONE_FUNC(bp, xlog_recover_iodone);
 	xfs_bdwrite(mp, bp);
 
-- 
1.7.1

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 69+ messages in thread

* [PATCH 11/18] xfs: kill XBF_FS_MANAGED buffers
  2010-09-14 10:55 [PATCH 0/18] xfs: metadata and buffer cache scalability improvements Dave Chinner
                   ` (9 preceding siblings ...)
  2010-09-14 10:56 ` [PATCH 10/18] xfs: store xfs_mount in the buftarg instead of in the xfs_buf Dave Chinner
@ 2010-09-14 10:56 ` Dave Chinner
  2010-09-14 14:59   ` Christoph Hellwig
  2010-09-14 22:26   ` Alex Elder
  2010-09-14 10:56 ` [PATCH 12/18] xfs: use unhashed buffers for size checks Dave Chinner
                   ` (8 subsequent siblings)
  19 siblings, 2 replies; 69+ messages in thread
From: Dave Chinner @ 2010-09-14 10:56 UTC (permalink / raw)
  To: xfs

From: Dave Chinner <dchinner@redhat.com>

Filesystem level managed buffers are buffers that have their
lifecycle controlled by the filesystem layer, not the buffer cache.
We currently cache these buffers, which makes cleanup and cache
walking somewhat troublesome. Convert the fs managed buffers to
uncached buffers obtained by via xfs_buf_get_uncached(), and remove
the XBF_FS_MANAGED special cases from the buffer cache.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/linux-2.6/xfs_buf.c |   20 +++------------
 fs/xfs/linux-2.6/xfs_buf.h |    4 ---
 fs/xfs/xfs_mount.c         |   56 ++++++++++++-------------------------------
 3 files changed, 20 insertions(+), 60 deletions(-)

diff --git a/fs/xfs/linux-2.6/xfs_buf.c b/fs/xfs/linux-2.6/xfs_buf.c
index 726a6dc..304515b 100644
--- a/fs/xfs/linux-2.6/xfs_buf.c
+++ b/fs/xfs/linux-2.6/xfs_buf.c
@@ -824,8 +824,6 @@ xfs_buf_rele(
 			atomic_inc(&bp->b_hold);
 			spin_unlock(&hash->bh_lock);
 			(*(bp->b_relse)) (bp);
-		} else if (bp->b_flags & XBF_FS_MANAGED) {
-			spin_unlock(&hash->bh_lock);
 		} else {
 			ASSERT(!(bp->b_flags & (XBF_DELWRI|_XBF_DELWRI_Q)));
 			list_del_init(&bp->b_hash_list);
@@ -1431,26 +1429,16 @@ void
 xfs_wait_buftarg(
 	xfs_buftarg_t	*btp)
 {
-	xfs_buf_t	*bp, *n;
 	xfs_bufhash_t	*hash;
 	uint		i;
 
 	for (i = 0; i < (1 << btp->bt_hashshift); i++) {
 		hash = &btp->bt_hash[i];
-again:
 		spin_lock(&hash->bh_lock);
-		list_for_each_entry_safe(bp, n, &hash->bh_list, b_hash_list) {
-			ASSERT(btp == bp->b_target);
-			if (!(bp->b_flags & XBF_FS_MANAGED)) {
-				spin_unlock(&hash->bh_lock);
-				/*
-				 * Catch superblock reference count leaks
-				 * immediately
-				 */
-				BUG_ON(bp->b_bn == 0);
-				delay(100);
-				goto again;
-			}
+		while (!list_empty(&hash->bh_list)) {
+			spin_unlock(&hash->bh_lock);
+			delay(100);
+			spin_lock(&hash->bh_lock);
 		}
 		spin_unlock(&hash->bh_lock);
 	}
diff --git a/fs/xfs/linux-2.6/xfs_buf.h b/fs/xfs/linux-2.6/xfs_buf.h
index 086d2bc..9517387 100644
--- a/fs/xfs/linux-2.6/xfs_buf.h
+++ b/fs/xfs/linux-2.6/xfs_buf.h
@@ -51,7 +51,6 @@ typedef enum {
 #define XBF_DONE	(1 << 5) /* all pages in the buffer uptodate */
 #define XBF_DELWRI	(1 << 6) /* buffer has dirty pages */
 #define XBF_STALE	(1 << 7) /* buffer has been staled, do not find it */
-#define XBF_FS_MANAGED	(1 << 8) /* filesystem controls freeing memory */
 #define XBF_ORDERED	(1 << 11)/* use ordered writes */
 #define XBF_READ_AHEAD	(1 << 12)/* asynchronous read-ahead */
 #define XBF_LOG_BUFFER	(1 << 13)/* this is a buffer used for the log */
@@ -104,7 +103,6 @@ typedef unsigned int xfs_buf_flags_t;
 	{ XBF_DONE,		"DONE" }, \
 	{ XBF_DELWRI,		"DELWRI" }, \
 	{ XBF_STALE,		"STALE" }, \
-	{ XBF_FS_MANAGED,	"FS_MANAGED" }, \
 	{ XBF_ORDERED,		"ORDERED" }, \
 	{ XBF_READ_AHEAD,	"READ_AHEAD" }, \
 	{ XBF_LOCK,		"LOCK" },  	/* should never be set */\
@@ -279,8 +277,6 @@ extern void xfs_buf_terminate(void);
 					XFS_BUF_DONE(bp);	\
 				} while (0)
 
-#define XFS_BUF_UNMANAGE(bp)	((bp)->b_flags &= ~XBF_FS_MANAGED)
-
 #define XFS_BUF_DELAYWRITE(bp)		((bp)->b_flags |= XBF_DELWRI)
 #define XFS_BUF_UNDELAYWRITE(bp)	xfs_buf_delwri_dequeue(bp)
 #define XFS_BUF_ISDELAYWRITE(bp)	((bp)->b_flags & XBF_DELWRI)
diff --git a/fs/xfs/xfs_mount.c b/fs/xfs/xfs_mount.c
index 546eb1f..4d727d0 100644
--- a/fs/xfs/xfs_mount.c
+++ b/fs/xfs/xfs_mount.c
@@ -646,7 +646,6 @@ int
 xfs_readsb(xfs_mount_t *mp, int flags)
 {
 	unsigned int	sector_size;
-	unsigned int	extra_flags;
 	xfs_buf_t	*bp;
 	int		error;
 
@@ -659,28 +658,23 @@ xfs_readsb(xfs_mount_t *mp, int flags)
 	 * access to the superblock.
 	 */
 	sector_size = xfs_getsize_buftarg(mp->m_ddev_targp);
-	extra_flags = XBF_LOCK | XBF_FS_MANAGED | XBF_MAPPED;
 
-	bp = xfs_buf_read(mp->m_ddev_targp, XFS_SB_DADDR, BTOBB(sector_size),
-			  extra_flags);
-	if (!bp || XFS_BUF_ISERROR(bp)) {
-		xfs_fs_mount_cmn_err(flags, "SB read failed");
-		error = bp ? XFS_BUF_GETERROR(bp) : ENOMEM;
-		goto fail;
+reread:
+	bp = xfs_buf_read_uncached(mp, mp->m_ddev_targp, XFS_SB_DADDR, sector_size);
+	if (!bp) {
+		xfs_fs_mount_cmn_err(flags, "SB buffer read failed");
+		return EIO;
 	}
-	ASSERT(XFS_BUF_ISBUSY(bp));
-	ASSERT(XFS_BUF_VALUSEMA(bp) <= 0);
 
 	/*
 	 * Initialize the mount structure from the superblock.
 	 * But first do some basic consistency checking.
 	 */
 	xfs_sb_from_disk(&mp->m_sb, XFS_BUF_TO_SBP(bp));
-
 	error = xfs_mount_validate_sb(mp, &(mp->m_sb), flags);
 	if (error) {
 		xfs_fs_mount_cmn_err(flags, "SB validate failed");
-		goto fail;
+		goto release_buf;
 	}
 
 	/*
@@ -691,7 +685,7 @@ xfs_readsb(xfs_mount_t *mp, int flags)
 			"device supports only %u byte sectors (not %u)",
 			sector_size, mp->m_sb.sb_sectsize);
 		error = ENOSYS;
-		goto fail;
+		goto release_buf;
 	}
 
 	/*
@@ -699,33 +693,20 @@ xfs_readsb(xfs_mount_t *mp, int flags)
 	 * re-read the superblock so the buffer is correctly sized.
 	 */
 	if (sector_size < mp->m_sb.sb_sectsize) {
-		XFS_BUF_UNMANAGE(bp);
 		xfs_buf_relse(bp);
 		sector_size = mp->m_sb.sb_sectsize;
-		bp = xfs_buf_read(mp->m_ddev_targp, XFS_SB_DADDR,
-				  BTOBB(sector_size), extra_flags);
-		if (!bp || XFS_BUF_ISERROR(bp)) {
-			xfs_fs_mount_cmn_err(flags, "SB re-read failed");
-			error = bp ? XFS_BUF_GETERROR(bp) : ENOMEM;
-			goto fail;
-		}
-		ASSERT(XFS_BUF_ISBUSY(bp));
-		ASSERT(XFS_BUF_VALUSEMA(bp) <= 0);
+		goto reread;
 	}
 
 	/* Initialize per-cpu counters */
 	xfs_icsb_reinit_counters(mp);
 
 	mp->m_sb_bp = bp;
-	xfs_buf_relse(bp);
-	ASSERT(XFS_BUF_VALUSEMA(bp) > 0);
+	xfs_buf_unlock(bp);
 	return 0;
 
- fail:
-	if (bp) {
-		XFS_BUF_UNMANAGE(bp);
-		xfs_buf_relse(bp);
-	}
+release_buf:
+	xfs_buf_relse(bp);
 	return error;
 }
 
@@ -2005,18 +1986,13 @@ xfs_getsb(
  */
 void
 xfs_freesb(
-	xfs_mount_t	*mp)
+	struct xfs_mount	*mp)
 {
-	xfs_buf_t	*bp;
+	struct xfs_buf		*bp = mp->m_sb_bp;
 
-	/*
-	 * Use xfs_getsb() so that the buffer will be locked
-	 * when we call xfs_buf_relse().
-	 */
-	bp = xfs_getsb(mp, 0);
-	XFS_BUF_UNMANAGE(bp);
-	xfs_buf_relse(bp);
+	xfs_buf_lock(bp);
 	mp->m_sb_bp = NULL;
+	xfs_buf_relse(bp);
 }
 
 /*
-- 
1.7.1

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 69+ messages in thread

* [PATCH 12/18] xfs: use unhashed buffers for size checks
  2010-09-14 10:55 [PATCH 0/18] xfs: metadata and buffer cache scalability improvements Dave Chinner
                   ` (10 preceding siblings ...)
  2010-09-14 10:56 ` [PATCH 11/18] xfs: kill XBF_FS_MANAGED buffers Dave Chinner
@ 2010-09-14 10:56 ` Dave Chinner
  2010-09-14 15:00   ` Christoph Hellwig
  2010-09-14 22:29   ` Alex Elder
  2010-09-14 10:56 ` [PATCH 13/18] xfs: remove buftarg hash for external devices Dave Chinner
                   ` (7 subsequent siblings)
  19 siblings, 2 replies; 69+ messages in thread
From: Dave Chinner @ 2010-09-14 10:56 UTC (permalink / raw)
  To: xfs

From: Dave Chinner <dchinner@redhat.com>

When we are checking we can access the last block of each device, we
do not need to use cached buffers as they will be tossed away
immediately. Use uncached buffers for size checks so that all IO
prior to full in-memory structure initialisation does not use the
buffer cache.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/xfs_fsops.c   |   11 +++++------
 fs/xfs/xfs_mount.c   |   39 ++++++++++++++++-----------------------
 fs/xfs/xfs_rtalloc.c |   29 +++++++++++++----------------
 3 files changed, 34 insertions(+), 45 deletions(-)

diff --git a/fs/xfs/xfs_fsops.c b/fs/xfs/xfs_fsops.c
index 43b1d56..158d5ab 100644
--- a/fs/xfs/xfs_fsops.c
+++ b/fs/xfs/xfs_fsops.c
@@ -144,12 +144,11 @@ xfs_growfs_data_private(
 	if ((error = xfs_sb_validate_fsb_count(&mp->m_sb, nb)))
 		return error;
 	dpct = pct - mp->m_sb.sb_imax_pct;
-	error = xfs_read_buf(mp, mp->m_ddev_targp,
-			XFS_FSB_TO_BB(mp, nb) - XFS_FSS_TO_BB(mp, 1),
-			XFS_FSS_TO_BB(mp, 1), 0, &bp);
-	if (error)
-		return error;
-	ASSERT(bp);
+	bp = xfs_buf_read_uncached(mp, mp->m_ddev_targp,
+				XFS_FSB_TO_BB(mp, nb) - XFS_FSS_TO_BB(mp, 1),
+				BBTOB(XFS_FSS_TO_BB(mp, 1)));
+	if (!bp)
+		return EIO;
 	xfs_buf_relse(bp);
 
 	new = nb;	/* use new as a temporary here */
diff --git a/fs/xfs/xfs_mount.c b/fs/xfs/xfs_mount.c
index 4d727d0..d1498a6 100644
--- a/fs/xfs/xfs_mount.c
+++ b/fs/xfs/xfs_mount.c
@@ -979,42 +979,35 @@ xfs_check_sizes(xfs_mount_t *mp)
 {
 	xfs_buf_t	*bp;
 	xfs_daddr_t	d;
-	int		error;
 
 	d = (xfs_daddr_t)XFS_FSB_TO_BB(mp, mp->m_sb.sb_dblocks);
 	if (XFS_BB_TO_FSB(mp, d) != mp->m_sb.sb_dblocks) {
-		cmn_err(CE_WARN, "XFS: size check 1 failed");
+		cmn_err(CE_WARN, "XFS: filesystem size mismatch detected");
 		return XFS_ERROR(EFBIG);
 	}
-	error = xfs_read_buf(mp, mp->m_ddev_targp,
-			     d - XFS_FSS_TO_BB(mp, 1),
-			     XFS_FSS_TO_BB(mp, 1), 0, &bp);
-	if (!error) {
-		xfs_buf_relse(bp);
-	} else {
-		cmn_err(CE_WARN, "XFS: size check 2 failed");
-		if (error == ENOSPC)
-			error = XFS_ERROR(EFBIG);
-		return error;
+	bp = xfs_buf_read_uncached(mp, mp->m_ddev_targp,
+					d - XFS_FSS_TO_BB(mp, 1),
+					BBTOB(XFS_FSS_TO_BB(mp, 1)));
+	if (!bp) {
+		cmn_err(CE_WARN, "XFS: last sector read failed");
+		return EIO;
 	}
+	xfs_buf_relse(bp);
 
 	if (mp->m_logdev_targp != mp->m_ddev_targp) {
 		d = (xfs_daddr_t)XFS_FSB_TO_BB(mp, mp->m_sb.sb_logblocks);
 		if (XFS_BB_TO_FSB(mp, d) != mp->m_sb.sb_logblocks) {
-			cmn_err(CE_WARN, "XFS: size check 3 failed");
+			cmn_err(CE_WARN, "XFS: log size mismatch detected");
 			return XFS_ERROR(EFBIG);
 		}
-		error = xfs_read_buf(mp, mp->m_logdev_targp,
-				     d - XFS_FSB_TO_BB(mp, 1),
-				     XFS_FSB_TO_BB(mp, 1), 0, &bp);
-		if (!error) {
-			xfs_buf_relse(bp);
-		} else {
-			cmn_err(CE_WARN, "XFS: size check 3 failed");
-			if (error == ENOSPC)
-				error = XFS_ERROR(EFBIG);
-			return error;
+		bp = xfs_buf_read_uncached(mp, mp->m_logdev_targp,
+					d - XFS_FSB_TO_BB(mp, 1),
+					XFS_FSB_TO_B(mp, 1));
+		if (!bp) {
+			cmn_err(CE_WARN, "XFS: log device read failed");
+			return EIO;
 		}
+		xfs_buf_relse(bp);
 	}
 	return 0;
 }
diff --git a/fs/xfs/xfs_rtalloc.c b/fs/xfs/xfs_rtalloc.c
index 891260f..5c5a4c4 100644
--- a/fs/xfs/xfs_rtalloc.c
+++ b/fs/xfs/xfs_rtalloc.c
@@ -39,6 +39,7 @@
 #include "xfs_trans_space.h"
 #include "xfs_utils.h"
 #include "xfs_trace.h"
+#include "xfs_buf.h"
 
 
 /*
@@ -1883,13 +1884,13 @@ xfs_growfs_rt(
 	/*
 	 * Read in the last block of the device, make sure it exists.
 	 */
-	error = xfs_read_buf(mp, mp->m_rtdev_targp,
-			XFS_FSB_TO_BB(mp, nrblocks - 1),
-			XFS_FSB_TO_BB(mp, 1), 0, &bp);
-	if (error)
-		return error;
-	ASSERT(bp);
+	bp = xfs_buf_read_uncached(mp, mp->m_rtdev_targp,
+				XFS_FSB_TO_BB(mp, nrblocks - 1),
+				XFS_FSB_TO_B(mp, 1));
+	if (!bp)
+		return EIO;
 	xfs_buf_relse(bp);
+
 	/*
 	 * Calculate new parameters.  These are the final values to be reached.
 	 */
@@ -2215,7 +2216,6 @@ xfs_rtmount_init(
 {
 	xfs_buf_t	*bp;	/* buffer for last block of subvolume */
 	xfs_daddr_t	d;	/* address of last block of subvolume */
-	int		error;	/* error return value */
 	xfs_sb_t	*sbp;	/* filesystem superblock copy in mount */
 
 	sbp = &mp->m_sb;
@@ -2242,15 +2242,12 @@ xfs_rtmount_init(
 			(unsigned long long) mp->m_sb.sb_rblocks);
 		return XFS_ERROR(EFBIG);
 	}
-	error = xfs_read_buf(mp, mp->m_rtdev_targp,
-				d - XFS_FSB_TO_BB(mp, 1),
-				XFS_FSB_TO_BB(mp, 1), 0, &bp);
-	if (error) {
-		cmn_err(CE_WARN,
-	"XFS: realtime mount -- xfs_read_buf failed, returned %d", error);
-		if (error == ENOSPC)
-			return XFS_ERROR(EFBIG);
-		return error;
+	bp = xfs_buf_read_uncached(mp, mp->m_rtdev_targp,
+					d - XFS_FSB_TO_BB(mp, 1),
+					XFS_FSB_TO_B(mp, 1));
+	if (!bp) {
+		cmn_err(CE_WARN, "XFS: realtime device size check failed");
+		return EIO;
 	}
 	xfs_buf_relse(bp);
 	return 0;
-- 
1.7.1

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 69+ messages in thread

* [PATCH 13/18] xfs: remove buftarg hash for external devices
  2010-09-14 10:55 [PATCH 0/18] xfs: metadata and buffer cache scalability improvements Dave Chinner
                   ` (11 preceding siblings ...)
  2010-09-14 10:56 ` [PATCH 12/18] xfs: use unhashed buffers for size checks Dave Chinner
@ 2010-09-14 10:56 ` Dave Chinner
  2010-09-14 22:29   ` Alex Elder
  2010-09-14 10:56 ` [PATCH 14/18] xfs: convert buffer cache hash to rbtree Dave Chinner
                   ` (6 subsequent siblings)
  19 siblings, 1 reply; 69+ messages in thread
From: Dave Chinner @ 2010-09-14 10:56 UTC (permalink / raw)
  To: xfs

From: Dave Chinner <dchinner@redhat.com>

For RT and external log devices, we never use hashed buffers on them
now.  Remove the buftarg hash tables that are set up for them.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
---
 fs/xfs/linux-2.6/xfs_buf.c |    6 +++++-
 1 files changed, 5 insertions(+), 1 deletions(-)

diff --git a/fs/xfs/linux-2.6/xfs_buf.c b/fs/xfs/linux-2.6/xfs_buf.c
index 304515b..1c6206e 100644
--- a/fs/xfs/linux-2.6/xfs_buf.c
+++ b/fs/xfs/linux-2.6/xfs_buf.c
@@ -1456,7 +1456,11 @@ xfs_alloc_bufhash(
 {
 	unsigned int		i;
 
-	btp->bt_hashshift = external ? 3 : 12;	/* 8 or 4096 buckets */
+	if (external) {
+		btp->bt_hash = NULL;
+		return;
+	}
+	btp->bt_hashshift = 12;	/* 4096 buckets */
 	btp->bt_hash = kmem_zalloc_large((1 << btp->bt_hashshift) *
 					 sizeof(xfs_bufhash_t));
 	for (i = 0; i < (1 << btp->bt_hashshift); i++) {
-- 
1.7.1

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 69+ messages in thread

* [PATCH 14/18] xfs: convert buffer cache hash to rbtree
  2010-09-14 10:55 [PATCH 0/18] xfs: metadata and buffer cache scalability improvements Dave Chinner
                   ` (12 preceding siblings ...)
  2010-09-14 10:56 ` [PATCH 13/18] xfs: remove buftarg hash for external devices Dave Chinner
@ 2010-09-14 10:56 ` Dave Chinner
  2010-09-14 16:29   ` Christoph Hellwig
  2010-09-15 17:46   ` Alex Elder
  2010-09-14 10:56 ` [PATCH 15/18] xfs; pack xfs_buf structure more tightly Dave Chinner
                   ` (5 subsequent siblings)
  19 siblings, 2 replies; 69+ messages in thread
From: Dave Chinner @ 2010-09-14 10:56 UTC (permalink / raw)
  To: xfs

From: Dave Chinner <dchinner@redhat.com>

The buffer cache hash is showing typical hash scalability problems.
In large scale testing the number of cached items growing far larger
than the hash can efficiently handle. Hence we need to move to a
self-scaling cache indexing mechanism.

I have selected rbtrees for indexing becuse they can have O(log n)
search scalability, and insert and remove cost is not excessive,
even on large trees. Hence we should be able to cache large numbers
of buffers without incurring the excessive cache miss search
penalties that the hash is imposing on us.

To ensure we still have parallel access to the cache, we need
multiple trees. Rather than hashing the buffers by disk address to
select a tree, it seems more sensible to separate trees by typical
access patterns. Most operations use buffers from within a single AG
at a time, so rather than searching lots of different lists,
separate the buffer indexes out into per-AG rbtrees. This means that
searches during metadata operation have a much higher chance of
hitting cache resident nodes, and that updates of the tree are less
likely to disturb trees being accessed on other CPUs doing
independent operations.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/linux-2.6/xfs_buf.c |  138 +++++++++++++++++++++----------------------
 fs/xfs/linux-2.6/xfs_buf.h |    8 +--
 fs/xfs/xfs_ag.h            |    4 +
 fs/xfs/xfs_mount.c         |    2 +
 4 files changed, 75 insertions(+), 77 deletions(-)

diff --git a/fs/xfs/linux-2.6/xfs_buf.c b/fs/xfs/linux-2.6/xfs_buf.c
index 1c6206e..cce427d 100644
--- a/fs/xfs/linux-2.6/xfs_buf.c
+++ b/fs/xfs/linux-2.6/xfs_buf.c
@@ -188,7 +188,7 @@ _xfs_buf_initialize(
 	atomic_set(&bp->b_hold, 1);
 	init_completion(&bp->b_iowait);
 	INIT_LIST_HEAD(&bp->b_list);
-	INIT_LIST_HEAD(&bp->b_hash_list);
+	RB_CLEAR_NODE(&bp->b_rbnode);
 	init_MUTEX_LOCKED(&bp->b_sema); /* held, no waiters */
 	XB_SET_OWNER(bp);
 	bp->b_target = target;
@@ -262,8 +262,6 @@ xfs_buf_free(
 {
 	trace_xfs_buf_free(bp, _RET_IP_);
 
-	ASSERT(list_empty(&bp->b_hash_list));
-
 	if (bp->b_flags & (_XBF_PAGE_CACHE|_XBF_PAGES)) {
 		uint		i;
 
@@ -422,8 +420,10 @@ _xfs_buf_find(
 {
 	xfs_off_t		range_base;
 	size_t			range_length;
-	xfs_bufhash_t		*hash;
-	xfs_buf_t		*bp, *n;
+	struct xfs_perag	*pag;
+	struct rb_node		**rbp;
+	struct rb_node		*parent;
+	xfs_buf_t		*bp;
 
 	range_base = (ioff << BBSHIFT);
 	range_length = (isize << BBSHIFT);
@@ -432,14 +432,37 @@ _xfs_buf_find(
 	ASSERT(!(range_length < (1 << btp->bt_sshift)));
 	ASSERT(!(range_base & (xfs_off_t)btp->bt_smask));
 
-	hash = &btp->bt_hash[hash_long((unsigned long)ioff, btp->bt_hashshift)];
-
-	spin_lock(&hash->bh_lock);
-
-	list_for_each_entry_safe(bp, n, &hash->bh_list, b_hash_list) {
-		ASSERT(btp == bp->b_target);
-		if (bp->b_file_offset == range_base &&
-		    bp->b_buffer_length == range_length) {
+	/* get tree root */
+	pag = xfs_perag_get(btp->bt_mount,
+				xfs_daddr_to_agno(btp->bt_mount, ioff));
+
+	/* walk tree */
+	spin_lock(&pag->pag_buf_lock);
+	rbp = &pag->pag_buf_tree.rb_node;
+	parent = NULL;
+	bp = NULL;
+	while (*rbp) {
+		parent = *rbp;
+		bp = rb_entry(parent, struct xfs_buf, b_rbnode);
+
+		if (range_base < bp->b_file_offset)
+			rbp = &(*rbp)->rb_left;
+		else if (range_base > bp->b_file_offset)
+			rbp = &(*rbp)->rb_right;
+		else {
+			/*
+			 * found a block offset match. If the range doesn't
+			 * match, the only way this is allowed is if the buffer
+			 * in the cache is stale and the transaction that made
+			 * it stale has not yet committed. i.e. we are
+			 * reallocating a busy extent. Skip this buffer and
+			 * continue searching to the right for an exact match.
+			 */
+			if (bp->b_buffer_length != range_length) {
+				ASSERT(bp->b_flags & XBF_STALE);
+				rbp = &(*rbp)->rb_right;
+				continue;
+			}
 			atomic_inc(&bp->b_hold);
 			goto found;
 		}
@@ -449,17 +472,21 @@ _xfs_buf_find(
 	if (new_bp) {
 		_xfs_buf_initialize(new_bp, btp, range_base,
 				range_length, flags);
-		new_bp->b_hash = hash;
-		list_add(&new_bp->b_hash_list, &hash->bh_list);
+		rb_link_node(&new_bp->b_rbnode, parent, rbp);
+		rb_insert_color(&new_bp->b_rbnode, &pag->pag_buf_tree);
+		/* the buffer keeps the perag reference until it is freed */
+		new_bp->b_pag = pag;
+		spin_unlock(&pag->pag_buf_lock);
 	} else {
 		XFS_STATS_INC(xb_miss_locked);
+		spin_unlock(&pag->pag_buf_lock);
+		xfs_perag_put(pag);
 	}
-
-	spin_unlock(&hash->bh_lock);
 	return new_bp;
 
 found:
-	spin_unlock(&hash->bh_lock);
+	spin_unlock(&pag->pag_buf_lock);
+	xfs_perag_put(pag);
 
 	/* Attempt to get the semaphore without sleeping,
 	 * if this does not work then we need to drop the
@@ -807,27 +834,30 @@ void
 xfs_buf_rele(
 	xfs_buf_t		*bp)
 {
-	xfs_bufhash_t		*hash = bp->b_hash;
+	struct xfs_perag	*pag = bp->b_pag;
 
 	trace_xfs_buf_rele(bp, _RET_IP_);
 
-	if (unlikely(!hash)) {
+	if (!pag) {
 		ASSERT(!bp->b_relse);
+		ASSERT(RB_EMPTY_NODE(&bp->b_rbnode));
 		if (atomic_dec_and_test(&bp->b_hold))
 			xfs_buf_free(bp);
 		return;
 	}
 
+	ASSERT(!RB_EMPTY_NODE(&bp->b_rbnode));
 	ASSERT(atomic_read(&bp->b_hold) > 0);
-	if (atomic_dec_and_lock(&bp->b_hold, &hash->bh_lock)) {
+	if (atomic_dec_and_lock(&bp->b_hold, &pag->pag_buf_lock)) {
 		if (bp->b_relse) {
 			atomic_inc(&bp->b_hold);
-			spin_unlock(&hash->bh_lock);
-			(*(bp->b_relse)) (bp);
+			spin_unlock(&pag->pag_buf_lock);
+			bp->b_relse(bp);
 		} else {
 			ASSERT(!(bp->b_flags & (XBF_DELWRI|_XBF_DELWRI_Q)));
-			list_del_init(&bp->b_hash_list);
-			spin_unlock(&hash->bh_lock);
+			rb_erase(&bp->b_rbnode, &pag->pag_buf_tree);
+			spin_unlock(&pag->pag_buf_lock);
+			xfs_perag_put(pag);
 			xfs_buf_free(bp);
 		}
 	}
@@ -1427,56 +1457,24 @@ xfs_buf_iomove(
  */
 void
 xfs_wait_buftarg(
-	xfs_buftarg_t	*btp)
+	struct xfs_buftarg	*btp)
 {
-	xfs_bufhash_t	*hash;
-	uint		i;
+	struct xfs_perag	*pag;
+	uint			i;
 
-	for (i = 0; i < (1 << btp->bt_hashshift); i++) {
-		hash = &btp->bt_hash[i];
-		spin_lock(&hash->bh_lock);
-		while (!list_empty(&hash->bh_list)) {
-			spin_unlock(&hash->bh_lock);
+	for (i = 0; i < btp->bt_mount->m_sb.sb_agcount; i++) {
+		pag = xfs_perag_get(btp->bt_mount, i);
+		spin_lock(&pag->pag_buf_lock);
+		while (rb_first(&pag->pag_buf_tree)) {
+			spin_unlock(&pag->pag_buf_lock);
 			delay(100);
-			spin_lock(&hash->bh_lock);
+			spin_lock(&pag->pag_buf_lock);
 		}
-		spin_unlock(&hash->bh_lock);
-	}
-}
-
-/*
- *	Allocate buffer hash table for a given target.
- *	For devices containing metadata (i.e. not the log/realtime devices)
- *	we need to allocate a much larger hash table.
- */
-STATIC void
-xfs_alloc_bufhash(
-	xfs_buftarg_t		*btp,
-	int			external)
-{
-	unsigned int		i;
-
-	if (external) {
-		btp->bt_hash = NULL;
-		return;
-	}
-	btp->bt_hashshift = 12;	/* 4096 buckets */
-	btp->bt_hash = kmem_zalloc_large((1 << btp->bt_hashshift) *
-					 sizeof(xfs_bufhash_t));
-	for (i = 0; i < (1 << btp->bt_hashshift); i++) {
-		spin_lock_init(&btp->bt_hash[i].bh_lock);
-		INIT_LIST_HEAD(&btp->bt_hash[i].bh_list);
+		spin_unlock(&pag->pag_buf_lock);
+		xfs_perag_put(pag);
 	}
 }
 
-STATIC void
-xfs_free_bufhash(
-	xfs_buftarg_t		*btp)
-{
-	kmem_free_large(btp->bt_hash);
-	btp->bt_hash = NULL;
-}
-
 /*
  *	buftarg list for delwrite queue processing
  */
@@ -1509,7 +1507,6 @@ xfs_free_buftarg(
 	xfs_flush_buftarg(btp, 1);
 	if (mp->m_flags & XFS_MOUNT_BARRIER)
 		xfs_blkdev_issue_flush(btp);
-	xfs_free_bufhash(btp);
 	iput(btp->bt_mapping->host);
 
 	/* Unregister the buftarg first so that we don't get a
@@ -1649,7 +1646,6 @@ xfs_alloc_buftarg(
 		goto error;
 	if (xfs_alloc_delwrite_queue(btp, fsname))
 		goto error;
-	xfs_alloc_bufhash(btp, external);
 	return btp;
 
 error:
diff --git a/fs/xfs/linux-2.6/xfs_buf.h b/fs/xfs/linux-2.6/xfs_buf.h
index 9517387..8d846e0 100644
--- a/fs/xfs/linux-2.6/xfs_buf.h
+++ b/fs/xfs/linux-2.6/xfs_buf.h
@@ -135,10 +135,6 @@ typedef struct xfs_buftarg {
 	unsigned int		bt_sshift;
 	size_t			bt_smask;
 
-	/* per device buffer hash table */
-	uint			bt_hashshift;
-	xfs_bufhash_t		*bt_hash;
-
 	/* per device delwri queue */
 	struct task_struct	*bt_task;
 	struct list_head	bt_list;
@@ -172,8 +168,8 @@ typedef struct xfs_buf {
 	wait_queue_head_t	b_waiters;	/* unpin waiters */
 	struct list_head	b_list;
 	xfs_buf_flags_t		b_flags;	/* status flags */
-	struct list_head	b_hash_list;	/* hash table list */
-	xfs_bufhash_t		*b_hash;	/* hash table list start */
+	struct rb_node		b_rbnode;	/* rbtree node */
+	struct xfs_perag	*b_pag;		/* contains rbtree root */
 	xfs_buftarg_t		*b_target;	/* buffer target (device) */
 	atomic_t		b_hold;		/* reference count */
 	xfs_daddr_t		b_bn;		/* block number for I/O */
diff --git a/fs/xfs/xfs_ag.h b/fs/xfs/xfs_ag.h
index 6de9128..7015217 100644
--- a/fs/xfs/xfs_ag.h
+++ b/fs/xfs/xfs_ag.h
@@ -231,6 +231,10 @@ typedef struct xfs_perag {
 	struct radix_tree_root pag_ici_root;	/* incore inode cache root */
 	int		pag_ici_reclaimable;	/* reclaimable inodes */
 
+	/* buffer cache index */
+	spinlock_t	pag_buf_lock;	/* lock for pag_buf_tree */
+	struct rb_root	pag_buf_tree;	/* ordered tree of active buffers */
+
 	/* for rcu-safe freeing */
 	struct rcu_head	rcu_head;
 #endif
diff --git a/fs/xfs/xfs_mount.c b/fs/xfs/xfs_mount.c
index d1498a6..b579258 100644
--- a/fs/xfs/xfs_mount.c
+++ b/fs/xfs/xfs_mount.c
@@ -452,6 +452,8 @@ xfs_initialize_perag(
 		pag->pag_mount = mp;
 		spin_lock_init(&pag->pag_ici_lock);
 		INIT_RADIX_TREE(&pag->pag_ici_root, GFP_ATOMIC);
+		spin_lock_init(&pag->pag_buf_lock);
+		pag->pag_buf_tree = RB_ROOT;
 
 		if (radix_tree_preload(GFP_NOFS))
 			goto out_unwind;
-- 
1.7.1

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 69+ messages in thread

* [PATCH 15/18] xfs; pack xfs_buf structure more tightly
  2010-09-14 10:55 [PATCH 0/18] xfs: metadata and buffer cache scalability improvements Dave Chinner
                   ` (13 preceding siblings ...)
  2010-09-14 10:56 ` [PATCH 14/18] xfs: convert buffer cache hash to rbtree Dave Chinner
@ 2010-09-14 10:56 ` Dave Chinner
  2010-09-14 16:30   ` Christoph Hellwig
  2010-09-15 18:01   ` Alex Elder
  2010-09-14 10:56 ` [PATCH 16/18] xfs: convert xfsbud shrinker to a per-buftarg shrinker Dave Chinner
                   ` (4 subsequent siblings)
  19 siblings, 2 replies; 69+ messages in thread
From: Dave Chinner @ 2010-09-14 10:56 UTC (permalink / raw)
  To: xfs

From: Dave Chinner <dchinner@redhat.com>

pahole reports the struct xfs_buf has quite a few holes in it, so
packing the structure better will reduce the size of it by 16 bytes.
Also, move all the fields used in cache lookups into the first
cacheline.

Before on x86_64:

        /* size: 320, cachelines: 5 */
	/* sum members: 298, holes: 6, sum holes: 22 */

After on x86_64:

        /* size: 304, cachelines: 5 */
	/* padding: 6 */
	/* last cacheline: 48 bytes */


Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/linux-2.6/xfs_buf.h |   30 +++++++++++++++++++-----------
 1 files changed, 19 insertions(+), 11 deletions(-)

diff --git a/fs/xfs/linux-2.6/xfs_buf.h b/fs/xfs/linux-2.6/xfs_buf.h
index 8d846e0..44f66cd 100644
--- a/fs/xfs/linux-2.6/xfs_buf.h
+++ b/fs/xfs/linux-2.6/xfs_buf.h
@@ -162,33 +162,41 @@ typedef int (*xfs_buf_bdstrat_t)(struct xfs_buf *);
 #define XB_PAGES	2
 
 typedef struct xfs_buf {
+	/*
+	 * first cacheline holds all the fields needed for an uncontended cache
+	 * hit to be fully processed. The semaphore straddles the cacheline
+	 * boundary, but the counter and lock sits on the first cacheline,
+	 * which is the only bit that is touched if we hit the semaphore
+	 * fast-path on locking.
+	 */
+	struct rb_node		b_rbnode;	/* rbtree node */
+	xfs_off_t		b_file_offset;	/* offset in file */
+	size_t			b_buffer_length;/* size of buffer in bytes */
+	atomic_t		b_hold;		/* reference count */
+	xfs_buf_flags_t		b_flags;	/* status flags */
 	struct semaphore	b_sema;		/* semaphore for lockables */
-	unsigned long		b_queuetime;	/* time buffer was queued */
-	atomic_t		b_pin_count;	/* pin count */
+
 	wait_queue_head_t	b_waiters;	/* unpin waiters */
 	struct list_head	b_list;
-	xfs_buf_flags_t		b_flags;	/* status flags */
-	struct rb_node		b_rbnode;	/* rbtree node */
 	struct xfs_perag	*b_pag;		/* contains rbtree root */
 	xfs_buftarg_t		*b_target;	/* buffer target (device) */
-	atomic_t		b_hold;		/* reference count */
 	xfs_daddr_t		b_bn;		/* block number for I/O */
-	xfs_off_t		b_file_offset;	/* offset in file */
-	size_t			b_buffer_length;/* size of buffer in bytes */
 	size_t			b_count_desired;/* desired transfer size */
 	void			*b_addr;	/* virtual address of buffer */
 	struct work_struct	b_iodone_work;
-	atomic_t		b_io_remaining;	/* #outstanding I/O requests */
 	xfs_buf_iodone_t	b_iodone;	/* I/O completion function */
 	xfs_buf_relse_t		b_relse;	/* releasing function */
 	struct completion	b_iowait;	/* queue for I/O waiters */
 	void			*b_fspriv;
 	void			*b_fspriv2;
-	unsigned short		b_error;	/* error code on I/O */
-	unsigned int		b_page_count;	/* size of page array */
-	unsigned int		b_offset;	/* page offset in first page */
 	struct page		**b_pages;	/* array of page pointers */
 	struct page		*b_page_array[XB_PAGES]; /* inline pages */
+	unsigned long		b_queuetime;	/* time buffer was queued */
+	atomic_t		b_pin_count;	/* pin count */
+	atomic_t		b_io_remaining;	/* #outstanding I/O requests */
+	unsigned int		b_page_count;	/* size of page array */
+	unsigned int		b_offset;	/* page offset in first page */
+	unsigned short		b_error;	/* error code on I/O */
 #ifdef XFS_BUF_LOCK_TRACKING
 	int			b_last_holder;
 #endif
-- 
1.7.1

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 69+ messages in thread

* [PATCH 16/18] xfs: convert xfsbud shrinker to a per-buftarg shrinker.
  2010-09-14 10:55 [PATCH 0/18] xfs: metadata and buffer cache scalability improvements Dave Chinner
                   ` (14 preceding siblings ...)
  2010-09-14 10:56 ` [PATCH 15/18] xfs; pack xfs_buf structure more tightly Dave Chinner
@ 2010-09-14 10:56 ` Dave Chinner
  2010-09-14 16:32   ` Christoph Hellwig
  2010-09-15 20:19   ` Alex Elder
  2010-09-14 10:56 ` [PATCH 17/18] xfs: add a lru to the XFS buffer cache Dave Chinner
                   ` (3 subsequent siblings)
  19 siblings, 2 replies; 69+ messages in thread
From: Dave Chinner @ 2010-09-14 10:56 UTC (permalink / raw)
  To: xfs

From: Dave Chinner <dchinner@redhat.com>

Before we introduce per-buftarg LRU lists, split the shrinker
implementation into per-buftarg shrinker callbacks. At the moment
we wake all the xfsbufds to run the delayed write queues to free
the dirty buffers and make their pages available for reclaim.
However, with an LRU, we want to be able to free clean, unused
buffers as well, so we need to separate the xfsbufd from the
shrinker callbacks.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/linux-2.6/xfs_buf.c |   89 ++++++++++++--------------------------------
 fs/xfs/linux-2.6/xfs_buf.h |    4 +-
 2 files changed, 27 insertions(+), 66 deletions(-)

diff --git a/fs/xfs/linux-2.6/xfs_buf.c b/fs/xfs/linux-2.6/xfs_buf.c
index cce427d..3b54fee 100644
--- a/fs/xfs/linux-2.6/xfs_buf.c
+++ b/fs/xfs/linux-2.6/xfs_buf.c
@@ -44,12 +44,7 @@
 
 static kmem_zone_t *xfs_buf_zone;
 STATIC int xfsbufd(void *);
-STATIC int xfsbufd_wakeup(struct shrinker *, int, gfp_t);
 STATIC void xfs_buf_delwri_queue(xfs_buf_t *, int);
-static struct shrinker xfs_buf_shake = {
-	.shrink = xfsbufd_wakeup,
-	.seeks = DEFAULT_SEEKS,
-};
 
 static struct workqueue_struct *xfslogd_workqueue;
 struct workqueue_struct *xfsdatad_workqueue;
@@ -337,7 +332,6 @@ _xfs_buf_lookup_pages(
 					__func__, gfp_mask);
 
 			XFS_STATS_INC(xb_page_retries);
-			xfsbufd_wakeup(NULL, 0, gfp_mask);
 			congestion_wait(BLK_RW_ASYNC, HZ/50);
 			goto retry;
 		}
@@ -1475,28 +1469,23 @@ xfs_wait_buftarg(
 	}
 }
 
-/*
- *	buftarg list for delwrite queue processing
- */
-static LIST_HEAD(xfs_buftarg_list);
-static DEFINE_SPINLOCK(xfs_buftarg_lock);
-
-STATIC void
-xfs_register_buftarg(
-	xfs_buftarg_t           *btp)
-{
-	spin_lock(&xfs_buftarg_lock);
-	list_add(&btp->bt_list, &xfs_buftarg_list);
-	spin_unlock(&xfs_buftarg_lock);
-}
-
-STATIC void
-xfs_unregister_buftarg(
-	xfs_buftarg_t           *btp)
+int
+xfs_buftarg_shrink(
+	struct shrinker		*shrink,
+	int			nr_to_scan,
+	gfp_t			mask)
 {
-	spin_lock(&xfs_buftarg_lock);
-	list_del(&btp->bt_list);
-	spin_unlock(&xfs_buftarg_lock);
+	struct xfs_buftarg	*btp = container_of(shrink,
+					struct xfs_buftarg, bt_shrinker);
+	if (nr_to_scan) {
+		if (test_bit(XBT_FORCE_SLEEP, &btp->bt_flags))
+			return -1;
+		if (list_empty(&btp->bt_delwrite_queue))
+			return -1;
+		set_bit(XBT_FORCE_FLUSH, &btp->bt_flags);
+		wake_up_process(btp->bt_task);
+	}
+	return list_empty(&btp->bt_delwrite_queue) ? -1 : 1;
 }
 
 void
@@ -1504,17 +1493,14 @@ xfs_free_buftarg(
 	struct xfs_mount	*mp,
 	struct xfs_buftarg	*btp)
 {
+	unregister_shrinker(&btp->bt_shrinker);
+
 	xfs_flush_buftarg(btp, 1);
 	if (mp->m_flags & XFS_MOUNT_BARRIER)
 		xfs_blkdev_issue_flush(btp);
 	iput(btp->bt_mapping->host);
 
-	/* Unregister the buftarg first so that we don't get a
-	 * wakeup finding a non-existent task
-	 */
-	xfs_unregister_buftarg(btp);
 	kthread_stop(btp->bt_task);
-
 	kmem_free(btp);
 }
 
@@ -1610,20 +1596,13 @@ xfs_alloc_delwrite_queue(
 	xfs_buftarg_t		*btp,
 	const char		*fsname)
 {
-	int	error = 0;
-
-	INIT_LIST_HEAD(&btp->bt_list);
 	INIT_LIST_HEAD(&btp->bt_delwrite_queue);
 	spin_lock_init(&btp->bt_delwrite_lock);
 	btp->bt_flags = 0;
 	btp->bt_task = kthread_run(xfsbufd, btp, "xfsbufd/%s", fsname);
-	if (IS_ERR(btp->bt_task)) {
-		error = PTR_ERR(btp->bt_task);
-		goto out_error;
-	}
-	xfs_register_buftarg(btp);
-out_error:
-	return error;
+	if (IS_ERR(btp->bt_task))
+		return PTR_ERR(btp->bt_task);
+	return 0;
 }
 
 xfs_buftarg_t *
@@ -1646,6 +1625,9 @@ xfs_alloc_buftarg(
 		goto error;
 	if (xfs_alloc_delwrite_queue(btp, fsname))
 		goto error;
+	btp->bt_shrinker.shrink = xfs_buftarg_shrink,
+	btp->bt_shrinker.seeks = DEFAULT_SEEKS,
+	register_shrinker(&btp->bt_shrinker);
 	return btp;
 
 error:
@@ -1750,27 +1732,6 @@ xfs_buf_runall_queues(
 	flush_workqueue(queue);
 }
 
-STATIC int
-xfsbufd_wakeup(
-	struct shrinker		*shrink,
-	int			priority,
-	gfp_t			mask)
-{
-	xfs_buftarg_t		*btp;
-
-	spin_lock(&xfs_buftarg_lock);
-	list_for_each_entry(btp, &xfs_buftarg_list, bt_list) {
-		if (test_bit(XBT_FORCE_SLEEP, &btp->bt_flags))
-			continue;
-		if (list_empty(&btp->bt_delwrite_queue))
-			continue;
-		set_bit(XBT_FORCE_FLUSH, &btp->bt_flags);
-		wake_up_process(btp->bt_task);
-	}
-	spin_unlock(&xfs_buftarg_lock);
-	return 0;
-}
-
 /*
  * Move as many buffers as specified to the supplied list
  * idicating if we skipped any buffers to prevent deadlocks.
@@ -1965,7 +1926,6 @@ xfs_buf_init(void)
 	if (!xfsconvertd_workqueue)
 		goto out_destroy_xfsdatad_workqueue;
 
-	register_shrinker(&xfs_buf_shake);
 	return 0;
 
  out_destroy_xfsdatad_workqueue:
@@ -1981,7 +1941,6 @@ xfs_buf_init(void)
 void
 xfs_buf_terminate(void)
 {
-	unregister_shrinker(&xfs_buf_shake);
 	destroy_workqueue(xfsconvertd_workqueue);
 	destroy_workqueue(xfsdatad_workqueue);
 	destroy_workqueue(xfslogd_workqueue);
diff --git a/fs/xfs/linux-2.6/xfs_buf.h b/fs/xfs/linux-2.6/xfs_buf.h
index 44f66cd..6e9310b 100644
--- a/fs/xfs/linux-2.6/xfs_buf.h
+++ b/fs/xfs/linux-2.6/xfs_buf.h
@@ -137,10 +137,12 @@ typedef struct xfs_buftarg {
 
 	/* per device delwri queue */
 	struct task_struct	*bt_task;
-	struct list_head	bt_list;
 	struct list_head	bt_delwrite_queue;
 	spinlock_t		bt_delwrite_lock;
 	unsigned long		bt_flags;
+
+	/* LRU control structures */
+	struct shrinker		bt_shrinker;
 } xfs_buftarg_t;
 
 /*
-- 
1.7.1

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 69+ messages in thread

* [PATCH 17/18] xfs: add a lru to the XFS buffer cache
  2010-09-14 10:55 [PATCH 0/18] xfs: metadata and buffer cache scalability improvements Dave Chinner
                   ` (15 preceding siblings ...)
  2010-09-14 10:56 ` [PATCH 16/18] xfs: convert xfsbud shrinker to a per-buftarg shrinker Dave Chinner
@ 2010-09-14 10:56 ` Dave Chinner
  2010-09-14 23:16   ` Christoph Hellwig
  2010-09-15 21:28   ` Alex Elder
  2010-09-14 10:56 ` [PATCH 18/18] xfs: stop using the page cache to back the " Dave Chinner
                   ` (2 subsequent siblings)
  19 siblings, 2 replies; 69+ messages in thread
From: Dave Chinner @ 2010-09-14 10:56 UTC (permalink / raw)
  To: xfs

From: Dave Chinner <dchinner@redhat.com>

Introduce a per-buftarg LRU for memory reclaim to operate on. This
is the last piece we need to put in place so that we can fully
control the buffer lifecycle. This allows XFS to be responsibile for
maintaining the working set of buffers under memory pressure instead
of relying on the VM reclaim not to take pages we need out from
underneath us.

The implementation is currently a bit naive - it does not rotate
buffers on the LRU when they are accessed multiple times. Solving
this problem is for a later patch series that re-introduces the
buffer type specific reclaim reference counts to prioritise reclaim
more effectively.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/linux-2.6/xfs_buf.c |   91 ++++++++++++++++++++++++++++++++++---------
 fs/xfs/linux-2.6/xfs_buf.h |    5 ++
 2 files changed, 77 insertions(+), 19 deletions(-)

diff --git a/fs/xfs/linux-2.6/xfs_buf.c b/fs/xfs/linux-2.6/xfs_buf.c
index 3b54fee..12b37c6 100644
--- a/fs/xfs/linux-2.6/xfs_buf.c
+++ b/fs/xfs/linux-2.6/xfs_buf.c
@@ -182,6 +182,7 @@ _xfs_buf_initialize(
 	memset(bp, 0, sizeof(xfs_buf_t));
 	atomic_set(&bp->b_hold, 1);
 	init_completion(&bp->b_iowait);
+	INIT_LIST_HEAD(&bp->b_lru);
 	INIT_LIST_HEAD(&bp->b_list);
 	RB_CLEAR_NODE(&bp->b_rbnode);
 	init_MUTEX_LOCKED(&bp->b_sema); /* held, no waiters */
@@ -257,6 +258,8 @@ xfs_buf_free(
 {
 	trace_xfs_buf_free(bp, _RET_IP_);
 
+	ASSERT(list_empty(&bp->b_lru));
+
 	if (bp->b_flags & (_XBF_PAGE_CACHE|_XBF_PAGES)) {
 		uint		i;
 
@@ -471,6 +474,13 @@ _xfs_buf_find(
 		/* the buffer keeps the perag reference until it is freed */
 		new_bp->b_pag = pag;
 		spin_unlock(&pag->pag_buf_lock);
+
+		/* add to LRU */
+		spin_lock(&btp->bt_lru_lock);
+		list_add_tail(&new_bp->b_lru, &btp->bt_lru);
+		btp->bt_lru_nr++;
+		atomic_inc(&new_bp->b_hold);
+		spin_unlock(&btp->bt_lru_lock);
 	} else {
 		XFS_STATS_INC(xb_miss_locked);
 		spin_unlock(&pag->pag_buf_lock);
@@ -834,12 +844,14 @@ xfs_buf_rele(
 
 	if (!pag) {
 		ASSERT(!bp->b_relse);
+		ASSERT(list_empty(&bp->b_lru));
 		ASSERT(RB_EMPTY_NODE(&bp->b_rbnode));
 		if (atomic_dec_and_test(&bp->b_hold))
 			xfs_buf_free(bp);
 		return;
 	}
 
+	ASSERT(!list_empty(&bp->b_lru));
 	ASSERT(!RB_EMPTY_NODE(&bp->b_rbnode));
 	ASSERT(atomic_read(&bp->b_hold) > 0);
 	if (atomic_dec_and_lock(&bp->b_hold, &pag->pag_buf_lock)) {
@@ -848,6 +860,14 @@ xfs_buf_rele(
 			spin_unlock(&pag->pag_buf_lock);
 			bp->b_relse(bp);
 		} else {
+			struct xfs_buftarg *btp = bp->b_target;
+
+			/* remove from LRU */
+			spin_lock(&btp->bt_lru_lock);
+			list_del_init(&bp->b_lru);
+			btp->bt_lru_nr--;
+			spin_unlock(&btp->bt_lru_lock);
+
 			ASSERT(!(bp->b_flags & (XBF_DELWRI|_XBF_DELWRI_Q)));
 			rb_erase(&bp->b_rbnode, &pag->pag_buf_tree);
 			spin_unlock(&pag->pag_buf_lock);
@@ -1446,27 +1466,29 @@ xfs_buf_iomove(
  */
 
 /*
- *	Wait for any bufs with callbacks that have been submitted but
- *	have not yet returned... walk the hash list for the target.
+ * Wait for any bufs with callbacks that have been submitted but have not yet
+ * returned. These buffers will have an elevated hold count, so wait on those
+ * while freeing all the buffers only held by the LRU.
  */
 void
 xfs_wait_buftarg(
 	struct xfs_buftarg	*btp)
 {
-	struct xfs_perag	*pag;
-	uint			i;
-
-	for (i = 0; i < btp->bt_mount->m_sb.sb_agcount; i++) {
-		pag = xfs_perag_get(btp->bt_mount, i);
-		spin_lock(&pag->pag_buf_lock);
-		while (rb_first(&pag->pag_buf_tree)) {
-			spin_unlock(&pag->pag_buf_lock);
+	struct xfs_buf		*bp;
+restart:
+	spin_lock(&btp->bt_lru_lock);
+	while (!list_empty(&btp->bt_lru)) {
+		bp = list_first_entry(&btp->bt_lru, struct xfs_buf, b_lru);
+		if (atomic_read(&bp->b_hold) > 1) {
+			spin_unlock(&btp->bt_lru_lock);
 			delay(100);
-			spin_lock(&pag->pag_buf_lock);
+			goto restart;
 		}
-		spin_unlock(&pag->pag_buf_lock);
-		xfs_perag_put(pag);
+		spin_unlock(&btp->bt_lru_lock);
+		xfs_buf_rele(bp);
+		spin_lock(&btp->bt_lru_lock);
 	}
+	spin_unlock(&btp->bt_lru_lock);
 }
 
 int
@@ -1477,15 +1499,44 @@ xfs_buftarg_shrink(
 {
 	struct xfs_buftarg	*btp = container_of(shrink,
 					struct xfs_buftarg, bt_shrinker);
-	if (nr_to_scan) {
-		if (test_bit(XBT_FORCE_SLEEP, &btp->bt_flags))
-			return -1;
-		if (list_empty(&btp->bt_delwrite_queue))
-			return -1;
+	struct xfs_buf		*bp, *n;
+
+	if (!nr_to_scan)
+		return btp->bt_lru_nr;
+
+	spin_lock(&btp->bt_lru_lock);
+	if (test_and_set_bit(XBT_SHRINKER_ACTIVE, &btp->bt_flags)) {
+		/* LRU walk already in progress */
+		spin_unlock(&btp->bt_lru_lock);
+		return -1;
+	}
+
+	list_for_each_entry_safe(bp, n, &btp->bt_lru, b_lru) {
+		if (nr_to_scan-- <= 0)
+			break;
+		/*
+		 * If the lru holds the only reference count on the buffer,
+		 * release it. Otherwise there is another user of the buffer
+		 * and it will be getting repositioned real soon.
+		 */
+		if (atomic_read(&bp->b_hold) > 1)
+			continue;
+		spin_unlock(&btp->bt_lru_lock);
+		xfs_buf_rele(bp);
+		spin_lock(&btp->bt_lru_lock);
+	}
+	clear_bit(XBT_SHRINKER_ACTIVE, &btp->bt_flags);
+	spin_unlock(&btp->bt_lru_lock);
+
+	/* kick the xfsbufd to write and release dirty buffers */
+	if (!test_bit(XBT_FORCE_SLEEP, &btp->bt_flags) &&
+	    !test_bit(XBT_FORCE_FLUSH, &btp->bt_flags) &&
+	    !list_empty(&btp->bt_delwrite_queue)) {
 		set_bit(XBT_FORCE_FLUSH, &btp->bt_flags);
 		wake_up_process(btp->bt_task);
 	}
-	return list_empty(&btp->bt_delwrite_queue) ? -1 : 1;
+
+	return btp->bt_lru_nr;
 }
 
 void
@@ -1619,6 +1670,8 @@ xfs_alloc_buftarg(
 	btp->bt_mount = mp;
 	btp->bt_dev =  bdev->bd_dev;
 	btp->bt_bdev = bdev;
+	INIT_LIST_HEAD(&btp->bt_lru);
+	spin_lock_init(&btp->bt_lru_lock);
 	if (xfs_setsize_buftarg_early(btp, bdev))
 		goto error;
 	if (xfs_mapping_buftarg(btp, bdev))
diff --git a/fs/xfs/linux-2.6/xfs_buf.h b/fs/xfs/linux-2.6/xfs_buf.h
index 6e9310b..36f71aa 100644
--- a/fs/xfs/linux-2.6/xfs_buf.h
+++ b/fs/xfs/linux-2.6/xfs_buf.h
@@ -119,6 +119,7 @@ typedef unsigned int xfs_buf_flags_t;
 typedef enum {
 	XBT_FORCE_SLEEP = 0,
 	XBT_FORCE_FLUSH = 1,
+	XBT_SHRINKER_ACTIVE = 2,
 } xfs_buftarg_flags_t;
 
 typedef struct xfs_bufhash {
@@ -143,6 +144,9 @@ typedef struct xfs_buftarg {
 
 	/* LRU control structures */
 	struct shrinker		bt_shrinker;
+	struct list_head	bt_lru;
+	spinlock_t		bt_lru_lock;
+	unsigned int		bt_lru_nr;
 } xfs_buftarg_t;
 
 /*
@@ -178,6 +182,7 @@ typedef struct xfs_buf {
 	xfs_buf_flags_t		b_flags;	/* status flags */
 	struct semaphore	b_sema;		/* semaphore for lockables */
 
+	struct list_head	b_lru;		/* lru list */
 	wait_queue_head_t	b_waiters;	/* unpin waiters */
 	struct list_head	b_list;
 	struct xfs_perag	*b_pag;		/* contains rbtree root */
-- 
1.7.1

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 69+ messages in thread

* [PATCH 18/18] xfs: stop using the page cache to back the buffer cache
  2010-09-14 10:55 [PATCH 0/18] xfs: metadata and buffer cache scalability improvements Dave Chinner
                   ` (16 preceding siblings ...)
  2010-09-14 10:56 ` [PATCH 17/18] xfs: add a lru to the XFS buffer cache Dave Chinner
@ 2010-09-14 10:56 ` Dave Chinner
  2010-09-14 23:20   ` Christoph Hellwig
  2010-09-14 14:25 ` [PATCH 0/18] xfs: metadata and buffer cache scalability improvements Christoph Hellwig
  2010-09-17 13:21 ` Alex Elder
  19 siblings, 1 reply; 69+ messages in thread
From: Dave Chinner @ 2010-09-14 10:56 UTC (permalink / raw)
  To: xfs

From: Dave Chinner <dchinner@redhat.com>

Now that he buffer cache has it's own LRU, we do not need to use the page cache
to provide persistent caching and reclaim infrastructure. Convert the buffer
cache to use alloc_pages() instead of the page cache. This will remove all the
overhead of page cache management from setup and teardown of the buffers, as
well as needing to mark pages accessed as we find buffers in the buffer cache.

By avoiding the page cache, we also remove the need to keep state in the
page_private(page) field so that it persists across buffer free/buffer rebuild,
and so all that code can be removed. This also fixes the long-standing problem
of not having enough bits in the page_private field to track all the state
needed for a 512 sector/64k page setup.

It also removes the need for page locking during reads as the pages are unique
to the buffer and nobody else will be attempting to access them.

The only open question is how to best handle sub-page buffers - can we use
kmalloc/slab memory for sub-page sized buffers, or do we need to split up
pages ourselves? Worth noting is that the current code still works on sub-page
block size filesystems, it is just inefficient w.r.t. memory usage.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/linux-2.6/xfs_buf.c |  256 +++++---------------------------------------
 fs/xfs/linux-2.6/xfs_buf.h |    2 +-
 2 files changed, 27 insertions(+), 231 deletions(-)

diff --git a/fs/xfs/linux-2.6/xfs_buf.c b/fs/xfs/linux-2.6/xfs_buf.c
index 12b37c6..c3b1f4a 100644
--- a/fs/xfs/linux-2.6/xfs_buf.c
+++ b/fs/xfs/linux-2.6/xfs_buf.c
@@ -94,75 +94,6 @@ xfs_buf_vmap_len(
 }
 
 /*
- *	Page Region interfaces.
- *
- *	For pages in filesystems where the blocksize is smaller than the
- *	pagesize, we use the page->private field (long) to hold a bitmap
- * 	of uptodate regions within the page.
- *
- *	Each such region is "bytes per page / bits per long" bytes long.
- *
- *	NBPPR == number-of-bytes-per-page-region
- *	BTOPR == bytes-to-page-region (rounded up)
- *	BTOPRT == bytes-to-page-region-truncated (rounded down)
- */
-#if (BITS_PER_LONG == 32)
-#define PRSHIFT		(PAGE_CACHE_SHIFT - 5)	/* (32 == 1<<5) */
-#elif (BITS_PER_LONG == 64)
-#define PRSHIFT		(PAGE_CACHE_SHIFT - 6)	/* (64 == 1<<6) */
-#else
-#error BITS_PER_LONG must be 32 or 64
-#endif
-#define NBPPR		(PAGE_CACHE_SIZE/BITS_PER_LONG)
-#define BTOPR(b)	(((unsigned int)(b) + (NBPPR - 1)) >> PRSHIFT)
-#define BTOPRT(b)	(((unsigned int)(b) >> PRSHIFT))
-
-STATIC unsigned long
-page_region_mask(
-	size_t		offset,
-	size_t		length)
-{
-	unsigned long	mask;
-	int		first, final;
-
-	first = BTOPR(offset);
-	final = BTOPRT(offset + length - 1);
-	first = min(first, final);
-
-	mask = ~0UL;
-	mask <<= BITS_PER_LONG - (final - first);
-	mask >>= BITS_PER_LONG - (final);
-
-	ASSERT(offset + length <= PAGE_CACHE_SIZE);
-	ASSERT((final - first) < BITS_PER_LONG && (final - first) >= 0);
-
-	return mask;
-}
-
-STATIC void
-set_page_region(
-	struct page	*page,
-	size_t		offset,
-	size_t		length)
-{
-	set_page_private(page,
-		page_private(page) | page_region_mask(offset, length));
-	if (page_private(page) == ~0UL)
-		SetPageUptodate(page);
-}
-
-STATIC int
-test_page_region(
-	struct page	*page,
-	size_t		offset,
-	size_t		length)
-{
-	unsigned long	mask = page_region_mask(offset, length);
-
-	return (mask && (page_private(page) & mask) == mask);
-}
-
-/*
  *	Internal xfs_buf_t object manipulation
  */
 
@@ -260,7 +191,7 @@ xfs_buf_free(
 
 	ASSERT(list_empty(&bp->b_lru));
 
-	if (bp->b_flags & (_XBF_PAGE_CACHE|_XBF_PAGES)) {
+	if (bp->b_flags & _XBF_PAGES) {
 		uint		i;
 
 		if (xfs_buf_is_vmapped(bp))
@@ -270,9 +201,7 @@ xfs_buf_free(
 		for (i = 0; i < bp->b_page_count; i++) {
 			struct page	*page = bp->b_pages[i];
 
-			if (bp->b_flags & _XBF_PAGE_CACHE)
-				ASSERT(!PagePrivate(page));
-			page_cache_release(page);
+			__free_page(page);
 		}
 	}
 	_xfs_buf_free_pages(bp);
@@ -287,8 +216,6 @@ _xfs_buf_lookup_pages(
 	xfs_buf_t		*bp,
 	uint			flags)
 {
-	struct address_space	*mapping = bp->b_target->bt_mapping;
-	size_t			blocksize = bp->b_target->bt_bsize;
 	size_t			size = bp->b_count_desired;
 	size_t			nbytes, offset;
 	gfp_t			gfp_mask = xb_to_gfp(flags);
@@ -303,22 +230,21 @@ _xfs_buf_lookup_pages(
 	error = _xfs_buf_get_pages(bp, page_count, flags);
 	if (unlikely(error))
 		return error;
-	bp->b_flags |= _XBF_PAGE_CACHE;
+	bp->b_flags |= _XBF_PAGES;
 
 	offset = bp->b_offset;
-	first = bp->b_file_offset >> PAGE_CACHE_SHIFT;
+	first = bp->b_file_offset >> PAGE_SHIFT;
 
 	for (i = 0; i < bp->b_page_count; i++) {
 		struct page	*page;
 		uint		retries = 0;
-
-	      retry:
-		page = find_or_create_page(mapping, first + i, gfp_mask);
+retry:
+		page = alloc_page(gfp_mask);
 		if (unlikely(page == NULL)) {
 			if (flags & XBF_READ_AHEAD) {
 				bp->b_page_count = i;
 				for (i = 0; i < bp->b_page_count; i++)
-					unlock_page(bp->b_pages[i]);
+					__free_page(bp->b_pages[i]);
 				return -ENOMEM;
 			}
 
@@ -341,33 +267,11 @@ _xfs_buf_lookup_pages(
 
 		XFS_STATS_INC(xb_page_found);
 
-		nbytes = min_t(size_t, size, PAGE_CACHE_SIZE - offset);
+		nbytes = min_t(size_t, size, PAGE_SIZE - offset);
 		size -= nbytes;
-
-		ASSERT(!PagePrivate(page));
-		if (!PageUptodate(page)) {
-			page_count--;
-			if (blocksize >= PAGE_CACHE_SIZE) {
-				if (flags & XBF_READ)
-					bp->b_flags |= _XBF_PAGE_LOCKED;
-			} else if (!PagePrivate(page)) {
-				if (test_page_region(page, offset, nbytes))
-					page_count++;
-			}
-		}
-
 		bp->b_pages[i] = page;
 		offset = 0;
 	}
-
-	if (!(bp->b_flags & _XBF_PAGE_LOCKED)) {
-		for (i = 0; i < bp->b_page_count; i++)
-			unlock_page(bp->b_pages[i]);
-	}
-
-	if (page_count == bp->b_page_count)
-		bp->b_flags |= XBF_DONE;
-
 	return error;
 }
 
@@ -661,7 +565,7 @@ xfs_buf_readahead(
 {
 	struct backing_dev_info *bdi;
 
-	bdi = target->bt_mapping->backing_dev_info;
+	bdi = blk_get_backing_dev_info(target->bt_bdev);
 	if (bdi_read_congested(bdi))
 		return;
 
@@ -739,10 +643,10 @@ xfs_buf_associate_memory(
 	size_t			buflen;
 	int			page_count;
 
-	pageaddr = (unsigned long)mem & PAGE_CACHE_MASK;
+	pageaddr = (unsigned long)mem & PAGE_MASK;
 	offset = (unsigned long)mem - pageaddr;
-	buflen = PAGE_CACHE_ALIGN(len + offset);
-	page_count = buflen >> PAGE_CACHE_SHIFT;
+	buflen = PAGE_ALIGN(len + offset);
+	page_count = buflen >> PAGE_SHIFT;
 
 	/* Free any previous set of page pointers */
 	if (bp->b_pages)
@@ -759,13 +663,12 @@ xfs_buf_associate_memory(
 
 	for (i = 0; i < bp->b_page_count; i++) {
 		bp->b_pages[i] = mem_to_page((void *)pageaddr);
-		pageaddr += PAGE_CACHE_SIZE;
+		pageaddr += PAGE_SIZE;
 	}
 
 	bp->b_count_desired = len;
 	bp->b_buffer_length = buflen;
 	bp->b_flags |= XBF_MAPPED;
-	bp->b_flags &= ~_XBF_PAGE_LOCKED;
 
 	return 0;
 }
@@ -936,7 +839,7 @@ xfs_buf_lock(
 	if (atomic_read(&bp->b_pin_count) && (bp->b_flags & XBF_STALE))
 		xfs_log_force(bp->b_target->bt_mount, 0);
 	if (atomic_read(&bp->b_io_remaining))
-		blk_run_address_space(bp->b_target->bt_mapping);
+		blk_run_backing_dev(bp->b_target->bt_bdi, NULL);
 	down(&bp->b_sema);
 	XB_SET_OWNER(bp);
 
@@ -981,7 +884,7 @@ xfs_buf_wait_unpin(
 		if (atomic_read(&bp->b_pin_count) == 0)
 			break;
 		if (atomic_read(&bp->b_io_remaining))
-			blk_run_address_space(bp->b_target->bt_mapping);
+			blk_run_backing_dev(bp->b_target->bt_bdi, NULL);
 		schedule();
 	}
 	remove_wait_queue(&bp->b_waiters, &wait);
@@ -1206,10 +1109,8 @@ _xfs_buf_ioend(
 	xfs_buf_t		*bp,
 	int			schedule)
 {
-	if (atomic_dec_and_test(&bp->b_io_remaining) == 1) {
-		bp->b_flags &= ~_XBF_PAGE_LOCKED;
+	if (atomic_dec_and_test(&bp->b_io_remaining) == 1)
 		xfs_buf_ioend(bp, schedule);
-	}
 }
 
 STATIC void
@@ -1218,35 +1119,12 @@ xfs_buf_bio_end_io(
 	int			error)
 {
 	xfs_buf_t		*bp = (xfs_buf_t *)bio->bi_private;
-	unsigned int		blocksize = bp->b_target->bt_bsize;
-	struct bio_vec		*bvec = bio->bi_io_vec + bio->bi_vcnt - 1;
 
 	xfs_buf_ioerror(bp, -error);
 
 	if (!error && xfs_buf_is_vmapped(bp) && (bp->b_flags & XBF_READ))
 		invalidate_kernel_vmap_range(bp->b_addr, xfs_buf_vmap_len(bp));
 
-	do {
-		struct page	*page = bvec->bv_page;
-
-		ASSERT(!PagePrivate(page));
-		if (unlikely(bp->b_error)) {
-			if (bp->b_flags & XBF_READ)
-				ClearPageUptodate(page);
-		} else if (blocksize >= PAGE_CACHE_SIZE) {
-			SetPageUptodate(page);
-		} else if (!PagePrivate(page) &&
-				(bp->b_flags & _XBF_PAGE_CACHE)) {
-			set_page_region(page, bvec->bv_offset, bvec->bv_len);
-		}
-
-		if (--bvec >= bio->bi_io_vec)
-			prefetchw(&bvec->bv_page->flags);
-
-		if (bp->b_flags & _XBF_PAGE_LOCKED)
-			unlock_page(page);
-	} while (bvec >= bio->bi_io_vec);
-
 	_xfs_buf_ioend(bp, 1);
 	bio_put(bio);
 }
@@ -1260,7 +1138,6 @@ _xfs_buf_ioapply(
 	int			offset = bp->b_offset;
 	int			size = bp->b_count_desired;
 	sector_t		sector = bp->b_bn;
-	unsigned int		blocksize = bp->b_target->bt_bsize;
 
 	total_nr_pages = bp->b_page_count;
 	map_i = 0;
@@ -1281,30 +1158,6 @@ _xfs_buf_ioapply(
 		     (bp->b_flags & XBF_READ_AHEAD) ? READA : READ;
 	}
 
-	/* Special code path for reading a sub page size buffer in --
-	 * we populate up the whole page, and hence the other metadata
-	 * in the same page.  This optimization is only valid when the
-	 * filesystem block size is not smaller than the page size.
-	 */
-	if ((bp->b_buffer_length < PAGE_CACHE_SIZE) &&
-	    ((bp->b_flags & (XBF_READ|_XBF_PAGE_LOCKED)) ==
-	      (XBF_READ|_XBF_PAGE_LOCKED)) &&
-	    (blocksize >= PAGE_CACHE_SIZE)) {
-		bio = bio_alloc(GFP_NOIO, 1);
-
-		bio->bi_bdev = bp->b_target->bt_bdev;
-		bio->bi_sector = sector - (offset >> BBSHIFT);
-		bio->bi_end_io = xfs_buf_bio_end_io;
-		bio->bi_private = bp;
-
-		bio_add_page(bio, bp->b_pages[0], PAGE_CACHE_SIZE, 0);
-		size = 0;
-
-		atomic_inc(&bp->b_io_remaining);
-
-		goto submit_io;
-	}
-
 next_chunk:
 	atomic_inc(&bp->b_io_remaining);
 	nr_pages = BIO_MAX_SECTORS >> (PAGE_SHIFT - BBSHIFT);
@@ -1318,7 +1171,7 @@ next_chunk:
 	bio->bi_private = bp;
 
 	for (; size && nr_pages; nr_pages--, map_i++) {
-		int	rbytes, nbytes = PAGE_CACHE_SIZE - offset;
+		int	rbytes, nbytes = PAGE_SIZE - offset;
 
 		if (nbytes > size)
 			nbytes = size;
@@ -1333,7 +1186,6 @@ next_chunk:
 		total_nr_pages--;
 	}
 
-submit_io:
 	if (likely(bio->bi_size)) {
 		if (xfs_buf_is_vmapped(bp)) {
 			flush_kernel_vmap_range(bp->b_addr,
@@ -1343,18 +1195,7 @@ submit_io:
 		if (size)
 			goto next_chunk;
 	} else {
-		/*
-		 * if we get here, no pages were added to the bio. However,
-		 * we can't just error out here - if the pages are locked then
-		 * we have to unlock them otherwise we can hang on a later
-		 * access to the page.
-		 */
 		xfs_buf_ioerror(bp, EIO);
-		if (bp->b_flags & _XBF_PAGE_LOCKED) {
-			int i;
-			for (i = 0; i < bp->b_page_count; i++)
-				unlock_page(bp->b_pages[i]);
-		}
 		bio_put(bio);
 	}
 }
@@ -1400,7 +1241,7 @@ xfs_buf_iowait(
 	trace_xfs_buf_iowait(bp, _RET_IP_);
 
 	if (atomic_read(&bp->b_io_remaining))
-		blk_run_address_space(bp->b_target->bt_mapping);
+		blk_run_backing_dev(bp->b_target->bt_bdi, NULL);
 	wait_for_completion(&bp->b_iowait);
 
 	trace_xfs_buf_iowait_done(bp, _RET_IP_);
@@ -1418,8 +1259,8 @@ xfs_buf_offset(
 		return XFS_BUF_PTR(bp) + offset;
 
 	offset += bp->b_offset;
-	page = bp->b_pages[offset >> PAGE_CACHE_SHIFT];
-	return (xfs_caddr_t)page_address(page) + (offset & (PAGE_CACHE_SIZE-1));
+	page = bp->b_pages[offset >> PAGE_SHIFT];
+	return (xfs_caddr_t)page_address(page) + (offset & (PAGE_SIZE-1));
 }
 
 /*
@@ -1441,9 +1282,9 @@ xfs_buf_iomove(
 		page = bp->b_pages[xfs_buf_btoct(boff + bp->b_offset)];
 		cpoff = xfs_buf_poff(boff + bp->b_offset);
 		csize = min_t(size_t,
-			      PAGE_CACHE_SIZE-cpoff, bp->b_count_desired-boff);
+			      PAGE_SIZE-cpoff, bp->b_count_desired-boff);
 
-		ASSERT(((csize + cpoff) <= PAGE_CACHE_SIZE));
+		ASSERT(((csize + cpoff) <= PAGE_SIZE));
 
 		switch (mode) {
 		case XBRW_ZERO:
@@ -1549,7 +1390,6 @@ xfs_free_buftarg(
 	xfs_flush_buftarg(btp, 1);
 	if (mp->m_flags & XFS_MOUNT_BARRIER)
 		xfs_blkdev_issue_flush(btp);
-	iput(btp->bt_mapping->host);
 
 	kthread_stop(btp->bt_task);
 	kmem_free(btp);
@@ -1573,15 +1413,6 @@ xfs_setsize_buftarg_flags(
 		return EINVAL;
 	}
 
-	if (verbose &&
-	    (PAGE_CACHE_SIZE / BITS_PER_LONG) > sectorsize) {
-		printk(KERN_WARNING
-			"XFS: %u byte sectors in use on device %s.  "
-			"This is suboptimal; %u or greater is ideal.\n",
-			sectorsize, XFS_BUFTARG_NAME(btp),
-			(unsigned int)PAGE_CACHE_SIZE / BITS_PER_LONG);
-	}
-
 	return 0;
 }
 
@@ -1596,7 +1427,7 @@ xfs_setsize_buftarg_early(
 	struct block_device	*bdev)
 {
 	return xfs_setsize_buftarg_flags(btp,
-			PAGE_CACHE_SIZE, bdev_logical_block_size(bdev), 0);
+			PAGE_SIZE, bdev_logical_block_size(bdev), 0);
 }
 
 int
@@ -1609,40 +1440,6 @@ xfs_setsize_buftarg(
 }
 
 STATIC int
-xfs_mapping_buftarg(
-	xfs_buftarg_t		*btp,
-	struct block_device	*bdev)
-{
-	struct backing_dev_info	*bdi;
-	struct inode		*inode;
-	struct address_space	*mapping;
-	static const struct address_space_operations mapping_aops = {
-		.sync_page = block_sync_page,
-		.migratepage = fail_migrate_page,
-	};
-
-	inode = new_inode(bdev->bd_inode->i_sb);
-	if (!inode) {
-		printk(KERN_WARNING
-			"XFS: Cannot allocate mapping inode for device %s\n",
-			XFS_BUFTARG_NAME(btp));
-		return ENOMEM;
-	}
-	inode->i_mode = S_IFBLK;
-	inode->i_bdev = bdev;
-	inode->i_rdev = bdev->bd_dev;
-	bdi = blk_get_backing_dev_info(bdev);
-	if (!bdi)
-		bdi = &default_backing_dev_info;
-	mapping = &inode->i_data;
-	mapping->a_ops = &mapping_aops;
-	mapping->backing_dev_info = bdi;
-	mapping_set_gfp_mask(mapping, GFP_NOFS);
-	btp->bt_mapping = mapping;
-	return 0;
-}
-
-STATIC int
 xfs_alloc_delwrite_queue(
 	xfs_buftarg_t		*btp,
 	const char		*fsname)
@@ -1670,12 +1467,11 @@ xfs_alloc_buftarg(
 	btp->bt_mount = mp;
 	btp->bt_dev =  bdev->bd_dev;
 	btp->bt_bdev = bdev;
+	btp->bt_bdi = blk_get_backing_dev_info(bdev);
 	INIT_LIST_HEAD(&btp->bt_lru);
 	spin_lock_init(&btp->bt_lru_lock);
 	if (xfs_setsize_buftarg_early(btp, bdev))
 		goto error;
-	if (xfs_mapping_buftarg(btp, bdev))
-		goto error;
 	if (xfs_alloc_delwrite_queue(btp, fsname))
 		goto error;
 	btp->bt_shrinker.shrink = xfs_buftarg_shrink,
@@ -1897,7 +1693,7 @@ xfsbufd(
 			count++;
 		}
 		if (count)
-			blk_run_address_space(target->bt_mapping);
+			blk_run_backing_dev(target->bt_bdi, NULL);
 
 	} while (!kthread_should_stop());
 
@@ -1945,7 +1741,7 @@ xfs_flush_buftarg(
 
 	if (wait) {
 		/* Expedite and wait for IO to complete. */
-		blk_run_address_space(target->bt_mapping);
+		blk_run_backing_dev(target->bt_bdi, NULL);
 		while (!list_empty(&wait_list)) {
 			bp = list_first_entry(&wait_list, struct xfs_buf, b_list);
 
diff --git a/fs/xfs/linux-2.6/xfs_buf.h b/fs/xfs/linux-2.6/xfs_buf.h
index 36f71aa..1ec3d28 100644
--- a/fs/xfs/linux-2.6/xfs_buf.h
+++ b/fs/xfs/linux-2.6/xfs_buf.h
@@ -130,7 +130,7 @@ typedef struct xfs_bufhash {
 typedef struct xfs_buftarg {
 	dev_t			bt_dev;
 	struct block_device	*bt_bdev;
-	struct address_space	*bt_mapping;
+	struct backing_dev_info	*bt_bdi;
 	struct xfs_mount	*bt_mount;
 	unsigned int		bt_bsize;
 	unsigned int		bt_sshift;
-- 
1.7.1

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 69+ messages in thread

* Re: [PATCH 04/18] xfs: lockless per-ag lookups
  2010-09-14 10:56 ` [PATCH 04/18] xfs: lockless per-ag lookups Dave Chinner
@ 2010-09-14 12:35   ` Dave Chinner
  2010-09-14 14:50   ` Christoph Hellwig
  2010-09-14 17:28   ` Alex Elder
  2 siblings, 0 replies; 69+ messages in thread
From: Dave Chinner @ 2010-09-14 12:35 UTC (permalink / raw)
  To: xfs

On Tue, Sep 14, 2010 at 08:56:03PM +1000, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> When we start taking a reference to the per-ag for every cached
> buffer in the system, kernel lockstat profiling on an 8-way create
> workload shows the mp->m_perag_lock has higher acquisition rates
> than the inode lock and has significantly more contention. That is,
> it becomes the highest contended lock in the system.
> 
> The perag lookup is trivial to convert to lock-less RCU lookups
> because perag structures never go away. Hence the only thing we need
> to protect against is tree structure changes during a grow. THis can
> be done simply by replacing the locking in xfs_perag_get() with RCU
> read locking. This removes the mp->m_perag_lock completely from this
> path.

I just noticed that I missed a lookup conversion in reclaim code, so
the next version will include that as well.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 0/18] xfs: metadata and buffer cache scalability improvements
  2010-09-14 10:55 [PATCH 0/18] xfs: metadata and buffer cache scalability improvements Dave Chinner
                   ` (17 preceding siblings ...)
  2010-09-14 10:56 ` [PATCH 18/18] xfs: stop using the page cache to back the " Dave Chinner
@ 2010-09-14 14:25 ` Christoph Hellwig
  2010-09-17 13:21 ` Alex Elder
  19 siblings, 0 replies; 69+ messages in thread
From: Christoph Hellwig @ 2010-09-14 14:25 UTC (permalink / raw)
  To: Dave Chinner; +Cc: xfs

> 	1. work out how to efficiently support block size smaller
> 	than page size. The current code works, but uses a page per
> 	sub-apge buffer.  A set of slab caches would be perfect for
> 	this use, but I'm not sure that we are allowed to use them
> 	for IO anymore. Christoph?

Using slab pages for I/O is fine again.  Back when we used them we
couldn't get agreement from driver authors that they need to support
them, but now that ext4 has started using them they are fine..

I'm not even sure we'll need separate slab caches, the normal kmalloc
caches probably are good enough.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 02/18] xfs: reduce the number of CIL lock round trips during commit
  2010-09-14 10:56 ` [PATCH 02/18] xfs: reduce the number of CIL lock round trips during commit Dave Chinner
@ 2010-09-14 14:48   ` Christoph Hellwig
  2010-09-14 17:21   ` Alex Elder
  1 sibling, 0 replies; 69+ messages in thread
From: Christoph Hellwig @ 2010-09-14 14:48 UTC (permalink / raw)
  To: Dave Chinner; +Cc: xfs

On Tue, Sep 14, 2010 at 08:56:01PM +1000, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> When commiting a transaction, we do a lock CIL state lock round trip
> on every single log vector we insert into the CIL. This is resulting
> in the lock being as hot as the inode and dcache locks on 8-way
> create workloads. Rework the insertion loops to bring the number
> of lock round trips to one per transaction for log vectors, and one
> more do the busy extents.
> 
> Also change the allocation of the log vector buffer not to zero it
> as we copy over the entire allocated buffer anyway.

The change looks good, but I think the functions names and splits in
the CIL insert code are now a bit confusing.  What about mergeing
something like the patch below into yours?


Index: xfs/fs/xfs/xfs_log_cil.c
===================================================================
--- xfs.orig/fs/xfs/xfs_log_cil.c	2010-09-14 11:32:36.365935029 -0300
+++ xfs/fs/xfs/xfs_log_cil.c	2010-09-14 11:43:58.046935056 -0300
@@ -145,6 +145,47 @@ xlog_cil_init_post_recovery(
 								log->l_curr_block);
 }
 
+static void
+xlog_cil_prepare_item(
+	struct log		*log,
+	struct xfs_log_vec	*lv,
+	int			*len,
+	int			*diff_iovecs)
+{
+	struct xfs_log_vec	*old = lv->lv_item->li_lv;
+
+	if (old) {
+		/* existing lv on log item, space used is a delta */
+		ASSERT(!list_empty(&lv->lv_item->li_cil));
+		ASSERT(old->lv_buf && old->lv_buf_len && old->lv_niovecs);
+
+		*len += lv->lv_buf_len - old->lv_buf_len;
+		*diff_iovecs += lv->lv_niovecs - old->lv_niovecs;
+		kmem_free(old->lv_buf);
+		kmem_free(old);
+	} else {
+		/* new lv, must pin the log item */
+		ASSERT(!lv->lv_item->li_lv);
+		ASSERT(list_empty(&lv->lv_item->li_cil));
+
+		*len += lv->lv_buf_len;
+		*diff_iovecs += lv->lv_niovecs;
+		IOP_PIN(lv->lv_item);
+	}
+
+	/* attach new log vector to log item */
+	lv->lv_item->li_lv = lv;
+
+	/*
+	 * If this is the first time the item is being committed to the
+	 * CIL, store the sequence number on the log item so we can
+	 * tell in future commits whether this is the first checkpoint
+	 * the item is being committed into.
+	 */
+	if (!lv->lv_item->li_seq)
+		lv->lv_item->li_seq = log->l_cilp->xc_ctx->sequence;
+}
+
 /*
  * Insert the log items into the CIL and calculate the difference in space
  * consumed by the item. Add the space to the checkpoint ticket and calculate
@@ -153,27 +194,46 @@ xlog_cil_init_post_recovery(
  * the current transaction ticket so that the accounting works out correctly.
  */
 static void
-xlog_cil_insert(
+xlog_cil_insert_items(
 	struct log		*log,
-	struct xlog_ticket	*ticket,
 	struct xfs_log_vec	*log_vector,
-	int			diff_length,
-	int			diff_iovecs)
+	struct xlog_ticket	*ticket)
 {
 	struct xfs_cil		*cil = log->l_cilp;
 	struct xfs_cil_ctx	*ctx = cil->xc_ctx;
-	int			iclog_space;
-	int			len = diff_length;
 	struct xfs_log_vec	*lv;
+	int			len = 0;
+	int			diff_iovecs = 0;
+	int			iclog_space;
 
-	spin_lock(&cil->xc_cil_lock);
+	ASSERT(log_vector);
 
-	/* move the items to the tail of the CIL */
+	/*
+	 * Do all the accounting aggregation and switching of log vectors
+	 * around in a separate loop to the insertion of items into the CIL.
+	 * Then we can do a separate loop to update the CIL within a single
+	 * lock/unlock pair. This reduces the number of round trips on the CIL
+	 * lock from O(nr_logvectors) to O(1) and greatly reduces the overall
+	 * hold time for the transaction commit.
+	 *
+	 * If this is the first time the item is being placed into the CIL in
+	 * this context, pin it so it can't be written to disk until the CIL is
+	 * flushed to the iclog and the iclog written to disk.
+	 *
+	 * We can do this safely because the context can't checkpoint until we
+	 * are done so it doesn't matter exactly how we update the CIL.
+	 */
 	for (lv = log_vector; lv; lv = lv->lv_next)
-		list_move_tail(&lv->lv_item->li_cil, &cil->xc_cil);
+		xlog_cil_prepare_item(log, lv, &len, &diff_iovecs);
 
 	/* account for space used by new iovec headers  */
 	len += diff_iovecs * sizeof(xlog_op_header_t);
+
+	spin_lock(&cil->xc_cil_lock);
+	/* move the items to the tail of the CIL */
+	for (lv = log_vector; lv; lv = lv->lv_next)
+		list_move_tail(&lv->lv_item->li_cil, &cil->xc_cil);
+
 	ctx->nvecs += diff_iovecs;
 
 	/*
@@ -270,76 +330,6 @@ xlog_cil_format_items(
 }
 
 static void
-xlog_cil_insert_items(
-	struct log		*log,
-	struct xfs_log_vec	*log_vector,
-	struct xlog_ticket	*ticket,
-	xfs_lsn_t		*start_lsn)
-{
-	struct xfs_log_vec	*lv;
-	int			len = 0;
-	int			diff_iovecs = 0;
-
-	ASSERT(log_vector);
-
-	if (start_lsn)
-		*start_lsn = log->l_cilp->xc_ctx->sequence;
-
-	/*
-	 * Do all the accounting aggregation and switching of log vectors
-	 * around in a separate loop to the insertion of items into the CIL.
-	 * Then we can do a separate loop to update the CIL within a single
-	 * lock/unlock pair. This reduces the number of round trips on the CIL
-	 * lock from O(nr_logvectors) to O(1) and greatly reduces the overall
-	 * hold time for the transaction commit.
-	 *
-	 * If this is the first time the item is being placed into the CIL in
-	 * this context, pin it so it can't be written to disk until the CIL is
-	 * flushed to the iclog and the iclog written to disk.
-	 *
-	 * We can do this safely because the context can't checkpoint until we
-	 * are done so it doesn't matter exactly how we update the CIL.
-	 */
-	for (lv = log_vector; lv; lv = lv->lv_next) {
-		struct xfs_log_vec	*old = lv->lv_item->li_lv;
-
-		if (old) {
-			/* existing lv on log item, space used is a delta */
-			ASSERT(!list_empty(&lv->lv_item->li_cil));
-			ASSERT(old->lv_buf && old->lv_buf_len && old->lv_niovecs);
-
-			len += lv->lv_buf_len - old->lv_buf_len;
-			diff_iovecs += lv->lv_niovecs - old->lv_niovecs;
-			kmem_free(old->lv_buf);
-			kmem_free(old);
-		} else {
-			/* new lv, must pin the log item */
-			ASSERT(!lv->lv_item->li_lv);
-			ASSERT(list_empty(&lv->lv_item->li_cil));
-
-			len += lv->lv_buf_len;
-			diff_iovecs += lv->lv_niovecs;
-			IOP_PIN(lv->lv_item);
-
-		}
-
-		/* attach new log vector to log item */
-		lv->lv_item->li_lv = lv;
-
-		/*
-		 * If this is the first time the item is being committed to the
-		 * CIL, store the sequence number on the log item so we can
-		 * tell in future commits whether this is the first checkpoint
-		 * the item is being committed into.
-		 */
-		if (!lv->lv_item->li_seq)
-			lv->lv_item->li_seq = log->l_cilp->xc_ctx->sequence;
-	}
-
-	xlog_cil_insert(log, ticket, log_vector, len, diff_iovecs);
-}
-
-static void
 xlog_cil_free_logvec(
 	struct xfs_log_vec	*log_vector)
 {
@@ -654,7 +644,11 @@ xfs_log_commit_cil(
 
 	/* lock out background commit */
 	down_read(&log->l_cilp->xc_ctx_lock);
-	xlog_cil_insert_items(log, log_vector, tp->t_ticket, commit_lsn);
+
+	if (commit_lsn)
+		*commit_lsn = log->l_cilp->xc_ctx->sequence;
+
+	xlog_cil_insert_items(log, log_vector, tp->t_ticket);
 
 	/* check we didn't blow the reservation */
 	if (tp->t_ticket->t_curr_res < 0)

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 03/18] xfs: remove debug assert for per-ag reference counting
  2010-09-14 10:56 ` [PATCH 03/18] xfs: remove debug assert for per-ag reference counting Dave Chinner
@ 2010-09-14 14:48   ` Christoph Hellwig
  2010-09-14 17:22   ` Alex Elder
  1 sibling, 0 replies; 69+ messages in thread
From: Christoph Hellwig @ 2010-09-14 14:48 UTC (permalink / raw)
  To: Dave Chinner; +Cc: xfs

On Tue, Sep 14, 2010 at 08:56:02PM +1000, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> When we start taking references per cached buffer to the the perag
> it is cached on, it will blow the current debug maximum reference
> count assert out of the water. The assert has never caught a bug,
> and we have tracing to track changes if there ever is a problem,
> so just remove it.

Looks good,


Reviewed-by: Christoph Hellwig <hch@lst.de>

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 04/18] xfs: lockless per-ag lookups
  2010-09-14 10:56 ` [PATCH 04/18] xfs: lockless per-ag lookups Dave Chinner
  2010-09-14 12:35   ` Dave Chinner
@ 2010-09-14 14:50   ` Christoph Hellwig
  2010-09-14 17:28   ` Alex Elder
  2 siblings, 0 replies; 69+ messages in thread
From: Christoph Hellwig @ 2010-09-14 14:50 UTC (permalink / raw)
  To: Dave Chinner; +Cc: xfs

On Tue, Sep 14, 2010 at 08:56:03PM +1000, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> When we start taking a reference to the per-ag for every cached
> buffer in the system, kernel lockstat profiling on an 8-way create
> workload shows the mp->m_perag_lock has higher acquisition rates
> than the inode lock and has significantly more contention. That is,
> it becomes the highest contended lock in the system.
> 
> The perag lookup is trivial to convert to lock-less RCU lookups
> because perag structures never go away. Hence the only thing we need
> to protect against is tree structure changes during a grow. THis can
> be done simply by replacing the locking in xfs_perag_get() with RCU
> read locking. This removes the mp->m_perag_lock completely from this
> path.

Looks good,


Reviewed-by: Christoph Hellwig <hch@lst.de>

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 07/18] xfs: don't use vfs writeback for pure metadata modifications
  2010-09-14 10:56 ` [PATCH 07/18] xfs: don't use vfs writeback for pure metadata modifications Dave Chinner
@ 2010-09-14 14:54   ` Christoph Hellwig
  2010-09-15  0:14     ` Dave Chinner
  2010-09-14 22:12   ` Alex Elder
  2010-11-08 10:47   ` Christoph Hellwig
  2 siblings, 1 reply; 69+ messages in thread
From: Christoph Hellwig @ 2010-09-14 14:54 UTC (permalink / raw)
  To: Dave Chinner; +Cc: xfs

On Tue, Sep 14, 2010 at 08:56:06PM +1000, Dave Chinner wrote:
> Hence we can stop marking VFS inodes dirty during transaction commit or when
> changing timestamps during transactions. This will keep the inodes in the
> superblock dirty list to those containing data or unlogged metadata changes.

But we also use xfs_ichgtime for non-transacion size updates, e.g.
for truncate of zero length files.  With this patch we lose track of
the timestamp updates for this case.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 08/18] xfs: rename xfs_buf_get_nodaddr to be more appropriate
  2010-09-14 10:56 ` [PATCH 08/18] xfs: rename xfs_buf_get_nodaddr to be more appropriate Dave Chinner
@ 2010-09-14 14:56   ` Christoph Hellwig
  2010-09-14 22:14   ` Alex Elder
  1 sibling, 0 replies; 69+ messages in thread
From: Christoph Hellwig @ 2010-09-14 14:56 UTC (permalink / raw)
  To: Dave Chinner; +Cc: xfs

>  xfs_buf_t *
> -xfs_buf_get_noaddr(
> +xfs_buf_get_uncached(
>  	size_t			len,
>  	xfs_buftarg_t		*target)
>  {
> @@ -725,7 +725,7 @@ xfs_buf_get_noaddr(
>  		goto fail_free_buf;
>  
>  	for (i = 0; i < page_count; i++) {
> -		bp->b_pages[i] = alloc_page(GFP_KERNEL);
> +		bp->b_pages[i] = alloc_page(GFP_NOFS);

Instead of doingthis unconditionally I think it's better to add a flags
aregument which can contain XBF_DONT_BLOCK, and then use xb_to_gfp for
the flags.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 09/18] xfs: introduced uncached buffer read primitve
  2010-09-14 10:56 ` [PATCH 09/18] xfs: introduced uncached buffer read primitve Dave Chinner
@ 2010-09-14 14:56   ` Christoph Hellwig
  2010-09-14 22:16   ` Alex Elder
  1 sibling, 0 replies; 69+ messages in thread
From: Christoph Hellwig @ 2010-09-14 14:56 UTC (permalink / raw)
  To: Dave Chinner; +Cc: xfs

On Tue, Sep 14, 2010 at 08:56:08PM +1000, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> To avoid the need to use cached buffers for single-shot or buffers
> cached at the filesystem level, introduce a new buffer read
> primitive that bypasses the cache an reads directly from disk.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>

Looks good,


Reviewed-by: Christoph Hellwig <hch@lst.de>

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 10/18] xfs: store xfs_mount in the buftarg instead of in the xfs_buf
  2010-09-14 10:56 ` [PATCH 10/18] xfs: store xfs_mount in the buftarg instead of in the xfs_buf Dave Chinner
@ 2010-09-14 14:57   ` Christoph Hellwig
  2010-09-14 22:21   ` Alex Elder
  1 sibling, 0 replies; 69+ messages in thread
From: Christoph Hellwig @ 2010-09-14 14:57 UTC (permalink / raw)
  To: Dave Chinner; +Cc: xfs

On Tue, Sep 14, 2010 at 08:56:09PM +1000, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> Each buffer contains both a buftarg pointer and a mount pointer. If
> we add a mount pointer into the buftarg, we can avoid needing the
> b_mount field in every buffer and grab it from the buftarg when
> needed instead. This shrinks the xfs_buf by 8 bytes.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>

Looks good,


Reviewed-by: Christoph Hellwig <hch@lst.de>

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 11/18] xfs: kill XBF_FS_MANAGED buffers
  2010-09-14 10:56 ` [PATCH 11/18] xfs: kill XBF_FS_MANAGED buffers Dave Chinner
@ 2010-09-14 14:59   ` Christoph Hellwig
  2010-09-14 22:26   ` Alex Elder
  1 sibling, 0 replies; 69+ messages in thread
From: Christoph Hellwig @ 2010-09-14 14:59 UTC (permalink / raw)
  To: Dave Chinner; +Cc: xfs

Looks good,


Reviewed-by: Christoph Hellwig <hch@lst.de>

minor nitpick below:

> +reread:
> +	bp = xfs_buf_read_uncached(mp, mp->m_ddev_targp, XFS_SB_DADDR, sector_size);

Too long line.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 12/18] xfs: use unhashed buffers for size checks
  2010-09-14 10:56 ` [PATCH 12/18] xfs: use unhashed buffers for size checks Dave Chinner
@ 2010-09-14 15:00   ` Christoph Hellwig
  2010-09-14 22:29   ` Alex Elder
  1 sibling, 0 replies; 69+ messages in thread
From: Christoph Hellwig @ 2010-09-14 15:00 UTC (permalink / raw)
  To: Dave Chinner; +Cc: xfs

On Tue, Sep 14, 2010 at 08:56:11PM +1000, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> When we are checking we can access the last block of each device, we
> do not need to use cached buffers as they will be tossed away
> immediately. Use uncached buffers for size checks so that all IO
> prior to full in-memory structure initialisation does not use the
> buffer cache.

Looks good,


Reviewed-by: Christoph Hellwig <hch@lst.de>

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 05/18] xfs: convert inode cache lookups to use RCU locking
  2010-09-14 10:56 ` [PATCH 05/18] xfs: convert inode cache lookups to use RCU locking Dave Chinner
@ 2010-09-14 16:27   ` Christoph Hellwig
  2010-09-14 23:17     ` Dave Chinner
  2010-09-14 21:23   ` Alex Elder
  1 sibling, 1 reply; 69+ messages in thread
From: Christoph Hellwig @ 2010-09-14 16:27 UTC (permalink / raw)
  To: Dave Chinner; +Cc: xfs

On Tue, Sep 14, 2010 at 08:56:04PM +1000, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> With delayed logging greatly increasing the sustained parallelism of inode
> operations, the inode cache locking is showing significant read vs write
> contention when inode reclaim runs at the same time as lookups. There is
> also a lot more write lock acquistions than there are read locks (4:1 ratio)
> so the read locking is not really buying us much in the way of parallelism.

That's just for your parallel creates workload, isn't it?  If we'd get
that bad hit rates on normal workloads something is pretty wrong with
the inode cache.

For a workload with 4 times as many writelocks just changing the
rwlock to a spinlock should provide more benefits.  Did you test what
effect this has on other workloads?

In addition to that I feel really uneasy about changes to the inode
cache locking without really heavy NFS server testing - we just had too
many issues in this area in the past.

> To avoid the read vs write contention, change the cache to use RCU locking on
> the read side. To avoid needing to RCU free every single inode, use the built
> in slab RCU freeing mechanism. This requires us to be able to detect lookups of
> freed inodes, so en??ure that ever freed inode has an inode number of zero and
> the XFS_IRECLAIM flag set. We already check the XFS_IRECLAIM flag in cache hit
> lookup path, but also add a check for a zero inode number as well.

How does this interact with slab poisoning?

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 14/18] xfs: convert buffer cache hash to rbtree
  2010-09-14 10:56 ` [PATCH 14/18] xfs: convert buffer cache hash to rbtree Dave Chinner
@ 2010-09-14 16:29   ` Christoph Hellwig
  2010-09-15 17:46   ` Alex Elder
  1 sibling, 0 replies; 69+ messages in thread
From: Christoph Hellwig @ 2010-09-14 16:29 UTC (permalink / raw)
  To: Dave Chinner; +Cc: xfs


Looks good,


Reviewed-by: Christoph Hellwig <hch@lst.de>

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 15/18] xfs; pack xfs_buf structure more tightly
  2010-09-14 10:56 ` [PATCH 15/18] xfs; pack xfs_buf structure more tightly Dave Chinner
@ 2010-09-14 16:30   ` Christoph Hellwig
  2010-09-15 18:01   ` Alex Elder
  1 sibling, 0 replies; 69+ messages in thread
From: Christoph Hellwig @ 2010-09-14 16:30 UTC (permalink / raw)
  To: Dave Chinner; +Cc: xfs

Looks good,


Reviewed-by: Christoph Hellwig <hch@lst.de>

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 16/18] xfs: convert xfsbud shrinker to a per-buftarg shrinker.
  2010-09-14 10:56 ` [PATCH 16/18] xfs: convert xfsbud shrinker to a per-buftarg shrinker Dave Chinner
@ 2010-09-14 16:32   ` Christoph Hellwig
  2010-09-15 20:19   ` Alex Elder
  1 sibling, 0 replies; 69+ messages in thread
From: Christoph Hellwig @ 2010-09-14 16:32 UTC (permalink / raw)
  To: Dave Chinner; +Cc: xfs

Looks good,


Reviewed-by: Christoph Hellwig <hch@lst.de>

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 02/18] xfs: reduce the number of CIL lock round trips during commit
  2010-09-14 10:56 ` [PATCH 02/18] xfs: reduce the number of CIL lock round trips during commit Dave Chinner
  2010-09-14 14:48   ` Christoph Hellwig
@ 2010-09-14 17:21   ` Alex Elder
  1 sibling, 0 replies; 69+ messages in thread
From: Alex Elder @ 2010-09-14 17:21 UTC (permalink / raw)
  To: Dave Chinner; +Cc: xfs

On Tue, 2010-09-14 at 20:56 +1000, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> When commiting a transaction, we do a lock CIL state lock round trip
> on every single log vector we insert into the CIL. This is resulting
> in the lock being as hot as the inode and dcache locks on 8-way
> create workloads. Rework the insertion loops to bring the number
> of lock round trips to one per transaction for log vectors, and one
> more do the busy extents.
> 
> Also change the allocation of the log vector buffer not to zero it
> as we copy over the entire allocated buffer anyway.
> 

Looks good.

Reviewed-by: Alex Elder <aelder@sgi.com>

> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---
>  fs/xfs/xfs_log_cil.c |  116 +++++++++++++++++++++++++++++--------------------
>  1 files changed, 69 insertions(+), 47 deletions(-)


_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 03/18] xfs: remove debug assert for per-ag reference counting
  2010-09-14 10:56 ` [PATCH 03/18] xfs: remove debug assert for per-ag reference counting Dave Chinner
  2010-09-14 14:48   ` Christoph Hellwig
@ 2010-09-14 17:22   ` Alex Elder
  1 sibling, 0 replies; 69+ messages in thread
From: Alex Elder @ 2010-09-14 17:22 UTC (permalink / raw)
  To: Dave Chinner; +Cc: xfs

On Tue, 2010-09-14 at 20:56 +1000, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> When we start taking references per cached buffer to the the perag
> it is cached on, it will blow the current debug maximum reference
> count assert out of the water. The assert has never caught a bug,
> and we have tracing to track changes if there ever is a problem,
> so just remove it.

Looks good.

Reviewed-by: Alex Elder <aelder@sgi.com>

> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---
>  fs/xfs/xfs_mount.c |    2 --
>  1 files changed, 0 insertions(+), 2 deletions(-)
> 
> diff --git a/fs/xfs/xfs_mount.c b/fs/xfs/xfs_mount.c


_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 04/18] xfs: lockless per-ag lookups
  2010-09-14 10:56 ` [PATCH 04/18] xfs: lockless per-ag lookups Dave Chinner
  2010-09-14 12:35   ` Dave Chinner
  2010-09-14 14:50   ` Christoph Hellwig
@ 2010-09-14 17:28   ` Alex Elder
  2 siblings, 0 replies; 69+ messages in thread
From: Alex Elder @ 2010-09-14 17:28 UTC (permalink / raw)
  To: Dave Chinner; +Cc: xfs

On Tue, 2010-09-14 at 20:56 +1000, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> When we start taking a reference to the per-ag for every cached
> buffer in the system, kernel lockstat profiling on an 8-way create
> workload shows the mp->m_perag_lock has higher acquisition rates
> than the inode lock and has significantly more contention. That is,
> it becomes the highest contended lock in the system.
> 
> The perag lookup is trivial to convert to lock-less RCU lookups
> because perag structures never go away. Hence the only thing we need
> to protect against is tree structure changes during a grow. THis can
> be done simply by replacing the locking in xfs_perag_get() with RCU
> read locking. This removes the mp->m_perag_lock completely from this
> path.

Nice.

Reviewed-by: Alex Elder <aelder@sgi.com>

> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---
>  fs/xfs/xfs_ag.h    |    3 +++
>  fs/xfs/xfs_mount.c |   25 +++++++++++++++++--------
>  2 files changed, 20 insertions(+), 8 deletions(-)


_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 01/18] xfs: single thread inode cache shrinking.
  2010-09-14 10:56 ` [PATCH 01/18] xfs: single thread inode cache shrinking Dave Chinner
@ 2010-09-14 18:48   ` Alex Elder
  2010-09-14 22:48     ` Dave Chinner
  0 siblings, 1 reply; 69+ messages in thread
From: Alex Elder @ 2010-09-14 18:48 UTC (permalink / raw)
  To: Dave Chinner; +Cc: xfs

On Tue, 2010-09-14 at 20:56 +1000, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> Having multiple CPUs trying to do the same cache shrinking work can
> be actively harmful to perforamnce when the shrinkers land in the
> same AGs.  They then lockstep on perag locks, causing contention and
> slowing each other down. Reclaim walking is sufficiently efficient
> that we do no need parallelism to make significant progress, so stop
> parallel access at the door.
> 
> Instead, keep track of the number of objects the shrinkers want
> cleaned and make sure the single running shrinker does not stop
> until it has hit the threshold that the other shrinker calls have
> built up.
> 
> This increases the cold-cache unlink rate of a 8-way parallel unlink
> workload from about 15,000 unlinks/s to 60-70,000 unlinks/s for the
> same CPU usage (~700%), resulting in the runtime for a 200M inode
> unlink workload dropping from 4h50m to just under 1 hour.

This is an aside, but...

Shrinking still hits the first AG's more than the rest,
right?  I.e. if AG 0 has nr_to_scan reclaimable inodes, no
other AG's get their inodes reclaimed?

Anyway, this change looks good.

Reviewed-by: Alex Elder <aelder@sgi.com>

> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> Reviewed-by: Christoph Hellwig <hch@lst.de>
> ---
>  fs/xfs/linux-2.6/xfs_sync.c |   21 +++++++++++++++++++--
>  fs/xfs/xfs_mount.h          |    2 ++
>  2 files changed, 21 insertions(+), 2 deletions(-)
> 
> diff --git a/fs/xfs/linux-2.6/xfs_sync.c b/fs/xfs/linux-2.6/xfs_sync.c
> index d59c4a6..bc54cd6 100644

. . .


_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 05/18] xfs: convert inode cache lookups to use RCU locking
  2010-09-14 10:56 ` [PATCH 05/18] xfs: convert inode cache lookups to use RCU locking Dave Chinner
  2010-09-14 16:27   ` Christoph Hellwig
@ 2010-09-14 21:23   ` Alex Elder
  2010-09-14 23:42     ` Dave Chinner
  1 sibling, 1 reply; 69+ messages in thread
From: Alex Elder @ 2010-09-14 21:23 UTC (permalink / raw)
  To: Dave Chinner; +Cc: xfs

On Tue, 2010-09-14 at 20:56 +1000, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> With delayed logging greatly increasing the sustained parallelism of inode
> operations, the inode cache locking is showing significant read vs write
> contention when inode reclaim runs at the same time as lookups. There is
> also a lot more write lock acquistions than there are read locks (4:1 ratio)
> so the read locking is not really buying us much in the way of parallelism.
> 
> To avoid the read vs write contention, change the cache to use RCU locking on
> the read side. To avoid needing to RCU free every single inode, use the built
> in slab RCU freeing mechanism. This requires us to be able to detect lookups of
> freed inodes, so enѕure that ever freed inode has an inode number of zero and
> the XFS_IRECLAIM flag set. We already check the XFS_IRECLAIM flag in cache hit
> lookup path, but also add a check for a zero inode number as well.
> 
> We canthen convert all the read locking lockups to use RCU read side locking
> and hence remove all read side locking.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>

I confess that I'm a little less than solid on this, but
that's a comment on me, not your code.  (After writing
all this I feel a bit better.)

I'll try to describe my understanding and you can reassure
me all is well...  It's quite a lot, but I'll call attention
to two things to look for:  a question about something in
xfs_reclaim_inode(); and a comment related to
xfs_iget_cache_hit().


First, you are replacing the use of a single rwlock for
protecting access to the per-AG in-core inode radix tree
with RCU for readers and a spinlock for writers.

This initially seemed strange to me, and unsafe, but I
now think it's OK because:
- the spinlock protects against concurrent writers
  interfering with each other
- the rcu_read_lock() is sufficient for ensuring readers
  have valid pointers, because the underlying structure
  is a radix tree, which uses rcu_update_pointer() in
  order to change anything in the tree.
I'm still unsettled about the protection readers have
against a concurrent writer, but it's probably just
because this particular usage is new to me.


Second, you are exploiting the SLAB_DESTROY_BY_RCU
feature in order to avoid having to have each inode
wait an RCU grace period when it's freed.  To use
that we need to check for and recognize a freed
inode after looking it up, since we have no guarantee
it's updated in the radix tree after it's freed until
after an RCU grace period has passed.  So zeroing the
i_ino field and setting XFS_RECLAIM handles that.

So I see these lookups:
- Two gang lookups in xfs_inode_ag_lookup(), which
  is called only by xfs_inode_ag_walk(), in turn
  called only by xfs_inode_ag_iterator().  The
  check in this case has to happen in the "execute"
  function passed in to xfs_inode_ag_walk() via
  xfs_inode_ag_iterator().  The affected functions
  are:
    - xfs_sync_inode_data().  This one calls
      xfs_sync_inode_valid() right away, which in
      your change now checks for a zero i_ino.
    - xfs_sync_inode_attr().  Same as above,
      handled by xfs_sync_inode_valid().
    - xfs_reclaim_inode().  This one should
      be fine, because it already has a test
      for the XFS_IRECLAIM flag being set, and
      ignores the inode if it is.  However, it
      has this line also:
        ASSERT_ALWAYS(__xfs_iflags_test(ip, XFS_IRECLAIMABLE));
      Your change doesn't set XFS_IRECLAIMABLE, so
*     I imagine if we get here inside that RCU window
*     we'd have a problem.  Am I wrong about this?
    - xfs_dqrele_inode().  This one again calls
      xfs_sync_inode_valid(), so should be covered.
- A lookup in xfs_iget().  This is handled by
  your change, by looking for a zero i_ino in
* xfs_iget_cache_hit().  (Please see the comment
  on this function in-line, below.)
- A lookup in xfs_ifree_cluster().  Handled by
  your change (now checks for zero i_ino).
- And a gang lookup in xfs_iflush_cluster().  This
  one is handled by your change (now checks each
  inode for a zero i_ino field).

OK, so I think that covers everything, but I have
that one question about xfs_reclaim_inode(), and
then I have one more comment below.



Despite all my commentary above...  The patch looks
good (consistent) to me.  I'm interested to hear
your feedback though.  And unless there is something
major changed, or I'm fundamentally misguided about
this stuff, you can consider it:

Reviewed-by: Alex Elder <aelder@sgi.com>


> ---
>  fs/xfs/linux-2.6/kmem.h        |    1 +
>  fs/xfs/linux-2.6/xfs_super.c   |    3 ++-
>  fs/xfs/linux-2.6/xfs_sync.c    |   12 ++++++------
>  fs/xfs/quota/xfs_qm_syscalls.c |    4 ++--

. . .

> diff --git a/fs/xfs/xfs_iget.c b/fs/xfs/xfs_iget.c
> index b1ecc6f..f3a46b6 100644
> --- a/fs/xfs/xfs_iget.c
> +++ b/fs/xfs/xfs_iget.c

. . .

> @@ -145,12 +153,26 @@ xfs_iget_cache_hit(
>  	struct xfs_perag	*pag,
>  	struct xfs_inode	*ip,
>  	int			flags,
> -	int			lock_flags) __releases(pag->pag_ici_lock)
> +	int			lock_flags) __releases(RCU)
>  {
>  	struct inode		*inode = VFS_I(ip);
>  	struct xfs_mount	*mp = ip->i_mount;
>  	int			error;
>  
> +	/*
> +	 * check for re-use of an inode within an RCU grace period due to the
> +	 * radix tree nodes not being updated yet. We monitor for this by
> +	 * setting the inode number to zero before freeing the inode structure.
> +	 */
> +	if (ip->i_ino == 0) {
> +		trace_xfs_iget_skip(ip);
> +		XFS_STATS_INC(xs_ig_frecycle);
> +		rcu_read_unlock();
> +		/* Expire the grace period so we don't trip over it again. */
> +		synchronize_rcu();

Since you're waiting for the end of the grace period here,
it seems a shame that the caller (xfs_iget()) will still
end up calling delay(1) before trying again.  It would
be nice if the delay could be avoided in that case.

> +		return EAGAIN;
> +	}
> +
>  	spin_lock(&ip->i_flags_lock);
>  
>  	/*

. . .

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 06/18] xfs: convert pag_ici_lock to a spin lock
  2010-09-14 10:56 ` [PATCH 06/18] xfs: convert pag_ici_lock to a spin lock Dave Chinner
@ 2010-09-14 21:26   ` Alex Elder
  0 siblings, 0 replies; 69+ messages in thread
From: Alex Elder @ 2010-09-14 21:26 UTC (permalink / raw)
  To: Dave Chinner; +Cc: xfs

On Tue, 2010-09-14 at 20:56 +1000, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> now that we are using RCU protection for the inode cache lookups,
> the lock is only needed on the modification side. Hence it is not
> necessary for the lock to be a rwlock as there are no read side
> holders anymore. Convert it to a spin lock to reflect it's exclusive
> nature.

This is of course contingent on the correctness of
the RCU change before this.  But this one looks good.

Reviewed-by: Alex Elder <aelder@sgi.com>


> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---
>  fs/xfs/linux-2.6/xfs_sync.c |   16 ++++++++--------

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 07/18] xfs: don't use vfs writeback for pure metadata modifications
  2010-09-14 10:56 ` [PATCH 07/18] xfs: don't use vfs writeback for pure metadata modifications Dave Chinner
  2010-09-14 14:54   ` Christoph Hellwig
@ 2010-09-14 22:12   ` Alex Elder
  2010-09-15  0:28     ` Dave Chinner
  2010-11-08 10:47   ` Christoph Hellwig
  2 siblings, 1 reply; 69+ messages in thread
From: Alex Elder @ 2010-09-14 22:12 UTC (permalink / raw)
  To: Dave Chinner; +Cc: xfs

On Tue, 2010-09-14 at 20:56 +1000, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> Under heavy multi-way parallel create workloads, the VFS struggles to write
> back all the inodes that have been changed in age order. The bdi flusher thread
> becomes CPU bound, spending 85% of it's time in the VFS code, mostly traversing
> the superblock dirty inode list to separate dirty inodes old enough to flush.
> 
> We already keep an index of all metadata changes in age order - in the AIL -
> and continued log pressure will do age ordered writeback without any extra
> overhead at all. If there is no pressure on the log, the xfssyncd will
> periodically write back metadata in ascending disk address offset order so will
> be very efficient.

So log pressure will cause the logged updates to the inode to be
written to disk (in order), which is all we really need.  Is that
right?  Therefore we don't need to rely on the VFS layer to get
the dirty inode pushed out?

Is writeback the only reason we should inform the VFS that an
inode is dirty?  (Sorry, I have to leave shortly and don't have
time to follow this at the moment--I may have to come back to
this later.)

> Hence we can stop marking VFS inodes dirty during transaction commit or when
> changing timestamps during transactions. This will keep the inodes in the
> superblock dirty list to those containing data or unlogged metadata changes.

The code looks fine to me, but I don't know whether the
change it implements is correct or not without digging
in a little deeper.

> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---
>  fs/xfs/linux-2.6/xfs_iops.c |   18 +++++-------------


_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 08/18] xfs: rename xfs_buf_get_nodaddr to be more appropriate
  2010-09-14 10:56 ` [PATCH 08/18] xfs: rename xfs_buf_get_nodaddr to be more appropriate Dave Chinner
  2010-09-14 14:56   ` Christoph Hellwig
@ 2010-09-14 22:14   ` Alex Elder
  1 sibling, 0 replies; 69+ messages in thread
From: Alex Elder @ 2010-09-14 22:14 UTC (permalink / raw)
  To: Dave Chinner; +Cc: xfs

On Tue, 2010-09-14 at 20:56 +1000, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> xfs_buf_get_nodaddr() is really used to allocate a buffer that is
> uncached. While it is not directly assigned a disk address, the fact
> that they are not cached is a more important distinction. With the
> upcoming uncached buffer read primitive, we should be consistent
> with this disctinction.
> 
> While there, make page allocation in xfs_buf_get_nodaddr() safe
> against memory reclaim re-entrancy into the filesystem by changing
> the allocation to GFP_NOFS.

This could be spun into its own patch (#19 or more).  But
I don't think it's that important.

> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---
>  fs/xfs/linux-2.6/xfs_buf.c   |    6 +++---
>  fs/xfs/linux-2.6/xfs_buf.h   |    2 +-

Looks good.

Reviewed-by: Alex Elder <aelder@sgi.com>


_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 09/18] xfs: introduced uncached buffer read primitve
  2010-09-14 10:56 ` [PATCH 09/18] xfs: introduced uncached buffer read primitve Dave Chinner
  2010-09-14 14:56   ` Christoph Hellwig
@ 2010-09-14 22:16   ` Alex Elder
  1 sibling, 0 replies; 69+ messages in thread
From: Alex Elder @ 2010-09-14 22:16 UTC (permalink / raw)
  To: Dave Chinner; +Cc: xfs

On Tue, 2010-09-14 at 20:56 +1000, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> To avoid the need to use cached buffers for single-shot or buffers
> cached at the filesystem level, introduce a new buffer read
> primitive that bypasses the cache an reads directly from disk.

Looks good (I already reviewed this, basically.)

Reviewed-by: Alex Elder <aelder@sgi.com>

> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---


_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 10/18] xfs: store xfs_mount in the buftarg instead of in the xfs_buf
  2010-09-14 10:56 ` [PATCH 10/18] xfs: store xfs_mount in the buftarg instead of in the xfs_buf Dave Chinner
  2010-09-14 14:57   ` Christoph Hellwig
@ 2010-09-14 22:21   ` Alex Elder
  1 sibling, 0 replies; 69+ messages in thread
From: Alex Elder @ 2010-09-14 22:21 UTC (permalink / raw)
  To: Dave Chinner; +Cc: xfs

On Tue, 2010-09-14 at 20:56 +1000, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> Each buffer contains both a buftarg pointer and a mount pointer. If
> we add a mount pointer into the buftarg, we can avoid needing the
> b_mount field in every buffer and grab it from the buftarg when
> needed instead. This shrinks the xfs_buf by 8 bytes.

Looks good.

Reviewed-by: Alex Elder <aelder@sgi.com>


> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---


_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 11/18] xfs: kill XBF_FS_MANAGED buffers
  2010-09-14 10:56 ` [PATCH 11/18] xfs: kill XBF_FS_MANAGED buffers Dave Chinner
  2010-09-14 14:59   ` Christoph Hellwig
@ 2010-09-14 22:26   ` Alex Elder
  1 sibling, 0 replies; 69+ messages in thread
From: Alex Elder @ 2010-09-14 22:26 UTC (permalink / raw)
  To: Dave Chinner; +Cc: xfs

On Tue, 2010-09-14 at 20:56 +1000, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> Filesystem level managed buffers are buffers that have their
> lifecycle controlled by the filesystem layer, not the buffer cache.
> We currently cache these buffers, which makes cleanup and cache
> walking somewhat troublesome. Convert the fs managed buffers to
> uncached buffers obtained by via xfs_buf_get_uncached(), and remove
> the XBF_FS_MANAGED special cases from the buffer cache.

Looks good.  Nicer with the xfs_buf_read_uncached() helper.

Reviewed-by: Alex Elder <aelder@sgi.com>

> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---
>  fs/xfs/linux-2.6/xfs_buf.c |   20 +++------------
>  fs/xfs/linux-2.6/xfs_buf.h |    4 ---
>  fs/xfs/xfs_mount.c         |   56 ++++++++++++-------------------------------
>  3 files changed, 20 insertions(+), 60 deletions(-)


_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 12/18] xfs: use unhashed buffers for size checks
  2010-09-14 10:56 ` [PATCH 12/18] xfs: use unhashed buffers for size checks Dave Chinner
  2010-09-14 15:00   ` Christoph Hellwig
@ 2010-09-14 22:29   ` Alex Elder
  1 sibling, 0 replies; 69+ messages in thread
From: Alex Elder @ 2010-09-14 22:29 UTC (permalink / raw)
  To: Dave Chinner; +Cc: xfs

On Tue, 2010-09-14 at 20:56 +1000, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> When we are checking we can access the last block of each device, we
> do not need to use cached buffers as they will be tossed away
> immediately. Use uncached buffers for size checks so that all IO
> prior to full in-memory structure initialisation does not use the
> buffer cache.

Looks good.

Reviewed-by: Alex Elder <aelder@sgi.com>

> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---


_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 13/18] xfs: remove buftarg hash for external devices
  2010-09-14 10:56 ` [PATCH 13/18] xfs: remove buftarg hash for external devices Dave Chinner
@ 2010-09-14 22:29   ` Alex Elder
  0 siblings, 0 replies; 69+ messages in thread
From: Alex Elder @ 2010-09-14 22:29 UTC (permalink / raw)
  To: Dave Chinner; +Cc: xfs

On Tue, 2010-09-14 at 20:56 +1000, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> For RT and external log devices, we never use hashed buffers on them
> now.  Remove the buftarg hash tables that are set up for them.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> Reviewed-by: Christoph Hellwig <hch@lst.de>
> ---
>  fs/xfs/linux-2.6/xfs_buf.c |    6 +++++-
>  1 files changed, 5 insertions(+), 1 deletions(-)

Looks good.

Reviewed-by: Alex Elder <aelder@sgi.com>

> diff --git a/fs/xfs/linux-2.6/xfs_buf.c b/fs/xfs/linux-2.6/xfs_buf.c
> index 304515b..1c6206e 100644
> --- a/fs/xfs/linux-2.6/xfs_buf.c
> +++ b/fs/xfs/linux-2.6/xfs_buf.c
> @@ -1456,7 +1456,11 @@ xfs_alloc_bufhash(
>  {
>  	unsigned int		i;
>  


_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 01/18] xfs: single thread inode cache shrinking.
  2010-09-14 18:48   ` Alex Elder
@ 2010-09-14 22:48     ` Dave Chinner
  0 siblings, 0 replies; 69+ messages in thread
From: Dave Chinner @ 2010-09-14 22:48 UTC (permalink / raw)
  To: Alex Elder; +Cc: xfs

On Tue, Sep 14, 2010 at 01:48:26PM -0500, Alex Elder wrote:
> On Tue, 2010-09-14 at 20:56 +1000, Dave Chinner wrote:
> > From: Dave Chinner <dchinner@redhat.com>
> > 
> > Having multiple CPUs trying to do the same cache shrinking work can
> > be actively harmful to perforamnce when the shrinkers land in the
> > same AGs.  They then lockstep on perag locks, causing contention and
> > slowing each other down. Reclaim walking is sufficiently efficient
> > that we do no need parallelism to make significant progress, so stop
> > parallel access at the door.
> > 
> > Instead, keep track of the number of objects the shrinkers want
> > cleaned and make sure the single running shrinker does not stop
> > until it has hit the threshold that the other shrinker calls have
> > built up.
> > 
> > This increases the cold-cache unlink rate of a 8-way parallel unlink
> > workload from about 15,000 unlinks/s to 60-70,000 unlinks/s for the
> > same CPU usage (~700%), resulting in the runtime for a 200M inode
> > unlink workload dropping from 4h50m to just under 1 hour.
> 
> This is an aside, but...
> 
> Shrinking still hits the first AG's more than the rest,
> right?  I.e. if AG 0 has nr_to_scan reclaimable inodes, no
> other AG's get their inodes reclaimed?

It aggregates across all AGs, so if AG zero has none, then it moves
to AG 1...

I'm actually considering respinning this patch to be a little
different. I've got a prototype that just does a full non-blocking
reclaim run if nr_to_scan != 0 and then returns -1. It seems to
result in much better dentry/inode/xfs_cache balance, kswapd CPU
time drops dramatically, it doesn't affect create perfromance at all
and unlink performance becomes much,much more consistent and drops
from ~14m30s down to ~11m30s for 50M inodes.

I only made thしs change late last night, so I'll do some more
testing before going any further with it.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 17/18] xfs: add a lru to the XFS buffer cache
  2010-09-14 10:56 ` [PATCH 17/18] xfs: add a lru to the XFS buffer cache Dave Chinner
@ 2010-09-14 23:16   ` Christoph Hellwig
  2010-09-15  0:05     ` Dave Chinner
  2010-09-15 21:28   ` Alex Elder
  1 sibling, 1 reply; 69+ messages in thread
From: Christoph Hellwig @ 2010-09-14 23:16 UTC (permalink / raw)
  To: Dave Chinner; +Cc: xfs


Looks correct to me, although probably indeed a bit simplistic.

> +		/* add to LRU */
> +		spin_lock(&btp->bt_lru_lock);
> +		list_add_tail(&new_bp->b_lru, &btp->bt_lru);
> +		btp->bt_lru_nr++;
> +		atomic_inc(&new_bp->b_hold);
> +		spin_unlock(&btp->bt_lru_lock);

Just for clarity it would be nice to have this as an xfs_buf_lru_add
helper.

> +			/* remove from LRU */
> +			spin_lock(&btp->bt_lru_lock);
> +			list_del_init(&bp->b_lru);
> +			btp->bt_lru_nr--;
> +			spin_unlock(&btp->bt_lru_lock);

And this ad xfs_buf_lru_del.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 05/18] xfs: convert inode cache lookups to use RCU locking
  2010-09-14 16:27   ` Christoph Hellwig
@ 2010-09-14 23:17     ` Dave Chinner
  0 siblings, 0 replies; 69+ messages in thread
From: Dave Chinner @ 2010-09-14 23:17 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: xfs

On Tue, Sep 14, 2010 at 12:27:42PM -0400, Christoph Hellwig wrote:
> On Tue, Sep 14, 2010 at 08:56:04PM +1000, Dave Chinner wrote:
> > From: Dave Chinner <dchinner@redhat.com>
> > 
> > With delayed logging greatly increasing the sustained
> > parallelism of inode operations, the inode cache locking is
> > showing significant read vs write contention when inode reclaim
> > runs at the same time as lookups. There is also a lot more write
> > lock acquistions than there are read locks (4:1 ratio) so the
> > read locking is not really buying us much in the way of
> > parallelism.
> 
> That's just for your parallel creates workload, isn't it?  If we'd
> get that bad hit rates on normal workloads something is pretty
> wrong with the inode cache.

I see it on the unlink workloads, as well as any use-once workload I
care to run (e.g. du, find, grep, xfsdump, xfs_fsr, etc) on a
fileystem with more inodes in it that can fit in the currently
available memory. When we have userspace memory pressure, the amount
of memory available for caching can be pretty low, so this is quite
a common situation.

> For a workload with 4 times as many writelocks just changing the
> rwlock to a spinlock should provide more benefits.  Did you test what
> effect this has on other workloads?

Not yet - I've run it through xfsqa on single CPU, 2p and 4p
machines for the past few days, but I haven't benchmarked it
comparitively on other workloads yet. The whole patchset needs
that....

> In addition to that I feel really uneasy about changes to the inode
> cache locking without really heavy NFS server testing - we just had too
> many issues in this area in the past.

Agreed, and that's something I need to test against.

> > To avoid the read vs write contention, change the cache to use RCU locking on
> > the read side. To avoid needing to RCU free every single inode, use the built
> > in slab RCU freeing mechanism. This requires us to be able to detect lookups of
> > freed inodes, so en??ure that ever freed inode has an inode number of zero and
> > the XFS_IRECLAIM flag set. We already check the XFS_IRECLAIM flag in cache hit
> > lookup path, but also add a check for a zero inode number as well.
> 
> How does this interact with slab poisoning?

mm/slab.c::kmem_cache_create():

        if (!(flags & SLAB_DESTROY_BY_RCU))
                flags |= SLAB_POISON;
#endif
        if (flags & SLAB_DESTROY_BY_RCU)
                BUG_ON(flags & SLAB_POISON);

and SLUB does an equivalent setup where OBJECT_POISON is not set if
SLAB_DESTROY_BY_RCU is set. So it effectively turns off slab
poisoning, but leaves all the other debug checks active.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 18/18] xfs: stop using the page cache to back the buffer cache
  2010-09-14 10:56 ` [PATCH 18/18] xfs: stop using the page cache to back the " Dave Chinner
@ 2010-09-14 23:20   ` Christoph Hellwig
  2010-09-15  0:06     ` Dave Chinner
  0 siblings, 1 reply; 69+ messages in thread
From: Christoph Hellwig @ 2010-09-14 23:20 UTC (permalink / raw)
  To: Dave Chinner; +Cc: xfs

On Tue, Sep 14, 2010 at 08:56:17PM +1000, Dave Chinner wrote:
> The only open question is how to best handle sub-page buffers - can we use
> kmalloc/slab memory for sub-page sized buffers, or do we need to split up
> pages ourselves? Worth noting is that the current code still works on sub-page
> block size filesystems, it is just inefficient w.r.t. memory usage.

As mentioned before I think we're fine to use slab/kmalloc pages now.
In fact using them will probably be more efficient than the current
code, given that at least btree blocks usually won't be close to each
other, so the old code wasted lots of memory for it, too.

>  	for (i = 0; i < bp->b_page_count; i++) {
>  		struct page	*page;
>  		uint		retries = 0;
> +retry:
> +		page = alloc_page(gfp_mask);
>  		if (unlikely(page == NULL)) {
>  			if (flags & XBF_READ_AHEAD) {
>  				bp->b_page_count = i;
>  				for (i = 0; i < bp->b_page_count; i++)
> +					__free_page(bp->b_pages[i]);
>  				return -ENOMEM;

Maybe convert this to and out_free_pages goto while you're at it?

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 05/18] xfs: convert inode cache lookups to use RCU locking
  2010-09-14 21:23   ` Alex Elder
@ 2010-09-14 23:42     ` Dave Chinner
  0 siblings, 0 replies; 69+ messages in thread
From: Dave Chinner @ 2010-09-14 23:42 UTC (permalink / raw)
  To: Alex Elder; +Cc: xfs

On Tue, Sep 14, 2010 at 04:23:41PM -0500, Alex Elder wrote:
> On Tue, 2010-09-14 at 20:56 +1000, Dave Chinner wrote:
> > From: Dave Chinner <dchinner@redhat.com>
> > 
> > With delayed logging greatly increasing the sustained parallelism of inode
> > operations, the inode cache locking is showing significant read vs write
> > contention when inode reclaim runs at the same time as lookups. There is
> > also a lot more write lock acquistions than there are read locks (4:1 ratio)
> > so the read locking is not really buying us much in the way of parallelism.
> > 
> > To avoid the read vs write contention, change the cache to use RCU locking on
> > the read side. To avoid needing to RCU free every single inode, use the built
> > in slab RCU freeing mechanism. This requires us to be able to detect lookups of
> > freed inodes, so enѕure that ever freed inode has an inode number of zero and
> > the XFS_IRECLAIM flag set. We already check the XFS_IRECLAIM flag in cache hit
> > lookup path, but also add a check for a zero inode number as well.
> > 
> > We canthen convert all the read locking lockups to use RCU read side locking
> > and hence remove all read side locking.
> > 
> > Signed-off-by: Dave Chinner <dchinner@redhat.com>
> 
> I confess that I'm a little less than solid on this, but
> that's a comment on me, not your code.  (After writing
> all this I feel a bit better.)
> 
> I'll try to describe my understanding and you can reassure
> me all is well...  It's quite a lot, but I'll call attention
> to two things to look for:  a question about something in
> xfs_reclaim_inode(); and a comment related to
> xfs_iget_cache_hit().
> 
> 
> First, you are replacing the use of a single rwlock for
> protecting access to the per-AG in-core inode radix tree
> with RCU for readers and a spinlock for writers.

Correct.

> This initially seemed strange to me, and unsafe, but I
> now think it's OK because:
> - the spinlock protects against concurrent writers
>   interfering with each other
> - the rcu_read_lock() is sufficient for ensuring readers
>   have valid pointers, because the underlying structure
>   is a radix tree, which uses rcu_update_pointer() in
>   order to change anything in the tree.

Correct.

> I'm still unsettled about the protection readers have
> against a concurrent writer, but it's probably just
> because this particular usage is new to me.

The protection is provided by the fact that the radix tree
node connectivity is protected by RCU read locking, so the only
thing we have to worry about on lookup is whether we have a valid
inode or not.

> Second, you are exploiting the SLAB_DESTROY_BY_RCU
> feature in order to avoid having to have each inode
> wait an RCU grace period when it's freed.  To use
> that we need to check for and recognize a freed
> inode after looking it up, since we have no guarantee
> it's updated in the radix tree after it's freed until
> after an RCU grace period has passed.  So zeroing the
> i_ino field and setting XFS_RECLAIM handles that.

Yes. However, that is not specific to the use of
SLAB_DESTROY_BY_RCU. Even just using call_rcu() to free the inodes,
we'd still need to detect freed inodes on lookup in some way because
the lookup can return those inodes due to the radix tree nodes being
RCU freed. That is, a lockless RCU cache lookup of any kind needs to
be able to safely detect a freed structure and avoid re-using it.

> So I see these lookups:
> - Two gang lookups in xfs_inode_ag_lookup(), which
>   is called only by xfs_inode_ag_walk(), in turn
>   called only by xfs_inode_ag_iterator().  The
>   check in this case has to happen in the "execute"
>   function passed in to xfs_inode_ag_walk() via
>   xfs_inode_ag_iterator().  The affected functions
>   are:
>     - xfs_sync_inode_data().  This one calls
>       xfs_sync_inode_valid() right away, which in
>       your change now checks for a zero i_ino.
>     - xfs_sync_inode_attr().  Same as above,
>       handled by xfs_sync_inode_valid().
>     - xfs_reclaim_inode().  This one should
>       be fine, because it already has a test
>       for the XFS_IRECLAIM flag being set, and
>       ignores the inode if it is.  However, it
>       has this line also:
>         ASSERT_ALWAYS(__xfs_iflags_test(ip, XFS_IRECLAIMABLE));
>       Your change doesn't set XFS_IRECLAIMABLE, so
> *     I imagine if we get here inside that RCU window
> *     we'd have a problem.  Am I wrong about this?

Good question. I think you are right - we didn't actually clear that
flag anywhere until this patch, so yes, it could trigger. I think
I'll add a check for ip->i_ino == 0 before we take the lock and
in that case I can leave the assert there.

>     - xfs_dqrele_inode().  This one again calls
>       xfs_sync_inode_valid(), so should be covered.
> - A lookup in xfs_iget().  This is handled by
>   your change, by looking for a zero i_ino in
> * xfs_iget_cache_hit().  (Please see the comment
>   on this function in-line, below.)
> - A lookup in xfs_ifree_cluster().  Handled by
>   your change (now checks for zero i_ino).
> - And a gang lookup in xfs_iflush_cluster().  This
>   one is handled by your change (now checks each
>   inode for a zero i_ino field).
> 
> OK, so I think that covers everything, but I have
> that one question about xfs_reclaim_inode(), and
> then I have one more comment below.
> 
> 
> 
> Despite all my commentary above...  The patch looks
> good (consistent) to me.  I'm interested to hear
> your feedback though.  And unless there is something
> major changed, or I'm fundamentally misguided about
> this stuff, you can consider it:
> 
> Reviewed-by: Alex Elder <aelder@sgi.com>
> 
> 
> > ---
> >  fs/xfs/linux-2.6/kmem.h        |    1 +
> >  fs/xfs/linux-2.6/xfs_super.c   |    3 ++-
> >  fs/xfs/linux-2.6/xfs_sync.c    |   12 ++++++------
> >  fs/xfs/quota/xfs_qm_syscalls.c |    4 ++--
> 
> . . .
> 
> > diff --git a/fs/xfs/xfs_iget.c b/fs/xfs/xfs_iget.c
> > index b1ecc6f..f3a46b6 100644
> > --- a/fs/xfs/xfs_iget.c
> > +++ b/fs/xfs/xfs_iget.c
> 
> . . .
> 
> > @@ -145,12 +153,26 @@ xfs_iget_cache_hit(
> >  	struct xfs_perag	*pag,
> >  	struct xfs_inode	*ip,
> >  	int			flags,
> > -	int			lock_flags) __releases(pag->pag_ici_lock)
> > +	int			lock_flags) __releases(RCU)
> >  {
> >  	struct inode		*inode = VFS_I(ip);
> >  	struct xfs_mount	*mp = ip->i_mount;
> >  	int			error;
> >  
> > +	/*
> > +	 * check for re-use of an inode within an RCU grace period due to the
> > +	 * radix tree nodes not being updated yet. We monitor for this by
> > +	 * setting the inode number to zero before freeing the inode structure.
> > +	 */
> > +	if (ip->i_ino == 0) {
> > +		trace_xfs_iget_skip(ip);
> > +		XFS_STATS_INC(xs_ig_frecycle);
> > +		rcu_read_unlock();
> > +		/* Expire the grace period so we don't trip over it again. */
> > +		synchronize_rcu();
> 
> Since you're waiting for the end of the grace period here,
> it seems a shame that the caller (xfs_iget()) will still
> end up calling delay(1) before trying again.  It would
> be nice if the delay could be avoided in that case.

True. I'll see if I can come up with a simple way of doing this.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 17/18] xfs: add a lru to the XFS buffer cache
  2010-09-14 23:16   ` Christoph Hellwig
@ 2010-09-15  0:05     ` Dave Chinner
  0 siblings, 0 replies; 69+ messages in thread
From: Dave Chinner @ 2010-09-15  0:05 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: xfs

On Tue, Sep 14, 2010 at 07:16:44PM -0400, Christoph Hellwig wrote:
> 
> Looks correct to me, although probably indeed a bit simplistic.
>
> > +		/* add to LRU */
> > +		spin_lock(&btp->bt_lru_lock);
> > +		list_add_tail(&new_bp->b_lru, &btp->bt_lru);
> > +		btp->bt_lru_nr++;
> > +		atomic_inc(&new_bp->b_hold);
> > +		spin_unlock(&btp->bt_lru_lock);
> 
> Just for clarity it would be nice to have this as an xfs_buf_lru_add
> helper.
> 
> > +			/* remove from LRU */
> > +			spin_lock(&btp->bt_lru_lock);
> > +			list_del_init(&bp->b_lru);
> > +			btp->bt_lru_nr--;
> > +			spin_unlock(&btp->bt_lru_lock);
> 
> And this ad xfs_buf_lru_del.

Yes, will do.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 18/18] xfs: stop using the page cache to back the buffer cache
  2010-09-14 23:20   ` Christoph Hellwig
@ 2010-09-15  0:06     ` Dave Chinner
  0 siblings, 0 replies; 69+ messages in thread
From: Dave Chinner @ 2010-09-15  0:06 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: xfs

On Tue, Sep 14, 2010 at 07:20:04PM -0400, Christoph Hellwig wrote:
> On Tue, Sep 14, 2010 at 08:56:17PM +1000, Dave Chinner wrote:
> > The only open question is how to best handle sub-page buffers - can we use
> > kmalloc/slab memory for sub-page sized buffers, or do we need to split up
> > pages ourselves? Worth noting is that the current code still works on sub-page
> > block size filesystems, it is just inefficient w.r.t. memory usage.
> 
> As mentioned before I think we're fine to use slab/kmalloc pages now.
> In fact using them will probably be more efficient than the current
> code, given that at least btree blocks usually won't be close to each
> other, so the old code wasted lots of memory for it, too.

Good point. I'll start working on this later today.

> 
> >  	for (i = 0; i < bp->b_page_count; i++) {
> >  		struct page	*page;
> >  		uint		retries = 0;
> > +retry:
> > +		page = alloc_page(gfp_mask);
> >  		if (unlikely(page == NULL)) {
> >  			if (flags & XBF_READ_AHEAD) {
> >  				bp->b_page_count = i;
> >  				for (i = 0; i < bp->b_page_count; i++)
> > +					__free_page(bp->b_pages[i]);
> >  				return -ENOMEM;
> 
> Maybe convert this to and out_free_pages goto while you're at it?

Yes, sounds reasonable. Will do.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 07/18] xfs: don't use vfs writeback for pure metadata modifications
  2010-09-14 14:54   ` Christoph Hellwig
@ 2010-09-15  0:14     ` Dave Chinner
  2010-09-15  0:17       ` Christoph Hellwig
  0 siblings, 1 reply; 69+ messages in thread
From: Dave Chinner @ 2010-09-15  0:14 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: xfs

On Tue, Sep 14, 2010 at 10:54:47AM -0400, Christoph Hellwig wrote:
> On Tue, Sep 14, 2010 at 08:56:06PM +1000, Dave Chinner wrote:
> > Hence we can stop marking VFS inodes dirty during transaction commit or when
> > changing timestamps during transactions. This will keep the inodes in the
> > superblock dirty list to those containing data or unlogged metadata changes.
> 
> But we also use xfs_ichgtime for non-transacion size updates, e.g.
> for truncate of zero length files.  With this patch we lose track of
> the timestamp updates for this case.

Oh, I missed that one in the twisty passages of the setattr code.
I'll do another (more careful) pass across all the callers, and for
those that are outside a transaction I'll add a separate call to
xfs_mark_inode_dirty_sync() for them.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 07/18] xfs: don't use vfs writeback for pure metadata modifications
  2010-09-15  0:14     ` Dave Chinner
@ 2010-09-15  0:17       ` Christoph Hellwig
  0 siblings, 0 replies; 69+ messages in thread
From: Christoph Hellwig @ 2010-09-15  0:17 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Christoph Hellwig, xfs

On Wed, Sep 15, 2010 at 10:14:23AM +1000, Dave Chinner wrote:
> Oh, I missed that one in the twisty passages of the setattr code.
> I'll do another (more careful) pass across all the callers, and for
> those that are outside a transaction I'll add a separate call to
> xfs_mark_inode_dirty_sync() for them.

I'd prefer to be on the safe side - rename xfs_ichgtime to
xfs_trans_inode_chgtime or similar, and assert that the added
transaction-argument is non-zero and the inode is locked and added
to the transactions.  That way it's easier to spot which timestamp
updates are transactional and which not.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 07/18] xfs: don't use vfs writeback for pure metadata modifications
  2010-09-14 22:12   ` Alex Elder
@ 2010-09-15  0:28     ` Dave Chinner
  0 siblings, 0 replies; 69+ messages in thread
From: Dave Chinner @ 2010-09-15  0:28 UTC (permalink / raw)
  To: Alex Elder; +Cc: xfs

On Tue, Sep 14, 2010 at 05:12:17PM -0500, Alex Elder wrote:
> On Tue, 2010-09-14 at 20:56 +1000, Dave Chinner wrote:
> > From: Dave Chinner <dchinner@redhat.com>
> > 
> > Under heavy multi-way parallel create workloads, the VFS struggles to write
> > back all the inodes that have been changed in age order. The bdi flusher thread
> > becomes CPU bound, spending 85% of it's time in the VFS code, mostly traversing
> > the superblock dirty inode list to separate dirty inodes old enough to flush.
> > 
> > We already keep an index of all metadata changes in age order - in the AIL -
> > and continued log pressure will do age ordered writeback without any extra
> > overhead at all. If there is no pressure on the log, the xfssyncd will
> > periodically write back metadata in ascending disk address offset order so will
> > be very efficient.
> 
> So log pressure will cause the logged updates to the inode to be
> written to disk (in order), which is all we really need.  Is that
> right?

Yes. And if there is no log pressure, xfssyncd will do the writeback
in an disk order efficient manner.

> Therefore we don't need to rely on the VFS layer to get
> the dirty inode pushed out?

No. Indeed, for all other types of metadata (btree blocks,
directory/attribute blocks, etc) we already rely on the
xfsaild/xfsbufd to write them out in a timely manner because the VFS
knows nothing about them.

> Is writeback the only reason we should inform the VFS that an
> inode is dirty?  (Sorry, I have to leave shortly and don't have
> time to follow this at the moment--I may have to come back to
> this later.)

Yes, pretty much. Take your time - this is one of the more radical
changes in the patch set...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 14/18] xfs: convert buffer cache hash to rbtree
  2010-09-14 10:56 ` [PATCH 14/18] xfs: convert buffer cache hash to rbtree Dave Chinner
  2010-09-14 16:29   ` Christoph Hellwig
@ 2010-09-15 17:46   ` Alex Elder
  1 sibling, 0 replies; 69+ messages in thread
From: Alex Elder @ 2010-09-15 17:46 UTC (permalink / raw)
  To: Dave Chinner; +Cc: xfs

On Tue, 2010-09-14 at 20:56 +1000, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> The buffer cache hash is showing typical hash scalability problems.
> In large scale testing the number of cached items growing far larger
> than the hash can efficiently handle. Hence we need to move to a
> self-scaling cache indexing mechanism.
> 
> I have selected rbtrees for indexing becuse they can have O(log n)
> search scalability, and insert and remove cost is not excessive,
> even on large trees. Hence we should be able to cache large numbers
> of buffers without incurring the excessive cache miss search
> penalties that the hash is imposing on us.
> 
> To ensure we still have parallel access to the cache, we need
> multiple trees. Rather than hashing the buffers by disk address to
> select a tree, it seems more sensible to separate trees by typical
> access patterns. Most operations use buffers from within a single AG
> at a time, so rather than searching lots of different lists,
> separate the buffer indexes out into per-AG rbtrees. This means that
> searches during metadata operation have a much higher chance of
> hitting cache resident nodes, and that updates of the tree are less
> likely to disturb trees being accessed on other CPUs doing
> independent operations.

I didn't review this time as carefully as I did
when you originally posted this.  Some parts from
the original are now in separate patches.  But this
looks good to me.

Reviewed-by: Alex Elder <aelder@sgi.com>

> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---
>  fs/xfs/linux-2.6/xfs_buf.c |  138 +++++++++++++++++++++----------------------
>  fs/xfs/linux-2.6/xfs_buf.h |    8 +--
>  fs/xfs/xfs_ag.h            |    4 +
>  fs/xfs/xfs_mount.c         |    2 +
>  4 files changed, 75 insertions(+), 77 deletions(-)

. . .

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 15/18] xfs; pack xfs_buf structure more tightly
  2010-09-14 10:56 ` [PATCH 15/18] xfs; pack xfs_buf structure more tightly Dave Chinner
  2010-09-14 16:30   ` Christoph Hellwig
@ 2010-09-15 18:01   ` Alex Elder
  1 sibling, 0 replies; 69+ messages in thread
From: Alex Elder @ 2010-09-15 18:01 UTC (permalink / raw)
  To: Dave Chinner; +Cc: xfs

On Tue, 2010-09-14 at 20:56 +1000, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> pahole reports the struct xfs_buf has quite a few holes in it, so
> packing the structure better will reduce the size of it by 16 bytes.
> Also, move all the fields used in cache lookups into the first
> cacheline.
> 
> Before on x86_64:
> 
>         /* size: 320, cachelines: 5 */
> 	/* sum members: 298, holes: 6, sum holes: 22 */
> 
> After on x86_64:
> 
>         /* size: 304, cachelines: 5 */
> 	/* padding: 6 */
> 	/* last cacheline: 48 bytes */

Looks good.

Reviewed-by: Alex Elder <aelder@sgi.com>

> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---
>  fs/xfs/linux-2.6/xfs_buf.h |   30 +++++++++++++++++++-----------
>  1 files changed, 19 insertions(+), 11 deletions(-)


_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 16/18] xfs: convert xfsbud shrinker to a per-buftarg shrinker.
  2010-09-14 10:56 ` [PATCH 16/18] xfs: convert xfsbud shrinker to a per-buftarg shrinker Dave Chinner
  2010-09-14 16:32   ` Christoph Hellwig
@ 2010-09-15 20:19   ` Alex Elder
  2010-09-16  0:28     ` Dave Chinner
  1 sibling, 1 reply; 69+ messages in thread
From: Alex Elder @ 2010-09-15 20:19 UTC (permalink / raw)
  To: Dave Chinner; +Cc: xfs

On Tue, 2010-09-14 at 20:56 +1000, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> Before we introduce per-buftarg LRU lists, split the shrinker
> implementation into per-buftarg shrinker callbacks. At the moment
> we wake all the xfsbufds to run the delayed write queues to free
> the dirty buffers and make their pages available for reclaim.
> However, with an LRU, we want to be able to free clean, unused
> buffers as well, so we need to separate the xfsbufd from the
> shrinker callbacks.

I have one comment/question embedded below.

Your new shrinker is better than the old one (and would
have been even if you didn't make them per-buftarg).
It doesn't initiate flushing when it's passed 0 for
nr_to_scan (though to be honest I'm not sure what
practical effect that will have).

In any case...

Reviewed-by: Alex Elder <aelder@sgi.com>


> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---
>  fs/xfs/linux-2.6/xfs_buf.c |   89 ++++++++++++--------------------------------
>  fs/xfs/linux-2.6/xfs_buf.h |    4 +-
>  2 files changed, 27 insertions(+), 66 deletions(-)
> 
> diff --git a/fs/xfs/linux-2.6/xfs_buf.c b/fs/xfs/linux-2.6/xfs_buf.c
> index cce427d..3b54fee 100644
> --- a/fs/xfs/linux-2.6/xfs_buf.c
> +++ b/fs/xfs/linux-2.6/xfs_buf.c

. . .

> @@ -337,7 +332,6 @@ _xfs_buf_lookup_pages(
>  					__func__, gfp_mask);
>  
>  			XFS_STATS_INC(xb_page_retries);
> -			xfsbufd_wakeup(NULL, 0, gfp_mask);

Why is it OK not to wake up the shrinker(s) here
now, when it was called for previously?

>  			congestion_wait(BLK_RW_ASYNC, HZ/50);
>  			goto retry;
>  		}

. . .

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 17/18] xfs: add a lru to the XFS buffer cache
  2010-09-14 10:56 ` [PATCH 17/18] xfs: add a lru to the XFS buffer cache Dave Chinner
  2010-09-14 23:16   ` Christoph Hellwig
@ 2010-09-15 21:28   ` Alex Elder
  1 sibling, 0 replies; 69+ messages in thread
From: Alex Elder @ 2010-09-15 21:28 UTC (permalink / raw)
  To: Dave Chinner; +Cc: xfs

On Tue, 2010-09-14 at 20:56 +1000, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> Introduce a per-buftarg LRU for memory reclaim to operate on. This
> is the last piece we need to put in place so that we can fully
> control the buffer lifecycle. This allows XFS to be responsibile for
> maintaining the working set of buffers under memory pressure instead
> of relying on the VM reclaim not to take pages we need out from
> underneath us.
> 
> The implementation is currently a bit naive - it does not rotate
> buffers on the LRU when they are accessed multiple times. Solving
> this problem is for a later patch series that re-introduces the
> buffer type specific reclaim reference counts to prioritise reclaim
> more effectively.

Two small comments below, otherwise looks good.

Reviewed-by: Alex Elder <aelder@sgi.com>

> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---
>  fs/xfs/linux-2.6/xfs_buf.c |   91 ++++++++++++++++++++++++++++++++++---------
>  fs/xfs/linux-2.6/xfs_buf.h |    5 ++
>  2 files changed, 77 insertions(+), 19 deletions(-)
> 
> diff --git a/fs/xfs/linux-2.6/xfs_buf.c b/fs/xfs/linux-2.6/xfs_buf.c
> index 3b54fee..12b37c6 100644
> --- a/fs/xfs/linux-2.6/xfs_buf.c
> +++ b/fs/xfs/linux-2.6/xfs_buf.c

. . .

> @@ -1446,27 +1466,29 @@ xfs_buf_iomove(
>   */
>  
>  /*
> - *	Wait for any bufs with callbacks that have been submitted but
> - *	have not yet returned... walk the hash list for the target.
> + * Wait for any bufs with callbacks that have been submitted but have not yet
> + * returned. These buffers will have an elevated hold count, so wait on those
> + * while freeing all the buffers only held by the LRU.
>   */
>  void
>  xfs_wait_buftarg(
>  	struct xfs_buftarg	*btp)
>  {
> -	struct xfs_perag	*pag;
> -	uint			i;
> -
> -	for (i = 0; i < btp->bt_mount->m_sb.sb_agcount; i++) {
> -		pag = xfs_perag_get(btp->bt_mount, i);
> -		spin_lock(&pag->pag_buf_lock);
> -		while (rb_first(&pag->pag_buf_tree)) {
> -			spin_unlock(&pag->pag_buf_lock);
> +	struct xfs_buf		*bp;

(Insert blank line here.)

> +restart:
> +	spin_lock(&btp->bt_lru_lock);
> +	while (!list_empty(&btp->bt_lru)) {
> +		bp = list_first_entry(&btp->bt_lru, struct xfs_buf, b_lru);
> +		if (atomic_read(&bp->b_hold) > 1) {
> +			spin_unlock(&btp->bt_lru_lock);
>  			delay(100);
> -			spin_lock(&pag->pag_buf_lock);
> +			goto restart;
>  		}
> -		spin_unlock(&pag->pag_buf_lock);
> -		xfs_perag_put(pag);
> +		spin_unlock(&btp->bt_lru_lock);
> +		xfs_buf_rele(bp);
> +		spin_lock(&btp->bt_lru_lock);
>  	}
> +	spin_unlock(&btp->bt_lru_lock);
>  }
>  
>  int
> @@ -1477,15 +1499,44 @@ xfs_buftarg_shrink(
>  {
>  	struct xfs_buftarg	*btp = container_of(shrink,
>  					struct xfs_buftarg, bt_shrinker);
> -	if (nr_to_scan) {
> -		if (test_bit(XBT_FORCE_SLEEP, &btp->bt_flags))
> -			return -1;
> -		if (list_empty(&btp->bt_delwrite_queue))
> -			return -1;
> +	struct xfs_buf		*bp, *n;
> +
> +	if (!nr_to_scan)
> +		return btp->bt_lru_nr;
> +
> +	spin_lock(&btp->bt_lru_lock);
> +	if (test_and_set_bit(XBT_SHRINKER_ACTIVE, &btp->bt_flags)) {

Since test_and_set_bit() and clear_bit() are already atomic
you don't need to be under protection of bt_lru_lock to
manipulate the flags.  Get the spinlock after you've set
XBT_SHRINKER_ACTIVE (and clear it outside the spinlock
also).

> +		/* LRU walk already in progress */
> +		spin_unlock(&btp->bt_lru_lock);
> +		return -1;
> +	}
> +
> +	list_for_each_entry_safe(bp, n, &btp->bt_lru, b_lru) {
> +		if (nr_to_scan-- <= 0)
> +			break;

. . .

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 16/18] xfs: convert xfsbud shrinker to a per-buftarg shrinker.
  2010-09-15 20:19   ` Alex Elder
@ 2010-09-16  0:28     ` Dave Chinner
  0 siblings, 0 replies; 69+ messages in thread
From: Dave Chinner @ 2010-09-16  0:28 UTC (permalink / raw)
  To: Alex Elder; +Cc: xfs

On Wed, Sep 15, 2010 at 03:19:27PM -0500, Alex Elder wrote:
> On Tue, 2010-09-14 at 20:56 +1000, Dave Chinner wrote:
> > From: Dave Chinner <dchinner@redhat.com>
> > 
> > Before we introduce per-buftarg LRU lists, split the shrinker
> > implementation into per-buftarg shrinker callbacks. At the moment
> > we wake all the xfsbufds to run the delayed write queues to free
> > the dirty buffers and make their pages available for reclaim.
> > However, with an LRU, we want to be able to free clean, unused
> > buffers as well, so we need to separate the xfsbufd from the
> > shrinker callbacks.
> 
> I have one comment/question embedded below.
> 
> Your new shrinker is better than the old one (and would
> have been even if you didn't make them per-buftarg).
> It doesn't initiate flushing when it's passed 0 for
> nr_to_scan (though to be honest I'm not sure what
> practical effect that will have).

shrinkers are a strange beast. When nr_to_scan is zero, it means
"tell me how many reclaimable objects you have" rather than "shrink
the cache". The calling code does some magic and calls the shrinker
again a great number of times with nr_to_scan == 128 until it
completes. If the shrinker does not want to be called again, then it
should return -1 instead of the number of reclaimable objects.

> > ---
> >  fs/xfs/linux-2.6/xfs_buf.c |   89 ++++++++++++--------------------------------
> >  fs/xfs/linux-2.6/xfs_buf.h |    4 +-
> >  2 files changed, 27 insertions(+), 66 deletions(-)
> > 
> > diff --git a/fs/xfs/linux-2.6/xfs_buf.c b/fs/xfs/linux-2.6/xfs_buf.c
> > index cce427d..3b54fee 100644
> > --- a/fs/xfs/linux-2.6/xfs_buf.c
> > +++ b/fs/xfs/linux-2.6/xfs_buf.c
> 
> . . .
> 
> > @@ -337,7 +332,6 @@ _xfs_buf_lookup_pages(
> >  					__func__, gfp_mask);
> >  
> >  			XFS_STATS_INC(xb_page_retries);
> > -			xfsbufd_wakeup(NULL, 0, gfp_mask);
> 
> Why is it OK not to wake up the shrinker(s) here
> now, when it was called for previously?

It's redundant. The shrinker loops will be called once per priority
level in reclaim (12 levels, IIRC, trying harder to free memory as
priority increases), so adding a 13th call after an allocation
failure does not really provide any extra benefit.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 0/18] xfs: metadata and buffer cache scalability improvements
  2010-09-14 10:55 [PATCH 0/18] xfs: metadata and buffer cache scalability improvements Dave Chinner
                   ` (18 preceding siblings ...)
  2010-09-14 14:25 ` [PATCH 0/18] xfs: metadata and buffer cache scalability improvements Christoph Hellwig
@ 2010-09-17 13:21 ` Alex Elder
  2010-09-21  2:02   ` Dave Chinner
  19 siblings, 1 reply; 69+ messages in thread
From: Alex Elder @ 2010-09-17 13:21 UTC (permalink / raw)
  To: Dave Chinner; +Cc: xfs

On Tue, 2010-09-14 at 20:55 +1000, Dave Chinner wrote:
> This patchset has grown quite a bit - it started out as a "convert
> the buffer cache to rbtrees" patch, and has gotten bigger as I
> peeled the onion from one bottleneck to another.

I know you're going to re-submit this series.  I would
like you to split it into several smaller series if you
don't mind.  Some of these are simpler than others,
and there are some somewhat logical groupings (you
even described them here as two sets).  But beyond
that it would be nice to get at least some of them
committed before the full series is perfected.

To be constructive, here's a grouping based on what
seems to be a change of significance somehow.  I'm
not suggesting they all be separated, but I'm just
trying to identify the many things you're doing with
this series.  

[01/18] xfs: single thread inode cache shrinking.
[02/18] xfs: reduce the number of CIL lock round trips during commit

[05/18] xfs: convert inode cache lookups to use RCU locking
[06/18] xfs: convert pag_ici_lock to a spin lock

[07/18] xfs: don't use vfs writeback for pure metadata modifications

[08/18] xfs: rename xfs_buf_get_nodaddr to be more appropriate
[09/18] xfs: introduced uncached buffer read primitve
[10/18] xfs: store xfs_mount in the buftarg instead of in the xfs_buf
[11/18] xfs: kill XBF_FS_MANAGED buffers
[12/18] xfs: use unhashed buffers for size checks
[13/18] xfs: remove buftarg hash for external devices

[03/18] xfs: remove debug assert for per-ag reference counting
[04/18] xfs: lockless per-ag lookups
[14/18] xfs: convert buffer cache hash to rbtree

[15/18] xfs; pack xfs_buf structure more tightly
[16/18] xfs: convert xfsbud shrinker to a per-buftarg shrinker.

[17/18] xfs: add a lru to the XFS buffer cache
[18/18] xfs: stop using the page cache to back the buffer cache

Thanks.

					-Alex

> Performance numbers here are 8-way fs_mark create to 50M files, and
> 8-way rm -rf to remove the files created.
> 
> 			wall time	fs_mark rate
> 2.6.36-rc4:
> 	create:		13m10s		65k file/s
> 	unlink:		23m58s		N/A
> 
> The first set of patches are generic infrastructure changes that
> address pain points the rbtree based buffer cache introduces. I've
> put them first because they are simpler to review and have immediate
> impact on performance. These patches address lock contention as
> measured by the kernel lockstat infrastructure.
> 
> xfs: single thread inode cache shrinking.
> 	- prevents per-ag contention during cache shrinking
> 
> xfs: reduce the number of CIL lock round trips during commit
> 	- reduces lock traffic on the xc_cil_lock by two orders of
> 	  magnitude
> 
> xfs: remove debug assert for per-ag reference counting
> xfs: lockless per-ag lookups
> 	- hottest lock in the system with buffer cache rbtree path
> 	- converted to use RCU.
> 
> xfs: convert inode cache lookups to use RCU locking
> xfs: convert pag_ici_lock to a spin lock
> 	- addresses lookup vs reclaim contention on pag_ici_lock
> 	- converted to use RCU.
> 
> xfs: don't use vfs writeback for pure metadata modifications
> 	- inode writeback does not keep up with dirtying 100,000
> 	  inodes a second. Avoids the superblock dirty list where
> 	  possible by using the AIL as the age-order flusher.
> 
> Performance with these patches:
> 
> 2.6.36-rc4 + shrinker + CIL + RCU:
> 	create:		11m38s		80k files/s
> 	unlink:		14m29s		N/A
> 
> Create rate has improved by 20%, unlink time has almost halved. On
> large numbers of inodes, the unlink rate improves even more
> dramatically.
> 
> The buffer cache to rbtree series current stands at:
> 
> xfs: rename xfs_buf_get_nodaddr to be more appropriate
> xfs: introduced uncached buffer read primitve
> xfs: store xfs_mount in the buftarg instead of in the xfs_buf
> xfs: kill XBF_FS_MANAGED buffers
> xfs: use unhashed buffers for size checks
> xfs: remove buftarg hash for external devices
> 	- preparatory buffer cache API cleanup patches
> 
> xfs: convert buffer cache hash to rbtree
> 	- what it says ;)
> 	- includes changes based on Alex's review.
> 
> xfs; pack xfs_buf structure more tightly
> 	- memory usage reduction, means adding the LRU list head is
> 	  effectively memory usage neutral.
> 
> xfs: convert xfsbud shrinker to a per-buftarg shrinker.
> xfs: add a lru to the XFS buffer cache
> 	- Add an LRU for reclaim
> 
> xfs: stop using the page cache to back the buffer cache
> 	- kill all the page cache code
> 
> 2.6.36-rc4 + shrinker + CIL + RCU + rbtree:
> 	create:		 9m47s		95k files/s
> 	unlink:		14m16s		N/A
> 
> Create rate has improved by another 20%, unlink rate has improved
> marginally (noise, really).
> 
> There are two remaining parts to the buffer cache conversions:
> 
> 	1. work out how to efficiently support block size smaller
> 	than page size. The current code works, but uses a page per
> 	sub-apge buffer.  A set of slab caches would be perfect for
> 	this use, but I'm not sure that we are allowed to use them
> 	for IO anymore. Christoph?
> 
> 	2. Connect up the buffer type sepcific reclaim priority
> 	reference counting and convert the LRU reclaim to a cursor
> 	based walk that simply drops reclaim reference counts and
> 	frees anything that has a zero reclaim reference.
> 
> Overall, I can swap the order of the two patch sets, and the
> incremental performance increases for create are pretty much
> identical. For unlink, te benefit comes from the shrinker
> modification. For those that care, the rbtree patch set in isolation
> results in a time of 4h38m to create 1 billion inodes on my 8p/4GB
> RAM test VM. I haven't run this test with the RCU and writeback
> modifications yet.
> 
> Moving on from this point is to start testing against Nick Piggin's
> VFS scalability tree, aѕ the inode_lock and dcache_lock are now the
> performance limiting factors. That will, without doubt, bring new
> hotspots out in XFS so I'll be starting this cycle over again soon.
> 
> Overall diffstat at this point is:
> 
>  fs/xfs/linux-2.6/kmem.h        |    1 +
>  fs/xfs/linux-2.6/xfs_buf.c     |  588 ++++++++++++++--------------------------
>  fs/xfs/linux-2.6/xfs_buf.h     |   61 +++--
>  fs/xfs/linux-2.6/xfs_iops.c    |   18 +-
>  fs/xfs/linux-2.6/xfs_super.c   |   11 +-
>  fs/xfs/linux-2.6/xfs_sync.c    |   49 +++-
>  fs/xfs/linux-2.6/xfs_trace.h   |    2 +-
>  fs/xfs/quota/xfs_qm_syscalls.c |    4 +-
>  fs/xfs/xfs_ag.h                |    9 +-
>  fs/xfs/xfs_buf_item.c          |    3 +-
>  fs/xfs/xfs_fsops.c             |   11 +-
>  fs/xfs/xfs_iget.c              |   46 +++-
>  fs/xfs/xfs_inode.c             |   22 +-
>  fs/xfs/xfs_inode_item.c        |    9 -
>  fs/xfs/xfs_log.c               |    3 +-
>  fs/xfs/xfs_log_cil.c           |  116 +++++----
>  fs/xfs/xfs_log_recover.c       |   18 +-
>  fs/xfs/xfs_mount.c             |  126 ++++-----
>  fs/xfs/xfs_mount.h             |    2 +
>  fs/xfs/xfs_rtalloc.c           |   29 +-
>  fs/xfs/xfs_vnodeops.c          |    2 +-
>  21 files changed, 502 insertions(+), 628 deletions(-)
> 
> So it is improving performance, removing code and fixing
> longstanding bugs all at the same time. ;)
> 
> _______________________________________________
> xfs mailing list
> xfs@oss.sgi.com
> http://oss.sgi.com/mailman/listinfo/xfs



_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 0/18] xfs: metadata and buffer cache scalability improvements
  2010-09-17 13:21 ` Alex Elder
@ 2010-09-21  2:02   ` Dave Chinner
  2010-09-21 16:23     ` Alex Elder
  0 siblings, 1 reply; 69+ messages in thread
From: Dave Chinner @ 2010-09-21  2:02 UTC (permalink / raw)
  To: Alex Elder; +Cc: xfs

On Fri, Sep 17, 2010 at 08:21:40AM -0500, Alex Elder wrote:
> On Tue, 2010-09-14 at 20:55 +1000, Dave Chinner wrote:
> > This patchset has grown quite a bit - it started out as a "convert
> > the buffer cache to rbtrees" patch, and has gotten bigger as I
> > peeled the onion from one bottleneck to another.
> 
> I know you're going to re-submit this series.  I would
> like you to split it into several smaller series if you
> don't mind.  Some of these are simpler than others,
> and there are some somewhat logical groupings (you
> even described them here as two sets).  But beyond
> that it would be nice to get at least some of them
> committed before the full series is perfected.
> 
> To be constructive, here's a grouping based on what
> seems to be a change of significance somehow.  I'm
> not suggesting they all be separated, but I'm just
> trying to identify the many things you're doing with
> this series.  
> 
> [01/18] xfs: single thread inode cache shrinking.

This is separate, and I'm completely reworking this patch.

Fundamentally, single threading the shrinker is not robust enough
to prevent OOM situations because reclaim relies on the shrinkers to
throttle reclaim sufficiently that some memory is reclaimed while
they are running. Hence every shrinker call needs to do something,
otherwise the direct reclaim priority will wind up very quickly and
declare OOM instead of waiting for IO completion to clean and free
objects.

I've introduced per-ag reclaim walk locks to prevent multiple
shrinkers from walking the same AG, and this has prevented most of
the lock contention. instead, the shrinkers burn CPU walking
instead. hence I've needed to add batched lookups (new patch)
and optimise the way we check whether an inode is reclaimable by
avoiding locking the inode first.

I've also added a scan cursor that tracks where the last reclaim
walk got up to, and restarts the next shrinker reclaim walk from
that inode. This means it is not walkiagn over the same unreclaimable
inodes over and over again.

Then I've changed shrinker reclaim to be sort-of-blocking. That is, it
doesn't block on locks (because that would cause deadlocks), but it
does does SYNC_WAIT inode cluster IO when it finds a dirty inode.
This provides sufficient throttling for reclaim not to OOM.

To make matters even more complex, the inode cache shrinker was
being overloaded by the VM because the buffer cache lru shrinker was
not being asked to do enough work. As a result, the buffer cache
would grow to 1.5GB, and the VM would spent all it's time trying to
shrinker the inode cache! The cause of this was leaving stale
buffers on the LRU, so avoiding leaving them on the LRU has
prevented buffer cache windup on parallel unlink workloads.

The reclaim code is not as efficient as the "just do a full, single
threaded non-blocking reclaim pass" code, but it is still ~4x faster
than the existing code and does not randomly go OOM on parallel
unlink workloads.


> [02/18] xfs: reduce the number of CIL lock round trips during commit

Stand-alone fix.

> [05/18] xfs: convert inode cache lookups to use RCU locking
> [06/18] xfs: convert pag_ici_lock to a spin lock

So up to this point I now have:

d10d7d6 xfs: reduce the number of CIL lock round trips during commit
e9a1a2a xfs: remove debug assert for per-ag reference counting
0a055a3 xfs: lockless per-ag lookups
3403f75 xfs: convert inode cache lookups to use RCU locking
84c5b79 xfs: convert pag_ici_lock to a spin lock
033bc9b xfs: batch per-ag inode lookup walks (NEW)
c5bbc30 xfs: rework inode cache shrinking. (NEW)




> [07/18] xfs: don't use vfs writeback for pure metadata modifications

Standalone.

> [08/18] xfs: rename xfs_buf_get_nodaddr to be more appropriate
> [09/18] xfs: introduced uncached buffer read primitve
> [10/18] xfs: store xfs_mount in the buftarg instead of in the xfs_buf
> [11/18] xfs: kill XBF_FS_MANAGED buffers
> [12/18] xfs: use unhashed buffers for size checks
> [13/18] xfs: remove buftarg hash for external devices

cleanups needed for rbtree conversion.

> [03/18] xfs: remove debug assert for per-ag reference counting
> [04/18] xfs: lockless per-ag lookups

Stand alone.

> [14/18] xfs: convert buffer cache hash to rbtree

Goes with the cleanups.

> [15/18] xfs; pack xfs_buf structure more tightly
> [16/18] xfs: convert xfsbud shrinker to a per-buftarg shrinker.
> [17/18] xfs: add a lru to the XFS buffer cache
> [18/18] xfs: stop using the page cache to back the buffer cache

All together, with the LRU code being reworked a bit w.r.t. stale
buffers and shrinker behaviour.

In reality, though, i don't think that separating them into separate
series make much sense. The order they are in right now is
bisectable and fairly logical....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 0/18] xfs: metadata and buffer cache scalability improvements
  2010-09-21  2:02   ` Dave Chinner
@ 2010-09-21 16:23     ` Alex Elder
  2010-09-21 22:34       ` Dave Chinner
  0 siblings, 1 reply; 69+ messages in thread
From: Alex Elder @ 2010-09-21 16:23 UTC (permalink / raw)
  To: Dave Chinner; +Cc: XFS Mailing List

On Tue, 2010-09-21 at 12:02 +1000, Dave Chinner wrote:
> On Fri, Sep 17, 2010 at 08:21:40AM -0500, Alex Elder wrote:
> > On Tue, 2010-09-14 at 20:55 +1000, Dave Chinner wrote:
> > > This patchset has grown quite a bit - it started out as a "convert
> > > the buffer cache to rbtrees" patch, and has gotten bigger as I
> > > peeled the onion from one bottleneck to another.
. . .

> 
> All together, with the LRU code being reworked a bit w.r.t. stale
> buffers and shrinker behaviour.
> 
> In reality, though, i don't think that separating them into separate
> series make much sense. The order they are in right now is
> bisectable and fairly logical....

I have been thinking about this since sending it.  I think my
concern was not so much that they were all in one series.  It's
more about the fact that you are doing a number of non-trivial
changes, all together.  And as such my perception of the combined
risk of committing them all at once is higher.  So what I was
probably after was somehow being able to verify each chunk of
the series separately, spilling them out gradually rather
than all at once.

But in the end, I guess I agree with what you say.  If we could
get some parts--like those you say are standalone--committed
earlier (and then out for wider exposure sooner) that would be
good, but otherwise it's OK as a single series.  I'll look for
your next update, and will just wait for pull request(s) when
you feel they're ready.

					-Alex

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 0/18] xfs: metadata and buffer cache scalability improvements
  2010-09-21 16:23     ` Alex Elder
@ 2010-09-21 22:34       ` Dave Chinner
  0 siblings, 0 replies; 69+ messages in thread
From: Dave Chinner @ 2010-09-21 22:34 UTC (permalink / raw)
  To: Alex Elder; +Cc: XFS Mailing List

On Tue, Sep 21, 2010 at 11:23:12AM -0500, Alex Elder wrote:
> On Tue, 2010-09-21 at 12:02 +1000, Dave Chinner wrote:
> > On Fri, Sep 17, 2010 at 08:21:40AM -0500, Alex Elder wrote:
> > > On Tue, 2010-09-14 at 20:55 +1000, Dave Chinner wrote:
> > > > This patchset has grown quite a bit - it started out as a "convert
> > > > the buffer cache to rbtrees" patch, and has gotten bigger as I
> > > > peeled the onion from one bottleneck to another.
> . . .
> 
> > 
> > All together, with the LRU code being reworked a bit w.r.t. stale
> > buffers and shrinker behaviour.
> > 
> > In reality, though, i don't think that separating them into separate
> > series make much sense. The order they are in right now is
> > bisectable and fairly logical....
> 
> I have been thinking about this since sending it.  I think my
> concern was not so much that they were all in one series.  It's
> more about the fact that you are doing a number of non-trivial
> changes, all together.  And as such my perception of the combined
> risk of committing them all at once is higher.  So what I was
> probably after was somehow being able to verify each chunk of
> the series separately, spilling them out gradually rather
> than all at once.
> 
> But in the end, I guess I agree with what you say.  If we could
> get some parts--like those you say are standalone--committed
> earlier (and then out for wider exposure sooner) that would be
> good, but otherwise it's OK as a single series.  I'll look for
> your next update, and will just wait for pull request(s) when
> you feel they're ready.

Ok, that sounds reaonable. I can split out all the stand
alone/cleanup stuff, and leave the functional changes to later. I'll
do a reorder later today.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 07/18] xfs: don't use vfs writeback for pure metadata modifications
  2010-09-14 10:56 ` [PATCH 07/18] xfs: don't use vfs writeback for pure metadata modifications Dave Chinner
  2010-09-14 14:54   ` Christoph Hellwig
  2010-09-14 22:12   ` Alex Elder
@ 2010-11-08 10:47   ` Christoph Hellwig
  2 siblings, 0 replies; 69+ messages in thread
From: Christoph Hellwig @ 2010-11-08 10:47 UTC (permalink / raw)
  To: Dave Chinner; +Cc: xfs

I finally tracked down the xfsqa 065 regression, and it's caused by this
patch.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 69+ messages in thread

* [PATCH 02/18] xfs: reduce the number of CIL lock round trips during commit
  2010-09-27  1:47 [PATCH 0/18] xfs: metadata scalability V4 Dave Chinner
@ 2010-09-27  1:47 ` Dave Chinner
  0 siblings, 0 replies; 69+ messages in thread
From: Dave Chinner @ 2010-09-27  1:47 UTC (permalink / raw)
  To: xfs

From: Dave Chinner <dchinner@redhat.com>

When commiting a transaction, we do a lock CIL state lock round trip
on every single log vector we insert into the CIL. This is resulting
in the lock being as hot as the inode and dcache locks on 8-way
create workloads. Rework the insertion loops to bring the number
of lock round trips to one per transaction for log vectors, and one
more do the busy extents.

Also change the allocation of the log vector buffer not to zero it
as we copy over the entire allocated buffer anyway.

This patch also includes a structural cleanup to the CIL item
insertion provided by Christoph Hellwig.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Alex Elder <aelder@sgi.com>
---
 fs/xfs/xfs_log_cil.c |  232 +++++++++++++++++++++++++++-----------------------
 1 files changed, 127 insertions(+), 105 deletions(-)

diff --git a/fs/xfs/xfs_log_cil.c b/fs/xfs/xfs_log_cil.c
index 7e206fc..23d6ceb 100644
--- a/fs/xfs/xfs_log_cil.c
+++ b/fs/xfs/xfs_log_cil.c
@@ -146,102 +146,6 @@ xlog_cil_init_post_recovery(
 }
 
 /*
- * Insert the log item into the CIL and calculate the difference in space
- * consumed by the item. Add the space to the checkpoint ticket and calculate
- * if the change requires additional log metadata. If it does, take that space
- * as well. Remove the amount of space we addded to the checkpoint ticket from
- * the current transaction ticket so that the accounting works out correctly.
- *
- * If this is the first time the item is being placed into the CIL in this
- * context, pin it so it can't be written to disk until the CIL is flushed to
- * the iclog and the iclog written to disk.
- */
-static void
-xlog_cil_insert(
-	struct log		*log,
-	struct xlog_ticket	*ticket,
-	struct xfs_log_item	*item,
-	struct xfs_log_vec	*lv)
-{
-	struct xfs_cil		*cil = log->l_cilp;
-	struct xfs_log_vec	*old = lv->lv_item->li_lv;
-	struct xfs_cil_ctx	*ctx = cil->xc_ctx;
-	int			len;
-	int			diff_iovecs;
-	int			iclog_space;
-
-	if (old) {
-		/* existing lv on log item, space used is a delta */
-		ASSERT(!list_empty(&item->li_cil));
-		ASSERT(old->lv_buf && old->lv_buf_len && old->lv_niovecs);
-
-		len = lv->lv_buf_len - old->lv_buf_len;
-		diff_iovecs = lv->lv_niovecs - old->lv_niovecs;
-		kmem_free(old->lv_buf);
-		kmem_free(old);
-	} else {
-		/* new lv, must pin the log item */
-		ASSERT(!lv->lv_item->li_lv);
-		ASSERT(list_empty(&item->li_cil));
-
-		len = lv->lv_buf_len;
-		diff_iovecs = lv->lv_niovecs;
-		IOP_PIN(lv->lv_item);
-
-	}
-	len += diff_iovecs * sizeof(xlog_op_header_t);
-
-	/* attach new log vector to log item */
-	lv->lv_item->li_lv = lv;
-
-	spin_lock(&cil->xc_cil_lock);
-	list_move_tail(&item->li_cil, &cil->xc_cil);
-	ctx->nvecs += diff_iovecs;
-
-	/*
-	 * If this is the first time the item is being committed to the CIL,
-	 * store the sequence number on the log item so we can tell
-	 * in future commits whether this is the first checkpoint the item is
-	 * being committed into.
-	 */
-	if (!item->li_seq)
-		item->li_seq = ctx->sequence;
-
-	/*
-	 * Now transfer enough transaction reservation to the context ticket
-	 * for the checkpoint. The context ticket is special - the unit
-	 * reservation has to grow as well as the current reservation as we
-	 * steal from tickets so we can correctly determine the space used
-	 * during the transaction commit.
-	 */
-	if (ctx->ticket->t_curr_res == 0) {
-		/* first commit in checkpoint, steal the header reservation */
-		ASSERT(ticket->t_curr_res >= ctx->ticket->t_unit_res + len);
-		ctx->ticket->t_curr_res = ctx->ticket->t_unit_res;
-		ticket->t_curr_res -= ctx->ticket->t_unit_res;
-	}
-
-	/* do we need space for more log record headers? */
-	iclog_space = log->l_iclog_size - log->l_iclog_hsize;
-	if (len > 0 && (ctx->space_used / iclog_space !=
-				(ctx->space_used + len) / iclog_space)) {
-		int hdrs;
-
-		hdrs = (len + iclog_space - 1) / iclog_space;
-		/* need to take into account split region headers, too */
-		hdrs *= log->l_iclog_hsize + sizeof(struct xlog_op_header);
-		ctx->ticket->t_unit_res += hdrs;
-		ctx->ticket->t_curr_res += hdrs;
-		ticket->t_curr_res -= hdrs;
-		ASSERT(ticket->t_curr_res >= len);
-	}
-	ticket->t_curr_res -= len;
-	ctx->space_used += len;
-
-	spin_unlock(&cil->xc_cil_lock);
-}
-
-/*
  * Format log item into a flat buffers
  *
  * For delayed logging, we need to hold a formatted buffer containing all the
@@ -286,7 +190,7 @@ xlog_cil_format_items(
 			len += lv->lv_iovecp[index].i_len;
 
 		lv->lv_buf_len = len;
-		lv->lv_buf = kmem_zalloc(lv->lv_buf_len, KM_SLEEP|KM_NOFS);
+		lv->lv_buf = kmem_alloc(lv->lv_buf_len, KM_SLEEP|KM_NOFS);
 		ptr = lv->lv_buf;
 
 		for (index = 0; index < lv->lv_niovecs; index++) {
@@ -300,21 +204,136 @@ xlog_cil_format_items(
 	}
 }
 
+/*
+ * Prepare the log item for insertion into the CIL. Calculate the difference in
+ * log space and vectors it will consume, and if it is a new item pin it as
+ * well.
+ */
+STATIC void
+xfs_cil_prepare_item(
+	struct log		*log,
+	struct xfs_log_vec	*lv,
+	int			*len,
+	int			*diff_iovecs)
+{
+	struct xfs_log_vec	*old = lv->lv_item->li_lv;
+
+	if (old) {
+		/* existing lv on log item, space used is a delta */
+		ASSERT(!list_empty(&lv->lv_item->li_cil));
+		ASSERT(old->lv_buf && old->lv_buf_len && old->lv_niovecs);
+
+		*len += lv->lv_buf_len - old->lv_buf_len;
+		*diff_iovecs += lv->lv_niovecs - old->lv_niovecs;
+		kmem_free(old->lv_buf);
+		kmem_free(old);
+	} else {
+		/* new lv, must pin the log item */
+		ASSERT(!lv->lv_item->li_lv);
+		ASSERT(list_empty(&lv->lv_item->li_cil));
+
+		*len += lv->lv_buf_len;
+		*diff_iovecs += lv->lv_niovecs;
+		IOP_PIN(lv->lv_item);
+
+	}
+
+	/* attach new log vector to log item */
+	lv->lv_item->li_lv = lv;
+
+	/*
+	 * If this is the first time the item is being committed to the
+	 * CIL, store the sequence number on the log item so we can
+	 * tell in future commits whether this is the first checkpoint
+	 * the item is being committed into.
+	 */
+	if (!lv->lv_item->li_seq)
+		lv->lv_item->li_seq = log->l_cilp->xc_ctx->sequence;
+}
+
+/*
+ * Insert the log items into the CIL and calculate the difference in space
+ * consumed by the item. Add the space to the checkpoint ticket and calculate
+ * if the change requires additional log metadata. If it does, take that space
+ * as well. Remove the amount of space we addded to the checkpoint ticket from
+ * the current transaction ticket so that the accounting works out correctly.
+ */
 static void
 xlog_cil_insert_items(
 	struct log		*log,
 	struct xfs_log_vec	*log_vector,
-	struct xlog_ticket	*ticket,
-	xfs_lsn_t		*start_lsn)
+	struct xlog_ticket	*ticket)
 {
-	struct xfs_log_vec *lv;
-
-	if (start_lsn)
-		*start_lsn = log->l_cilp->xc_ctx->sequence;
+	struct xfs_cil		*cil = log->l_cilp;
+	struct xfs_cil_ctx	*ctx = cil->xc_ctx;
+	struct xfs_log_vec	*lv;
+	int			len = 0;
+	int			diff_iovecs = 0;
+	int			iclog_space;
 
 	ASSERT(log_vector);
+
+	/*
+	 * Do all the accounting aggregation and switching of log vectors
+	 * around in a separate loop to the insertion of items into the CIL.
+	 * Then we can do a separate loop to update the CIL within a single
+	 * lock/unlock pair. This reduces the number of round trips on the CIL
+	 * lock from O(nr_logvectors) to O(1) and greatly reduces the overall
+	 * hold time for the transaction commit.
+	 *
+	 * If this is the first time the item is being placed into the CIL in
+	 * this context, pin it so it can't be written to disk until the CIL is
+	 * flushed to the iclog and the iclog written to disk.
+	 *
+	 * We can do this safely because the context can't checkpoint until we
+	 * are done so it doesn't matter exactly how we update the CIL.
+	 */
+	for (lv = log_vector; lv; lv = lv->lv_next)
+		xfs_cil_prepare_item(log, lv, &len, &diff_iovecs);
+
+	/* account for space used by new iovec headers  */
+	len += diff_iovecs * sizeof(xlog_op_header_t);
+
+	spin_lock(&cil->xc_cil_lock);
+
+	/* move the items to the tail of the CIL */
 	for (lv = log_vector; lv; lv = lv->lv_next)
-		xlog_cil_insert(log, ticket, lv->lv_item, lv);
+		list_move_tail(&lv->lv_item->li_cil, &cil->xc_cil);
+
+	ctx->nvecs += diff_iovecs;
+
+	/*
+	 * Now transfer enough transaction reservation to the context ticket
+	 * for the checkpoint. The context ticket is special - the unit
+	 * reservation has to grow as well as the current reservation as we
+	 * steal from tickets so we can correctly determine the space used
+	 * during the transaction commit.
+	 */
+	if (ctx->ticket->t_curr_res == 0) {
+		/* first commit in checkpoint, steal the header reservation */
+		ASSERT(ticket->t_curr_res >= ctx->ticket->t_unit_res + len);
+		ctx->ticket->t_curr_res = ctx->ticket->t_unit_res;
+		ticket->t_curr_res -= ctx->ticket->t_unit_res;
+	}
+
+	/* do we need space for more log record headers? */
+	iclog_space = log->l_iclog_size - log->l_iclog_hsize;
+	if (len > 0 && (ctx->space_used / iclog_space !=
+				(ctx->space_used + len) / iclog_space)) {
+		int hdrs;
+
+		hdrs = (len + iclog_space - 1) / iclog_space;
+		/* need to take into account split region headers, too */
+		hdrs *= log->l_iclog_hsize + sizeof(struct xlog_op_header);
+		ctx->ticket->t_unit_res += hdrs;
+		ctx->ticket->t_curr_res += hdrs;
+		ticket->t_curr_res -= hdrs;
+		ASSERT(ticket->t_curr_res >= len);
+	}
+	ticket->t_curr_res -= len;
+	ctx->space_used += len;
+
+	spin_unlock(&cil->xc_cil_lock);
 }
 
 static void
@@ -638,7 +657,10 @@ xfs_log_commit_cil(
 
 	/* lock out background commit */
 	down_read(&log->l_cilp->xc_ctx_lock);
-	xlog_cil_insert_items(log, log_vector, tp->t_ticket, commit_lsn);
+	if (commit_lsn)
+		*commit_lsn = log->l_cilp->xc_ctx->sequence;
+
+	xlog_cil_insert_items(log, log_vector, tp->t_ticket);
 
 	/* check we didn't blow the reservation */
 	if (tp->t_ticket->t_curr_res < 0)
-- 
1.7.1

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 69+ messages in thread

* [PATCH 02/18] xfs: reduce the number of CIL lock round trips during commit
  2010-09-24 12:30 [PATCH 0/18] xfs: metadata scalability V3 Dave Chinner
@ 2010-09-24 12:31 ` Dave Chinner
  0 siblings, 0 replies; 69+ messages in thread
From: Dave Chinner @ 2010-09-24 12:31 UTC (permalink / raw)
  To: xfs

From: Dave Chinner <dchinner@redhat.com>

When commiting a transaction, we do a lock CIL state lock round trip
on every single log vector we insert into the CIL. This is resulting
in the lock being as hot as the inode and dcache locks on 8-way
create workloads. Rework the insertion loops to bring the number
of lock round trips to one per transaction for log vectors, and one
more do the busy extents.

Also change the allocation of the log vector buffer not to zero it
as we copy over the entire allocated buffer anyway.

This patch also includes a structural cleanup to the CIL item
insertion provided by Christoph Hellwig.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Alex Elder <aelder@sgi.com>
---
 fs/xfs/xfs_log_cil.c |  232 +++++++++++++++++++++++++++-----------------------
 1 files changed, 127 insertions(+), 105 deletions(-)

diff --git a/fs/xfs/xfs_log_cil.c b/fs/xfs/xfs_log_cil.c
index 7e206fc..23d6ceb 100644
--- a/fs/xfs/xfs_log_cil.c
+++ b/fs/xfs/xfs_log_cil.c
@@ -146,102 +146,6 @@ xlog_cil_init_post_recovery(
 }
 
 /*
- * Insert the log item into the CIL and calculate the difference in space
- * consumed by the item. Add the space to the checkpoint ticket and calculate
- * if the change requires additional log metadata. If it does, take that space
- * as well. Remove the amount of space we addded to the checkpoint ticket from
- * the current transaction ticket so that the accounting works out correctly.
- *
- * If this is the first time the item is being placed into the CIL in this
- * context, pin it so it can't be written to disk until the CIL is flushed to
- * the iclog and the iclog written to disk.
- */
-static void
-xlog_cil_insert(
-	struct log		*log,
-	struct xlog_ticket	*ticket,
-	struct xfs_log_item	*item,
-	struct xfs_log_vec	*lv)
-{
-	struct xfs_cil		*cil = log->l_cilp;
-	struct xfs_log_vec	*old = lv->lv_item->li_lv;
-	struct xfs_cil_ctx	*ctx = cil->xc_ctx;
-	int			len;
-	int			diff_iovecs;
-	int			iclog_space;
-
-	if (old) {
-		/* existing lv on log item, space used is a delta */
-		ASSERT(!list_empty(&item->li_cil));
-		ASSERT(old->lv_buf && old->lv_buf_len && old->lv_niovecs);
-
-		len = lv->lv_buf_len - old->lv_buf_len;
-		diff_iovecs = lv->lv_niovecs - old->lv_niovecs;
-		kmem_free(old->lv_buf);
-		kmem_free(old);
-	} else {
-		/* new lv, must pin the log item */
-		ASSERT(!lv->lv_item->li_lv);
-		ASSERT(list_empty(&item->li_cil));
-
-		len = lv->lv_buf_len;
-		diff_iovecs = lv->lv_niovecs;
-		IOP_PIN(lv->lv_item);
-
-	}
-	len += diff_iovecs * sizeof(xlog_op_header_t);
-
-	/* attach new log vector to log item */
-	lv->lv_item->li_lv = lv;
-
-	spin_lock(&cil->xc_cil_lock);
-	list_move_tail(&item->li_cil, &cil->xc_cil);
-	ctx->nvecs += diff_iovecs;
-
-	/*
-	 * If this is the first time the item is being committed to the CIL,
-	 * store the sequence number on the log item so we can tell
-	 * in future commits whether this is the first checkpoint the item is
-	 * being committed into.
-	 */
-	if (!item->li_seq)
-		item->li_seq = ctx->sequence;
-
-	/*
-	 * Now transfer enough transaction reservation to the context ticket
-	 * for the checkpoint. The context ticket is special - the unit
-	 * reservation has to grow as well as the current reservation as we
-	 * steal from tickets so we can correctly determine the space used
-	 * during the transaction commit.
-	 */
-	if (ctx->ticket->t_curr_res == 0) {
-		/* first commit in checkpoint, steal the header reservation */
-		ASSERT(ticket->t_curr_res >= ctx->ticket->t_unit_res + len);
-		ctx->ticket->t_curr_res = ctx->ticket->t_unit_res;
-		ticket->t_curr_res -= ctx->ticket->t_unit_res;
-	}
-
-	/* do we need space for more log record headers? */
-	iclog_space = log->l_iclog_size - log->l_iclog_hsize;
-	if (len > 0 && (ctx->space_used / iclog_space !=
-				(ctx->space_used + len) / iclog_space)) {
-		int hdrs;
-
-		hdrs = (len + iclog_space - 1) / iclog_space;
-		/* need to take into account split region headers, too */
-		hdrs *= log->l_iclog_hsize + sizeof(struct xlog_op_header);
-		ctx->ticket->t_unit_res += hdrs;
-		ctx->ticket->t_curr_res += hdrs;
-		ticket->t_curr_res -= hdrs;
-		ASSERT(ticket->t_curr_res >= len);
-	}
-	ticket->t_curr_res -= len;
-	ctx->space_used += len;
-
-	spin_unlock(&cil->xc_cil_lock);
-}
-
-/*
  * Format log item into a flat buffers
  *
  * For delayed logging, we need to hold a formatted buffer containing all the
@@ -286,7 +190,7 @@ xlog_cil_format_items(
 			len += lv->lv_iovecp[index].i_len;
 
 		lv->lv_buf_len = len;
-		lv->lv_buf = kmem_zalloc(lv->lv_buf_len, KM_SLEEP|KM_NOFS);
+		lv->lv_buf = kmem_alloc(lv->lv_buf_len, KM_SLEEP|KM_NOFS);
 		ptr = lv->lv_buf;
 
 		for (index = 0; index < lv->lv_niovecs; index++) {
@@ -300,21 +204,136 @@ xlog_cil_format_items(
 	}
 }
 
+/*
+ * Prepare the log item for insertion into the CIL. Calculate the difference in
+ * log space and vectors it will consume, and if it is a new item pin it as
+ * well.
+ */
+STATIC void
+xfs_cil_prepare_item(
+	struct log		*log,
+	struct xfs_log_vec	*lv,
+	int			*len,
+	int			*diff_iovecs)
+{
+	struct xfs_log_vec	*old = lv->lv_item->li_lv;
+
+	if (old) {
+		/* existing lv on log item, space used is a delta */
+		ASSERT(!list_empty(&lv->lv_item->li_cil));
+		ASSERT(old->lv_buf && old->lv_buf_len && old->lv_niovecs);
+
+		*len += lv->lv_buf_len - old->lv_buf_len;
+		*diff_iovecs += lv->lv_niovecs - old->lv_niovecs;
+		kmem_free(old->lv_buf);
+		kmem_free(old);
+	} else {
+		/* new lv, must pin the log item */
+		ASSERT(!lv->lv_item->li_lv);
+		ASSERT(list_empty(&lv->lv_item->li_cil));
+
+		*len += lv->lv_buf_len;
+		*diff_iovecs += lv->lv_niovecs;
+		IOP_PIN(lv->lv_item);
+
+	}
+
+	/* attach new log vector to log item */
+	lv->lv_item->li_lv = lv;
+
+	/*
+	 * If this is the first time the item is being committed to the
+	 * CIL, store the sequence number on the log item so we can
+	 * tell in future commits whether this is the first checkpoint
+	 * the item is being committed into.
+	 */
+	if (!lv->lv_item->li_seq)
+		lv->lv_item->li_seq = log->l_cilp->xc_ctx->sequence;
+}
+
+/*
+ * Insert the log items into the CIL and calculate the difference in space
+ * consumed by the item. Add the space to the checkpoint ticket and calculate
+ * if the change requires additional log metadata. If it does, take that space
+ * as well. Remove the amount of space we addded to the checkpoint ticket from
+ * the current transaction ticket so that the accounting works out correctly.
+ */
 static void
 xlog_cil_insert_items(
 	struct log		*log,
 	struct xfs_log_vec	*log_vector,
-	struct xlog_ticket	*ticket,
-	xfs_lsn_t		*start_lsn)
+	struct xlog_ticket	*ticket)
 {
-	struct xfs_log_vec *lv;
-
-	if (start_lsn)
-		*start_lsn = log->l_cilp->xc_ctx->sequence;
+	struct xfs_cil		*cil = log->l_cilp;
+	struct xfs_cil_ctx	*ctx = cil->xc_ctx;
+	struct xfs_log_vec	*lv;
+	int			len = 0;
+	int			diff_iovecs = 0;
+	int			iclog_space;
 
 	ASSERT(log_vector);
+
+	/*
+	 * Do all the accounting aggregation and switching of log vectors
+	 * around in a separate loop to the insertion of items into the CIL.
+	 * Then we can do a separate loop to update the CIL within a single
+	 * lock/unlock pair. This reduces the number of round trips on the CIL
+	 * lock from O(nr_logvectors) to O(1) and greatly reduces the overall
+	 * hold time for the transaction commit.
+	 *
+	 * If this is the first time the item is being placed into the CIL in
+	 * this context, pin it so it can't be written to disk until the CIL is
+	 * flushed to the iclog and the iclog written to disk.
+	 *
+	 * We can do this safely because the context can't checkpoint until we
+	 * are done so it doesn't matter exactly how we update the CIL.
+	 */
+	for (lv = log_vector; lv; lv = lv->lv_next)
+		xfs_cil_prepare_item(log, lv, &len, &diff_iovecs);
+
+	/* account for space used by new iovec headers  */
+	len += diff_iovecs * sizeof(xlog_op_header_t);
+
+	spin_lock(&cil->xc_cil_lock);
+
+	/* move the items to the tail of the CIL */
 	for (lv = log_vector; lv; lv = lv->lv_next)
-		xlog_cil_insert(log, ticket, lv->lv_item, lv);
+		list_move_tail(&lv->lv_item->li_cil, &cil->xc_cil);
+
+	ctx->nvecs += diff_iovecs;
+
+	/*
+	 * Now transfer enough transaction reservation to the context ticket
+	 * for the checkpoint. The context ticket is special - the unit
+	 * reservation has to grow as well as the current reservation as we
+	 * steal from tickets so we can correctly determine the space used
+	 * during the transaction commit.
+	 */
+	if (ctx->ticket->t_curr_res == 0) {
+		/* first commit in checkpoint, steal the header reservation */
+		ASSERT(ticket->t_curr_res >= ctx->ticket->t_unit_res + len);
+		ctx->ticket->t_curr_res = ctx->ticket->t_unit_res;
+		ticket->t_curr_res -= ctx->ticket->t_unit_res;
+	}
+
+	/* do we need space for more log record headers? */
+	iclog_space = log->l_iclog_size - log->l_iclog_hsize;
+	if (len > 0 && (ctx->space_used / iclog_space !=
+				(ctx->space_used + len) / iclog_space)) {
+		int hdrs;
+
+		hdrs = (len + iclog_space - 1) / iclog_space;
+		/* need to take into account split region headers, too */
+		hdrs *= log->l_iclog_hsize + sizeof(struct xlog_op_header);
+		ctx->ticket->t_unit_res += hdrs;
+		ctx->ticket->t_curr_res += hdrs;
+		ticket->t_curr_res -= hdrs;
+		ASSERT(ticket->t_curr_res >= len);
+	}
+	ticket->t_curr_res -= len;
+	ctx->space_used += len;
+
+	spin_unlock(&cil->xc_cil_lock);
 }
 
 static void
@@ -638,7 +657,10 @@ xfs_log_commit_cil(
 
 	/* lock out background commit */
 	down_read(&log->l_cilp->xc_ctx_lock);
-	xlog_cil_insert_items(log, log_vector, tp->t_ticket, commit_lsn);
+	if (commit_lsn)
+		*commit_lsn = log->l_cilp->xc_ctx->sequence;
+
+	xlog_cil_insert_items(log, log_vector, tp->t_ticket);
 
 	/* check we didn't blow the reservation */
 	if (tp->t_ticket->t_curr_res < 0)
-- 
1.7.1

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 69+ messages in thread

end of thread, other threads:[~2010-11-08 10:46 UTC | newest]

Thread overview: 69+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2010-09-14 10:55 [PATCH 0/18] xfs: metadata and buffer cache scalability improvements Dave Chinner
2010-09-14 10:56 ` [PATCH 01/18] xfs: single thread inode cache shrinking Dave Chinner
2010-09-14 18:48   ` Alex Elder
2010-09-14 22:48     ` Dave Chinner
2010-09-14 10:56 ` [PATCH 02/18] xfs: reduce the number of CIL lock round trips during commit Dave Chinner
2010-09-14 14:48   ` Christoph Hellwig
2010-09-14 17:21   ` Alex Elder
2010-09-14 10:56 ` [PATCH 03/18] xfs: remove debug assert for per-ag reference counting Dave Chinner
2010-09-14 14:48   ` Christoph Hellwig
2010-09-14 17:22   ` Alex Elder
2010-09-14 10:56 ` [PATCH 04/18] xfs: lockless per-ag lookups Dave Chinner
2010-09-14 12:35   ` Dave Chinner
2010-09-14 14:50   ` Christoph Hellwig
2010-09-14 17:28   ` Alex Elder
2010-09-14 10:56 ` [PATCH 05/18] xfs: convert inode cache lookups to use RCU locking Dave Chinner
2010-09-14 16:27   ` Christoph Hellwig
2010-09-14 23:17     ` Dave Chinner
2010-09-14 21:23   ` Alex Elder
2010-09-14 23:42     ` Dave Chinner
2010-09-14 10:56 ` [PATCH 06/18] xfs: convert pag_ici_lock to a spin lock Dave Chinner
2010-09-14 21:26   ` Alex Elder
2010-09-14 10:56 ` [PATCH 07/18] xfs: don't use vfs writeback for pure metadata modifications Dave Chinner
2010-09-14 14:54   ` Christoph Hellwig
2010-09-15  0:14     ` Dave Chinner
2010-09-15  0:17       ` Christoph Hellwig
2010-09-14 22:12   ` Alex Elder
2010-09-15  0:28     ` Dave Chinner
2010-11-08 10:47   ` Christoph Hellwig
2010-09-14 10:56 ` [PATCH 08/18] xfs: rename xfs_buf_get_nodaddr to be more appropriate Dave Chinner
2010-09-14 14:56   ` Christoph Hellwig
2010-09-14 22:14   ` Alex Elder
2010-09-14 10:56 ` [PATCH 09/18] xfs: introduced uncached buffer read primitve Dave Chinner
2010-09-14 14:56   ` Christoph Hellwig
2010-09-14 22:16   ` Alex Elder
2010-09-14 10:56 ` [PATCH 10/18] xfs: store xfs_mount in the buftarg instead of in the xfs_buf Dave Chinner
2010-09-14 14:57   ` Christoph Hellwig
2010-09-14 22:21   ` Alex Elder
2010-09-14 10:56 ` [PATCH 11/18] xfs: kill XBF_FS_MANAGED buffers Dave Chinner
2010-09-14 14:59   ` Christoph Hellwig
2010-09-14 22:26   ` Alex Elder
2010-09-14 10:56 ` [PATCH 12/18] xfs: use unhashed buffers for size checks Dave Chinner
2010-09-14 15:00   ` Christoph Hellwig
2010-09-14 22:29   ` Alex Elder
2010-09-14 10:56 ` [PATCH 13/18] xfs: remove buftarg hash for external devices Dave Chinner
2010-09-14 22:29   ` Alex Elder
2010-09-14 10:56 ` [PATCH 14/18] xfs: convert buffer cache hash to rbtree Dave Chinner
2010-09-14 16:29   ` Christoph Hellwig
2010-09-15 17:46   ` Alex Elder
2010-09-14 10:56 ` [PATCH 15/18] xfs; pack xfs_buf structure more tightly Dave Chinner
2010-09-14 16:30   ` Christoph Hellwig
2010-09-15 18:01   ` Alex Elder
2010-09-14 10:56 ` [PATCH 16/18] xfs: convert xfsbud shrinker to a per-buftarg shrinker Dave Chinner
2010-09-14 16:32   ` Christoph Hellwig
2010-09-15 20:19   ` Alex Elder
2010-09-16  0:28     ` Dave Chinner
2010-09-14 10:56 ` [PATCH 17/18] xfs: add a lru to the XFS buffer cache Dave Chinner
2010-09-14 23:16   ` Christoph Hellwig
2010-09-15  0:05     ` Dave Chinner
2010-09-15 21:28   ` Alex Elder
2010-09-14 10:56 ` [PATCH 18/18] xfs: stop using the page cache to back the " Dave Chinner
2010-09-14 23:20   ` Christoph Hellwig
2010-09-15  0:06     ` Dave Chinner
2010-09-14 14:25 ` [PATCH 0/18] xfs: metadata and buffer cache scalability improvements Christoph Hellwig
2010-09-17 13:21 ` Alex Elder
2010-09-21  2:02   ` Dave Chinner
2010-09-21 16:23     ` Alex Elder
2010-09-21 22:34       ` Dave Chinner
2010-09-24 12:30 [PATCH 0/18] xfs: metadata scalability V3 Dave Chinner
2010-09-24 12:31 ` [PATCH 02/18] xfs: reduce the number of CIL lock round trips during commit Dave Chinner
2010-09-27  1:47 [PATCH 0/18] xfs: metadata scalability V4 Dave Chinner
2010-09-27  1:47 ` [PATCH 02/18] xfs: reduce the number of CIL lock round trips during commit Dave Chinner

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.