linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 00/11] Sync and VFS scalability improvements
@ 2013-07-31  4:15 Dave Chinner
  2013-07-31  4:15 ` [PATCH 01/11] writeback: plug writeback at a high level Dave Chinner
                   ` (11 more replies)
  0 siblings, 12 replies; 32+ messages in thread
From: Dave Chinner @ 2013-07-31  4:15 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: linux-kernel, akpm, davej, viro, jack, glommer

Hi folks,

This series of patches is against the curent mmotm tree here:

http://git.cmpxchg.org/cgit/linux-mmotm.git/

It addresses several VFS scalability issues, the most pressing of which is lock
contention triggered by concurrent sync(2) calls.

The patches in the series are:

writeback: plug writeback at a high level

This patch greatly reduces writeback IOPS on XFS when writing lots of small
files. Improves performance by ~20-30% on XFS on fast devices by reducing small
file write IOPS by 95%, but doesn't seem to impact ext4 or btrfs performance or
IOPS in any noticable way.

inode: add IOP_NOTHASHED to avoid inode hash lock in evict

Roughly 5-10% of the spinlock contention on 16-way create workloads on XFS comes
from inode_hash_remove(), even though XFS doesn't use the inode hash and uses
inode_hash_fake() to avoid neeeding to insert inodes into the hash. We still
take the lock to remove it form the hash. This patch avoids the lock on inode
eviction, too.

inode: convert inode_sb_list_lock to per-sb
sync: serialise per-superblock sync operations
inode: rename i_wb_list to i_io_list
bdi: add a new writeback list for sync
writeback: periodically trim the writeback list

This series removes the global inode_sb_list_lock and all the contention points
related to sync(2) The global lock is first converted to a per-filesystem lock
to reduce the scope of global contention, a mutex is add to wait_sb_inodes() to
avoid concurrent sync(2) operations from walking the inode list at the same time
while still guaranteeing sync(2) waits for all the IO it needs to.  It then adds
patches to track inodes under writeback for sync(2) in an optimal manner,
greatly reducing the overhead of sync(2) on large inode caches.

inode: convert per-sb inode list to a list_lru

This patch converts the per-sb list and lock to the per-node list_lru structures
to remove the global lock bottleneck for workloads that have heavy cache
insertion and removal concurrency. A 4-node numa machine saw a 3.5x speedup on
inode cache intensive concurrent bulkstat operation (cycling 1.7 million
inodes/s through the XFS inode cache) as a result of this patch.

c8cb115 fs: Use RCU lookups for inode cache

Lockless inode hash traversals for ext4 and btrfs. Both see signficant speedups
for directory traversal intensive workloads with this patch as it removes the
inode_hash_lock from cache lookups. The inode_hash_lock is still a limiting
factor for inode cache inserts and removals, but that's a much more complex
problem to solve.

8925a8d list_lru: don't need node lock in list_lru_count_node
4411917 list_lru: don't lock during add/del if unnecessary

Optimisations for the list_lru primitives. Because of the sheer number of calls
to these functions under heavy concurrent VFS workloads, these functions show up
quite hot in profiles. Hence making sure we don't take locks when we don't
really need to makes a measurable difference to the CPU consumption shown in the
profiles.


Performance Summary
-------------------

Concurrent sync:

Load 8 million XFs inodes into the cache - all clean - and run
100 concurrent sync calls using;

$ time (for i in `seq 0 1 100`; do sync & done; wait)

		inodes		   total sync time
				  real		 system
mmotm		8366826		146.080s	1481.698s
patched		8560697		  0.109s	   0.346s

System interactivity on mmotm is crap - it's completely CPU bound and takes
seconds to repsond to input.

Run fsmark creating 10 million 4k files with 16 threads, run the above 100
concurrent sync calls when when 1.5 million files have been created.

		fsmark		sync		sync system time
mmotm		259s		502.794s	4893.977s
patched		204s		 62.423s	   3.224s

Note: the difference in fsmark performance on this workload is due to the
first patch in the series - the writeback plugging patch.

Inode cache modification intensive workloads:

Simple workloads:

	- 16 way fsmark to create 51.2 million empty files.
	- multithreaded bulkstat, one thread per AG
	- 16-way 'find /mnt/N -ctime 1' (directory + inode read)
	- 16-way unlink

Storage: 100TB sparse filesystem image with a 1MB extent size hint on XFS on
4x64GB SSD RAID 0 (i.e. thin-provisioned with 1MB allocation granularity):

XFS		create		bulkstat	lookup	unlink
mmotm		4m28s		2m42s		2m20	6m46s
patched		4m22s		0m37s		1m59s	6m45s

create and unlink are no faster as the reduction in lock contention on the
inode lists translated into causing more contention on the XFS transaction
commit code (I have other patches to address that). The bulkstat scaled almost
linearly with the number of inode lists, and lookup improved significantly as
well.

For ext4, I didn't bother with unlinks because they are single threaded due to
the orphan list locking, so it there's not much point in waiting for half an
hour to get the same result each time.

ext4		create		lookup
mmotm		7m35s		4m46
patched		7m40s		2m01s

See the links for more detailed analysis including profiles:

http://oss.sgi.com/archives/xfs/2013-07/msg00084.html
http://oss.sgi.com/archives/xfs/2013-07/msg00110.html

Testing:

- xfstests on 1p, 2p, and 8p VMs, with both xfs and ext4.
- benchmarking using fsmark as per above with xfs, ext4 and btrfs.
- prolonged stress testing with fsstress, dbench and postmark

Comments, thoughts, testing and flames are all welcome....

Cheers,

Dave.

---
 fs/block_dev.c                   |  77 +++++++++------
 fs/drop_caches.c                 |  57 +++++++----
 fs/fs-writeback.c                | 163 ++++++++++++++++++++++++++-----
 fs/inode.c                       | 217 ++++++++++++++++++++++-------------------
 fs/internal.h                    |   1 -
 fs/notify/inode_mark.c           | 111 +++++++++------------
 fs/quota/dquot.c                 | 174 +++++++++++++++++++++------------
 fs/super.c                       |  11 ++-
 fs/xfs/xfs_iops.c                |   2 +
 include/linux/backing-dev.h      |   3 +
 include/linux/fs.h               |  16 ++-
 include/linux/fsnotify_backend.h |   2 +-
 mm/backing-dev.c                 |   7 +-
 mm/list_lru.c                    |  14 +--
 mm/page-writeback.c              |  14 +++
 15 files changed, 550 insertions(+), 319 deletions(-)


^ permalink raw reply	[flat|nested] 32+ messages in thread

* [PATCH 01/11] writeback: plug writeback at a high level
  2013-07-31  4:15 [PATCH 00/11] Sync and VFS scalability improvements Dave Chinner
@ 2013-07-31  4:15 ` Dave Chinner
  2013-07-31 14:40   ` Jan Kara
  2013-07-31  4:15 ` [PATCH 02/11] inode: add IOP_NOTHASHED to avoid inode hash lock in evict Dave Chinner
                   ` (10 subsequent siblings)
  11 siblings, 1 reply; 32+ messages in thread
From: Dave Chinner @ 2013-07-31  4:15 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: linux-kernel, akpm, davej, viro, jack, glommer

From: Dave Chinner <dchinner@redhat.com>

Doing writeback on lots of little files causes terrible IOPS storms
because of the per-mapping writeback plugging we do. This
essentially causes imeediate dispatch of IO for each mapping,
regardless of the context in which writeback is occurring.

IOWs, running a concurrent write-lots-of-small 4k files using fsmark
on XFS results in a huge number of IOPS being issued for data
writes.  Metadata writes are sorted and plugged at a high level by
XFS, so aggregate nicely into large IOs. However, data writeback IOs
are dispatched in individual 4k IOs, even when the blocks of two
consecutively written files are adjacent.

Test VM: 8p, 8GB RAM, 4xSSD in RAID0, 100TB sparse XFS filesystem,
metadata CRCs enabled.

Kernel: 3.10-rc5 + xfsdev + my 3.11 xfs queue (~70 patches)

Test:

$ ./fs_mark  -D  10000  -S0  -n  10000  -s  4096  -L  120  -d
/mnt/scratch/0  -d  /mnt/scratch/1  -d  /mnt/scratch/2  -d
/mnt/scratch/3  -d  /mnt/scratch/4  -d  /mnt/scratch/5  -d
/mnt/scratch/6  -d  /mnt/scratch/7

Result:

		wall	sys	create rate	Physical write IO
		time	CPU	(avg files/s)	 IOPS	Bandwidth
		-----	-----	------------	------	---------
unpatched	6m56s	15m47s	24,000+/-500	26,000	130MB/s
patched		5m06s	13m28s	32,800+/-600	 1,500	180MB/s
improvement	-26.44%	-14.68%	  +36.67%	-94.23%	+38.46%

If I use zero length files, this workload at about 500 IOPS, so
plugging drops the data IOs from roughly 25,500/s to 1000/s.
3 lines of code, 35% better throughput for 15% less CPU.

The benefits of plugging at this layer are likely to be higher for
spinning media as the IO patterns for this workload are going make a
much bigger difference on high IO latency devices.....

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/fs-writeback.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index 68851ff..1d23d9a 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -589,7 +589,9 @@ static long writeback_sb_inodes(struct super_block *sb,
 	unsigned long start_time = jiffies;
 	long write_chunk;
 	long wrote = 0;  /* count both pages and inodes */
+	struct blk_plug plug;
 
+	blk_start_plug(&plug);
 	while (!list_empty(&wb->b_io)) {
 		struct inode *inode = wb_inode(wb->b_io.prev);
 
@@ -686,6 +688,7 @@ static long writeback_sb_inodes(struct super_block *sb,
 				break;
 		}
 	}
+	blk_finish_plug(&plug);
 	return wrote;
 }
 
-- 
1.8.3.2


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH 02/11] inode: add IOP_NOTHASHED to avoid inode hash lock in evict
  2013-07-31  4:15 [PATCH 00/11] Sync and VFS scalability improvements Dave Chinner
  2013-07-31  4:15 ` [PATCH 01/11] writeback: plug writeback at a high level Dave Chinner
@ 2013-07-31  4:15 ` Dave Chinner
  2013-07-31 14:44   ` Jan Kara
  2013-08-01  8:12   ` Christoph Hellwig
  2013-07-31  4:15 ` [PATCH 03/11] inode: convert inode_sb_list_lock to per-sb Dave Chinner
                   ` (9 subsequent siblings)
  11 siblings, 2 replies; 32+ messages in thread
From: Dave Chinner @ 2013-07-31  4:15 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: linux-kernel, akpm, davej, viro, jack, glommer

From: Dave Chinner <dchinner@redhat.com>

Some filesystems don't use the VFS inode hash and fake the fact they
are hashed so that all the writeback code works correctly. However,
this means the evict() path still tries to remove the inode from the
hash, meaning that the inode_hash_lock() needs to be taken
unnecessarily. Hence under certain workloads the inode_hash_lock can
be contended even if the inode is never actually hashed.

To avoid this, add an inode opflag to allow inode_hash_remove() to
avoid taking the hash lock on inodes have never actually been
hashed.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/xfs_iops.c  | 2 ++
 include/linux/fs.h | 3 ++-
 2 files changed, 4 insertions(+), 1 deletion(-)

diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c
index 96dda62..68a8264 100644
--- a/fs/xfs/xfs_iops.c
+++ b/fs/xfs/xfs_iops.c
@@ -1168,8 +1168,10 @@ xfs_setup_inode(
 	inode->i_state = I_NEW;
 
 	inode_sb_list_add(inode);
+
 	/* make the inode look hashed for the writeback code */
 	hlist_add_fake(&inode->i_hash);
+	inode->i_opflags |= IOP_NOTHASHED;
 
 	inode->i_mode	= ip->i_d.di_mode;
 	set_nlink(inode, ip->i_d.di_nlink);
diff --git a/include/linux/fs.h b/include/linux/fs.h
index b09ddc0..51cf6ed 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -515,6 +515,7 @@ struct posix_acl;
 #define IOP_FASTPERM	0x0001
 #define IOP_LOOKUP	0x0002
 #define IOP_NOFOLLOW	0x0004
+#define IOP_NOTHASHED	0x0008	/* inode never hashed, avoid unhashing */
 
 /*
  * Keep mostly read-only and often accessed (especially for
@@ -2371,7 +2372,7 @@ static inline void insert_inode_hash(struct inode *inode)
 extern void __remove_inode_hash(struct inode *);
 static inline void remove_inode_hash(struct inode *inode)
 {
-	if (!inode_unhashed(inode))
+	if (!((inode->i_opflags & IOP_NOTHASHED) || inode_unhashed(inode)))
 		__remove_inode_hash(inode);
 }
 
-- 
1.8.3.2


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH 03/11] inode: convert inode_sb_list_lock to per-sb
  2013-07-31  4:15 [PATCH 00/11] Sync and VFS scalability improvements Dave Chinner
  2013-07-31  4:15 ` [PATCH 01/11] writeback: plug writeback at a high level Dave Chinner
  2013-07-31  4:15 ` [PATCH 02/11] inode: add IOP_NOTHASHED to avoid inode hash lock in evict Dave Chinner
@ 2013-07-31  4:15 ` Dave Chinner
  2013-07-31 14:48   ` Jan Kara
  2013-07-31  4:15 ` [PATCH 04/11] sync: serialise per-superblock sync operations Dave Chinner
                   ` (8 subsequent siblings)
  11 siblings, 1 reply; 32+ messages in thread
From: Dave Chinner @ 2013-07-31  4:15 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: linux-kernel, akpm, davej, viro, jack, glommer

From: Dave Chinner <dchinner@redhat.com>

The process of reducing contention on per-superblock inode lists
starts with moving the locking to match the per-superblock inode
list. This takes the global lock out of the picture and reduces the
contention problems to within a single filesystem. This doesn't get
rid of contention as the locks still have global CPU scope, but it
does isolate operations on different superblocks form each other.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/block_dev.c                   | 12 ++++++------
 fs/drop_caches.c                 | 10 ++++++----
 fs/fs-writeback.c                | 12 ++++++------
 fs/inode.c                       | 28 +++++++++++++---------------
 fs/internal.h                    |  1 -
 fs/notify/inode_mark.c           | 20 ++++++++++----------
 fs/quota/dquot.c                 | 16 ++++++++--------
 fs/super.c                       |  3 ++-
 include/linux/fs.h               |  5 ++++-
 include/linux/fsnotify_backend.h |  2 +-
 10 files changed, 56 insertions(+), 53 deletions(-)

diff --git a/fs/block_dev.c b/fs/block_dev.c
index c7bda5c..c39c0f3 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -1669,7 +1669,7 @@ void iterate_bdevs(void (*func)(struct block_device *, void *), void *arg)
 {
 	struct inode *inode, *old_inode = NULL;
 
-	spin_lock(&inode_sb_list_lock);
+	spin_lock(&blockdev_superblock->s_inode_list_lock);
 	list_for_each_entry(inode, &blockdev_superblock->s_inodes, i_sb_list) {
 		struct address_space *mapping = inode->i_mapping;
 
@@ -1681,13 +1681,13 @@ void iterate_bdevs(void (*func)(struct block_device *, void *), void *arg)
 		}
 		__iget(inode);
 		spin_unlock(&inode->i_lock);
-		spin_unlock(&inode_sb_list_lock);
+		spin_unlock(&blockdev_superblock->s_inode_list_lock);
 		/*
 		 * We hold a reference to 'inode' so it couldn't have been
 		 * removed from s_inodes list while we dropped the
-		 * inode_sb_list_lock.  We cannot iput the inode now as we can
+		 * s_inode_list_lock  We cannot iput the inode now as we can
 		 * be holding the last reference and we cannot iput it under
-		 * inode_sb_list_lock. So we keep the reference and iput it
+		 * s_inode_list_lock. So we keep the reference and iput it
 		 * later.
 		 */
 		iput(old_inode);
@@ -1695,8 +1695,8 @@ void iterate_bdevs(void (*func)(struct block_device *, void *), void *arg)
 
 		func(I_BDEV(inode), arg);
 
-		spin_lock(&inode_sb_list_lock);
+		spin_lock(&blockdev_superblock->s_inode_list_lock);
 	}
-	spin_unlock(&inode_sb_list_lock);
+	spin_unlock(&blockdev_superblock->s_inode_list_lock);
 	iput(old_inode);
 }
diff --git a/fs/drop_caches.c b/fs/drop_caches.c
index 9fd702f..f1be790 100644
--- a/fs/drop_caches.c
+++ b/fs/drop_caches.c
@@ -17,7 +17,7 @@ static void drop_pagecache_sb(struct super_block *sb, void *unused)
 {
 	struct inode *inode, *toput_inode = NULL;
 
-	spin_lock(&inode_sb_list_lock);
+	spin_lock(&sb->s_inode_list_lock);
 	list_for_each_entry(inode, &sb->s_inodes, i_sb_list) {
 		spin_lock(&inode->i_lock);
 		if ((inode->i_state & (I_FREEING|I_WILL_FREE|I_NEW)) ||
@@ -27,13 +27,15 @@ static void drop_pagecache_sb(struct super_block *sb, void *unused)
 		}
 		__iget(inode);
 		spin_unlock(&inode->i_lock);
-		spin_unlock(&inode_sb_list_lock);
+		spin_unlock(&sb->s_inode_list_lock);
+
 		invalidate_mapping_pages(inode->i_mapping, 0, -1);
 		iput(toput_inode);
 		toput_inode = inode;
-		spin_lock(&inode_sb_list_lock);
+
+		spin_lock(&sb->s_inode_list_lock);
 	}
-	spin_unlock(&inode_sb_list_lock);
+	spin_unlock(&sb->s_inode_list_lock);
 	iput(toput_inode);
 }
 
diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index 1d23d9a..ca66dc8 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -1217,7 +1217,7 @@ static void wait_sb_inodes(struct super_block *sb)
 	 */
 	WARN_ON(!rwsem_is_locked(&sb->s_umount));
 
-	spin_lock(&inode_sb_list_lock);
+	spin_lock(&sb->s_inode_list_lock);
 
 	/*
 	 * Data integrity sync. Must wait for all pages under writeback,
@@ -1237,14 +1237,14 @@ static void wait_sb_inodes(struct super_block *sb)
 		}
 		__iget(inode);
 		spin_unlock(&inode->i_lock);
-		spin_unlock(&inode_sb_list_lock);
+		spin_unlock(&sb->s_inode_list_lock);
 
 		/*
 		 * We hold a reference to 'inode' so it couldn't have been
 		 * removed from s_inodes list while we dropped the
-		 * inode_sb_list_lock.  We cannot iput the inode now as we can
+		 * s_inode_list_lock.  We cannot iput the inode now as we can
 		 * be holding the last reference and we cannot iput it under
-		 * inode_sb_list_lock. So we keep the reference and iput it
+		 * s_inode_list_lock. So we keep the reference and iput it
 		 * later.
 		 */
 		iput(old_inode);
@@ -1254,9 +1254,9 @@ static void wait_sb_inodes(struct super_block *sb)
 
 		cond_resched();
 
-		spin_lock(&inode_sb_list_lock);
+		spin_lock(&sb->s_inode_list_lock);
 	}
-	spin_unlock(&inode_sb_list_lock);
+	spin_unlock(&sb->s_inode_list_lock);
 	iput(old_inode);
 }
 
diff --git a/fs/inode.c b/fs/inode.c
index e315c0a..cd10287 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -27,8 +27,8 @@
  *   inode->i_state, inode->i_hash, __iget()
  * Inode LRU list locks protect:
  *   inode->i_sb->s_inode_lru, inode->i_lru
- * inode_sb_list_lock protects:
- *   sb->s_inodes, inode->i_sb_list
+ * inode->i_sb->s_inode_list_lock protects:
+ *   inode->i_sb->s_inodes, inode->i_sb_list
  * bdi->wb.list_lock protects:
  *   bdi->wb.b_{dirty,io,more_io}, inode->i_wb_list
  * inode_hash_lock protects:
@@ -36,7 +36,7 @@
  *
  * Lock ordering:
  *
- * inode_sb_list_lock
+ * inode->i_sb->s_inode_list_lock
  *   inode->i_lock
  *     Inode LRU list locks
  *
@@ -44,7 +44,7 @@
  *   inode->i_lock
  *
  * inode_hash_lock
- *   inode_sb_list_lock
+ *   inode->i_sb->s_inode_list_lock
  *   inode->i_lock
  *
  * iunique_lock
@@ -56,8 +56,6 @@ static unsigned int i_hash_shift __read_mostly;
 static struct hlist_head *inode_hashtable __read_mostly;
 static __cacheline_aligned_in_smp DEFINE_SPINLOCK(inode_hash_lock);
 
-__cacheline_aligned_in_smp DEFINE_SPINLOCK(inode_sb_list_lock);
-
 /*
  * Empty aops. Can be used for the cases where the user does not
  * define any of the address_space operations.
@@ -432,18 +430,18 @@ static void inode_lru_list_del(struct inode *inode)
  */
 void inode_sb_list_add(struct inode *inode)
 {
-	spin_lock(&inode_sb_list_lock);
+	spin_lock(&inode->i_sb->s_inode_list_lock);
 	list_add(&inode->i_sb_list, &inode->i_sb->s_inodes);
-	spin_unlock(&inode_sb_list_lock);
+	spin_unlock(&inode->i_sb->s_inode_list_lock);
 }
 EXPORT_SYMBOL_GPL(inode_sb_list_add);
 
 static inline void inode_sb_list_del(struct inode *inode)
 {
 	if (!list_empty(&inode->i_sb_list)) {
-		spin_lock(&inode_sb_list_lock);
+		spin_lock(&inode->i_sb->s_inode_list_lock);
 		list_del_init(&inode->i_sb_list);
-		spin_unlock(&inode_sb_list_lock);
+		spin_unlock(&inode->i_sb->s_inode_list_lock);
 	}
 }
 
@@ -600,7 +598,7 @@ void evict_inodes(struct super_block *sb)
 	struct inode *inode, *next;
 	LIST_HEAD(dispose);
 
-	spin_lock(&inode_sb_list_lock);
+	spin_lock(&sb->s_inode_list_lock);
 	list_for_each_entry_safe(inode, next, &sb->s_inodes, i_sb_list) {
 		if (atomic_read(&inode->i_count))
 			continue;
@@ -616,7 +614,7 @@ void evict_inodes(struct super_block *sb)
 		spin_unlock(&inode->i_lock);
 		list_add(&inode->i_lru, &dispose);
 	}
-	spin_unlock(&inode_sb_list_lock);
+	spin_unlock(&sb->s_inode_list_lock);
 
 	dispose_list(&dispose);
 }
@@ -637,7 +635,7 @@ int invalidate_inodes(struct super_block *sb, bool kill_dirty)
 	struct inode *inode, *next;
 	LIST_HEAD(dispose);
 
-	spin_lock(&inode_sb_list_lock);
+	spin_lock(&sb->s_inode_list_lock);
 	list_for_each_entry_safe(inode, next, &sb->s_inodes, i_sb_list) {
 		spin_lock(&inode->i_lock);
 		if (inode->i_state & (I_NEW | I_FREEING | I_WILL_FREE)) {
@@ -660,7 +658,7 @@ int invalidate_inodes(struct super_block *sb, bool kill_dirty)
 		spin_unlock(&inode->i_lock);
 		list_add(&inode->i_lru, &dispose);
 	}
-	spin_unlock(&inode_sb_list_lock);
+	spin_unlock(&sb->s_inode_list_lock);
 
 	dispose_list(&dispose);
 
@@ -901,7 +899,7 @@ struct inode *new_inode(struct super_block *sb)
 {
 	struct inode *inode;
 
-	spin_lock_prefetch(&inode_sb_list_lock);
+	spin_lock_prefetch(&sb->s_inode_list_lock);
 
 	inode = new_inode_pseudo(sb);
 	if (inode)
diff --git a/fs/internal.h b/fs/internal.h
index 0b0db70..bdc0e86 100644
--- a/fs/internal.h
+++ b/fs/internal.h
@@ -110,7 +110,6 @@ extern int open_check_o_direct(struct file *f);
 /*
  * inode.c
  */
-extern spinlock_t inode_sb_list_lock;
 extern long prune_icache_sb(struct super_block *sb, unsigned long nr_to_scan,
 			    int nid);
 extern void inode_add_lru(struct inode *inode);
diff --git a/fs/notify/inode_mark.c b/fs/notify/inode_mark.c
index 74825be..fac139a 100644
--- a/fs/notify/inode_mark.c
+++ b/fs/notify/inode_mark.c
@@ -242,17 +242,17 @@ out:
 
 /**
  * fsnotify_unmount_inodes - an sb is unmounting.  handle any watched inodes.
- * @list: list of inodes being unmounted (sb->s_inodes)
+ * @sb: superblock being unmounted.
  *
  * Called during unmount with no locks held, so needs to be safe against
- * concurrent modifiers. We temporarily drop inode_sb_list_lock and CAN block.
+ * concurrent modifiers. We temporarily drop sb->s_inode_list_lock and CAN block.
  */
-void fsnotify_unmount_inodes(struct list_head *list)
+void fsnotify_unmount_inodes(struct super_block *sb)
 {
 	struct inode *inode, *next_i, *need_iput = NULL;
 
-	spin_lock(&inode_sb_list_lock);
-	list_for_each_entry_safe(inode, next_i, list, i_sb_list) {
+	spin_lock(&sb->s_inode_list_lock);
+	list_for_each_entry_safe(inode, next_i, &sb->s_inodes, i_sb_list) {
 		struct inode *need_iput_tmp;
 
 		/*
@@ -288,7 +288,7 @@ void fsnotify_unmount_inodes(struct list_head *list)
 		spin_unlock(&inode->i_lock);
 
 		/* In case the dropping of a reference would nuke next_i. */
-		if ((&next_i->i_sb_list != list) &&
+		if ((&next_i->i_sb_list != &sb->s_inodes) &&
 		    atomic_read(&next_i->i_count)) {
 			spin_lock(&next_i->i_lock);
 			if (!(next_i->i_state & (I_FREEING | I_WILL_FREE))) {
@@ -299,11 +299,11 @@ void fsnotify_unmount_inodes(struct list_head *list)
 		}
 
 		/*
-		 * We can safely drop inode_sb_list_lock here because we hold
+		 * We can safely drop s_inode_list_lock here because we hold
 		 * references on both inode and next_i.  Also no new inodes
 		 * will be added since the umount has begun.
 		 */
-		spin_unlock(&inode_sb_list_lock);
+		spin_unlock(&sb->s_inode_list_lock);
 
 		if (need_iput_tmp)
 			iput(need_iput_tmp);
@@ -315,7 +315,7 @@ void fsnotify_unmount_inodes(struct list_head *list)
 
 		iput(inode);
 
-		spin_lock(&inode_sb_list_lock);
+		spin_lock(&sb->s_inode_list_lock);
 	}
-	spin_unlock(&inode_sb_list_lock);
+	spin_unlock(&sb->s_inode_list_lock);
 }
diff --git a/fs/quota/dquot.c b/fs/quota/dquot.c
index 017eaea..fe93be2 100644
--- a/fs/quota/dquot.c
+++ b/fs/quota/dquot.c
@@ -909,7 +909,7 @@ static void add_dquot_ref(struct super_block *sb, int type)
 	int reserved = 0;
 #endif
 
-	spin_lock(&inode_sb_list_lock);
+	spin_lock(&sb->s_inode_list_lock);
 	list_for_each_entry(inode, &sb->s_inodes, i_sb_list) {
 		spin_lock(&inode->i_lock);
 		if ((inode->i_state & (I_FREEING|I_WILL_FREE|I_NEW)) ||
@@ -920,7 +920,7 @@ static void add_dquot_ref(struct super_block *sb, int type)
 		}
 		__iget(inode);
 		spin_unlock(&inode->i_lock);
-		spin_unlock(&inode_sb_list_lock);
+		spin_unlock(&sb->s_inode_list_lock);
 
 #ifdef CONFIG_QUOTA_DEBUG
 		if (unlikely(inode_get_rsv_space(inode) > 0))
@@ -932,15 +932,15 @@ static void add_dquot_ref(struct super_block *sb, int type)
 		/*
 		 * We hold a reference to 'inode' so it couldn't have been
 		 * removed from s_inodes list while we dropped the
-		 * inode_sb_list_lock We cannot iput the inode now as we can be
+		 * s_inode_list_lock. We cannot iput the inode now as we can be
 		 * holding the last reference and we cannot iput it under
-		 * inode_sb_list_lock. So we keep the reference and iput it
+		 * s_inode_list_lock. So we keep the reference and iput it
 		 * later.
 		 */
 		old_inode = inode;
-		spin_lock(&inode_sb_list_lock);
+		spin_lock(&sb->s_inode_list_lock);
 	}
-	spin_unlock(&inode_sb_list_lock);
+	spin_unlock(&sb->s_inode_list_lock);
 	iput(old_inode);
 
 #ifdef CONFIG_QUOTA_DEBUG
@@ -1021,7 +1021,7 @@ static void remove_dquot_ref(struct super_block *sb, int type,
 	struct inode *inode;
 	int reserved = 0;
 
-	spin_lock(&inode_sb_list_lock);
+	spin_lock(&sb->s_inode_list_lock);
 	list_for_each_entry(inode, &sb->s_inodes, i_sb_list) {
 		/*
 		 *  We have to scan also I_NEW inodes because they can already
@@ -1035,7 +1035,7 @@ static void remove_dquot_ref(struct super_block *sb, int type,
 			remove_inode_dquot_ref(inode, type, tofree_head);
 		}
 	}
-	spin_unlock(&inode_sb_list_lock);
+	spin_unlock(&sb->s_inode_list_lock);
 #ifdef CONFIG_QUOTA_DEBUG
 	if (reserved) {
 		printk(KERN_WARNING "VFS (%s): Writes happened after quota"
diff --git a/fs/super.c b/fs/super.c
index 09da975..d4d753e 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -201,6 +201,7 @@ static struct super_block *alloc_super(struct file_system_type *type, int flags)
 		INIT_HLIST_NODE(&s->s_instances);
 		INIT_HLIST_BL_HEAD(&s->s_anon);
 		INIT_LIST_HEAD(&s->s_inodes);
+		spin_lock_init(&s->s_inode_list_lock);
 
 		if (list_lru_init(&s->s_dentry_lru))
 			goto err_out;
@@ -441,7 +442,7 @@ void generic_shutdown_super(struct super_block *sb)
 		sync_filesystem(sb);
 		sb->s_flags &= ~MS_ACTIVE;
 
-		fsnotify_unmount_inodes(&sb->s_inodes);
+		fsnotify_unmount_inodes(sb);
 
 		evict_inodes(sb);
 
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 51cf6ed..923b465 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1263,7 +1263,6 @@ struct super_block {
 #endif
 	const struct xattr_handler **s_xattr;
 
-	struct list_head	s_inodes;	/* all inodes */
 	struct hlist_bl_head	s_anon;		/* anonymous dentries for (nfs) exporting */
 #ifdef CONFIG_SMP
 	struct list_head __percpu *s_files;
@@ -1328,6 +1327,10 @@ struct super_block {
 	 */
 	struct list_lru		s_dentry_lru ____cacheline_aligned_in_smp;
 	struct list_lru		s_inode_lru ____cacheline_aligned_in_smp;
+
+	/* s_inode_list_lock protects s_inodes */
+	spinlock_t		s_inode_list_lock ____cacheline_aligned_in_smp;
+	struct list_head	s_inodes;	/* all inodes */
 };
 
 extern struct timespec current_fs_time(struct super_block *sb);
diff --git a/include/linux/fsnotify_backend.h b/include/linux/fsnotify_backend.h
index 4b2ee8d..3d58028 100644
--- a/include/linux/fsnotify_backend.h
+++ b/include/linux/fsnotify_backend.h
@@ -427,7 +427,7 @@ extern void fsnotify_clear_marks_by_group_flags(struct fsnotify_group *group, un
 extern void fsnotify_clear_marks_by_group(struct fsnotify_group *group);
 extern void fsnotify_get_mark(struct fsnotify_mark *mark);
 extern void fsnotify_put_mark(struct fsnotify_mark *mark);
-extern void fsnotify_unmount_inodes(struct list_head *list);
+extern void fsnotify_unmount_inodes(struct super_block *sb);
 
 /* put here because inotify does some weird stuff when destroying watches */
 extern struct fsnotify_event *fsnotify_create_event(struct inode *to_tell, __u32 mask,
-- 
1.8.3.2


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH 04/11] sync: serialise per-superblock sync operations
  2013-07-31  4:15 [PATCH 00/11] Sync and VFS scalability improvements Dave Chinner
                   ` (2 preceding siblings ...)
  2013-07-31  4:15 ` [PATCH 03/11] inode: convert inode_sb_list_lock to per-sb Dave Chinner
@ 2013-07-31  4:15 ` Dave Chinner
  2013-07-31 15:12   ` Jan Kara
  2013-07-31  4:15 ` [PATCH 05/11] inode: rename i_wb_list to i_io_list Dave Chinner
                   ` (7 subsequent siblings)
  11 siblings, 1 reply; 32+ messages in thread
From: Dave Chinner @ 2013-07-31  4:15 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: linux-kernel, akpm, davej, viro, jack, glommer

From: Dave Chinner <dchinner@redhat.com>

When competing sync(2) calls walk the same filesystem, they need to
walk the list of inodes on the superblock to find all the inodes
that we need to wait for IO completion on. However, when multiple
wait_sb_inodes() calls do this at the same time, they contend on the
the inode_sb_list_lock and the contention causes system wide
slowdowns. In effect, concurrent sync(2) calls can take longer and
burn more CPU than if they were serialised.

Stop the worst of the contention by adding a per-sb mutex to wrap
around wait_sb_inodes() so that we only execute one sync(2) IO
completion walk per superblock superblock at a time and hence avoid
contention being triggered by concurrent sync(2) calls.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/fs-writeback.c  | 11 +++++++++++
 fs/super.c         |  1 +
 include/linux/fs.h |  2 ++
 3 files changed, 14 insertions(+)

diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index ca66dc8..56272ec 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -1207,6 +1207,15 @@ out_unlock_inode:
 }
 EXPORT_SYMBOL(__mark_inode_dirty);
 
+/*
+ * The @s_sync_lock is used to serialise concurrent sync operations
+ * to avoid lock contention problems with concurrent wait_sb_inodes() calls.
+ * Concurrent callers will block on the s_sync_lock rather than doing contending
+ * walks. The queueing maintains sync(2) required behaviour as all the IO that
+ * has been issued up to the time this function is enter is guaranteed to be
+ * completed by the time we have gained the lock and waited for all IO that is
+ * in progress regardless of the order callers are granted the lock.
+ */
 static void wait_sb_inodes(struct super_block *sb)
 {
 	struct inode *inode, *old_inode = NULL;
@@ -1217,6 +1226,7 @@ static void wait_sb_inodes(struct super_block *sb)
 	 */
 	WARN_ON(!rwsem_is_locked(&sb->s_umount));
 
+	mutex_lock(&sb->s_sync_lock);
 	spin_lock(&sb->s_inode_list_lock);
 
 	/*
@@ -1258,6 +1268,7 @@ static void wait_sb_inodes(struct super_block *sb)
 	}
 	spin_unlock(&sb->s_inode_list_lock);
 	iput(old_inode);
+	mutex_unlock(&sb->s_sync_lock);
 }
 
 /**
diff --git a/fs/super.c b/fs/super.c
index d4d753e..7f98fd6 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -200,6 +200,7 @@ static struct super_block *alloc_super(struct file_system_type *type, int flags)
 		s->s_bdi = &default_backing_dev_info;
 		INIT_HLIST_NODE(&s->s_instances);
 		INIT_HLIST_BL_HEAD(&s->s_anon);
+		mutex_init(&s->s_sync_lock);
 		INIT_LIST_HEAD(&s->s_inodes);
 		spin_lock_init(&s->s_inode_list_lock);
 
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 923b465..971e8be 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1321,6 +1321,8 @@ struct super_block {
 	/* Being remounted read-only */
 	int s_readonly_remount;
 
+	struct mutex		s_sync_lock;	/* sync serialisation lock */
+
 	/*
 	 * Keep the lru lists last in the structure so they always sit on their
 	 * own individual cachelines.
-- 
1.8.3.2


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH 05/11] inode: rename i_wb_list to i_io_list
  2013-07-31  4:15 [PATCH 00/11] Sync and VFS scalability improvements Dave Chinner
                   ` (3 preceding siblings ...)
  2013-07-31  4:15 ` [PATCH 04/11] sync: serialise per-superblock sync operations Dave Chinner
@ 2013-07-31  4:15 ` Dave Chinner
  2013-07-31 14:51   ` Jan Kara
  2013-07-31  4:15 ` [PATCH 06/11] bdi: add a new writeback list for sync Dave Chinner
                   ` (6 subsequent siblings)
  11 siblings, 1 reply; 32+ messages in thread
From: Dave Chinner @ 2013-07-31  4:15 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: linux-kernel, akpm, davej, viro, jack, glommer

From: Dave Chinner <dchinner@redhat.com>

There's a small consistency problem between the inode and writeback
naming. Writeback calls the "for IO" inode queues b_io and
b_more_io, but the inode calls these the "writeback list" or
i_wb_list. This makes it hard to an new "under writeback" list to
the inode, or call it an "under IO" list on the bdi because either
way we'll have writeback on IO and IO on writeback and it'll just be
confusing. I'm getting confused just writing this!

So, rename the inode "for IO" list variable to i_io_list so we can
add a new "writeback list" in a subsequent patch.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/block_dev.c     |  2 +-
 fs/fs-writeback.c  | 18 +++++++++---------
 fs/inode.c         |  6 +++---
 include/linux/fs.h |  2 +-
 mm/backing-dev.c   |  6 +++---
 5 files changed, 17 insertions(+), 17 deletions(-)

diff --git a/fs/block_dev.c b/fs/block_dev.c
index c39c0f3..bec0c26 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -68,7 +68,7 @@ static void bdev_inode_switch_bdi(struct inode *inode,
 	if (inode->i_state & I_DIRTY) {
 		if (bdi_cap_writeback_dirty(dst) && !wb_has_dirty_io(&dst->wb))
 			wakeup_bdi = true;
-		list_move(&inode->i_wb_list, &dst->wb.b_dirty);
+		list_move(&inode->i_io_list, &dst->wb.b_dirty);
 	}
 	spin_unlock(&inode->i_lock);
 	spin_unlock(&old->wb.list_lock);
diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index 56272ec..b7ac1c2 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -77,7 +77,7 @@ static inline struct backing_dev_info *inode_to_bdi(struct inode *inode)
 
 static inline struct inode *wb_inode(struct list_head *head)
 {
-	return list_entry(head, struct inode, i_wb_list);
+	return list_entry(head, struct inode, i_io_list);
 }
 
 /*
@@ -171,7 +171,7 @@ void inode_wb_list_del(struct inode *inode)
 	struct backing_dev_info *bdi = inode_to_bdi(inode);
 
 	spin_lock(&bdi->wb.list_lock);
-	list_del_init(&inode->i_wb_list);
+	list_del_init(&inode->i_io_list);
 	spin_unlock(&bdi->wb.list_lock);
 }
 
@@ -194,7 +194,7 @@ static void redirty_tail(struct inode *inode, struct bdi_writeback *wb)
 		if (time_before(inode->dirtied_when, tail->dirtied_when))
 			inode->dirtied_when = jiffies;
 	}
-	list_move(&inode->i_wb_list, &wb->b_dirty);
+	list_move(&inode->i_io_list, &wb->b_dirty);
 }
 
 /*
@@ -203,7 +203,7 @@ static void redirty_tail(struct inode *inode, struct bdi_writeback *wb)
 static void requeue_io(struct inode *inode, struct bdi_writeback *wb)
 {
 	assert_spin_locked(&wb->list_lock);
-	list_move(&inode->i_wb_list, &wb->b_more_io);
+	list_move(&inode->i_io_list, &wb->b_more_io);
 }
 
 static void inode_sync_complete(struct inode *inode)
@@ -254,7 +254,7 @@ static int move_expired_inodes(struct list_head *delaying_queue,
 		if (sb && sb != inode->i_sb)
 			do_sb_sort = 1;
 		sb = inode->i_sb;
-		list_move(&inode->i_wb_list, &tmp);
+		list_move(&inode->i_io_list, &tmp);
 		moved++;
 	}
 
@@ -270,7 +270,7 @@ static int move_expired_inodes(struct list_head *delaying_queue,
 		list_for_each_prev_safe(pos, node, &tmp) {
 			inode = wb_inode(pos);
 			if (inode->i_sb == sb)
-				list_move(&inode->i_wb_list, dispatch_queue);
+				list_move(&inode->i_io_list, dispatch_queue);
 		}
 	}
 out:
@@ -418,7 +418,7 @@ static void requeue_inode(struct inode *inode, struct bdi_writeback *wb,
 		redirty_tail(inode, wb);
 	} else {
 		/* The inode is clean. Remove from writeback lists. */
-		list_del_init(&inode->i_wb_list);
+		list_del_init(&inode->i_io_list);
 	}
 }
 
@@ -528,7 +528,7 @@ writeback_single_inode(struct inode *inode, struct bdi_writeback *wb,
 	 * touch it. See comment above for explanation.
 	 */
 	if (!(inode->i_state & I_DIRTY))
-		list_del_init(&inode->i_wb_list);
+		list_del_init(&inode->i_io_list);
 	spin_unlock(&wb->list_lock);
 	inode_sync_complete(inode);
 out:
@@ -1193,7 +1193,7 @@ void __mark_inode_dirty(struct inode *inode, int flags)
 			spin_unlock(&inode->i_lock);
 			spin_lock(&bdi->wb.list_lock);
 			inode->dirtied_when = jiffies;
-			list_move(&inode->i_wb_list, &bdi->wb.b_dirty);
+			list_move(&inode->i_io_list, &bdi->wb.b_dirty);
 			spin_unlock(&bdi->wb.list_lock);
 
 			if (wakeup_bdi)
diff --git a/fs/inode.c b/fs/inode.c
index cd10287..f8e0f2f 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -30,7 +30,7 @@
  * inode->i_sb->s_inode_list_lock protects:
  *   inode->i_sb->s_inodes, inode->i_sb_list
  * bdi->wb.list_lock protects:
- *   bdi->wb.b_{dirty,io,more_io}, inode->i_wb_list
+ *   bdi->wb.b_{dirty,io,more_io}, inode->i_io_list
  * inode_hash_lock protects:
  *   inode_hashtable, inode->i_hash
  *
@@ -364,7 +364,7 @@ void inode_init_once(struct inode *inode)
 	memset(inode, 0, sizeof(*inode));
 	INIT_HLIST_NODE(&inode->i_hash);
 	INIT_LIST_HEAD(&inode->i_devices);
-	INIT_LIST_HEAD(&inode->i_wb_list);
+	INIT_LIST_HEAD(&inode->i_io_list);
 	INIT_LIST_HEAD(&inode->i_lru);
 	address_space_init_once(&inode->i_data);
 	i_size_ordered_init(inode);
@@ -530,7 +530,7 @@ static void evict(struct inode *inode)
 	BUG_ON(!(inode->i_state & I_FREEING));
 	BUG_ON(!list_empty(&inode->i_lru));
 
-	if (!list_empty(&inode->i_wb_list))
+	if (!list_empty(&inode->i_io_list))
 		inode_wb_list_del(inode);
 
 	inode_sb_list_del(inode);
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 971e8be..b99eb39 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -576,7 +576,7 @@ struct inode {
 	unsigned long		dirtied_when;	/* jiffies of first dirtying */
 
 	struct hlist_node	i_hash;
-	struct list_head	i_wb_list;	/* backing dev IO list */
+	struct list_head	i_io_list;	/* backing dev IO list */
 	struct list_head	i_lru;		/* inode LRU list */
 	struct list_head	i_sb_list;
 	union {
diff --git a/mm/backing-dev.c b/mm/backing-dev.c
index e04454c..024d4ca 100644
--- a/mm/backing-dev.c
+++ b/mm/backing-dev.c
@@ -74,11 +74,11 @@ static int bdi_debug_stats_show(struct seq_file *m, void *v)
 
 	nr_dirty = nr_io = nr_more_io = 0;
 	spin_lock(&wb->list_lock);
-	list_for_each_entry(inode, &wb->b_dirty, i_wb_list)
+	list_for_each_entry(inode, &wb->b_dirty, i_io_list)
 		nr_dirty++;
-	list_for_each_entry(inode, &wb->b_io, i_wb_list)
+	list_for_each_entry(inode, &wb->b_io, i_io_list)
 		nr_io++;
-	list_for_each_entry(inode, &wb->b_more_io, i_wb_list)
+	list_for_each_entry(inode, &wb->b_more_io, i_io_list)
 		nr_more_io++;
 	spin_unlock(&wb->list_lock);
 
-- 
1.8.3.2


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH 06/11] bdi: add a new writeback list for sync
  2013-07-31  4:15 [PATCH 00/11] Sync and VFS scalability improvements Dave Chinner
                   ` (4 preceding siblings ...)
  2013-07-31  4:15 ` [PATCH 05/11] inode: rename i_wb_list to i_io_list Dave Chinner
@ 2013-07-31  4:15 ` Dave Chinner
  2013-07-31 15:11   ` Jan Kara
  2013-07-31  4:15 ` [PATCH 07/11] writeback: periodically trim the writeback list Dave Chinner
                   ` (5 subsequent siblings)
  11 siblings, 1 reply; 32+ messages in thread
From: Dave Chinner @ 2013-07-31  4:15 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: linux-kernel, akpm, davej, viro, jack, glommer

From: Dave Chinner <dchinner@redhat.com>

wait_sb_inodes() current does a walk of all inodes in the filesystem
to find dirty one to wait on during sync. This is highly
inefficient and wastes a lot of CPU when there are lots of clean
cached inodes that we don't need to wait on.

To avoid this "all inode" walk, we need to track inodes that are
currently under writeback that we need to wait for. We do this by
adding inodes to a writeback list on the bdi when the mapping is
first tagged as having pages under writeback.  wait_sb_inodes() can
then walk this list of "inodes under IO" and wait specifically just
for the inodes that the current sync(2) needs to wait for.

To avoid needing all the realted locking to be safe against
interrupts, Jan Kara suggested that we be lazy about removal from
the writeback list. That is, we don't remove inodes from the
writeback list on IO completion, but do it directly during a
wait_sb_inodes() walk.

This means that the a rare sync(2) call will have some work to do
skipping clean inodes However, in the current problem case of
concurrent sync workloads, concurrent wait_sb_inodes() calls only
walk the very recently dispatched inodes and hence should have very
little work to do.

This also means that we have to remove the inodes from the writeback
list during eviction. Do this in inode_wait_for_writeback() once
all writeback on the inode is complete.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/fs-writeback.c           | 110 +++++++++++++++++++++++++++++++++++++-------
 fs/inode.c                  |   1 +
 include/linux/backing-dev.h |   3 ++
 include/linux/fs.h          |   1 +
 mm/backing-dev.c            |   1 +
 mm/page-writeback.c         |  14 ++++++
 6 files changed, 114 insertions(+), 16 deletions(-)

diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index b7ac1c2..638f122 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -176,6 +176,34 @@ void inode_wb_list_del(struct inode *inode)
 }
 
 /*
+ * mark an inode as under writeback on the given bdi
+ */
+void bdi_mark_inode_writeback(struct backing_dev_info *bdi, struct inode *inode)
+{
+	WARN_ON_ONCE(bdi != inode_to_bdi(inode));
+	if (list_empty(&inode->i_wb_list)) {
+		spin_lock(&bdi->wb.list_lock);
+		if (list_empty(&inode->i_wb_list))
+			list_add_tail(&inode->i_wb_list, &bdi->wb.b_writeback);
+		spin_unlock(&bdi->wb.list_lock);
+	}
+}
+
+/*
+ * clear an inode as under writeback on the given bdi
+ */
+static void bdi_clear_inode_writeback(struct backing_dev_info *bdi,
+				      struct inode *inode)
+{
+	WARN_ON_ONCE(bdi != inode_to_bdi(inode));
+	if (!list_empty(&inode->i_wb_list)) {
+		spin_lock(&bdi->wb.list_lock);
+		list_del_init(&inode->i_wb_list);
+		spin_unlock(&bdi->wb.list_lock);
+	}
+}
+
+/*
  * Redirty an inode: set its when-it-was dirtied timestamp and move it to the
  * furthest end of its superblock's dirty-inode list.
  *
@@ -330,13 +358,18 @@ static void __inode_wait_for_writeback(struct inode *inode)
 }
 
 /*
- * Wait for writeback on an inode to complete. Caller must have inode pinned.
+ * Wait for writeback on an inode to complete during eviction. Caller must have
+ * inode pinned.
  */
 void inode_wait_for_writeback(struct inode *inode)
 {
+	BUG_ON(!(inode->i_state & I_FREEING));
+
 	spin_lock(&inode->i_lock);
 	__inode_wait_for_writeback(inode);
 	spin_unlock(&inode->i_lock);
+
+	bdi_clear_inode_writeback(inode_to_bdi(inode), inode);
 }
 
 /*
@@ -1218,7 +1251,9 @@ EXPORT_SYMBOL(__mark_inode_dirty);
  */
 static void wait_sb_inodes(struct super_block *sb)
 {
-	struct inode *inode, *old_inode = NULL;
+	struct backing_dev_info *bdi = sb->s_bdi;
+	struct inode *old_inode = NULL;
+	LIST_HEAD(sync_list);
 
 	/*
 	 * We need to be protected against the filesystem going from
@@ -1226,28 +1261,59 @@ static void wait_sb_inodes(struct super_block *sb)
 	 */
 	WARN_ON(!rwsem_is_locked(&sb->s_umount));
 
-	mutex_lock(&sb->s_sync_lock);
-	spin_lock(&sb->s_inode_list_lock);
-
 	/*
-	 * Data integrity sync. Must wait for all pages under writeback,
-	 * because there may have been pages dirtied before our sync
-	 * call, but which had writeout started before we write it out.
-	 * In which case, the inode may not be on the dirty list, but
-	 * we still have to wait for that writeout.
+	 * Data integrity sync. Must wait for all pages under writeback, because
+	 * there may have been pages dirtied before our sync call, but which had
+	 * writeout started before we write it out.  In which case, the inode
+	 * may not be on the dirty list, but we still have to wait for that
+	 * writeout.
+	 *
+	 * To avoid syncing inodes put under IO after we have started here,
+	 * splice the io list to a temporary list head and walk that. Newly
+	 * dirtied inodes will go onto the primary list so we won't wait for
+	 * them. This is safe to do as we are serialised by the s_sync_lock,
+	 * so we'll complete processing the complete list before the next
+	 * sync operation repeats the splice-and-walk process.
+	 *
+	 * Inodes that have pages under writeback after we've finished the wait
+	 * may or may not be on the primary list. They had pages under put IO
+	 * after we started our wait, so we need to make sure the next sync or
+	 * IO completion treats them correctly, Move them back to the primary
+	 * list and restart the walk.
+	 *
+	 * Inodes that are clean after we have waited for them don't belong on
+	 * any list, so simply remove them from the sync list and move onwards.
 	 */
-	list_for_each_entry(inode, &sb->s_inodes, i_sb_list) {
+	mutex_lock(&sb->s_sync_lock);
+	spin_lock(&bdi->wb.list_lock);
+	list_splice_init(&bdi->wb.b_writeback, &sync_list);
+
+	while (!list_empty(&sync_list)) {
+		struct inode *inode = list_first_entry(&sync_list, struct inode,
+						       i_wb_list);
 		struct address_space *mapping = inode->i_mapping;
 
+		/*
+		 * We are lazy on IO completion and don't remove inodes from the
+		 * list when they are clean. Detect that immediately and skip
+		 * inodes we don't ahve to wait on.
+		 */
+		if (!mapping_tagged(mapping, PAGECACHE_TAG_WRITEBACK)) {
+			list_del_init(&inode->i_wb_list);
+			cond_resched_lock(&bdi->wb.list_lock);
+			continue;
+		}
+
 		spin_lock(&inode->i_lock);
-		if ((inode->i_state & (I_FREEING|I_WILL_FREE|I_NEW)) ||
-		    (mapping->nrpages == 0)) {
+		if (inode->i_state & (I_FREEING|I_WILL_FREE|I_NEW)) {
+                        list_move(&inode->i_wb_list, &bdi->wb.b_writeback);
 			spin_unlock(&inode->i_lock);
+			cond_resched_lock(&bdi->wb.list_lock);
 			continue;
 		}
 		__iget(inode);
 		spin_unlock(&inode->i_lock);
-		spin_unlock(&sb->s_inode_list_lock);
+		spin_unlock(&bdi->wb.list_lock);
 
 		/*
 		 * We hold a reference to 'inode' so it couldn't have been
@@ -1264,9 +1330,21 @@ static void wait_sb_inodes(struct super_block *sb)
 
 		cond_resched();
 
-		spin_lock(&sb->s_inode_list_lock);
+                /*
+                 * the inode has been written back now, so check whether we
+                 * still have pages under IO and move it back to the primary
+                 * list if necessary
+                 */
+                spin_lock_irq(&mapping->tree_lock);
+                spin_lock(&bdi->wb.list_lock);
+                if (mapping_tagged(mapping, PAGECACHE_TAG_WRITEBACK)) {
+                        WARN_ON(list_empty(&inode->i_wb_list));
+                        list_move(&inode->i_wb_list, &bdi->wb.b_writeback);
+                } else
+			list_del_init(&inode->i_wb_list);
+                spin_unlock_irq(&mapping->tree_lock);
 	}
-	spin_unlock(&sb->s_inode_list_lock);
+	spin_unlock(&bdi->wb.list_lock);
 	iput(old_inode);
 	mutex_unlock(&sb->s_sync_lock);
 }
diff --git a/fs/inode.c b/fs/inode.c
index f8e0f2f..de62d2b 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -365,6 +365,7 @@ void inode_init_once(struct inode *inode)
 	INIT_HLIST_NODE(&inode->i_hash);
 	INIT_LIST_HEAD(&inode->i_devices);
 	INIT_LIST_HEAD(&inode->i_io_list);
+	INIT_LIST_HEAD(&inode->i_wb_list);
 	INIT_LIST_HEAD(&inode->i_lru);
 	address_space_init_once(&inode->i_data);
 	i_size_ordered_init(inode);
diff --git a/include/linux/backing-dev.h b/include/linux/backing-dev.h
index c388155..45edd8a 100644
--- a/include/linux/backing-dev.h
+++ b/include/linux/backing-dev.h
@@ -58,6 +58,7 @@ struct bdi_writeback {
 	struct list_head b_dirty;	/* dirty inodes */
 	struct list_head b_io;		/* parked for writeback */
 	struct list_head b_more_io;	/* parked for more writeback */
+	struct list_head b_writeback;	/* inodes under writeback */
 	spinlock_t list_lock;		/* protects the b_* lists */
 };
 
@@ -125,6 +126,8 @@ void bdi_writeback_workfn(struct work_struct *work);
 int bdi_has_dirty_io(struct backing_dev_info *bdi);
 void bdi_wakeup_thread_delayed(struct backing_dev_info *bdi);
 void bdi_lock_two(struct bdi_writeback *wb1, struct bdi_writeback *wb2);
+void bdi_mark_inode_writeback(struct backing_dev_info *bdi,
+			      struct inode *inode);
 
 extern spinlock_t bdi_lock;
 extern struct list_head bdi_list;
diff --git a/include/linux/fs.h b/include/linux/fs.h
index b99eb39..5bb84b1 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -579,6 +579,7 @@ struct inode {
 	struct list_head	i_io_list;	/* backing dev IO list */
 	struct list_head	i_lru;		/* inode LRU list */
 	struct list_head	i_sb_list;
+	struct list_head	i_wb_list;	/* backing dev writeback list */
 	union {
 		struct hlist_head	i_dentry;
 		struct rcu_head		i_rcu;
diff --git a/mm/backing-dev.c b/mm/backing-dev.c
index 024d4ca..b0cc55a 100644
--- a/mm/backing-dev.c
+++ b/mm/backing-dev.c
@@ -419,6 +419,7 @@ static void bdi_wb_init(struct bdi_writeback *wb, struct backing_dev_info *bdi)
 	INIT_LIST_HEAD(&wb->b_dirty);
 	INIT_LIST_HEAD(&wb->b_io);
 	INIT_LIST_HEAD(&wb->b_more_io);
+	INIT_LIST_HEAD(&wb->b_writeback);
 	spin_lock_init(&wb->list_lock);
 	INIT_DELAYED_WORK(&wb->dwork, bdi_writeback_workfn);
 }
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 3f0c895..038c893 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -2262,11 +2262,25 @@ int test_set_page_writeback(struct page *page)
 		spin_lock_irqsave(&mapping->tree_lock, flags);
 		ret = TestSetPageWriteback(page);
 		if (!ret) {
+			bool on_wblist;
+
+			on_wblist = mapping_tagged(mapping,
+						   PAGECACHE_TAG_WRITEBACK);
+
 			radix_tree_tag_set(&mapping->page_tree,
 						page_index(page),
 						PAGECACHE_TAG_WRITEBACK);
 			if (bdi_cap_account_writeback(bdi))
 				__inc_bdi_stat(bdi, BDI_WRITEBACK);
+
+			/*
+			 * we can come through here when swapping anonymous
+			 * pages, so we don't necessarily have an inode to
+			 * track for sync. Inodes are remove lazily, so there is
+			 * no equivalent in test_clear_page_writeback().
+			 */
+			if (!on_wblist && mapping->host)
+				bdi_mark_inode_writeback(bdi, mapping->host);
 		}
 		if (!PageDirty(page))
 			radix_tree_tag_clear(&mapping->page_tree,
-- 
1.8.3.2


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH 07/11] writeback: periodically trim the writeback list
  2013-07-31  4:15 [PATCH 00/11] Sync and VFS scalability improvements Dave Chinner
                   ` (5 preceding siblings ...)
  2013-07-31  4:15 ` [PATCH 06/11] bdi: add a new writeback list for sync Dave Chinner
@ 2013-07-31  4:15 ` Dave Chinner
  2013-07-31 15:15   ` Jan Kara
  2013-07-31  4:15 ` [PATCH 08/11] inode: convert per-sb inode list to a list_lru Dave Chinner
                   ` (4 subsequent siblings)
  11 siblings, 1 reply; 32+ messages in thread
From: Dave Chinner @ 2013-07-31  4:15 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: linux-kernel, akpm, davej, viro, jack, glommer

From: Dave Chinner <dchinner@redhat.com>

Inodes are removed lazily from the bdi writeback list, so in the
absence of sync(2) work inodes will build up on the bdi writback
list even though they are no longer under IO. Use the periodic
kupdate work check to remove inodes no longer under IO from the
writeback list.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/fs-writeback.c | 19 +++++++++++++++++++
 1 file changed, 19 insertions(+)

diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index 638f122..7c9bbf0 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -962,6 +962,23 @@ static long wb_check_background_flush(struct bdi_writeback *wb)
 	return 0;
 }
 
+/*
+ * clean out writeback list for all inodes that don't have IO in progress
+ */
+static void wb_trim_writeback_list(struct bdi_writeback *wb)
+{
+	struct inode *inode;
+	struct inode *tmp;
+
+	spin_lock(&wb->list_lock);
+	list_for_each_entry_safe(inode, tmp, &wb->b_writeback, i_wb_list) {
+		if (!mapping_tagged(inode->i_mapping, PAGECACHE_TAG_WRITEBACK))
+			list_del_init(&inode->i_wb_list);
+	}
+	spin_unlock(&wb->list_lock);
+
+}
+
 static long wb_check_old_data_flush(struct bdi_writeback *wb)
 {
 	unsigned long expired;
@@ -978,6 +995,8 @@ static long wb_check_old_data_flush(struct bdi_writeback *wb)
 	if (time_before(jiffies, expired))
 		return 0;
 
+	wb_trim_writeback_list(wb);
+
 	wb->last_old_flush = jiffies;
 	nr_pages = get_nr_dirty_pages();
 
-- 
1.8.3.2


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH 08/11] inode: convert per-sb inode list to a list_lru
  2013-07-31  4:15 [PATCH 00/11] Sync and VFS scalability improvements Dave Chinner
                   ` (6 preceding siblings ...)
  2013-07-31  4:15 ` [PATCH 07/11] writeback: periodically trim the writeback list Dave Chinner
@ 2013-07-31  4:15 ` Dave Chinner
  2013-08-01  8:19   ` Christoph Hellwig
  2013-07-31  4:15 ` [PATCH 09/11] fs: Use RCU lookups for inode cache Dave Chinner
                   ` (3 subsequent siblings)
  11 siblings, 1 reply; 32+ messages in thread
From: Dave Chinner @ 2013-07-31  4:15 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: linux-kernel, akpm, davej, viro, jack, glommer

From: Dave Chinner <dchinner@redhat.com>

The per-superblock inode list and lock is a bottleneck for systems
that cycle inodes in and out of cache concurrently. The global lock
is a limiting factor.

Most of the additions to the sb inode list occur on the CPU that
allocated the inode, and most of the removals occur during evict()
calls as a result of memory reclaim. Both of these events are local
to the node that the inode belongs to, so it maps to the per-node
lists that the list_lru uses.

There are several places where the inode list is walked. These can
be converted easily to use list_lru_walk() to do their work on each
inode on the list.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/block_dev.c         |  75 ++++++++++++---------
 fs/drop_caches.c       |  57 +++++++++++-----
 fs/fs-writeback.c      |   4 +-
 fs/inode.c             | 134 +++++++++++++++++++------------------
 fs/notify/inode_mark.c | 111 +++++++++++++------------------
 fs/quota/dquot.c       | 174 ++++++++++++++++++++++++++++++++-----------------
 fs/super.c             |   9 ++-
 include/linux/fs.h     |   9 ++-
 8 files changed, 327 insertions(+), 246 deletions(-)

diff --git a/fs/block_dev.c b/fs/block_dev.c
index bec0c26..b8ec2bd 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -1665,38 +1665,55 @@ int __invalidate_device(struct block_device *bdev, bool kill_dirty)
 }
 EXPORT_SYMBOL(__invalidate_device);
 
-void iterate_bdevs(void (*func)(struct block_device *, void *), void *arg)
-{
-	struct inode *inode, *old_inode = NULL;
+struct bdev_iter {
+	void (*func)(struct block_device *, void *);
+	void *arg;
+	struct inode *toput_inode;
+};
 
-	spin_lock(&blockdev_superblock->s_inode_list_lock);
-	list_for_each_entry(inode, &blockdev_superblock->s_inodes, i_sb_list) {
-		struct address_space *mapping = inode->i_mapping;
+static enum lru_status
+bdev_iter_cb(struct list_head *item, spinlock_t *lock, void *cb_arg)
+{
+	struct bdev_iter *iter = cb_arg;
+	struct inode *inode = container_of(item, struct inode, i_sb_list);
 
-		spin_lock(&inode->i_lock);
-		if (inode->i_state & (I_FREEING|I_WILL_FREE|I_NEW) ||
-		    mapping->nrpages == 0) {
-			spin_unlock(&inode->i_lock);
-			continue;
-		}
-		__iget(inode);
+	spin_lock(&inode->i_lock);
+	if (inode->i_state & (I_FREEING|I_WILL_FREE|I_NEW) ||
+	    inode->i_mapping->nrpages == 0) {
 		spin_unlock(&inode->i_lock);
-		spin_unlock(&blockdev_superblock->s_inode_list_lock);
-		/*
-		 * We hold a reference to 'inode' so it couldn't have been
-		 * removed from s_inodes list while we dropped the
-		 * s_inode_list_lock  We cannot iput the inode now as we can
-		 * be holding the last reference and we cannot iput it under
-		 * s_inode_list_lock. So we keep the reference and iput it
-		 * later.
-		 */
-		iput(old_inode);
-		old_inode = inode;
+		return LRU_SKIP;
+	}
+	__iget(inode);
+	spin_unlock(&inode->i_lock);
+	spin_unlock(lock);
 
-		func(I_BDEV(inode), arg);
+	iput(iter->toput_inode);
+	iter->toput_inode = inode;
 
-		spin_lock(&blockdev_superblock->s_inode_list_lock);
-	}
-	spin_unlock(&blockdev_superblock->s_inode_list_lock);
-	iput(old_inode);
+	iter->func(I_BDEV(inode), iter->arg);
+
+	/*
+	 * Even though we have dropped the lock here, we can return LRU_SKIP as
+	 * we have a reference to the current inode and so it's next pointer is
+	 * guaranteed to be valid even though we dropped the list lock.
+	 */
+	spin_lock(lock);
+	return LRU_SKIP;
+}
+
+/*
+ * iterate_bdevs - run a callback across all block devices
+ */
+void iterate_bdevs(void (*func)(struct block_device *, void *), void *arg)
+{
+	struct bdev_iter iter = {
+		.func = func,
+		.arg = arg,
+	};
+
+	list_lru_walk(&blockdev_superblock->s_inode_list, bdev_iter_cb, &iter,
+		      ULONG_MAX);
+
+	/* the list walk doesn't release the last inode it sees! */
+	iput(iter.toput_inode);
 }
diff --git a/fs/drop_caches.c b/fs/drop_caches.c
index f1be790..048a7d7 100644
--- a/fs/drop_caches.c
+++ b/fs/drop_caches.c
@@ -13,29 +13,50 @@
 /* A global variable is a bit ugly, but it keeps the code simple */
 int sysctl_drop_caches;
 
-static void drop_pagecache_sb(struct super_block *sb, void *unused)
+static enum lru_status
+drop_pagecache_inode(struct list_head *item, spinlock_t *lock, void *cb_arg)
 {
-	struct inode *inode, *toput_inode = NULL;
+	struct inode **toput_inode = cb_arg;
+	struct inode *inode = container_of(item, struct inode, i_sb_list);
 
-	spin_lock(&sb->s_inode_list_lock);
-	list_for_each_entry(inode, &sb->s_inodes, i_sb_list) {
-		spin_lock(&inode->i_lock);
-		if ((inode->i_state & (I_FREEING|I_WILL_FREE|I_NEW)) ||
-		    (inode->i_mapping->nrpages == 0)) {
-			spin_unlock(&inode->i_lock);
-			continue;
-		}
-		__iget(inode);
+	spin_lock(&inode->i_lock);
+	if ((inode->i_state & (I_FREEING|I_WILL_FREE|I_NEW)) ||
+	    (inode->i_mapping->nrpages == 0)) {
 		spin_unlock(&inode->i_lock);
-		spin_unlock(&sb->s_inode_list_lock);
+		return LRU_SKIP;
+	}
+	__iget(inode);
+	spin_unlock(&inode->i_lock);
+	spin_unlock(lock);
 
-		invalidate_mapping_pages(inode->i_mapping, 0, -1);
-		iput(toput_inode);
-		toput_inode = inode;
+	iput(*toput_inode);
+	*toput_inode = inode;
 
-		spin_lock(&sb->s_inode_list_lock);
-	}
-	spin_unlock(&sb->s_inode_list_lock);
+	invalidate_mapping_pages(inode->i_mapping, 0, -1);
+
+	/*
+	 * Even though we have dropped the lock here, we can return LRU_SKIP as
+	 * we have a reference to the current inode and so it's next pointer is
+	 * guaranteed to be valid even though we dropped the list lock.
+	 */
+	spin_lock(lock);
+	return LRU_SKIP;
+}
+
+
+/*
+ * This is a best effort scan, so we don't need to be absolutely sure we hit al
+ * inodes on the superblock. Hence a single pass is sufficient to catch them
+ * all.
+ */
+static void drop_pagecache_sb(struct super_block *sb, void *unused)
+{
+	struct inode *toput_inode = NULL;
+
+	list_lru_walk(&sb->s_inode_list, drop_pagecache_inode, &toput_inode,
+		      ULONG_MAX);
+
+	/* the list walk doesn't release the last inode it sees! */
 	iput(toput_inode);
 }
 
diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index 7c9bbf0..f2d91de 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -1337,9 +1337,9 @@ static void wait_sb_inodes(struct super_block *sb)
 		/*
 		 * We hold a reference to 'inode' so it couldn't have been
 		 * removed from s_inodes list while we dropped the
-		 * s_inode_list_lock.  We cannot iput the inode now as we can
+		 * wb.list_lock  We cannot iput the inode now as we can
 		 * be holding the last reference and we cannot iput it under
-		 * s_inode_list_lock. So we keep the reference and iput it
+		 * wb.list_lock So we keep the reference and iput it
 		 * later.
 		 */
 		iput(old_inode);
diff --git a/fs/inode.c b/fs/inode.c
index de62d2b..7eea591 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -27,8 +27,8 @@
  *   inode->i_state, inode->i_hash, __iget()
  * Inode LRU list locks protect:
  *   inode->i_sb->s_inode_lru, inode->i_lru
- * inode->i_sb->s_inode_list_lock protects:
- *   inode->i_sb->s_inodes, inode->i_sb_list
+ * Inode list locks protects:
+ *   inode->i_sb->s_inode_list, inode->i_sb_list
  * bdi->wb.list_lock protects:
  *   bdi->wb.b_{dirty,io,more_io}, inode->i_io_list
  * inode_hash_lock protects:
@@ -36,7 +36,7 @@
  *
  * Lock ordering:
  *
- * inode->i_sb->s_inode_list_lock
+ * Inode list lock
  *   inode->i_lock
  *     Inode LRU list locks
  *
@@ -44,7 +44,7 @@
  *   inode->i_lock
  *
  * inode_hash_lock
- *   inode->i_sb->s_inode_list_lock
+ *   Inode list lock
  *   inode->i_lock
  *
  * iunique_lock
@@ -366,6 +366,7 @@ void inode_init_once(struct inode *inode)
 	INIT_LIST_HEAD(&inode->i_devices);
 	INIT_LIST_HEAD(&inode->i_io_list);
 	INIT_LIST_HEAD(&inode->i_wb_list);
+	INIT_LIST_HEAD(&inode->i_sb_list);
 	INIT_LIST_HEAD(&inode->i_lru);
 	address_space_init_once(&inode->i_data);
 	i_size_ordered_init(inode);
@@ -431,19 +432,13 @@ static void inode_lru_list_del(struct inode *inode)
  */
 void inode_sb_list_add(struct inode *inode)
 {
-	spin_lock(&inode->i_sb->s_inode_list_lock);
-	list_add(&inode->i_sb_list, &inode->i_sb->s_inodes);
-	spin_unlock(&inode->i_sb->s_inode_list_lock);
+	list_lru_add(&inode->i_sb->s_inode_list, &inode->i_sb_list);
 }
 EXPORT_SYMBOL_GPL(inode_sb_list_add);
 
 static inline void inode_sb_list_del(struct inode *inode)
 {
-	if (!list_empty(&inode->i_sb_list)) {
-		spin_lock(&inode->i_sb->s_inode_list_lock);
-		list_del_init(&inode->i_sb_list);
-		spin_unlock(&inode->i_sb->s_inode_list_lock);
-	}
+	list_lru_del(&inode->i_sb->s_inode_list, &inode->i_sb_list);
 }
 
 static unsigned long hash(struct super_block *sb, unsigned long hashval)
@@ -585,6 +580,48 @@ static void dispose_list(struct list_head *head)
 	}
 }
 
+static enum lru_status
+__evict_inodes_isolate(struct list_head *item, spinlock_t *lock, void *cb_arg,
+		       bool kill_dirty)
+{
+	struct list_head *dispose = cb_arg;
+	struct inode	*inode = container_of(item, struct inode, i_sb_list);
+
+	if (atomic_read(&inode->i_count))
+		return LRU_SKIP;
+
+	spin_lock(&inode->i_lock);
+	if (inode->i_state & (I_NEW | I_FREEING | I_WILL_FREE)) {
+		spin_unlock(&inode->i_lock);
+		return LRU_SKIP;
+	}
+
+	if ((inode->i_state & I_DIRTY) && !kill_dirty) {
+		spin_unlock(&inode->i_lock);
+		return LRU_SKIP;
+	}
+
+	inode->i_state |= I_FREEING;
+	inode_lru_list_del(inode);
+	list_add(&inode->i_lru, dispose);
+
+	list_del_init(&inode->i_sb_list);
+	spin_unlock(&inode->i_lock);
+	return LRU_REMOVED;
+}
+
+static enum lru_status
+evict_inodes_isolate(struct list_head *item, spinlock_t *lock, void *cb_arg)
+{
+	return __evict_inodes_isolate(item, lock, cb_arg, true);
+}
+
+static enum lru_status
+invalidate_inodes_isolate(struct list_head *item, spinlock_t *lock, void *cb_arg)
+{
+	return __evict_inodes_isolate(item, lock, cb_arg, false);
+}
+
 /**
  * evict_inodes	- evict all evictable inodes for a superblock
  * @sb:		superblock to operate on
@@ -596,28 +633,15 @@ static void dispose_list(struct list_head *head)
  */
 void evict_inodes(struct super_block *sb)
 {
-	struct inode *inode, *next;
-	LIST_HEAD(dispose);
-
-	spin_lock(&sb->s_inode_list_lock);
-	list_for_each_entry_safe(inode, next, &sb->s_inodes, i_sb_list) {
-		if (atomic_read(&inode->i_count))
-			continue;
-
-		spin_lock(&inode->i_lock);
-		if (inode->i_state & (I_NEW | I_FREEING | I_WILL_FREE)) {
-			spin_unlock(&inode->i_lock);
-			continue;
-		}
+	long freed;
 
-		inode->i_state |= I_FREEING;
-		inode_lru_list_del(inode);
-		spin_unlock(&inode->i_lock);
-		list_add(&inode->i_lru, &dispose);
-	}
-	spin_unlock(&sb->s_inode_list_lock);
+	do {
+		LIST_HEAD(dispose);
 
-	dispose_list(&dispose);
+		freed = list_lru_walk(&sb->s_inode_list, evict_inodes_isolate,
+				      &dispose, ULONG_MAX);
+		dispose_list(&dispose);
+	} while (freed > 0);
 }
 
 /**
@@ -632,38 +656,24 @@ void evict_inodes(struct super_block *sb)
  */
 int invalidate_inodes(struct super_block *sb, bool kill_dirty)
 {
-	int busy = 0;
-	struct inode *inode, *next;
-	LIST_HEAD(dispose);
+	list_lru_walk_cb isolate;
+	long freed;
 
-	spin_lock(&sb->s_inode_list_lock);
-	list_for_each_entry_safe(inode, next, &sb->s_inodes, i_sb_list) {
-		spin_lock(&inode->i_lock);
-		if (inode->i_state & (I_NEW | I_FREEING | I_WILL_FREE)) {
-			spin_unlock(&inode->i_lock);
-			continue;
-		}
-		if (inode->i_state & I_DIRTY && !kill_dirty) {
-			spin_unlock(&inode->i_lock);
-			busy = 1;
-			continue;
-		}
-		if (atomic_read(&inode->i_count)) {
-			spin_unlock(&inode->i_lock);
-			busy = 1;
-			continue;
-		}
+	isolate = kill_dirty ? evict_inodes_isolate :invalidate_inodes_isolate;
 
-		inode->i_state |= I_FREEING;
-		inode_lru_list_del(inode);
-		spin_unlock(&inode->i_lock);
-		list_add(&inode->i_lru, &dispose);
-	}
-	spin_unlock(&sb->s_inode_list_lock);
+	do {
+		LIST_HEAD(dispose);
 
-	dispose_list(&dispose);
+		freed = list_lru_walk(&sb->s_inode_list, isolate,
+				      &dispose, ULONG_MAX);
+		dispose_list(&dispose);
+	} while (freed > 0);
 
-	return busy;
+	/*
+	 * if we skipped any inodes because we couldn't isolate them, tell the
+	 * caller there are still active inodes.
+	 */
+	return !!list_lru_count(&sb->s_inode_list);
 }
 
 /*
@@ -866,7 +876,7 @@ EXPORT_SYMBOL(get_next_ino);
  *	@sb: superblock
  *
  *	Allocates a new inode for given superblock.
- *	Inode wont be chained in superblock s_inodes list
+ *	Inode wont be chained in superblock s_inode_list list
  *	This means :
  *	- fs can't be unmount
  *	- quotas, fsnotify, writeback can't work
@@ -900,8 +910,6 @@ struct inode *new_inode(struct super_block *sb)
 {
 	struct inode *inode;
 
-	spin_lock_prefetch(&sb->s_inode_list_lock);
-
 	inode = new_inode_pseudo(sb);
 	if (inode)
 		inode_sb_list_add(inode);
diff --git a/fs/notify/inode_mark.c b/fs/notify/inode_mark.c
index fac139a..2aef9e5 100644
--- a/fs/notify/inode_mark.c
+++ b/fs/notify/inode_mark.c
@@ -240,82 +240,59 @@ out:
 	return ret;
 }
 
-/**
- * fsnotify_unmount_inodes - an sb is unmounting.  handle any watched inodes.
- * @sb: superblock being unmounted.
- *
- * Called during unmount with no locks held, so needs to be safe against
- * concurrent modifiers. We temporarily drop sb->s_inode_list_lock and CAN block.
- */
-void fsnotify_unmount_inodes(struct super_block *sb)
+static enum lru_status
+fsnotify_unmount_inode(struct list_head *item, spinlock_t *lock, void *cb_arg)
 {
-	struct inode *inode, *next_i, *need_iput = NULL;
-
-	spin_lock(&sb->s_inode_list_lock);
-	list_for_each_entry_safe(inode, next_i, &sb->s_inodes, i_sb_list) {
-		struct inode *need_iput_tmp;
-
-		/*
-		 * We cannot __iget() an inode in state I_FREEING,
-		 * I_WILL_FREE, or I_NEW which is fine because by that point
-		 * the inode cannot have any associated watches.
-		 */
-		spin_lock(&inode->i_lock);
-		if (inode->i_state & (I_FREEING|I_WILL_FREE|I_NEW)) {
-			spin_unlock(&inode->i_lock);
-			continue;
-		}
+	struct inode **toput_inode = cb_arg;
+	struct inode *inode = container_of(item, struct inode, i_sb_list);
 
-		/*
-		 * If i_count is zero, the inode cannot have any watches and
-		 * doing an __iget/iput with MS_ACTIVE clear would actually
-		 * evict all inodes with zero i_count from icache which is
-		 * unnecessarily violent and may in fact be illegal to do.
-		 */
-		if (!atomic_read(&inode->i_count)) {
-			spin_unlock(&inode->i_lock);
-			continue;
-		}
-
-		need_iput_tmp = need_iput;
-		need_iput = NULL;
+	/* New or being freed inodes cannot have any associated watches. */
+	spin_lock(&inode->i_lock);
+	if (inode->i_state & (I_FREEING|I_WILL_FREE|I_NEW)) {
+		spin_unlock(&inode->i_lock);
+		return LRU_SKIP;
+	}
 
-		/* In case fsnotify_inode_delete() drops a reference. */
-		if (inode != need_iput_tmp)
-			__iget(inode);
-		else
-			need_iput_tmp = NULL;
+	/* If i_count is zero, the inode cannot have any watches */
+	if (!atomic_read(&inode->i_count)) {
 		spin_unlock(&inode->i_lock);
+		return LRU_SKIP;
+	}
 
-		/* In case the dropping of a reference would nuke next_i. */
-		if ((&next_i->i_sb_list != &sb->s_inodes) &&
-		    atomic_read(&next_i->i_count)) {
-			spin_lock(&next_i->i_lock);
-			if (!(next_i->i_state & (I_FREEING | I_WILL_FREE))) {
-				__iget(next_i);
-				need_iput = next_i;
-			}
-			spin_unlock(&next_i->i_lock);
-		}
+	__iget(inode);
+	spin_unlock(&inode->i_lock);
+	spin_unlock(lock);
 
-		/*
-		 * We can safely drop s_inode_list_lock here because we hold
-		 * references on both inode and next_i.  Also no new inodes
-		 * will be added since the umount has begun.
-		 */
-		spin_unlock(&sb->s_inode_list_lock);
+	iput(*toput_inode);
+	*toput_inode = inode;
 
-		if (need_iput_tmp)
-			iput(need_iput_tmp);
+	/* for each watch, send FS_UNMOUNT and then remove it */
+	fsnotify(inode, FS_UNMOUNT, inode, FSNOTIFY_EVENT_INODE, NULL, 0);
+	fsnotify_inode_delete(inode);
 
-		/* for each watch, send FS_UNMOUNT and then remove it */
-		fsnotify(inode, FS_UNMOUNT, inode, FSNOTIFY_EVENT_INODE, NULL, 0);
+	/*
+	 * Even though we have dropped the lock here, we can return LRU_SKIP as
+	 * we have a reference to the current inode and so it's next pointer is
+	 * guaranteed to be valid even though we dropped the list lock.
+	 */
+	spin_lock(lock);
+	return LRU_SKIP;
+}
 
-		fsnotify_inode_delete(inode);
+/**
+ * fsnotify_unmount_inodes - an sb is unmounting.  handle any watched inodes.
+ * @sb: superblock being unmounted.
+ *
+ * Called during unmount with the sb->s_umount held exclusively and so the inode
+ * list will not grow and so a single pass will catch all inodes.
+ */
+void fsnotify_unmount_inodes(struct super_block *sb)
+{
+	struct inode *toput_inode = NULL;
 
-		iput(inode);
+	list_lru_walk(&sb->s_inode_list, fsnotify_unmount_inode, &toput_inode,
+		      ULONG_MAX);
 
-		spin_lock(&sb->s_inode_list_lock);
-	}
-	spin_unlock(&sb->s_inode_list_lock);
+	/* the list walk doesn't release the last inode it sees! */
+	iput(toput_inode);
 }
diff --git a/fs/quota/dquot.c b/fs/quota/dquot.c
index fe93be2..829d7c8 100644
--- a/fs/quota/dquot.c
+++ b/fs/quota/dquot.c
@@ -901,55 +901,79 @@ static int dqinit_needed(struct inode *inode, int type)
 	return 0;
 }
 
-/* This routine is guarded by dqonoff_mutex mutex */
-static void add_dquot_ref(struct super_block *sb, int type)
+static enum lru_status
+add_dquot_ref_type(struct list_head *item, spinlock_t *lock, void *cb_arg,
+		int type)
 {
-	struct inode *inode, *old_inode = NULL;
-#ifdef CONFIG_QUOTA_DEBUG
-	int reserved = 0;
-#endif
+	struct inode **toput_inode = cb_arg;
+	struct inode *inode = container_of(item, struct inode, i_sb_list);
 
-	spin_lock(&sb->s_inode_list_lock);
-	list_for_each_entry(inode, &sb->s_inodes, i_sb_list) {
-		spin_lock(&inode->i_lock);
-		if ((inode->i_state & (I_FREEING|I_WILL_FREE|I_NEW)) ||
-		    !atomic_read(&inode->i_writecount) ||
-		    !dqinit_needed(inode, type)) {
-			spin_unlock(&inode->i_lock);
-			continue;
-		}
-		__iget(inode);
+	spin_lock(&inode->i_lock);
+	if ((inode->i_state & (I_FREEING|I_WILL_FREE|I_NEW)) ||
+	    !atomic_read(&inode->i_writecount) ||
+	    !dqinit_needed(inode, type)) {
 		spin_unlock(&inode->i_lock);
-		spin_unlock(&sb->s_inode_list_lock);
-
-#ifdef CONFIG_QUOTA_DEBUG
-		if (unlikely(inode_get_rsv_space(inode) > 0))
-			reserved = 1;
-#endif
-		iput(old_inode);
-		__dquot_initialize(inode, type);
-
-		/*
-		 * We hold a reference to 'inode' so it couldn't have been
-		 * removed from s_inodes list while we dropped the
-		 * s_inode_list_lock. We cannot iput the inode now as we can be
-		 * holding the last reference and we cannot iput it under
-		 * s_inode_list_lock. So we keep the reference and iput it
-		 * later.
-		 */
-		old_inode = inode;
-		spin_lock(&sb->s_inode_list_lock);
+		return LRU_SKIP;
 	}
-	spin_unlock(&sb->s_inode_list_lock);
-	iput(old_inode);
+
+	__iget(inode);
+	spin_unlock(&inode->i_lock);
+	spin_unlock(lock);
 
 #ifdef CONFIG_QUOTA_DEBUG
-	if (reserved) {
+	if (unlikely(inode_get_rsv_space(inode) > 0))
 		quota_error(sb, "Writes happened before quota was turned on "
 			"thus quota information is probably inconsistent. "
 			"Please run quotacheck(8)");
-	}
 #endif
+
+	iput(*toput_inode);
+	*toput_inode = inode;
+
+	__dquot_initialize(inode, type);
+
+	/*
+	 * Even though we have dropped the lock here, we can return LRU_SKIP as
+	 * we have a reference to the current inode and so it's next pointer is
+	 * guaranteed to be valid even though we dropped the list lock.
+	 */
+	spin_lock(lock);
+	return LRU_SKIP;
+}
+
+static enum lru_status
+add_dquot_ref_usr(struct list_head *item, spinlock_t *lock, void *cb_arg)
+{
+	return add_dquot_ref_type(item, lock, cb_arg, USRQUOTA);
+}
+
+static enum lru_status
+add_dquot_ref_grp(struct list_head *item, spinlock_t *lock, void *cb_arg)
+{
+	return add_dquot_ref_type(item, lock, cb_arg, GRPQUOTA);
+}
+
+/* add_dquot_ref is protected by the dqonoff_mutex mutex */
+void add_dquot_ref(struct super_block *sb, int type)
+{
+	struct inode *toput_inode = NULL;
+	list_lru_walk_cb isolate;
+
+	switch (type) {
+	case USRQUOTA:
+		isolate = add_dquot_ref_usr;
+		break;
+	case GRPQUOTA:
+		isolate = add_dquot_ref_grp;
+		break;
+	default:
+		WARN_ON_ONCE(1);
+		return;
+	}
+	list_lru_walk(&sb->s_inode_list, isolate, &toput_inode, ULONG_MAX);
+
+	/* the list walk doesn't release the last inode it sees! */
+	iput(toput_inode);
 }
 
 /*
@@ -1015,34 +1039,62 @@ static void put_dquot_list(struct list_head *tofree_head)
 	}
 }
 
-static void remove_dquot_ref(struct super_block *sb, int type,
-		struct list_head *tofree_head)
+static enum lru_status
+remove_dquot_ref_type(struct list_head *item, spinlock_t *lock, void *cb_arg,
+		      int type)
 {
-	struct inode *inode;
-	int reserved = 0;
+	struct list_head *tofree_head = cb_arg;
+	struct inode *inode = container_of(item, struct inode, i_sb_list);
 
-	spin_lock(&sb->s_inode_list_lock);
-	list_for_each_entry(inode, &sb->s_inodes, i_sb_list) {
-		/*
-		 *  We have to scan also I_NEW inodes because they can already
-		 *  have quota pointer initialized. Luckily, we need to touch
-		 *  only quota pointers and these have separate locking
-		 *  (dqptr_sem).
-		 */
-		if (!IS_NOQUOTA(inode)) {
-			if (unlikely(inode_get_rsv_space(inode) > 0))
-				reserved = 1;
-			remove_inode_dquot_ref(inode, type, tofree_head);
-		}
-	}
-	spin_unlock(&sb->s_inode_list_lock);
+	/*
+	 *  We have to scan also I_NEW inodes because they can already
+	 *  have quota pointer initialized. Luckily, we need to touch
+	 *  only quota pointers and these have separate locking
+	 *  (dqptr_sem).
+	 */
+	if (!IS_NOQUOTA(inode)) {
 #ifdef CONFIG_QUOTA_DEBUG
-	if (reserved) {
-		printk(KERN_WARNING "VFS (%s): Writes happened after quota"
+		if (unlikely(inode_get_rsv_space(inode) > 0)) {
+			printk_ratelimited(KERN_WARNING
+			"VFS (%s): Writes happened after quota"
 			" was disabled thus quota information is probably "
 			"inconsistent. Please run quotacheck(8).\n", sb->s_id);
-	}
+		}
 #endif
+		remove_inode_dquot_ref(inode, type, tofree_head);
+	}
+	return LRU_SKIP;
+}
+
+static enum lru_status
+remove_dquot_ref_usr(struct list_head *item, spinlock_t *lock, void *cb_arg)
+{
+	return remove_dquot_ref_type(item, lock, cb_arg, USRQUOTA);
+}
+
+static enum lru_status
+remove_dquot_ref_grp(struct list_head *item, spinlock_t *lock, void *cb_arg)
+{
+	return remove_dquot_ref_type(item, lock, cb_arg, GRPQUOTA);
+}
+
+static void remove_dquot_ref(struct super_block *sb, int type,
+		struct list_head *tofree_head)
+{
+	list_lru_walk_cb isolate;
+
+	switch (type) {
+	case USRQUOTA:
+		isolate = remove_dquot_ref_usr;
+		break;
+	case GRPQUOTA:
+		isolate = remove_dquot_ref_grp;
+		break;
+	default:
+		WARN_ON_ONCE(1);
+		return;
+	}
+	list_lru_walk(&sb->s_inode_list, isolate, tofree_head, ULONG_MAX);
 }
 
 /* Gather all references from inodes and drop them */
diff --git a/fs/super.c b/fs/super.c
index 7f98fd6..97e2274 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -201,13 +201,13 @@ static struct super_block *alloc_super(struct file_system_type *type, int flags)
 		INIT_HLIST_NODE(&s->s_instances);
 		INIT_HLIST_BL_HEAD(&s->s_anon);
 		mutex_init(&s->s_sync_lock);
-		INIT_LIST_HEAD(&s->s_inodes);
-		spin_lock_init(&s->s_inode_list_lock);
 
 		if (list_lru_init(&s->s_dentry_lru))
 			goto err_out;
 		if (list_lru_init(&s->s_inode_lru))
 			goto err_out_dentry_lru;
+		if (list_lru_init(&s->s_inode_list))
+			goto err_out_inode_lru;
 
 		INIT_LIST_HEAD(&s->s_mounts);
 		init_rwsem(&s->s_umount);
@@ -249,6 +249,8 @@ static struct super_block *alloc_super(struct file_system_type *type, int flags)
 out:
 	return s;
 
+err_out_inode_lru:
+	list_lru_destroy(&s->s_inode_lru);
 err_out_dentry_lru:
 	list_lru_destroy(&s->s_dentry_lru);
 err_out:
@@ -332,6 +334,7 @@ void deactivate_locked_super(struct super_block *s)
 		unregister_shrinker(&s->s_shrink);
 		list_lru_destroy(&s->s_dentry_lru);
 		list_lru_destroy(&s->s_inode_lru);
+		list_lru_destroy(&s->s_inode_list);
 
 		put_filesystem(fs);
 		put_super(s);
@@ -450,7 +453,7 @@ void generic_shutdown_super(struct super_block *sb)
 		if (sop->put_super)
 			sop->put_super(sb);
 
-		if (!list_empty(&sb->s_inodes)) {
+		if (list_lru_count(&sb->s_inode_list)) {
 			printk("VFS: Busy inodes after unmount of %s. "
 			   "Self-destruct in 5 seconds.  Have a nice day...\n",
 			   sb->s_id);
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 5bb84b1..28baf4d 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1331,9 +1331,12 @@ struct super_block {
 	struct list_lru		s_dentry_lru ____cacheline_aligned_in_smp;
 	struct list_lru		s_inode_lru ____cacheline_aligned_in_smp;
 
-	/* s_inode_list_lock protects s_inodes */
-	spinlock_t		s_inode_list_lock ____cacheline_aligned_in_smp;
-	struct list_head	s_inodes;	/* all inodes */
+	/*
+	 * the inode list is not strictly used as a LRU, but uses the list_lru
+	 * construct to provide a scalable list implemenation for adding,
+	 * removing and walking the inodes cached in memory.
+	 */
+	struct list_lru		s_inode_list ____cacheline_aligned_in_smp;
 };
 
 extern struct timespec current_fs_time(struct super_block *sb);
-- 
1.8.3.2


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH 09/11] fs: Use RCU lookups for inode cache
  2013-07-31  4:15 [PATCH 00/11] Sync and VFS scalability improvements Dave Chinner
                   ` (7 preceding siblings ...)
  2013-07-31  4:15 ` [PATCH 08/11] inode: convert per-sb inode list to a list_lru Dave Chinner
@ 2013-07-31  4:15 ` Dave Chinner
  2013-07-31  4:15 ` [PATCH 10/11] list_lru: don't need node lock in list_lru_count_node Dave Chinner
                   ` (2 subsequent siblings)
  11 siblings, 0 replies; 32+ messages in thread
From: Dave Chinner @ 2013-07-31  4:15 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: linux-kernel, akpm, davej, viro, jack, glommer

From: Dave Chinner <dchinner@redhat.com>

Convert inode cache lookups to be protected by RCU locking rather
than the global inode_hash_lock. This will improve scalability of
inode lookup intensive workloads.

Tested w/ ext4 and btrfs on concurrent fsmark/lookup/unlink
workloads only. It removes the inode hash lock from the inode lookup
paths, but does not solve the problem of the inode hash lock being a
bottleneck on the inode cache insert/remove paths.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/inode.c | 76 +++++++++++++++++++++++++++++++++++---------------------------
 1 file changed, 43 insertions(+), 33 deletions(-)

diff --git a/fs/inode.c b/fs/inode.c
index 7eea591..810386e 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -465,7 +465,7 @@ void __insert_inode_hash(struct inode *inode, unsigned long hashval)
 
 	spin_lock(&inode_hash_lock);
 	spin_lock(&inode->i_lock);
-	hlist_add_head(&inode->i_hash, b);
+	hlist_add_head_rcu(&inode->i_hash, b);
 	spin_unlock(&inode->i_lock);
 	spin_unlock(&inode_hash_lock);
 }
@@ -481,7 +481,7 @@ void __remove_inode_hash(struct inode *inode)
 {
 	spin_lock(&inode_hash_lock);
 	spin_lock(&inode->i_lock);
-	hlist_del_init(&inode->i_hash);
+	hlist_del_init_rcu(&inode->i_hash);
 	spin_unlock(&inode->i_lock);
 	spin_unlock(&inode_hash_lock);
 }
@@ -776,13 +776,18 @@ static void __wait_on_freeing_inode(struct inode *inode);
 static struct inode *find_inode(struct super_block *sb,
 				struct hlist_head *head,
 				int (*test)(struct inode *, void *),
-				void *data)
+				void *data, bool locked)
 {
 	struct inode *inode = NULL;
 
 repeat:
-	hlist_for_each_entry(inode, head, i_hash) {
+	rcu_read_lock();
+	hlist_for_each_entry_rcu(inode, head, i_hash) {
 		spin_lock(&inode->i_lock);
+		if (inode_unhashed(inode)) {
+			spin_unlock(&inode->i_lock);
+			continue;
+		}
 		if (inode->i_sb != sb) {
 			spin_unlock(&inode->i_lock);
 			continue;
@@ -792,13 +797,20 @@ repeat:
 			continue;
 		}
 		if (inode->i_state & (I_FREEING|I_WILL_FREE)) {
+			rcu_read_unlock();
+			if (locked)
+				spin_unlock(&inode_hash_lock);
 			__wait_on_freeing_inode(inode);
+			if (locked)
+				spin_lock(&inode_hash_lock);
 			goto repeat;
 		}
 		__iget(inode);
 		spin_unlock(&inode->i_lock);
+		rcu_read_unlock();
 		return inode;
 	}
+	rcu_read_unlock();
 	return NULL;
 }
 
@@ -807,13 +819,19 @@ repeat:
  * iget_locked for details.
  */
 static struct inode *find_inode_fast(struct super_block *sb,
-				struct hlist_head *head, unsigned long ino)
+				     struct hlist_head *head,
+				     unsigned long ino, bool locked)
 {
 	struct inode *inode = NULL;
 
 repeat:
-	hlist_for_each_entry(inode, head, i_hash) {
+	rcu_read_lock();
+	hlist_for_each_entry_rcu(inode, head, i_hash) {
 		spin_lock(&inode->i_lock);
+		if (inode_unhashed(inode)) {
+			spin_unlock(&inode->i_lock);
+			continue;
+		}
 		if (inode->i_ino != ino) {
 			spin_unlock(&inode->i_lock);
 			continue;
@@ -823,13 +841,20 @@ repeat:
 			continue;
 		}
 		if (inode->i_state & (I_FREEING|I_WILL_FREE)) {
+			rcu_read_unlock();
+			if (locked)
+				spin_unlock(&inode_hash_lock);
 			__wait_on_freeing_inode(inode);
+			if (locked)
+				spin_lock(&inode_hash_lock);
 			goto repeat;
 		}
 		__iget(inode);
 		spin_unlock(&inode->i_lock);
+		rcu_read_unlock();
 		return inode;
 	}
+	rcu_read_unlock();
 	return NULL;
 }
 
@@ -984,9 +1009,7 @@ struct inode *iget5_locked(struct super_block *sb, unsigned long hashval,
 	struct hlist_head *head = inode_hashtable + hash(sb, hashval);
 	struct inode *inode;
 
-	spin_lock(&inode_hash_lock);
-	inode = find_inode(sb, head, test, data);
-	spin_unlock(&inode_hash_lock);
+	inode = find_inode(sb, head, test, data, false);
 
 	if (inode) {
 		wait_on_inode(inode);
@@ -998,8 +1021,7 @@ struct inode *iget5_locked(struct super_block *sb, unsigned long hashval,
 		struct inode *old;
 
 		spin_lock(&inode_hash_lock);
-		/* We released the lock, so.. */
-		old = find_inode(sb, head, test, data);
+		old = find_inode(sb, head, test, data, true);
 		if (!old) {
 			if (set(inode, data))
 				goto set_failed;
@@ -1054,9 +1076,7 @@ struct inode *iget_locked(struct super_block *sb, unsigned long ino)
 	struct hlist_head *head = inode_hashtable + hash(sb, ino);
 	struct inode *inode;
 
-	spin_lock(&inode_hash_lock);
-	inode = find_inode_fast(sb, head, ino);
-	spin_unlock(&inode_hash_lock);
+	inode = find_inode_fast(sb, head, ino, false);
 	if (inode) {
 		wait_on_inode(inode);
 		return inode;
@@ -1067,8 +1087,7 @@ struct inode *iget_locked(struct super_block *sb, unsigned long ino)
 		struct inode *old;
 
 		spin_lock(&inode_hash_lock);
-		/* We released the lock, so.. */
-		old = find_inode_fast(sb, head, ino);
+		old = find_inode_fast(sb, head, ino, true);
 		if (!old) {
 			inode->i_ino = ino;
 			spin_lock(&inode->i_lock);
@@ -1110,14 +1129,15 @@ static int test_inode_iunique(struct super_block *sb, unsigned long ino)
 	struct hlist_head *b = inode_hashtable + hash(sb, ino);
 	struct inode *inode;
 
-	spin_lock(&inode_hash_lock);
-	hlist_for_each_entry(inode, b, i_hash) {
-		if (inode->i_ino == ino && inode->i_sb == sb) {
-			spin_unlock(&inode_hash_lock);
+	rcu_read_lock();
+	hlist_for_each_entry_rcu(inode, b, i_hash) {
+		if (inode->i_ino == ino && inode->i_sb == sb &&
+		    !inode_unhashed(inode)) {
+			rcu_read_unlock();
 			return 0;
 		}
 	}
-	spin_unlock(&inode_hash_lock);
+	rcu_read_unlock();
 
 	return 1;
 }
@@ -1198,13 +1218,8 @@ struct inode *ilookup5_nowait(struct super_block *sb, unsigned long hashval,
 		int (*test)(struct inode *, void *), void *data)
 {
 	struct hlist_head *head = inode_hashtable + hash(sb, hashval);
-	struct inode *inode;
-
-	spin_lock(&inode_hash_lock);
-	inode = find_inode(sb, head, test, data);
-	spin_unlock(&inode_hash_lock);
 
-	return inode;
+	return find_inode(sb, head, test, data, false);
 }
 EXPORT_SYMBOL(ilookup5_nowait);
 
@@ -1249,10 +1264,7 @@ struct inode *ilookup(struct super_block *sb, unsigned long ino)
 	struct hlist_head *head = inode_hashtable + hash(sb, ino);
 	struct inode *inode;
 
-	spin_lock(&inode_hash_lock);
-	inode = find_inode_fast(sb, head, ino);
-	spin_unlock(&inode_hash_lock);
-
+	inode = find_inode_fast(sb, head, ino, false);
 	if (inode)
 		wait_on_inode(inode);
 	return inode;
@@ -1696,10 +1708,8 @@ static void __wait_on_freeing_inode(struct inode *inode)
 	wq = bit_waitqueue(&inode->i_state, __I_NEW);
 	prepare_to_wait(wq, &wait.wait, TASK_UNINTERRUPTIBLE);
 	spin_unlock(&inode->i_lock);
-	spin_unlock(&inode_hash_lock);
 	schedule();
 	finish_wait(wq, &wait.wait);
-	spin_lock(&inode_hash_lock);
 }
 
 static __initdata unsigned long ihash_entries;
-- 
1.8.3.2


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH 10/11] list_lru: don't need node lock in list_lru_count_node
  2013-07-31  4:15 [PATCH 00/11] Sync and VFS scalability improvements Dave Chinner
                   ` (8 preceding siblings ...)
  2013-07-31  4:15 ` [PATCH 09/11] fs: Use RCU lookups for inode cache Dave Chinner
@ 2013-07-31  4:15 ` Dave Chinner
  2013-07-31  4:15 ` [PATCH 11/11] list_lru: don't lock during add/del if unnecessary Dave Chinner
  2013-07-31  6:48 ` [PATCH 00/11] Sync and VFS scalability improvements Sedat Dilek
  11 siblings, 0 replies; 32+ messages in thread
From: Dave Chinner @ 2013-07-31  4:15 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: linux-kernel, akpm, davej, viro, jack, glommer

From: Dave Chinner <dchinner@redhat.com>

The overall count of objects on a node might be accurate, but the
moment it is returned to the caller it is no longer perfectly
accurate. Hence we don't really need to hold the node lock to
protect the read of the object. The count is a long, so can be read
atomically on all platforms and so we don't need the lock there,
either. And the cost of the lock is not trivial, either, as it is
showing up in profiles on 16-way lookup workloads like so:

-  15.44%  [kernel]  [k] __ticket_spin_trylock
      - 46.59% _raw_spin_lock
         + 69.40% list_lru_add
           17.65% list_lru_del
           5.70% list_lru_count_node

IOWs, while the LRU locking scales, it is still costly. The locking
doesn't provide any real advantage for counting, so just kill the
locking in list_lru_count_node().

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 mm/list_lru.c | 8 +-------
 1 file changed, 1 insertion(+), 7 deletions(-)

diff --git a/mm/list_lru.c b/mm/list_lru.c
index 7246791..9aadb6c 100644
--- a/mm/list_lru.c
+++ b/mm/list_lru.c
@@ -51,15 +51,9 @@ EXPORT_SYMBOL_GPL(list_lru_del);
 unsigned long
 list_lru_count_node(struct list_lru *lru, int nid)
 {
-	unsigned long count = 0;
 	struct list_lru_node *nlru = &lru->node[nid];
-
-	spin_lock(&nlru->lock);
 	WARN_ON_ONCE(nlru->nr_items < 0);
-	count += nlru->nr_items;
-	spin_unlock(&nlru->lock);
-
-	return count;
+	return nlru->nr_items;
 }
 EXPORT_SYMBOL_GPL(list_lru_count_node);
 
-- 
1.8.3.2


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH 11/11] list_lru: don't lock during add/del if unnecessary
  2013-07-31  4:15 [PATCH 00/11] Sync and VFS scalability improvements Dave Chinner
                   ` (9 preceding siblings ...)
  2013-07-31  4:15 ` [PATCH 10/11] list_lru: don't need node lock in list_lru_count_node Dave Chinner
@ 2013-07-31  4:15 ` Dave Chinner
  2013-07-31  6:48 ` [PATCH 00/11] Sync and VFS scalability improvements Sedat Dilek
  11 siblings, 0 replies; 32+ messages in thread
From: Dave Chinner @ 2013-07-31  4:15 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: linux-kernel, akpm, davej, viro, jack, glommer

From: Dave Chinner <dchinner@redhat.com>

We only add an item to the LRU is it is empty, and we only remove
them if they are on the LRU. We can check these conditions without
holding the node lock and avoid grabbing the lock if they evaluate
as not needing a list operation.

We can do this without concern, because we already have the
situation that two concurrent callers to list_lru_add/del() for the
same object will race on the node lock and so the result of the
concurrent operations on the same object has always been always
undefined.

We still do the check once we have the lock, so even if we are
racing and decide we need the node lock to do the operation, then
the behaviour under the lock will be unchanged.

This significantly reduces the per-node lock traffic on workloads
where we aren't lazy about adding and removing objects from the LRU
lists.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 mm/list_lru.c | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/mm/list_lru.c b/mm/list_lru.c
index 9aadb6c..5207ae2 100644
--- a/mm/list_lru.c
+++ b/mm/list_lru.c
@@ -15,6 +15,9 @@ bool list_lru_add(struct list_lru *lru, struct list_head *item)
 	int nid = page_to_nid(virt_to_page(item));
 	struct list_lru_node *nlru = &lru->node[nid];
 
+	if (!list_empty(item))
+		return false;
+
 	spin_lock(&nlru->lock);
 	WARN_ON_ONCE(nlru->nr_items < 0);
 	if (list_empty(item)) {
@@ -34,6 +37,9 @@ bool list_lru_del(struct list_lru *lru, struct list_head *item)
 	int nid = page_to_nid(virt_to_page(item));
 	struct list_lru_node *nlru = &lru->node[nid];
 
+	if (list_empty(item))
+		return false;
+
 	spin_lock(&nlru->lock);
 	if (!list_empty(item)) {
 		list_del_init(item);
-- 
1.8.3.2


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* Re: [PATCH 00/11] Sync and VFS scalability improvements
  2013-07-31  4:15 [PATCH 00/11] Sync and VFS scalability improvements Dave Chinner
                   ` (10 preceding siblings ...)
  2013-07-31  4:15 ` [PATCH 11/11] list_lru: don't lock during add/del if unnecessary Dave Chinner
@ 2013-07-31  6:48 ` Sedat Dilek
  2013-08-01  6:19   ` Dave Chinner
  11 siblings, 1 reply; 32+ messages in thread
From: Sedat Dilek @ 2013-07-31  6:48 UTC (permalink / raw)
  To: Dave Chinner
  Cc: linux-fsdevel, linux-kernel, akpm, davej, viro, jack, glommer

On Wed, Jul 31, 2013 at 6:15 AM, Dave Chinner <david@fromorbit.com> wrote:
> Hi folks,
>
> This series of patches is against the curent mmotm tree here:
>
> http://git.cmpxchg.org/cgit/linux-mmotm.git/
>

"Current" is not precise enough... above tree has git-tags.
I guess you mean "v3.11-rc1-mmotm-2013-07-18-16-40" ?

As I am not a linux-fs expert I can try to test, preferably against a
Linux-next release.
Would you like to test against "vanilla" mmotm or is Linux-next
suitable in your eyes?

I have seen some typos... are you interested in such feedback?

- Sedat -

[1] http://git.cmpxchg.org/cgit/linux-mmotm.git/tag/?id=v3.11-rc1-mmotm-2013-07-18-16-40

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 01/11] writeback: plug writeback at a high level
  2013-07-31  4:15 ` [PATCH 01/11] writeback: plug writeback at a high level Dave Chinner
@ 2013-07-31 14:40   ` Jan Kara
  2013-08-01  5:48     ` Dave Chinner
  0 siblings, 1 reply; 32+ messages in thread
From: Jan Kara @ 2013-07-31 14:40 UTC (permalink / raw)
  To: Dave Chinner
  Cc: linux-fsdevel, linux-kernel, akpm, davej, viro, jack, glommer

On Wed 31-07-13 14:15:40, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> Doing writeback on lots of little files causes terrible IOPS storms
> because of the per-mapping writeback plugging we do. This
> essentially causes imeediate dispatch of IO for each mapping,
> regardless of the context in which writeback is occurring.
> 
> IOWs, running a concurrent write-lots-of-small 4k files using fsmark
> on XFS results in a huge number of IOPS being issued for data
> writes.  Metadata writes are sorted and plugged at a high level by
> XFS, so aggregate nicely into large IOs. However, data writeback IOs
> are dispatched in individual 4k IOs, even when the blocks of two
> consecutively written files are adjacent.
> 
> Test VM: 8p, 8GB RAM, 4xSSD in RAID0, 100TB sparse XFS filesystem,
> metadata CRCs enabled.
> 
> Kernel: 3.10-rc5 + xfsdev + my 3.11 xfs queue (~70 patches)
> 
> Test:
> 
> $ ./fs_mark  -D  10000  -S0  -n  10000  -s  4096  -L  120  -d
> /mnt/scratch/0  -d  /mnt/scratch/1  -d  /mnt/scratch/2  -d
> /mnt/scratch/3  -d  /mnt/scratch/4  -d  /mnt/scratch/5  -d
> /mnt/scratch/6  -d  /mnt/scratch/7
> 
> Result:
> 
> 		wall	sys	create rate	Physical write IO
> 		time	CPU	(avg files/s)	 IOPS	Bandwidth
> 		-----	-----	------------	------	---------
> unpatched	6m56s	15m47s	24,000+/-500	26,000	130MB/s
> patched		5m06s	13m28s	32,800+/-600	 1,500	180MB/s
> improvement	-26.44%	-14.68%	  +36.67%	-94.23%	+38.46%
> 
> If I use zero length files, this workload at about 500 IOPS, so
> plugging drops the data IOs from roughly 25,500/s to 1000/s.
> 3 lines of code, 35% better throughput for 15% less CPU.
> 
> The benefits of plugging at this layer are likely to be higher for
> spinning media as the IO patterns for this workload are going make a
> much bigger difference on high IO latency devices.....
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
  Just one question: Won't this cause a regression when files are say 2 MB
large? Then we generate maximum sized requests for these files with
per-inode plugging anyway and they will unnecessarily sit in the plug list
until the plug list gets full (that is after 16 requests). Granted it
shouldn't be too long but with fast storage it may be measurable...

Now if we have maximum sized request in the plug list, maybe we could just
dispatch it right away but that's another story.


							Honza
> ---
>  fs/fs-writeback.c | 3 +++
>  1 file changed, 3 insertions(+)
> 
> diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
> index 68851ff..1d23d9a 100644
> --- a/fs/fs-writeback.c
> +++ b/fs/fs-writeback.c
> @@ -589,7 +589,9 @@ static long writeback_sb_inodes(struct super_block *sb,
>  	unsigned long start_time = jiffies;
>  	long write_chunk;
>  	long wrote = 0;  /* count both pages and inodes */
> +	struct blk_plug plug;
>  
> +	blk_start_plug(&plug);
>  	while (!list_empty(&wb->b_io)) {
>  		struct inode *inode = wb_inode(wb->b_io.prev);
>  
> @@ -686,6 +688,7 @@ static long writeback_sb_inodes(struct super_block *sb,
>  				break;
>  		}
>  	}
> +	blk_finish_plug(&plug);
>  	return wrote;
>  }
>  
> -- 
> 1.8.3.2
> 
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 02/11] inode: add IOP_NOTHASHED to avoid inode hash lock in evict
  2013-07-31  4:15 ` [PATCH 02/11] inode: add IOP_NOTHASHED to avoid inode hash lock in evict Dave Chinner
@ 2013-07-31 14:44   ` Jan Kara
  2013-08-01  8:12   ` Christoph Hellwig
  1 sibling, 0 replies; 32+ messages in thread
From: Jan Kara @ 2013-07-31 14:44 UTC (permalink / raw)
  To: Dave Chinner
  Cc: linux-fsdevel, linux-kernel, akpm, davej, viro, jack, glommer

On Wed 31-07-13 14:15:41, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> Some filesystems don't use the VFS inode hash and fake the fact they
> are hashed so that all the writeback code works correctly. However,
> this means the evict() path still tries to remove the inode from the
> hash, meaning that the inode_hash_lock() needs to be taken
> unnecessarily. Hence under certain workloads the inode_hash_lock can
> be contended even if the inode is never actually hashed.
> 
> To avoid this, add an inode opflag to allow inode_hash_remove() to
> avoid taking the hash lock on inodes have never actually been
> hashed.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---
>  fs/xfs/xfs_iops.c  | 2 ++
>  include/linux/fs.h | 3 ++-
>  2 files changed, 4 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c
> index 96dda62..68a8264 100644
> --- a/fs/xfs/xfs_iops.c
> +++ b/fs/xfs/xfs_iops.c
> @@ -1168,8 +1168,10 @@ xfs_setup_inode(
>  	inode->i_state = I_NEW;
>  
>  	inode_sb_list_add(inode);
> +
>  	/* make the inode look hashed for the writeback code */
>  	hlist_add_fake(&inode->i_hash);
> +	inode->i_opflags |= IOP_NOTHASHED;
>  
>  	inode->i_mode	= ip->i_d.di_mode;
>  	set_nlink(inode, ip->i_d.di_nlink);
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index b09ddc0..51cf6ed 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -515,6 +515,7 @@ struct posix_acl;
>  #define IOP_FASTPERM	0x0001
>  #define IOP_LOOKUP	0x0002
>  #define IOP_NOFOLLOW	0x0004
> +#define IOP_NOTHASHED	0x0008	/* inode never hashed, avoid unhashing */
>  
>  /*
>   * Keep mostly read-only and often accessed (especially for
> @@ -2371,7 +2372,7 @@ static inline void insert_inode_hash(struct inode *inode)
>  extern void __remove_inode_hash(struct inode *);
>  static inline void remove_inode_hash(struct inode *inode)
>  {
> -	if (!inode_unhashed(inode))
> +	if (!((inode->i_opflags & IOP_NOTHASHED) || inode_unhashed(inode)))
  It's somewhat easier for me to read if the condition if written as:
	if (!(inode->i_opflags & IOP_NOTHASHED) && !inode_unhashed(inode))
  But whatever... The patch looks good. You can add:
Reviewed-by: Jan Kara <jack@suse.cz>

>  		__remove_inode_hash(inode);
>  }

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 03/11] inode: convert inode_sb_list_lock to per-sb
  2013-07-31  4:15 ` [PATCH 03/11] inode: convert inode_sb_list_lock to per-sb Dave Chinner
@ 2013-07-31 14:48   ` Jan Kara
  0 siblings, 0 replies; 32+ messages in thread
From: Jan Kara @ 2013-07-31 14:48 UTC (permalink / raw)
  To: Dave Chinner
  Cc: linux-fsdevel, linux-kernel, akpm, davej, viro, jack, glommer

On Wed 31-07-13 14:15:42, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> The process of reducing contention on per-superblock inode lists
> starts with moving the locking to match the per-superblock inode
> list. This takes the global lock out of the picture and reduces the
> contention problems to within a single filesystem. This doesn't get
> rid of contention as the locks still have global CPU scope, but it
> does isolate operations on different superblocks form each other.
  Looks good. You can add:
Reviewed-by: Jan Kara <jack@suse.cz>

								Honza
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---
>  fs/block_dev.c                   | 12 ++++++------
>  fs/drop_caches.c                 | 10 ++++++----
>  fs/fs-writeback.c                | 12 ++++++------
>  fs/inode.c                       | 28 +++++++++++++---------------
>  fs/internal.h                    |  1 -
>  fs/notify/inode_mark.c           | 20 ++++++++++----------
>  fs/quota/dquot.c                 | 16 ++++++++--------
>  fs/super.c                       |  3 ++-
>  include/linux/fs.h               |  5 ++++-
>  include/linux/fsnotify_backend.h |  2 +-
>  10 files changed, 56 insertions(+), 53 deletions(-)
> 
> diff --git a/fs/block_dev.c b/fs/block_dev.c
> index c7bda5c..c39c0f3 100644
> --- a/fs/block_dev.c
> +++ b/fs/block_dev.c
> @@ -1669,7 +1669,7 @@ void iterate_bdevs(void (*func)(struct block_device *, void *), void *arg)
>  {
>  	struct inode *inode, *old_inode = NULL;
>  
> -	spin_lock(&inode_sb_list_lock);
> +	spin_lock(&blockdev_superblock->s_inode_list_lock);
>  	list_for_each_entry(inode, &blockdev_superblock->s_inodes, i_sb_list) {
>  		struct address_space *mapping = inode->i_mapping;
>  
> @@ -1681,13 +1681,13 @@ void iterate_bdevs(void (*func)(struct block_device *, void *), void *arg)
>  		}
>  		__iget(inode);
>  		spin_unlock(&inode->i_lock);
> -		spin_unlock(&inode_sb_list_lock);
> +		spin_unlock(&blockdev_superblock->s_inode_list_lock);
>  		/*
>  		 * We hold a reference to 'inode' so it couldn't have been
>  		 * removed from s_inodes list while we dropped the
> -		 * inode_sb_list_lock.  We cannot iput the inode now as we can
> +		 * s_inode_list_lock  We cannot iput the inode now as we can
>  		 * be holding the last reference and we cannot iput it under
> -		 * inode_sb_list_lock. So we keep the reference and iput it
> +		 * s_inode_list_lock. So we keep the reference and iput it
>  		 * later.
>  		 */
>  		iput(old_inode);
> @@ -1695,8 +1695,8 @@ void iterate_bdevs(void (*func)(struct block_device *, void *), void *arg)
>  
>  		func(I_BDEV(inode), arg);
>  
> -		spin_lock(&inode_sb_list_lock);
> +		spin_lock(&blockdev_superblock->s_inode_list_lock);
>  	}
> -	spin_unlock(&inode_sb_list_lock);
> +	spin_unlock(&blockdev_superblock->s_inode_list_lock);
>  	iput(old_inode);
>  }
> diff --git a/fs/drop_caches.c b/fs/drop_caches.c
> index 9fd702f..f1be790 100644
> --- a/fs/drop_caches.c
> +++ b/fs/drop_caches.c
> @@ -17,7 +17,7 @@ static void drop_pagecache_sb(struct super_block *sb, void *unused)
>  {
>  	struct inode *inode, *toput_inode = NULL;
>  
> -	spin_lock(&inode_sb_list_lock);
> +	spin_lock(&sb->s_inode_list_lock);
>  	list_for_each_entry(inode, &sb->s_inodes, i_sb_list) {
>  		spin_lock(&inode->i_lock);
>  		if ((inode->i_state & (I_FREEING|I_WILL_FREE|I_NEW)) ||
> @@ -27,13 +27,15 @@ static void drop_pagecache_sb(struct super_block *sb, void *unused)
>  		}
>  		__iget(inode);
>  		spin_unlock(&inode->i_lock);
> -		spin_unlock(&inode_sb_list_lock);
> +		spin_unlock(&sb->s_inode_list_lock);
> +
>  		invalidate_mapping_pages(inode->i_mapping, 0, -1);
>  		iput(toput_inode);
>  		toput_inode = inode;
> -		spin_lock(&inode_sb_list_lock);
> +
> +		spin_lock(&sb->s_inode_list_lock);
>  	}
> -	spin_unlock(&inode_sb_list_lock);
> +	spin_unlock(&sb->s_inode_list_lock);
>  	iput(toput_inode);
>  }
>  
> diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
> index 1d23d9a..ca66dc8 100644
> --- a/fs/fs-writeback.c
> +++ b/fs/fs-writeback.c
> @@ -1217,7 +1217,7 @@ static void wait_sb_inodes(struct super_block *sb)
>  	 */
>  	WARN_ON(!rwsem_is_locked(&sb->s_umount));
>  
> -	spin_lock(&inode_sb_list_lock);
> +	spin_lock(&sb->s_inode_list_lock);
>  
>  	/*
>  	 * Data integrity sync. Must wait for all pages under writeback,
> @@ -1237,14 +1237,14 @@ static void wait_sb_inodes(struct super_block *sb)
>  		}
>  		__iget(inode);
>  		spin_unlock(&inode->i_lock);
> -		spin_unlock(&inode_sb_list_lock);
> +		spin_unlock(&sb->s_inode_list_lock);
>  
>  		/*
>  		 * We hold a reference to 'inode' so it couldn't have been
>  		 * removed from s_inodes list while we dropped the
> -		 * inode_sb_list_lock.  We cannot iput the inode now as we can
> +		 * s_inode_list_lock.  We cannot iput the inode now as we can
>  		 * be holding the last reference and we cannot iput it under
> -		 * inode_sb_list_lock. So we keep the reference and iput it
> +		 * s_inode_list_lock. So we keep the reference and iput it
>  		 * later.
>  		 */
>  		iput(old_inode);
> @@ -1254,9 +1254,9 @@ static void wait_sb_inodes(struct super_block *sb)
>  
>  		cond_resched();
>  
> -		spin_lock(&inode_sb_list_lock);
> +		spin_lock(&sb->s_inode_list_lock);
>  	}
> -	spin_unlock(&inode_sb_list_lock);
> +	spin_unlock(&sb->s_inode_list_lock);
>  	iput(old_inode);
>  }
>  
> diff --git a/fs/inode.c b/fs/inode.c
> index e315c0a..cd10287 100644
> --- a/fs/inode.c
> +++ b/fs/inode.c
> @@ -27,8 +27,8 @@
>   *   inode->i_state, inode->i_hash, __iget()
>   * Inode LRU list locks protect:
>   *   inode->i_sb->s_inode_lru, inode->i_lru
> - * inode_sb_list_lock protects:
> - *   sb->s_inodes, inode->i_sb_list
> + * inode->i_sb->s_inode_list_lock protects:
> + *   inode->i_sb->s_inodes, inode->i_sb_list
>   * bdi->wb.list_lock protects:
>   *   bdi->wb.b_{dirty,io,more_io}, inode->i_wb_list
>   * inode_hash_lock protects:
> @@ -36,7 +36,7 @@
>   *
>   * Lock ordering:
>   *
> - * inode_sb_list_lock
> + * inode->i_sb->s_inode_list_lock
>   *   inode->i_lock
>   *     Inode LRU list locks
>   *
> @@ -44,7 +44,7 @@
>   *   inode->i_lock
>   *
>   * inode_hash_lock
> - *   inode_sb_list_lock
> + *   inode->i_sb->s_inode_list_lock
>   *   inode->i_lock
>   *
>   * iunique_lock
> @@ -56,8 +56,6 @@ static unsigned int i_hash_shift __read_mostly;
>  static struct hlist_head *inode_hashtable __read_mostly;
>  static __cacheline_aligned_in_smp DEFINE_SPINLOCK(inode_hash_lock);
>  
> -__cacheline_aligned_in_smp DEFINE_SPINLOCK(inode_sb_list_lock);
> -
>  /*
>   * Empty aops. Can be used for the cases where the user does not
>   * define any of the address_space operations.
> @@ -432,18 +430,18 @@ static void inode_lru_list_del(struct inode *inode)
>   */
>  void inode_sb_list_add(struct inode *inode)
>  {
> -	spin_lock(&inode_sb_list_lock);
> +	spin_lock(&inode->i_sb->s_inode_list_lock);
>  	list_add(&inode->i_sb_list, &inode->i_sb->s_inodes);
> -	spin_unlock(&inode_sb_list_lock);
> +	spin_unlock(&inode->i_sb->s_inode_list_lock);
>  }
>  EXPORT_SYMBOL_GPL(inode_sb_list_add);
>  
>  static inline void inode_sb_list_del(struct inode *inode)
>  {
>  	if (!list_empty(&inode->i_sb_list)) {
> -		spin_lock(&inode_sb_list_lock);
> +		spin_lock(&inode->i_sb->s_inode_list_lock);
>  		list_del_init(&inode->i_sb_list);
> -		spin_unlock(&inode_sb_list_lock);
> +		spin_unlock(&inode->i_sb->s_inode_list_lock);
>  	}
>  }
>  
> @@ -600,7 +598,7 @@ void evict_inodes(struct super_block *sb)
>  	struct inode *inode, *next;
>  	LIST_HEAD(dispose);
>  
> -	spin_lock(&inode_sb_list_lock);
> +	spin_lock(&sb->s_inode_list_lock);
>  	list_for_each_entry_safe(inode, next, &sb->s_inodes, i_sb_list) {
>  		if (atomic_read(&inode->i_count))
>  			continue;
> @@ -616,7 +614,7 @@ void evict_inodes(struct super_block *sb)
>  		spin_unlock(&inode->i_lock);
>  		list_add(&inode->i_lru, &dispose);
>  	}
> -	spin_unlock(&inode_sb_list_lock);
> +	spin_unlock(&sb->s_inode_list_lock);
>  
>  	dispose_list(&dispose);
>  }
> @@ -637,7 +635,7 @@ int invalidate_inodes(struct super_block *sb, bool kill_dirty)
>  	struct inode *inode, *next;
>  	LIST_HEAD(dispose);
>  
> -	spin_lock(&inode_sb_list_lock);
> +	spin_lock(&sb->s_inode_list_lock);
>  	list_for_each_entry_safe(inode, next, &sb->s_inodes, i_sb_list) {
>  		spin_lock(&inode->i_lock);
>  		if (inode->i_state & (I_NEW | I_FREEING | I_WILL_FREE)) {
> @@ -660,7 +658,7 @@ int invalidate_inodes(struct super_block *sb, bool kill_dirty)
>  		spin_unlock(&inode->i_lock);
>  		list_add(&inode->i_lru, &dispose);
>  	}
> -	spin_unlock(&inode_sb_list_lock);
> +	spin_unlock(&sb->s_inode_list_lock);
>  
>  	dispose_list(&dispose);
>  
> @@ -901,7 +899,7 @@ struct inode *new_inode(struct super_block *sb)
>  {
>  	struct inode *inode;
>  
> -	spin_lock_prefetch(&inode_sb_list_lock);
> +	spin_lock_prefetch(&sb->s_inode_list_lock);
>  
>  	inode = new_inode_pseudo(sb);
>  	if (inode)
> diff --git a/fs/internal.h b/fs/internal.h
> index 0b0db70..bdc0e86 100644
> --- a/fs/internal.h
> +++ b/fs/internal.h
> @@ -110,7 +110,6 @@ extern int open_check_o_direct(struct file *f);
>  /*
>   * inode.c
>   */
> -extern spinlock_t inode_sb_list_lock;
>  extern long prune_icache_sb(struct super_block *sb, unsigned long nr_to_scan,
>  			    int nid);
>  extern void inode_add_lru(struct inode *inode);
> diff --git a/fs/notify/inode_mark.c b/fs/notify/inode_mark.c
> index 74825be..fac139a 100644
> --- a/fs/notify/inode_mark.c
> +++ b/fs/notify/inode_mark.c
> @@ -242,17 +242,17 @@ out:
>  
>  /**
>   * fsnotify_unmount_inodes - an sb is unmounting.  handle any watched inodes.
> - * @list: list of inodes being unmounted (sb->s_inodes)
> + * @sb: superblock being unmounted.
>   *
>   * Called during unmount with no locks held, so needs to be safe against
> - * concurrent modifiers. We temporarily drop inode_sb_list_lock and CAN block.
> + * concurrent modifiers. We temporarily drop sb->s_inode_list_lock and CAN block.
>   */
> -void fsnotify_unmount_inodes(struct list_head *list)
> +void fsnotify_unmount_inodes(struct super_block *sb)
>  {
>  	struct inode *inode, *next_i, *need_iput = NULL;
>  
> -	spin_lock(&inode_sb_list_lock);
> -	list_for_each_entry_safe(inode, next_i, list, i_sb_list) {
> +	spin_lock(&sb->s_inode_list_lock);
> +	list_for_each_entry_safe(inode, next_i, &sb->s_inodes, i_sb_list) {
>  		struct inode *need_iput_tmp;
>  
>  		/*
> @@ -288,7 +288,7 @@ void fsnotify_unmount_inodes(struct list_head *list)
>  		spin_unlock(&inode->i_lock);
>  
>  		/* In case the dropping of a reference would nuke next_i. */
> -		if ((&next_i->i_sb_list != list) &&
> +		if ((&next_i->i_sb_list != &sb->s_inodes) &&
>  		    atomic_read(&next_i->i_count)) {
>  			spin_lock(&next_i->i_lock);
>  			if (!(next_i->i_state & (I_FREEING | I_WILL_FREE))) {
> @@ -299,11 +299,11 @@ void fsnotify_unmount_inodes(struct list_head *list)
>  		}
>  
>  		/*
> -		 * We can safely drop inode_sb_list_lock here because we hold
> +		 * We can safely drop s_inode_list_lock here because we hold
>  		 * references on both inode and next_i.  Also no new inodes
>  		 * will be added since the umount has begun.
>  		 */
> -		spin_unlock(&inode_sb_list_lock);
> +		spin_unlock(&sb->s_inode_list_lock);
>  
>  		if (need_iput_tmp)
>  			iput(need_iput_tmp);
> @@ -315,7 +315,7 @@ void fsnotify_unmount_inodes(struct list_head *list)
>  
>  		iput(inode);
>  
> -		spin_lock(&inode_sb_list_lock);
> +		spin_lock(&sb->s_inode_list_lock);
>  	}
> -	spin_unlock(&inode_sb_list_lock);
> +	spin_unlock(&sb->s_inode_list_lock);
>  }
> diff --git a/fs/quota/dquot.c b/fs/quota/dquot.c
> index 017eaea..fe93be2 100644
> --- a/fs/quota/dquot.c
> +++ b/fs/quota/dquot.c
> @@ -909,7 +909,7 @@ static void add_dquot_ref(struct super_block *sb, int type)
>  	int reserved = 0;
>  #endif
>  
> -	spin_lock(&inode_sb_list_lock);
> +	spin_lock(&sb->s_inode_list_lock);
>  	list_for_each_entry(inode, &sb->s_inodes, i_sb_list) {
>  		spin_lock(&inode->i_lock);
>  		if ((inode->i_state & (I_FREEING|I_WILL_FREE|I_NEW)) ||
> @@ -920,7 +920,7 @@ static void add_dquot_ref(struct super_block *sb, int type)
>  		}
>  		__iget(inode);
>  		spin_unlock(&inode->i_lock);
> -		spin_unlock(&inode_sb_list_lock);
> +		spin_unlock(&sb->s_inode_list_lock);
>  
>  #ifdef CONFIG_QUOTA_DEBUG
>  		if (unlikely(inode_get_rsv_space(inode) > 0))
> @@ -932,15 +932,15 @@ static void add_dquot_ref(struct super_block *sb, int type)
>  		/*
>  		 * We hold a reference to 'inode' so it couldn't have been
>  		 * removed from s_inodes list while we dropped the
> -		 * inode_sb_list_lock We cannot iput the inode now as we can be
> +		 * s_inode_list_lock. We cannot iput the inode now as we can be
>  		 * holding the last reference and we cannot iput it under
> -		 * inode_sb_list_lock. So we keep the reference and iput it
> +		 * s_inode_list_lock. So we keep the reference and iput it
>  		 * later.
>  		 */
>  		old_inode = inode;
> -		spin_lock(&inode_sb_list_lock);
> +		spin_lock(&sb->s_inode_list_lock);
>  	}
> -	spin_unlock(&inode_sb_list_lock);
> +	spin_unlock(&sb->s_inode_list_lock);
>  	iput(old_inode);
>  
>  #ifdef CONFIG_QUOTA_DEBUG
> @@ -1021,7 +1021,7 @@ static void remove_dquot_ref(struct super_block *sb, int type,
>  	struct inode *inode;
>  	int reserved = 0;
>  
> -	spin_lock(&inode_sb_list_lock);
> +	spin_lock(&sb->s_inode_list_lock);
>  	list_for_each_entry(inode, &sb->s_inodes, i_sb_list) {
>  		/*
>  		 *  We have to scan also I_NEW inodes because they can already
> @@ -1035,7 +1035,7 @@ static void remove_dquot_ref(struct super_block *sb, int type,
>  			remove_inode_dquot_ref(inode, type, tofree_head);
>  		}
>  	}
> -	spin_unlock(&inode_sb_list_lock);
> +	spin_unlock(&sb->s_inode_list_lock);
>  #ifdef CONFIG_QUOTA_DEBUG
>  	if (reserved) {
>  		printk(KERN_WARNING "VFS (%s): Writes happened after quota"
> diff --git a/fs/super.c b/fs/super.c
> index 09da975..d4d753e 100644
> --- a/fs/super.c
> +++ b/fs/super.c
> @@ -201,6 +201,7 @@ static struct super_block *alloc_super(struct file_system_type *type, int flags)
>  		INIT_HLIST_NODE(&s->s_instances);
>  		INIT_HLIST_BL_HEAD(&s->s_anon);
>  		INIT_LIST_HEAD(&s->s_inodes);
> +		spin_lock_init(&s->s_inode_list_lock);
>  
>  		if (list_lru_init(&s->s_dentry_lru))
>  			goto err_out;
> @@ -441,7 +442,7 @@ void generic_shutdown_super(struct super_block *sb)
>  		sync_filesystem(sb);
>  		sb->s_flags &= ~MS_ACTIVE;
>  
> -		fsnotify_unmount_inodes(&sb->s_inodes);
> +		fsnotify_unmount_inodes(sb);
>  
>  		evict_inodes(sb);
>  
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index 51cf6ed..923b465 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -1263,7 +1263,6 @@ struct super_block {
>  #endif
>  	const struct xattr_handler **s_xattr;
>  
> -	struct list_head	s_inodes;	/* all inodes */
>  	struct hlist_bl_head	s_anon;		/* anonymous dentries for (nfs) exporting */
>  #ifdef CONFIG_SMP
>  	struct list_head __percpu *s_files;
> @@ -1328,6 +1327,10 @@ struct super_block {
>  	 */
>  	struct list_lru		s_dentry_lru ____cacheline_aligned_in_smp;
>  	struct list_lru		s_inode_lru ____cacheline_aligned_in_smp;
> +
> +	/* s_inode_list_lock protects s_inodes */
> +	spinlock_t		s_inode_list_lock ____cacheline_aligned_in_smp;
> +	struct list_head	s_inodes;	/* all inodes */
>  };
>  
>  extern struct timespec current_fs_time(struct super_block *sb);
> diff --git a/include/linux/fsnotify_backend.h b/include/linux/fsnotify_backend.h
> index 4b2ee8d..3d58028 100644
> --- a/include/linux/fsnotify_backend.h
> +++ b/include/linux/fsnotify_backend.h
> @@ -427,7 +427,7 @@ extern void fsnotify_clear_marks_by_group_flags(struct fsnotify_group *group, un
>  extern void fsnotify_clear_marks_by_group(struct fsnotify_group *group);
>  extern void fsnotify_get_mark(struct fsnotify_mark *mark);
>  extern void fsnotify_put_mark(struct fsnotify_mark *mark);
> -extern void fsnotify_unmount_inodes(struct list_head *list);
> +extern void fsnotify_unmount_inodes(struct super_block *sb);
>  
>  /* put here because inotify does some weird stuff when destroying watches */
>  extern struct fsnotify_event *fsnotify_create_event(struct inode *to_tell, __u32 mask,
> -- 
> 1.8.3.2
> 
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 05/11] inode: rename i_wb_list to i_io_list
  2013-07-31  4:15 ` [PATCH 05/11] inode: rename i_wb_list to i_io_list Dave Chinner
@ 2013-07-31 14:51   ` Jan Kara
  0 siblings, 0 replies; 32+ messages in thread
From: Jan Kara @ 2013-07-31 14:51 UTC (permalink / raw)
  To: Dave Chinner
  Cc: linux-fsdevel, linux-kernel, akpm, davej, viro, jack, glommer

On Wed 31-07-13 14:15:44, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> There's a small consistency problem between the inode and writeback
> naming. Writeback calls the "for IO" inode queues b_io and
> b_more_io, but the inode calls these the "writeback list" or
> i_wb_list. This makes it hard to an new "under writeback" list to
> the inode, or call it an "under IO" list on the bdi because either
> way we'll have writeback on IO and IO on writeback and it'll just be
> confusing. I'm getting confused just writing this!
> 
> So, rename the inode "for IO" list variable to i_io_list so we can
> add a new "writeback list" in a subsequent patch.
  Thanks! The patch looks good. You can add:
Reviewed-by: Jan Kara <jack@suse.cz>

								Honza

> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---
>  fs/block_dev.c     |  2 +-
>  fs/fs-writeback.c  | 18 +++++++++---------
>  fs/inode.c         |  6 +++---
>  include/linux/fs.h |  2 +-
>  mm/backing-dev.c   |  6 +++---
>  5 files changed, 17 insertions(+), 17 deletions(-)
> 
> diff --git a/fs/block_dev.c b/fs/block_dev.c
> index c39c0f3..bec0c26 100644
> --- a/fs/block_dev.c
> +++ b/fs/block_dev.c
> @@ -68,7 +68,7 @@ static void bdev_inode_switch_bdi(struct inode *inode,
>  	if (inode->i_state & I_DIRTY) {
>  		if (bdi_cap_writeback_dirty(dst) && !wb_has_dirty_io(&dst->wb))
>  			wakeup_bdi = true;
> -		list_move(&inode->i_wb_list, &dst->wb.b_dirty);
> +		list_move(&inode->i_io_list, &dst->wb.b_dirty);
>  	}
>  	spin_unlock(&inode->i_lock);
>  	spin_unlock(&old->wb.list_lock);
> diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
> index 56272ec..b7ac1c2 100644
> --- a/fs/fs-writeback.c
> +++ b/fs/fs-writeback.c
> @@ -77,7 +77,7 @@ static inline struct backing_dev_info *inode_to_bdi(struct inode *inode)
>  
>  static inline struct inode *wb_inode(struct list_head *head)
>  {
> -	return list_entry(head, struct inode, i_wb_list);
> +	return list_entry(head, struct inode, i_io_list);
>  }
>  
>  /*
> @@ -171,7 +171,7 @@ void inode_wb_list_del(struct inode *inode)
>  	struct backing_dev_info *bdi = inode_to_bdi(inode);
>  
>  	spin_lock(&bdi->wb.list_lock);
> -	list_del_init(&inode->i_wb_list);
> +	list_del_init(&inode->i_io_list);
>  	spin_unlock(&bdi->wb.list_lock);
>  }
>  
> @@ -194,7 +194,7 @@ static void redirty_tail(struct inode *inode, struct bdi_writeback *wb)
>  		if (time_before(inode->dirtied_when, tail->dirtied_when))
>  			inode->dirtied_when = jiffies;
>  	}
> -	list_move(&inode->i_wb_list, &wb->b_dirty);
> +	list_move(&inode->i_io_list, &wb->b_dirty);
>  }
>  
>  /*
> @@ -203,7 +203,7 @@ static void redirty_tail(struct inode *inode, struct bdi_writeback *wb)
>  static void requeue_io(struct inode *inode, struct bdi_writeback *wb)
>  {
>  	assert_spin_locked(&wb->list_lock);
> -	list_move(&inode->i_wb_list, &wb->b_more_io);
> +	list_move(&inode->i_io_list, &wb->b_more_io);
>  }
>  
>  static void inode_sync_complete(struct inode *inode)
> @@ -254,7 +254,7 @@ static int move_expired_inodes(struct list_head *delaying_queue,
>  		if (sb && sb != inode->i_sb)
>  			do_sb_sort = 1;
>  		sb = inode->i_sb;
> -		list_move(&inode->i_wb_list, &tmp);
> +		list_move(&inode->i_io_list, &tmp);
>  		moved++;
>  	}
>  
> @@ -270,7 +270,7 @@ static int move_expired_inodes(struct list_head *delaying_queue,
>  		list_for_each_prev_safe(pos, node, &tmp) {
>  			inode = wb_inode(pos);
>  			if (inode->i_sb == sb)
> -				list_move(&inode->i_wb_list, dispatch_queue);
> +				list_move(&inode->i_io_list, dispatch_queue);
>  		}
>  	}
>  out:
> @@ -418,7 +418,7 @@ static void requeue_inode(struct inode *inode, struct bdi_writeback *wb,
>  		redirty_tail(inode, wb);
>  	} else {
>  		/* The inode is clean. Remove from writeback lists. */
> -		list_del_init(&inode->i_wb_list);
> +		list_del_init(&inode->i_io_list);
>  	}
>  }
>  
> @@ -528,7 +528,7 @@ writeback_single_inode(struct inode *inode, struct bdi_writeback *wb,
>  	 * touch it. See comment above for explanation.
>  	 */
>  	if (!(inode->i_state & I_DIRTY))
> -		list_del_init(&inode->i_wb_list);
> +		list_del_init(&inode->i_io_list);
>  	spin_unlock(&wb->list_lock);
>  	inode_sync_complete(inode);
>  out:
> @@ -1193,7 +1193,7 @@ void __mark_inode_dirty(struct inode *inode, int flags)
>  			spin_unlock(&inode->i_lock);
>  			spin_lock(&bdi->wb.list_lock);
>  			inode->dirtied_when = jiffies;
> -			list_move(&inode->i_wb_list, &bdi->wb.b_dirty);
> +			list_move(&inode->i_io_list, &bdi->wb.b_dirty);
>  			spin_unlock(&bdi->wb.list_lock);
>  
>  			if (wakeup_bdi)
> diff --git a/fs/inode.c b/fs/inode.c
> index cd10287..f8e0f2f 100644
> --- a/fs/inode.c
> +++ b/fs/inode.c
> @@ -30,7 +30,7 @@
>   * inode->i_sb->s_inode_list_lock protects:
>   *   inode->i_sb->s_inodes, inode->i_sb_list
>   * bdi->wb.list_lock protects:
> - *   bdi->wb.b_{dirty,io,more_io}, inode->i_wb_list
> + *   bdi->wb.b_{dirty,io,more_io}, inode->i_io_list
>   * inode_hash_lock protects:
>   *   inode_hashtable, inode->i_hash
>   *
> @@ -364,7 +364,7 @@ void inode_init_once(struct inode *inode)
>  	memset(inode, 0, sizeof(*inode));
>  	INIT_HLIST_NODE(&inode->i_hash);
>  	INIT_LIST_HEAD(&inode->i_devices);
> -	INIT_LIST_HEAD(&inode->i_wb_list);
> +	INIT_LIST_HEAD(&inode->i_io_list);
>  	INIT_LIST_HEAD(&inode->i_lru);
>  	address_space_init_once(&inode->i_data);
>  	i_size_ordered_init(inode);
> @@ -530,7 +530,7 @@ static void evict(struct inode *inode)
>  	BUG_ON(!(inode->i_state & I_FREEING));
>  	BUG_ON(!list_empty(&inode->i_lru));
>  
> -	if (!list_empty(&inode->i_wb_list))
> +	if (!list_empty(&inode->i_io_list))
>  		inode_wb_list_del(inode);
>  
>  	inode_sb_list_del(inode);
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index 971e8be..b99eb39 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -576,7 +576,7 @@ struct inode {
>  	unsigned long		dirtied_when;	/* jiffies of first dirtying */
>  
>  	struct hlist_node	i_hash;
> -	struct list_head	i_wb_list;	/* backing dev IO list */
> +	struct list_head	i_io_list;	/* backing dev IO list */
>  	struct list_head	i_lru;		/* inode LRU list */
>  	struct list_head	i_sb_list;
>  	union {
> diff --git a/mm/backing-dev.c b/mm/backing-dev.c
> index e04454c..024d4ca 100644
> --- a/mm/backing-dev.c
> +++ b/mm/backing-dev.c
> @@ -74,11 +74,11 @@ static int bdi_debug_stats_show(struct seq_file *m, void *v)
>  
>  	nr_dirty = nr_io = nr_more_io = 0;
>  	spin_lock(&wb->list_lock);
> -	list_for_each_entry(inode, &wb->b_dirty, i_wb_list)
> +	list_for_each_entry(inode, &wb->b_dirty, i_io_list)
>  		nr_dirty++;
> -	list_for_each_entry(inode, &wb->b_io, i_wb_list)
> +	list_for_each_entry(inode, &wb->b_io, i_io_list)
>  		nr_io++;
> -	list_for_each_entry(inode, &wb->b_more_io, i_wb_list)
> +	list_for_each_entry(inode, &wb->b_more_io, i_io_list)
>  		nr_more_io++;
>  	spin_unlock(&wb->list_lock);
>  
> -- 
> 1.8.3.2
> 
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 06/11] bdi: add a new writeback list for sync
  2013-07-31  4:15 ` [PATCH 06/11] bdi: add a new writeback list for sync Dave Chinner
@ 2013-07-31 15:11   ` Jan Kara
  2013-08-01  5:59     ` Dave Chinner
  0 siblings, 1 reply; 32+ messages in thread
From: Jan Kara @ 2013-07-31 15:11 UTC (permalink / raw)
  To: Dave Chinner
  Cc: linux-fsdevel, linux-kernel, akpm, davej, viro, jack, glommer

On Wed 31-07-13 14:15:45, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> wait_sb_inodes() current does a walk of all inodes in the filesystem
> to find dirty one to wait on during sync. This is highly
> inefficient and wastes a lot of CPU when there are lots of clean
> cached inodes that we don't need to wait on.
> 
> To avoid this "all inode" walk, we need to track inodes that are
> currently under writeback that we need to wait for. We do this by
> adding inodes to a writeback list on the bdi when the mapping is
> first tagged as having pages under writeback.  wait_sb_inodes() can
> then walk this list of "inodes under IO" and wait specifically just
> for the inodes that the current sync(2) needs to wait for.
> 
> To avoid needing all the realted locking to be safe against
> interrupts, Jan Kara suggested that we be lazy about removal from
> the writeback list. That is, we don't remove inodes from the
> writeback list on IO completion, but do it directly during a
> wait_sb_inodes() walk.
> 
> This means that the a rare sync(2) call will have some work to do
> skipping clean inodes However, in the current problem case of
> concurrent sync workloads, concurrent wait_sb_inodes() calls only
> walk the very recently dispatched inodes and hence should have very
> little work to do.
> 
> This also means that we have to remove the inodes from the writeback
> list during eviction. Do this in inode_wait_for_writeback() once
> all writeback on the inode is complete.
  The patch looks OK, just two minor comments below.

								Honza

> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---
>  fs/fs-writeback.c           | 110 +++++++++++++++++++++++++++++++++++++-------
>  fs/inode.c                  |   1 +
>  include/linux/backing-dev.h |   3 ++
>  include/linux/fs.h          |   1 +
>  mm/backing-dev.c            |   1 +
>  mm/page-writeback.c         |  14 ++++++
>  6 files changed, 114 insertions(+), 16 deletions(-)
> 
> diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
> index b7ac1c2..638f122 100644
> --- a/fs/fs-writeback.c
> +++ b/fs/fs-writeback.c
> @@ -176,6 +176,34 @@ void inode_wb_list_del(struct inode *inode)
>  }
>  
>  /*
> + * mark an inode as under writeback on the given bdi
> + */
> +void bdi_mark_inode_writeback(struct backing_dev_info *bdi, struct inode *inode)
> +{
> +	WARN_ON_ONCE(bdi != inode_to_bdi(inode));
> +	if (list_empty(&inode->i_wb_list)) {
> +		spin_lock(&bdi->wb.list_lock);
> +		if (list_empty(&inode->i_wb_list))
> +			list_add_tail(&inode->i_wb_list, &bdi->wb.b_writeback);
> +		spin_unlock(&bdi->wb.list_lock);
> +	}
> +}
> +
> +/*
> + * clear an inode as under writeback on the given bdi
> + */
> +static void bdi_clear_inode_writeback(struct backing_dev_info *bdi,
> +				      struct inode *inode)
> +{
> +	WARN_ON_ONCE(bdi != inode_to_bdi(inode));
> +	if (!list_empty(&inode->i_wb_list)) {
> +		spin_lock(&bdi->wb.list_lock);
> +		list_del_init(&inode->i_wb_list);
> +		spin_unlock(&bdi->wb.list_lock);
> +	}
> +}
  Umm, are these list_empty() checks without lock really safe? Looking into
the code in more detail, it seems that mapping->tree_lock saves us from
races between insert & removal but it definitely deserves a comment (or maybe
even an assertion) that the function requires it.
bdi_clear_inode_writeback() is safe only because it is called only when the
inode is practically dead. Again, I think it deserves a comment...

> +
> +/*
>   * Redirty an inode: set its when-it-was dirtied timestamp and move it to the
>   * furthest end of its superblock's dirty-inode list.
>   *
> @@ -330,13 +358,18 @@ static void __inode_wait_for_writeback(struct inode *inode)
>  }
>  
>  /*
> - * Wait for writeback on an inode to complete. Caller must have inode pinned.
> + * Wait for writeback on an inode to complete during eviction. Caller must have
> + * inode pinned.
>   */
>  void inode_wait_for_writeback(struct inode *inode)
>  {
> +	BUG_ON(!(inode->i_state & I_FREEING));
> +
>  	spin_lock(&inode->i_lock);
>  	__inode_wait_for_writeback(inode);
>  	spin_unlock(&inode->i_lock);
> +
> +	bdi_clear_inode_writeback(inode_to_bdi(inode), inode);
>  }
>  
>  /*
> @@ -1218,7 +1251,9 @@ EXPORT_SYMBOL(__mark_inode_dirty);
>   */
>  static void wait_sb_inodes(struct super_block *sb)
>  {
> -	struct inode *inode, *old_inode = NULL;
> +	struct backing_dev_info *bdi = sb->s_bdi;
> +	struct inode *old_inode = NULL;
> +	LIST_HEAD(sync_list);
>  
>  	/*
>  	 * We need to be protected against the filesystem going from
> @@ -1226,28 +1261,59 @@ static void wait_sb_inodes(struct super_block *sb)
>  	 */
>  	WARN_ON(!rwsem_is_locked(&sb->s_umount));
>  
> -	mutex_lock(&sb->s_sync_lock);
> -	spin_lock(&sb->s_inode_list_lock);
> -
>  	/*
> -	 * Data integrity sync. Must wait for all pages under writeback,
> -	 * because there may have been pages dirtied before our sync
> -	 * call, but which had writeout started before we write it out.
> -	 * In which case, the inode may not be on the dirty list, but
> -	 * we still have to wait for that writeout.
> +	 * Data integrity sync. Must wait for all pages under writeback, because
> +	 * there may have been pages dirtied before our sync call, but which had
> +	 * writeout started before we write it out.  In which case, the inode
> +	 * may not be on the dirty list, but we still have to wait for that
> +	 * writeout.
> +	 *
> +	 * To avoid syncing inodes put under IO after we have started here,
> +	 * splice the io list to a temporary list head and walk that. Newly
> +	 * dirtied inodes will go onto the primary list so we won't wait for
> +	 * them. This is safe to do as we are serialised by the s_sync_lock,
> +	 * so we'll complete processing the complete list before the next
> +	 * sync operation repeats the splice-and-walk process.
> +	 *
> +	 * Inodes that have pages under writeback after we've finished the wait
> +	 * may or may not be on the primary list. They had pages under put IO
> +	 * after we started our wait, so we need to make sure the next sync or
> +	 * IO completion treats them correctly, Move them back to the primary
> +	 * list and restart the walk.
> +	 *
> +	 * Inodes that are clean after we have waited for them don't belong on
> +	 * any list, so simply remove them from the sync list and move onwards.
>  	 */
> -	list_for_each_entry(inode, &sb->s_inodes, i_sb_list) {
> +	mutex_lock(&sb->s_sync_lock);
> +	spin_lock(&bdi->wb.list_lock);
> +	list_splice_init(&bdi->wb.b_writeback, &sync_list);
> +
> +	while (!list_empty(&sync_list)) {
> +		struct inode *inode = list_first_entry(&sync_list, struct inode,
> +						       i_wb_list);
>  		struct address_space *mapping = inode->i_mapping;
>  
> +		/*
> +		 * We are lazy on IO completion and don't remove inodes from the
> +		 * list when they are clean. Detect that immediately and skip
> +		 * inodes we don't ahve to wait on.
> +		 */
> +		if (!mapping_tagged(mapping, PAGECACHE_TAG_WRITEBACK)) {
> +			list_del_init(&inode->i_wb_list);
> +			cond_resched_lock(&bdi->wb.list_lock);
> +			continue;
> +		}
> +
>  		spin_lock(&inode->i_lock);
> -		if ((inode->i_state & (I_FREEING|I_WILL_FREE|I_NEW)) ||
> -		    (mapping->nrpages == 0)) {
> +		if (inode->i_state & (I_FREEING|I_WILL_FREE|I_NEW)) {
> +                        list_move(&inode->i_wb_list, &bdi->wb.b_writeback);
>  			spin_unlock(&inode->i_lock);
> +			cond_resched_lock(&bdi->wb.list_lock);
>  			continue;
>  		}
>  		__iget(inode);
>  		spin_unlock(&inode->i_lock);
> -		spin_unlock(&sb->s_inode_list_lock);
> +		spin_unlock(&bdi->wb.list_lock);
>  
>  		/*
>  		 * We hold a reference to 'inode' so it couldn't have been
> @@ -1264,9 +1330,21 @@ static void wait_sb_inodes(struct super_block *sb)
>  
>  		cond_resched();
>  
> -		spin_lock(&sb->s_inode_list_lock);
> +                /*
> +                 * the inode has been written back now, so check whether we
> +                 * still have pages under IO and move it back to the primary
> +                 * list if necessary
> +                 */
> +                spin_lock_irq(&mapping->tree_lock);
> +                spin_lock(&bdi->wb.list_lock);
> +                if (mapping_tagged(mapping, PAGECACHE_TAG_WRITEBACK)) {
> +                        WARN_ON(list_empty(&inode->i_wb_list));
> +                        list_move(&inode->i_wb_list, &bdi->wb.b_writeback);
> +                } else
> +			list_del_init(&inode->i_wb_list);
> +                spin_unlock_irq(&mapping->tree_lock);
  Whitespace is damaged in the above hunk...

>  	}
> -	spin_unlock(&sb->s_inode_list_lock);
> +	spin_unlock(&bdi->wb.list_lock);
>  	iput(old_inode);
>  	mutex_unlock(&sb->s_sync_lock);
>  }
> diff --git a/fs/inode.c b/fs/inode.c
> index f8e0f2f..de62d2b 100644
> --- a/fs/inode.c
> +++ b/fs/inode.c
> @@ -365,6 +365,7 @@ void inode_init_once(struct inode *inode)
>  	INIT_HLIST_NODE(&inode->i_hash);
>  	INIT_LIST_HEAD(&inode->i_devices);
>  	INIT_LIST_HEAD(&inode->i_io_list);
> +	INIT_LIST_HEAD(&inode->i_wb_list);
>  	INIT_LIST_HEAD(&inode->i_lru);
>  	address_space_init_once(&inode->i_data);
>  	i_size_ordered_init(inode);
> diff --git a/include/linux/backing-dev.h b/include/linux/backing-dev.h
> index c388155..45edd8a 100644
> --- a/include/linux/backing-dev.h
> +++ b/include/linux/backing-dev.h
> @@ -58,6 +58,7 @@ struct bdi_writeback {
>  	struct list_head b_dirty;	/* dirty inodes */
>  	struct list_head b_io;		/* parked for writeback */
>  	struct list_head b_more_io;	/* parked for more writeback */
> +	struct list_head b_writeback;	/* inodes under writeback */
>  	spinlock_t list_lock;		/* protects the b_* lists */
>  };
>  
> @@ -125,6 +126,8 @@ void bdi_writeback_workfn(struct work_struct *work);
>  int bdi_has_dirty_io(struct backing_dev_info *bdi);
>  void bdi_wakeup_thread_delayed(struct backing_dev_info *bdi);
>  void bdi_lock_two(struct bdi_writeback *wb1, struct bdi_writeback *wb2);
> +void bdi_mark_inode_writeback(struct backing_dev_info *bdi,
> +			      struct inode *inode);
>  
>  extern spinlock_t bdi_lock;
>  extern struct list_head bdi_list;
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index b99eb39..5bb84b1 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -579,6 +579,7 @@ struct inode {
>  	struct list_head	i_io_list;	/* backing dev IO list */
>  	struct list_head	i_lru;		/* inode LRU list */
>  	struct list_head	i_sb_list;
> +	struct list_head	i_wb_list;	/* backing dev writeback list */
>  	union {
>  		struct hlist_head	i_dentry;
>  		struct rcu_head		i_rcu;
> diff --git a/mm/backing-dev.c b/mm/backing-dev.c
> index 024d4ca..b0cc55a 100644
> --- a/mm/backing-dev.c
> +++ b/mm/backing-dev.c
> @@ -419,6 +419,7 @@ static void bdi_wb_init(struct bdi_writeback *wb, struct backing_dev_info *bdi)
>  	INIT_LIST_HEAD(&wb->b_dirty);
>  	INIT_LIST_HEAD(&wb->b_io);
>  	INIT_LIST_HEAD(&wb->b_more_io);
> +	INIT_LIST_HEAD(&wb->b_writeback);
>  	spin_lock_init(&wb->list_lock);
>  	INIT_DELAYED_WORK(&wb->dwork, bdi_writeback_workfn);
>  }
> diff --git a/mm/page-writeback.c b/mm/page-writeback.c
> index 3f0c895..038c893 100644
> --- a/mm/page-writeback.c
> +++ b/mm/page-writeback.c
> @@ -2262,11 +2262,25 @@ int test_set_page_writeback(struct page *page)
>  		spin_lock_irqsave(&mapping->tree_lock, flags);
>  		ret = TestSetPageWriteback(page);
>  		if (!ret) {
> +			bool on_wblist;
> +
> +			on_wblist = mapping_tagged(mapping,
> +						   PAGECACHE_TAG_WRITEBACK);
> +
>  			radix_tree_tag_set(&mapping->page_tree,
>  						page_index(page),
>  						PAGECACHE_TAG_WRITEBACK);
>  			if (bdi_cap_account_writeback(bdi))
>  				__inc_bdi_stat(bdi, BDI_WRITEBACK);
> +
> +			/*
> +			 * we can come through here when swapping anonymous
> +			 * pages, so we don't necessarily have an inode to
> +			 * track for sync. Inodes are remove lazily, so there is
> +			 * no equivalent in test_clear_page_writeback().
> +			 */
> +			if (!on_wblist && mapping->host)
> +				bdi_mark_inode_writeback(bdi, mapping->host);
>  		}
>  		if (!PageDirty(page))
>  			radix_tree_tag_clear(&mapping->page_tree,
> -- 
> 1.8.3.2
> 
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 04/11] sync: serialise per-superblock sync operations
  2013-07-31  4:15 ` [PATCH 04/11] sync: serialise per-superblock sync operations Dave Chinner
@ 2013-07-31 15:12   ` Jan Kara
  0 siblings, 0 replies; 32+ messages in thread
From: Jan Kara @ 2013-07-31 15:12 UTC (permalink / raw)
  To: Dave Chinner
  Cc: linux-fsdevel, linux-kernel, akpm, davej, viro, jack, glommer

On Wed 31-07-13 14:15:43, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> When competing sync(2) calls walk the same filesystem, they need to
> walk the list of inodes on the superblock to find all the inodes
> that we need to wait for IO completion on. However, when multiple
> wait_sb_inodes() calls do this at the same time, they contend on the
> the inode_sb_list_lock and the contention causes system wide
> slowdowns. In effect, concurrent sync(2) calls can take longer and
> burn more CPU than if they were serialised.
> 
> Stop the worst of the contention by adding a per-sb mutex to wrap
> around wait_sb_inodes() so that we only execute one sync(2) IO
> completion walk per superblock superblock at a time and hence avoid
> contention being triggered by concurrent sync(2) calls.
  The patch looks OK. You can add:
Reviewed-by: Jan Kara <jack@suse.cz>

								Honza

> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---
>  fs/fs-writeback.c  | 11 +++++++++++
>  fs/super.c         |  1 +
>  include/linux/fs.h |  2 ++
>  3 files changed, 14 insertions(+)
> 
> diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
> index ca66dc8..56272ec 100644
> --- a/fs/fs-writeback.c
> +++ b/fs/fs-writeback.c
> @@ -1207,6 +1207,15 @@ out_unlock_inode:
>  }
>  EXPORT_SYMBOL(__mark_inode_dirty);
>  
> +/*
> + * The @s_sync_lock is used to serialise concurrent sync operations
> + * to avoid lock contention problems with concurrent wait_sb_inodes() calls.
> + * Concurrent callers will block on the s_sync_lock rather than doing contending
> + * walks. The queueing maintains sync(2) required behaviour as all the IO that
> + * has been issued up to the time this function is enter is guaranteed to be
> + * completed by the time we have gained the lock and waited for all IO that is
> + * in progress regardless of the order callers are granted the lock.
> + */
>  static void wait_sb_inodes(struct super_block *sb)
>  {
>  	struct inode *inode, *old_inode = NULL;
> @@ -1217,6 +1226,7 @@ static void wait_sb_inodes(struct super_block *sb)
>  	 */
>  	WARN_ON(!rwsem_is_locked(&sb->s_umount));
>  
> +	mutex_lock(&sb->s_sync_lock);
>  	spin_lock(&sb->s_inode_list_lock);
>  
>  	/*
> @@ -1258,6 +1268,7 @@ static void wait_sb_inodes(struct super_block *sb)
>  	}
>  	spin_unlock(&sb->s_inode_list_lock);
>  	iput(old_inode);
> +	mutex_unlock(&sb->s_sync_lock);
>  }
>  
>  /**
> diff --git a/fs/super.c b/fs/super.c
> index d4d753e..7f98fd6 100644
> --- a/fs/super.c
> +++ b/fs/super.c
> @@ -200,6 +200,7 @@ static struct super_block *alloc_super(struct file_system_type *type, int flags)
>  		s->s_bdi = &default_backing_dev_info;
>  		INIT_HLIST_NODE(&s->s_instances);
>  		INIT_HLIST_BL_HEAD(&s->s_anon);
> +		mutex_init(&s->s_sync_lock);
>  		INIT_LIST_HEAD(&s->s_inodes);
>  		spin_lock_init(&s->s_inode_list_lock);
>  
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index 923b465..971e8be 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -1321,6 +1321,8 @@ struct super_block {
>  	/* Being remounted read-only */
>  	int s_readonly_remount;
>  
> +	struct mutex		s_sync_lock;	/* sync serialisation lock */
> +
>  	/*
>  	 * Keep the lru lists last in the structure so they always sit on their
>  	 * own individual cachelines.
> -- 
> 1.8.3.2
> 
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 07/11] writeback: periodically trim the writeback list
  2013-07-31  4:15 ` [PATCH 07/11] writeback: periodically trim the writeback list Dave Chinner
@ 2013-07-31 15:15   ` Jan Kara
  2013-08-01  6:16     ` Dave Chinner
  0 siblings, 1 reply; 32+ messages in thread
From: Jan Kara @ 2013-07-31 15:15 UTC (permalink / raw)
  To: Dave Chinner
  Cc: linux-fsdevel, linux-kernel, akpm, davej, viro, jack, glommer

On Wed 31-07-13 14:15:46, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> Inodes are removed lazily from the bdi writeback list, so in the
> absence of sync(2) work inodes will build up on the bdi writback
> list even though they are no longer under IO. Use the periodic
> kupdate work check to remove inodes no longer under IO from the
> writeback list.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---
>  fs/fs-writeback.c | 19 +++++++++++++++++++
>  1 file changed, 19 insertions(+)
> 
> diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
> index 638f122..7c9bbf0 100644
> --- a/fs/fs-writeback.c
> +++ b/fs/fs-writeback.c
> @@ -962,6 +962,23 @@ static long wb_check_background_flush(struct bdi_writeback *wb)
>  	return 0;
>  }
>  
> +/*
> + * clean out writeback list for all inodes that don't have IO in progress
> + */
> +static void wb_trim_writeback_list(struct bdi_writeback *wb)
> +{
> +	struct inode *inode;
> +	struct inode *tmp;
> +
> +	spin_lock(&wb->list_lock);
> +	list_for_each_entry_safe(inode, tmp, &wb->b_writeback, i_wb_list) {
> +		if (!mapping_tagged(inode->i_mapping, PAGECACHE_TAG_WRITEBACK))
> +			list_del_init(&inode->i_wb_list);
> +	}
   Oo, and here you manipulate i_wb_list without mapping->tree_lock so that
can race with the list_empty() check in bdi_mark_inode_writeback().

								Honza

> +	spin_unlock(&wb->list_lock);
> +
> +}
> +
>  static long wb_check_old_data_flush(struct bdi_writeback *wb)
>  {
>  	unsigned long expired;
> @@ -978,6 +995,8 @@ static long wb_check_old_data_flush(struct bdi_writeback *wb)
>  	if (time_before(jiffies, expired))
>  		return 0;
>  
> +	wb_trim_writeback_list(wb);
> +
>  	wb->last_old_flush = jiffies;
>  	nr_pages = get_nr_dirty_pages();
>  
> -- 
> 1.8.3.2
> 
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 01/11] writeback: plug writeback at a high level
  2013-07-31 14:40   ` Jan Kara
@ 2013-08-01  5:48     ` Dave Chinner
  2013-08-01  8:34       ` Jan Kara
  0 siblings, 1 reply; 32+ messages in thread
From: Dave Chinner @ 2013-08-01  5:48 UTC (permalink / raw)
  To: Jan Kara; +Cc: linux-fsdevel, linux-kernel, akpm, davej, viro, glommer

On Wed, Jul 31, 2013 at 04:40:19PM +0200, Jan Kara wrote:
> On Wed 31-07-13 14:15:40, Dave Chinner wrote:
> > From: Dave Chinner <dchinner@redhat.com>
> > 
> > Doing writeback on lots of little files causes terrible IOPS storms
> > because of the per-mapping writeback plugging we do. This
> > essentially causes imeediate dispatch of IO for each mapping,
> > regardless of the context in which writeback is occurring.
> > 
> > IOWs, running a concurrent write-lots-of-small 4k files using fsmark
> > on XFS results in a huge number of IOPS being issued for data
> > writes.  Metadata writes are sorted and plugged at a high level by
> > XFS, so aggregate nicely into large IOs. However, data writeback IOs
> > are dispatched in individual 4k IOs, even when the blocks of two
> > consecutively written files are adjacent.
> > 
> > Test VM: 8p, 8GB RAM, 4xSSD in RAID0, 100TB sparse XFS filesystem,
> > metadata CRCs enabled.
> > 
> > Kernel: 3.10-rc5 + xfsdev + my 3.11 xfs queue (~70 patches)
> > 
> > Test:
> > 
> > $ ./fs_mark  -D  10000  -S0  -n  10000  -s  4096  -L  120  -d
> > /mnt/scratch/0  -d  /mnt/scratch/1  -d  /mnt/scratch/2  -d
> > /mnt/scratch/3  -d  /mnt/scratch/4  -d  /mnt/scratch/5  -d
> > /mnt/scratch/6  -d  /mnt/scratch/7
> > 
> > Result:
> > 
> > 		wall	sys	create rate	Physical write IO
> > 		time	CPU	(avg files/s)	 IOPS	Bandwidth
> > 		-----	-----	------------	------	---------
> > unpatched	6m56s	15m47s	24,000+/-500	26,000	130MB/s
> > patched		5m06s	13m28s	32,800+/-600	 1,500	180MB/s
> > improvement	-26.44%	-14.68%	  +36.67%	-94.23%	+38.46%
> > 
> > If I use zero length files, this workload at about 500 IOPS, so
> > plugging drops the data IOs from roughly 25,500/s to 1000/s.
> > 3 lines of code, 35% better throughput for 15% less CPU.
> > 
> > The benefits of plugging at this layer are likely to be higher for
> > spinning media as the IO patterns for this workload are going make a
> > much bigger difference on high IO latency devices.....
> > 
> > Signed-off-by: Dave Chinner <dchinner@redhat.com>
>   Just one question: Won't this cause a regression when files are say 2 MB
> large? Then we generate maximum sized requests for these files with
> per-inode plugging anyway and they will unnecessarily sit in the plug list
> until the plug list gets full (that is after 16 requests). Granted it
> shouldn't be too long but with fast storage it may be measurable...

Latency of IO dispatch only matters for the initial IOs being
queued. This, however, is not a latency sensitive IO path -
writeback is our bulk throughput IO engine, and in those cases low
latency dispatch is precisely what we don't want. We want to
optimise IO patterns for maximum *bandwidth*, not minimal latency.

The problem is that fast storage with immediate dispatch and dep
queues can keep ahead of IO dispatch, preventing throughput
optimisations like IO aggregation from being made because there is
never any IO queued to aggregate. That's why I'm seeing a couple of
orders of magnitude higher IOPS than I should. Sure, the hardware
can do that, but it's not the *most efficient* method of dispatching
background IO.

Allowing IOs a chance to aggregate in the scheduler for a short
while because dispatch allows existing bulk throughput optimisations
to be made to the IO stream, and as we can see, where a delayed
allocation filesystem is optimised for adjacent allocation
across sequentially written inodes such oppportunites for IO
aggregation make a big difference to performance.

So, to test your 2MB IO case, I ran a fsmark test using 40,000
2MB files instead of 10 million 4k files.

		wall time	IOPS	BW
mmotm		170s		1000	350MB/s
patched		167s		1000	350MB/s

The IO profiles are near enough to be identical, and the wall time
is basically the same.


I just don't see any particular concern about larger IOs and initial
dispatch latency here from either a theoretical or an observed POV.
Indeed, I haven't seen a performance degradation as a result of this
patch in any of the testing I've done since I first posted it...

> Now if we have maximum sized request in the plug list, maybe we could just
> dispatch it right away but that's another story.

That, in itself is potentially an issue, too, as it prevents seek
minimisation optimisations from being made when we batch up multiple
IOs on the plug list...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 06/11] bdi: add a new writeback list for sync
  2013-07-31 15:11   ` Jan Kara
@ 2013-08-01  5:59     ` Dave Chinner
  0 siblings, 0 replies; 32+ messages in thread
From: Dave Chinner @ 2013-08-01  5:59 UTC (permalink / raw)
  To: Jan Kara; +Cc: linux-fsdevel, linux-kernel, akpm, davej, viro, glommer

On Wed, Jul 31, 2013 at 05:11:14PM +0200, Jan Kara wrote:
> On Wed 31-07-13 14:15:45, Dave Chinner wrote:
> >  /*
> > + * mark an inode as under writeback on the given bdi
> > + */
> > +void bdi_mark_inode_writeback(struct backing_dev_info *bdi, struct inode *inode)
> > +{
> > +	WARN_ON_ONCE(bdi != inode_to_bdi(inode));
> > +	if (list_empty(&inode->i_wb_list)) {
> > +		spin_lock(&bdi->wb.list_lock);
> > +		if (list_empty(&inode->i_wb_list))
> > +			list_add_tail(&inode->i_wb_list, &bdi->wb.b_writeback);
> > +		spin_unlock(&bdi->wb.list_lock);
> > +	}
> > +}
> > +
> > +/*
> > + * clear an inode as under writeback on the given bdi
> > + */
> > +static void bdi_clear_inode_writeback(struct backing_dev_info *bdi,
> > +				      struct inode *inode)
> > +{
> > +	WARN_ON_ONCE(bdi != inode_to_bdi(inode));
> > +	if (!list_empty(&inode->i_wb_list)) {
> > +		spin_lock(&bdi->wb.list_lock);
> > +		list_del_init(&inode->i_wb_list);
> > +		spin_unlock(&bdi->wb.list_lock);
> > +	}
> > +}
>   Umm, are these list_empty() checks without lock really safe?

I think they are.....

>   Looking into
> the code in more detail, it seems that mapping->tree_lock saves us from
> races between insert & removal but it definitely deserves a comment (or maybe
> even an assertion) that the function requires it.
> bdi_clear_inode_writeback() is safe only because it is called only when the
> inode is practically dead. Again, I think it deserves a comment...

Ok.

> > @@ -1264,9 +1330,21 @@ static void wait_sb_inodes(struct super_block *sb)
> >  
> >  		cond_resched();
> >  
> > -		spin_lock(&sb->s_inode_list_lock);
> > +                /*
> > +                 * the inode has been written back now, so check whether we
> > +                 * still have pages under IO and move it back to the primary
> > +                 * list if necessary
> > +                 */
> > +                spin_lock_irq(&mapping->tree_lock);
> > +                spin_lock(&bdi->wb.list_lock);
> > +                if (mapping_tagged(mapping, PAGECACHE_TAG_WRITEBACK)) {
> > +                        WARN_ON(list_empty(&inode->i_wb_list));
> > +                        list_move(&inode->i_wb_list, &bdi->wb.b_writeback);
> > +                } else
> > +			list_del_init(&inode->i_wb_list);
> > +                spin_unlock_irq(&mapping->tree_lock);
>   Whitespace is damaged in the above hunk...

Bizarre. I'll fix it.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 07/11] writeback: periodically trim the writeback list
  2013-07-31 15:15   ` Jan Kara
@ 2013-08-01  6:16     ` Dave Chinner
  2013-08-01  9:03       ` Jan Kara
  0 siblings, 1 reply; 32+ messages in thread
From: Dave Chinner @ 2013-08-01  6:16 UTC (permalink / raw)
  To: Jan Kara; +Cc: linux-fsdevel, linux-kernel, akpm, davej, viro, glommer

On Wed, Jul 31, 2013 at 05:15:42PM +0200, Jan Kara wrote:
> On Wed 31-07-13 14:15:46, Dave Chinner wrote:
> > From: Dave Chinner <dchinner@redhat.com>
> > 
> > Inodes are removed lazily from the bdi writeback list, so in the
> > absence of sync(2) work inodes will build up on the bdi writback
> > list even though they are no longer under IO. Use the periodic
> > kupdate work check to remove inodes no longer under IO from the
> > writeback list.
> > 
> > Signed-off-by: Dave Chinner <dchinner@redhat.com>
> > ---
> >  fs/fs-writeback.c | 19 +++++++++++++++++++
> >  1 file changed, 19 insertions(+)
> > 
> > diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
> > index 638f122..7c9bbf0 100644
> > --- a/fs/fs-writeback.c
> > +++ b/fs/fs-writeback.c
> > @@ -962,6 +962,23 @@ static long wb_check_background_flush(struct bdi_writeback *wb)
> >  	return 0;
> >  }
> >  
> > +/*
> > + * clean out writeback list for all inodes that don't have IO in progress
> > + */
> > +static void wb_trim_writeback_list(struct bdi_writeback *wb)
> > +{
> > +	struct inode *inode;
> > +	struct inode *tmp;
> > +
> > +	spin_lock(&wb->list_lock);
> > +	list_for_each_entry_safe(inode, tmp, &wb->b_writeback, i_wb_list) {
> > +		if (!mapping_tagged(inode->i_mapping, PAGECACHE_TAG_WRITEBACK))
> > +			list_del_init(&inode->i_wb_list);
> > +	}
>    Oo, and here you manipulate i_wb_list without mapping->tree_lock so that
> can race with the list_empty() check in bdi_mark_inode_writeback().

I'm not sure it does - we only remove is the PAGECACHE_TAG_WRITEBACK
is not set, and only insert after we set the tag. Hence, if we are
walking the &wb->b_writeback list here and PAGECACHE_TAG_WRITEBACK
is set because there is an insert in progress then we can't race
with the insert because we won't be trying to delete it from the
list...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 00/11] Sync and VFS scalability improvements
  2013-07-31  6:48 ` [PATCH 00/11] Sync and VFS scalability improvements Sedat Dilek
@ 2013-08-01  6:19   ` Dave Chinner
  2013-08-01  6:31     ` Sedat Dilek
  0 siblings, 1 reply; 32+ messages in thread
From: Dave Chinner @ 2013-08-01  6:19 UTC (permalink / raw)
  To: Sedat Dilek; +Cc: linux-fsdevel, linux-kernel, akpm, davej, viro, jack, glommer

On Wed, Jul 31, 2013 at 08:48:40AM +0200, Sedat Dilek wrote:
> On Wed, Jul 31, 2013 at 6:15 AM, Dave Chinner <david@fromorbit.com> wrote:
> > Hi folks,
> >
> > This series of patches is against the curent mmotm tree here:
> >
> > http://git.cmpxchg.org/cgit/linux-mmotm.git/
> >
> 
> "Current" is not precise enough... above tree has git-tags.

It's most certainly precise.  "current" always means the master
branch at the time the series was posted. I.e. same definition as
"top of tree"....

> As I am not a linux-fs expert I can try to test, preferably against a
> Linux-next release.
> Would you like to test against "vanilla" mmotm or is Linux-next
> suitable in your eyes?

mmotm is based on the linux-next tree at the time the tree was made.

> I have seen some typos... are you interested in such feedback?

feel free to point them out.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 00/11] Sync and VFS scalability improvements
  2013-08-01  6:19   ` Dave Chinner
@ 2013-08-01  6:31     ` Sedat Dilek
  0 siblings, 0 replies; 32+ messages in thread
From: Sedat Dilek @ 2013-08-01  6:31 UTC (permalink / raw)
  To: Dave Chinner
  Cc: linux-fsdevel, linux-kernel, akpm, davej, viro, jack, glommer

On Thu, Aug 1, 2013 at 8:19 AM, Dave Chinner <david@fromorbit.com> wrote:
> On Wed, Jul 31, 2013 at 08:48:40AM +0200, Sedat Dilek wrote:
>> On Wed, Jul 31, 2013 at 6:15 AM, Dave Chinner <david@fromorbit.com> wrote:
>> > Hi folks,
>> >
>> > This series of patches is against the curent mmotm tree here:
>> >
>> > http://git.cmpxchg.org/cgit/linux-mmotm.git/
>> >
>>
>> "Current" is not precise enough... above tree has git-tags.
>
> It's most certainly precise.  "current" always means the master
> branch at the time the series was posted. I.e. same definition as
> "top of tree"....
>

Today, I see in my inbox "mmotm 2013-07-31-16-52 uploaded" which means
"master" will change (or has changed).
Think of all the people being on vacation and reading your email later.
People waste their time if they don't know on what a patchset is based on.

Pointing to the git-tag does not hurt.

>> As I am not a linux-fs expert I can try to test, preferably against a
>> Linux-next release.
>> Would you like to test against "vanilla" mmotm or is Linux-next
>> suitable in your eyes?
>
> mmotm is based on the linux-next tree at the time the tree was made.
>

No, linux-next includes patches from Andrew's mmotm marked in his series file.
So, your patches can fit or not, that's why I asked.

[1] http://ozlabs.org/~akpm/mmotm/series
[2] http://ozlabs.org/~akpm/mmotm/mmotm-readme.txt

>> I have seen some typos... are you interested in such feedback?
>
> feel free to point them out.
>

Cannot promise anything. Dealing with some embedded stuff.

- Sedat -

> Cheers,
>
> Dave.
> --
> Dave Chinner
> david@fromorbit.com

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 02/11] inode: add IOP_NOTHASHED to avoid inode hash lock in evict
  2013-07-31  4:15 ` [PATCH 02/11] inode: add IOP_NOTHASHED to avoid inode hash lock in evict Dave Chinner
  2013-07-31 14:44   ` Jan Kara
@ 2013-08-01  8:12   ` Christoph Hellwig
  2013-08-02  1:11     ` Dave Chinner
  1 sibling, 1 reply; 32+ messages in thread
From: Christoph Hellwig @ 2013-08-01  8:12 UTC (permalink / raw)
  To: Dave Chinner
  Cc: linux-fsdevel, linux-kernel, akpm, davej, viro, jack, glommer

On Wed, Jul 31, 2013 at 02:15:41PM +1000, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> Some filesystems don't use the VFS inode hash and fake the fact they
> are hashed so that all the writeback code works correctly. However,
> this means the evict() path still tries to remove the inode from the
> hash, meaning that the inode_hash_lock() needs to be taken
> unnecessarily. Hence under certain workloads the inode_hash_lock can
> be contended even if the inode is never actually hashed.
> 
> To avoid this, add an inode opflag to allow inode_hash_remove() to
> avoid taking the hash lock on inodes have never actually been
> hashed.

Good idea, but I don't like how it's implemented.

First a formality: i_opflags really is for flags showing that inode
operations exist, not for addional bits.  Just use i_flags for it.

Second this is a hack hacking around a hack.  We just mark the inode
hashed so that writeback doesn't ignore it, and not we need to work
around the fact that we don't want an inode marked hashed from the
hashlist.

As the most simple version I'd suggest to just add an I_NEEDS_WRITEBACK
flag which gets set by __insert_inode_hash, and all the current users
of hlist_add_fake on i_hash, as well as the block devices that currently
have another special case in the writeback code.

Thinking about it I'm not even sure I_NEEDS_WRITEBACK is needed, no
that we have I_NEW as a marker a negative test for it should be
enough, but doing this might require some further research.


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 08/11] inode: convert per-sb inode list to a list_lru
  2013-07-31  4:15 ` [PATCH 08/11] inode: convert per-sb inode list to a list_lru Dave Chinner
@ 2013-08-01  8:19   ` Christoph Hellwig
  2013-08-02  1:06     ` Dave Chinner
  0 siblings, 1 reply; 32+ messages in thread
From: Christoph Hellwig @ 2013-08-01  8:19 UTC (permalink / raw)
  To: Dave Chinner
  Cc: linux-fsdevel, linux-kernel, akpm, davej, viro, jack, glommer

At this point the list_lru name really becomes confusing.  Given that
it's not really LRU specific maybe just rename it to pernode_list
or similar?


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 01/11] writeback: plug writeback at a high level
  2013-08-01  5:48     ` Dave Chinner
@ 2013-08-01  8:34       ` Jan Kara
  0 siblings, 0 replies; 32+ messages in thread
From: Jan Kara @ 2013-08-01  8:34 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Jan Kara, linux-fsdevel, linux-kernel, akpm, davej, viro, glommer

On Thu 01-08-13 15:48:05, Dave Chinner wrote:
> On Wed, Jul 31, 2013 at 04:40:19PM +0200, Jan Kara wrote:
> > On Wed 31-07-13 14:15:40, Dave Chinner wrote:
> > > From: Dave Chinner <dchinner@redhat.com>
> > > 
> > > Doing writeback on lots of little files causes terrible IOPS storms
> > > because of the per-mapping writeback plugging we do. This
> > > essentially causes imeediate dispatch of IO for each mapping,
> > > regardless of the context in which writeback is occurring.
> > > 
> > > IOWs, running a concurrent write-lots-of-small 4k files using fsmark
> > > on XFS results in a huge number of IOPS being issued for data
> > > writes.  Metadata writes are sorted and plugged at a high level by
> > > XFS, so aggregate nicely into large IOs. However, data writeback IOs
> > > are dispatched in individual 4k IOs, even when the blocks of two
> > > consecutively written files are adjacent.
> > > 
> > > Test VM: 8p, 8GB RAM, 4xSSD in RAID0, 100TB sparse XFS filesystem,
> > > metadata CRCs enabled.
> > > 
> > > Kernel: 3.10-rc5 + xfsdev + my 3.11 xfs queue (~70 patches)
> > > 
> > > Test:
> > > 
> > > $ ./fs_mark  -D  10000  -S0  -n  10000  -s  4096  -L  120  -d
> > > /mnt/scratch/0  -d  /mnt/scratch/1  -d  /mnt/scratch/2  -d
> > > /mnt/scratch/3  -d  /mnt/scratch/4  -d  /mnt/scratch/5  -d
> > > /mnt/scratch/6  -d  /mnt/scratch/7
> > > 
> > > Result:
> > > 
> > > 		wall	sys	create rate	Physical write IO
> > > 		time	CPU	(avg files/s)	 IOPS	Bandwidth
> > > 		-----	-----	------------	------	---------
> > > unpatched	6m56s	15m47s	24,000+/-500	26,000	130MB/s
> > > patched		5m06s	13m28s	32,800+/-600	 1,500	180MB/s
> > > improvement	-26.44%	-14.68%	  +36.67%	-94.23%	+38.46%
> > > 
> > > If I use zero length files, this workload at about 500 IOPS, so
> > > plugging drops the data IOs from roughly 25,500/s to 1000/s.
> > > 3 lines of code, 35% better throughput for 15% less CPU.
> > > 
> > > The benefits of plugging at this layer are likely to be higher for
> > > spinning media as the IO patterns for this workload are going make a
> > > much bigger difference on high IO latency devices.....
> > > 
> > > Signed-off-by: Dave Chinner <dchinner@redhat.com>
> >   Just one question: Won't this cause a regression when files are say 2 MB
> > large? Then we generate maximum sized requests for these files with
> > per-inode plugging anyway and they will unnecessarily sit in the plug list
> > until the plug list gets full (that is after 16 requests). Granted it
> > shouldn't be too long but with fast storage it may be measurable...
> 
> Latency of IO dispatch only matters for the initial IOs being
> queued. This, however, is not a latency sensitive IO path -
> writeback is our bulk throughput IO engine, and in those cases low
> latency dispatch is precisely what we don't want. We want to
> optimise IO patterns for maximum *bandwidth*, not minimal latency.
> 
> The problem is that fast storage with immediate dispatch and dep
> queues can keep ahead of IO dispatch, preventing throughput
> optimisations like IO aggregation from being made because there is
> never any IO queued to aggregate. That's why I'm seeing a couple of
> orders of magnitude higher IOPS than I should. Sure, the hardware
> can do that, but it's not the *most efficient* method of dispatching
> background IO.
> 
> Allowing IOs a chance to aggregate in the scheduler for a short
> while because dispatch allows existing bulk throughput optimisations
> to be made to the IO stream, and as we can see, where a delayed
> allocation filesystem is optimised for adjacent allocation
> across sequentially written inodes such oppportunites for IO
> aggregation make a big difference to performance.
  Yeah, I understand the reasoning but sometimes experimental results
differ from theory :)

> So, to test your 2MB IO case, I ran a fsmark test using 40,000
> 2MB files instead of 10 million 4k files.
> 
> 		wall time	IOPS	BW
> mmotm		170s		1000	350MB/s
> patched		167s		1000	350MB/s
> 
> The IO profiles are near enough to be identical, and the wall time
> is basically the same.
> 
> 
> I just don't see any particular concern about larger IOs and initial
> dispatch latency here from either a theoretical or an observed POV.
> Indeed, I haven't seen a performance degradation as a result of this
> patch in any of the testing I've done since I first posted it...
  Thanks for doing the test! So I'm fine with this patch. You can add:
Reviewed-by: Jan Kara <jack@suse.cz>

> > Now if we have maximum sized request in the plug list, maybe we could just
> > dispatch it right away but that's another story.
> 
> That, in itself is potentially an issue, too, as it prevents seek
> minimisation optimisations from being made when we batch up multiple
> IOs on the plug list...
  Good point.

								Honza

-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 07/11] writeback: periodically trim the writeback list
  2013-08-01  6:16     ` Dave Chinner
@ 2013-08-01  9:03       ` Jan Kara
  0 siblings, 0 replies; 32+ messages in thread
From: Jan Kara @ 2013-08-01  9:03 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Jan Kara, linux-fsdevel, linux-kernel, akpm, davej, viro, glommer

On Thu 01-08-13 16:16:18, Dave Chinner wrote:
> On Wed, Jul 31, 2013 at 05:15:42PM +0200, Jan Kara wrote:
> > On Wed 31-07-13 14:15:46, Dave Chinner wrote:
> > > From: Dave Chinner <dchinner@redhat.com>
> > > 
> > > Inodes are removed lazily from the bdi writeback list, so in the
> > > absence of sync(2) work inodes will build up on the bdi writback
> > > list even though they are no longer under IO. Use the periodic
> > > kupdate work check to remove inodes no longer under IO from the
> > > writeback list.
> > > 
> > > Signed-off-by: Dave Chinner <dchinner@redhat.com>
> > > ---
> > >  fs/fs-writeback.c | 19 +++++++++++++++++++
> > >  1 file changed, 19 insertions(+)
> > > 
> > > diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
> > > index 638f122..7c9bbf0 100644
> > > --- a/fs/fs-writeback.c
> > > +++ b/fs/fs-writeback.c
> > > @@ -962,6 +962,23 @@ static long wb_check_background_flush(struct bdi_writeback *wb)
> > >  	return 0;
> > >  }
> > >  
> > > +/*
> > > + * clean out writeback list for all inodes that don't have IO in progress
> > > + */
> > > +static void wb_trim_writeback_list(struct bdi_writeback *wb)
> > > +{
> > > +	struct inode *inode;
> > > +	struct inode *tmp;
> > > +
> > > +	spin_lock(&wb->list_lock);
> > > +	list_for_each_entry_safe(inode, tmp, &wb->b_writeback, i_wb_list) {
> > > +		if (!mapping_tagged(inode->i_mapping, PAGECACHE_TAG_WRITEBACK))
> > > +			list_del_init(&inode->i_wb_list);
> > > +	}
> >    Oo, and here you manipulate i_wb_list without mapping->tree_lock so that
> > can race with the list_empty() check in bdi_mark_inode_writeback().
> 
> I'm not sure it does - we only remove is the PAGECACHE_TAG_WRITEBACK
> is not set, and only insert after we set the tag. Hence, if we are
> walking the &wb->b_writeback list here and PAGECACHE_TAG_WRITEBACK
> is set because there is an insert in progress then we can't race
> with the insert because we won't be trying to delete it from the
> list...
  The following race seems to be possible:
        CPU1                                            CPU2
test_set_page_writeback()
  spin_lock_irqsave(&mapping->tree_lock, flags);
  ret = TestSetPageWriteback(page);
  if (!ret) {
    /* == false */
    on_wblist = mapping_tagged(mapping,
                               PAGECACHE_TAG_WRITEBACK);
                                                        wb_trim_writeback_list()
                                                          spin_lock(&wb->list_lock);
                                                          ...
                                                          if (!mapping_tagged(inode->i_mapping, PAGECACHE_TAG_WRITEBACK)
    radix_tree_tag_set(&mapping->page_tree, ...)
    ...
    if (!on_wblist && mapping->host)
      bdi_mark_inode_writeback(bdi, mapping->host);
        if (list_empty(&inode->i_wb_list)) /* false */
                                                            list_del_init(&inode->i_wb_list);

And we end up with inode with PAGECACHE_TAG_WRITEBACK set but not on
i_wb_list. 

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 08/11] inode: convert per-sb inode list to a list_lru
  2013-08-01  8:19   ` Christoph Hellwig
@ 2013-08-02  1:06     ` Dave Chinner
  0 siblings, 0 replies; 32+ messages in thread
From: Dave Chinner @ 2013-08-02  1:06 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: linux-fsdevel, linux-kernel, akpm, davej, viro, jack, glommer

On Thu, Aug 01, 2013 at 01:19:50AM -0700, Christoph Hellwig wrote:
> At this point the list_lru name really becomes confusing.  Given that
> it's not really LRU specific maybe just rename it to pernode_list
> or similar?

Well, such things were suggested when the list_lru was first
proposed. Indeed, it even lived in lib/ because I saw it as widely
useful. But akpm didn't like it because of the "structure with
internal locking" architecture and thought it was too purpose
specific for generic use and so it got moved uot of lib/ into mm/
and here we are.

I don't care for how it's named, located or used - I just want to be
able to use the infrastructure it provides. As such renaming and
relocating it is not something I'm about to do in this specific
patchset....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 02/11] inode: add IOP_NOTHASHED to avoid inode hash lock in evict
  2013-08-01  8:12   ` Christoph Hellwig
@ 2013-08-02  1:11     ` Dave Chinner
  2013-08-02 14:32       ` Christoph Hellwig
  0 siblings, 1 reply; 32+ messages in thread
From: Dave Chinner @ 2013-08-02  1:11 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: linux-fsdevel, linux-kernel, akpm, davej, viro, jack, glommer

On Thu, Aug 01, 2013 at 01:12:35AM -0700, Christoph Hellwig wrote:
> On Wed, Jul 31, 2013 at 02:15:41PM +1000, Dave Chinner wrote:
> > From: Dave Chinner <dchinner@redhat.com>
> > 
> > Some filesystems don't use the VFS inode hash and fake the fact they
> > are hashed so that all the writeback code works correctly. However,
> > this means the evict() path still tries to remove the inode from the
> > hash, meaning that the inode_hash_lock() needs to be taken
> > unnecessarily. Hence under certain workloads the inode_hash_lock can
> > be contended even if the inode is never actually hashed.
> > 
> > To avoid this, add an inode opflag to allow inode_hash_remove() to
> > avoid taking the hash lock on inodes have never actually been
> > hashed.
> 
> Good idea, but I don't like how it's implemented.
> 
> First a formality: i_opflags really is for flags showing that inode
> operations exist, not for addional bits.  Just use i_flags for it.
> 
> Second this is a hack hacking around a hack.  We just mark the inode
> hashed so that writeback doesn't ignore it, and not we need to work
> around the fact that we don't want an inode marked hashed from the
> hashlist.
> 
> As the most simple version I'd suggest to just add an I_NEEDS_WRITEBACK
> flag which gets set by __insert_inode_hash, and all the current users
> of hlist_add_fake on i_hash, as well as the block devices that currently
> have another special case in the writeback code.

But that doesn't fix the problem of taking the hash lock in evict()
when it is not necessary. If everything sets I_NEEDS_WRITEBACK, and
we still fake hashing the inode, how are do we know that we don't
need to unhash it in evict()?

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 02/11] inode: add IOP_NOTHASHED to avoid inode hash lock in evict
  2013-08-02  1:11     ` Dave Chinner
@ 2013-08-02 14:32       ` Christoph Hellwig
  0 siblings, 0 replies; 32+ messages in thread
From: Christoph Hellwig @ 2013-08-02 14:32 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Christoph Hellwig, linux-fsdevel, linux-kernel, akpm, davej,
	viro, jack, glommer

On Fri, Aug 02, 2013 at 11:11:33AM +1000, Dave Chinner wrote:
> But that doesn't fix the problem of taking the hash lock in evict()
> when it is not necessary. If everything sets I_NEEDS_WRITEBACK, and
> we still fake hashing the inode, how are do we know that we don't
> need to unhash it in evict()?

The idea is to keep evict as-is but make the flag that an inode needs
writeback entirely separate from the is hashed check.

I've attached a draft patch implementing my idea belo.  It does survive
booting and data copies to ext4 or xfs filesystems is persistent with
it, so it works for some defintion of working.

Bedxies splitting and proper documentation we probably also want to get
rid of the inode_unhashed check in generic_drop_inode and replace it
with an is_bad_inode, just special casing unhashed inodes in coda to
keep the behaviour for it the same.


diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index 68851ff..e36cad9 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -1147,22 +1147,12 @@ void __mark_inode_dirty(struct inode *inode, int flags)
 		inode->i_state |= flags;
 
 		/*
-		 * If the inode is being synced, just update its dirty state.
-		 * The unlocker will place the inode on the appropriate
-		 * superblock list, based upon its state.
+		 * If an inode isn't fully constructed yet or being torn
+		 * down let it on the list it is on currently.  For inodes
+		 * undergoing sync we will move it to the right list once
+		 * syncing it has been finished.
 		 */
-		if (inode->i_state & I_SYNC)
-			goto out_unlock_inode;
-
-		/*
-		 * Only add valid (hashed) inodes to the superblock's
-		 * dirty list.  Add blockdev inodes as well.
-		 */
-		if (!S_ISBLK(inode->i_mode)) {
-			if (inode_unhashed(inode))
-				goto out_unlock_inode;
-		}
-		if (inode->i_state & I_FREEING)
+		if (inode->i_state & (I_NEW | I_SYNC | I_FREEING))
 			goto out_unlock_inode;
 
 		/*
diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c
index 96dda62..4bd3be0 100644
--- a/fs/xfs/xfs_iops.c
+++ b/fs/xfs/xfs_iops.c
@@ -1168,8 +1168,6 @@ xfs_setup_inode(
 	inode->i_state = I_NEW;
 
 	inode_sb_list_add(inode);
-	/* make the inode look hashed for the writeback code */
-	hlist_add_fake(&inode->i_hash);
 
 	inode->i_mode	= ip->i_d.di_mode;
 	set_nlink(inode, ip->i_d.di_nlink);
diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
index 19922eb..c5fdf48 100644
--- a/fs/xfs/xfs_super.c
+++ b/fs/xfs/xfs_super.c
@@ -1026,9 +1026,7 @@ STATIC int
 xfs_fs_drop_inode(
 	struct inode		*inode)
 {
-	struct xfs_inode	*ip = XFS_I(inode);
-
-	return generic_drop_inode(inode) || (ip->i_flags & XFS_IDONTCACHE);
+	return !inode->i_nlink || (XFS_I(inode)->i_flags & XFS_IDONTCACHE);
 }
 
 STATIC void

^ permalink raw reply related	[flat|nested] 32+ messages in thread

end of thread, other threads:[~2013-08-02 14:32 UTC | newest]

Thread overview: 32+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-07-31  4:15 [PATCH 00/11] Sync and VFS scalability improvements Dave Chinner
2013-07-31  4:15 ` [PATCH 01/11] writeback: plug writeback at a high level Dave Chinner
2013-07-31 14:40   ` Jan Kara
2013-08-01  5:48     ` Dave Chinner
2013-08-01  8:34       ` Jan Kara
2013-07-31  4:15 ` [PATCH 02/11] inode: add IOP_NOTHASHED to avoid inode hash lock in evict Dave Chinner
2013-07-31 14:44   ` Jan Kara
2013-08-01  8:12   ` Christoph Hellwig
2013-08-02  1:11     ` Dave Chinner
2013-08-02 14:32       ` Christoph Hellwig
2013-07-31  4:15 ` [PATCH 03/11] inode: convert inode_sb_list_lock to per-sb Dave Chinner
2013-07-31 14:48   ` Jan Kara
2013-07-31  4:15 ` [PATCH 04/11] sync: serialise per-superblock sync operations Dave Chinner
2013-07-31 15:12   ` Jan Kara
2013-07-31  4:15 ` [PATCH 05/11] inode: rename i_wb_list to i_io_list Dave Chinner
2013-07-31 14:51   ` Jan Kara
2013-07-31  4:15 ` [PATCH 06/11] bdi: add a new writeback list for sync Dave Chinner
2013-07-31 15:11   ` Jan Kara
2013-08-01  5:59     ` Dave Chinner
2013-07-31  4:15 ` [PATCH 07/11] writeback: periodically trim the writeback list Dave Chinner
2013-07-31 15:15   ` Jan Kara
2013-08-01  6:16     ` Dave Chinner
2013-08-01  9:03       ` Jan Kara
2013-07-31  4:15 ` [PATCH 08/11] inode: convert per-sb inode list to a list_lru Dave Chinner
2013-08-01  8:19   ` Christoph Hellwig
2013-08-02  1:06     ` Dave Chinner
2013-07-31  4:15 ` [PATCH 09/11] fs: Use RCU lookups for inode cache Dave Chinner
2013-07-31  4:15 ` [PATCH 10/11] list_lru: don't need node lock in list_lru_count_node Dave Chinner
2013-07-31  4:15 ` [PATCH 11/11] list_lru: don't lock during add/del if unnecessary Dave Chinner
2013-07-31  6:48 ` [PATCH 00/11] Sync and VFS scalability improvements Sedat Dilek
2013-08-01  6:19   ` Dave Chinner
2013-08-01  6:31     ` Sedat Dilek

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).