linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/9] Sync and VFS scalability improvements
@ 2015-03-10 19:45 Josef Bacik
  2015-03-10 19:45 ` [PATCH 1/9] writeback: plug writeback at a high level Josef Bacik
                   ` (10 more replies)
  0 siblings, 11 replies; 24+ messages in thread
From: Josef Bacik @ 2015-03-10 19:45 UTC (permalink / raw)
  To: linux-fsdevel, david, viro, jack, linux-kernel

These are patches that Dave Chinner wrote two years ago that are still very much
needed today.  I recently ran into a problem where I had millions of inodes that
needed to be evicted at unmount time and it soft locked up the box and kept any
other file system from doing work.  These patches fix this problem by breaking
the inode_sb_list_lock into per-sb, and then converting the per sb inode list
into a list_lru for better scalability.

I've also pulled forward Dave's sync scalability patches which still seem pretty
relevant.  I had to fix a couple of these to bring them forward but I touched
very little and I've preserved the authorship of everything.  I added the
Reviewed-by's that were there when the patches were originally submitted.  I've
run this through xfstests on btrfs and xfs and verified that everything seems to
be working.  If you are interested the original submission can be found here

http://lwn.net/Articles/561569/

Finally the last patch is from me and this fixes the softlockup problems I was
seeing on unmount with a large amount of inodes that needed to be evicted.
Thanks,

Josef


^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH 1/9] writeback: plug writeback at a high level
  2015-03-10 19:45 [PATCH 0/9] Sync and VFS scalability improvements Josef Bacik
@ 2015-03-10 19:45 ` Josef Bacik
  2015-03-10 19:45 ` [PATCH 2/9] inode: add IOP_NOTHASHED to avoid inode hash lock in evict Josef Bacik
                   ` (9 subsequent siblings)
  10 siblings, 0 replies; 24+ messages in thread
From: Josef Bacik @ 2015-03-10 19:45 UTC (permalink / raw)
  To: linux-fsdevel, david, viro, jack, linux-kernel; +Cc: Dave Chinner

From: Dave Chinner <dchinner@redhat.com>

Doing writeback on lots of little files causes terrible IOPS storms
because of the per-mapping writeback plugging we do. This
essentially causes imeediate dispatch of IO for each mapping,
regardless of the context in which writeback is occurring.

IOWs, running a concurrent write-lots-of-small 4k files using fsmark
on XFS results in a huge number of IOPS being issued for data
writes.  Metadata writes are sorted and plugged at a high level by
XFS, so aggregate nicely into large IOs. However, data writeback IOs
are dispatched in individual 4k IOs, even when the blocks of two
consecutively written files are adjacent.

Test VM: 8p, 8GB RAM, 4xSSD in RAID0, 100TB sparse XFS filesystem,
metadata CRCs enabled.

Kernel: 3.10-rc5 + xfsdev + my 3.11 xfs queue (~70 patches)

Test:

$ ./fs_mark  -D  10000  -S0  -n  10000  -s  4096  -L  120  -d
/mnt/scratch/0  -d  /mnt/scratch/1  -d  /mnt/scratch/2  -d
/mnt/scratch/3  -d  /mnt/scratch/4  -d  /mnt/scratch/5  -d
/mnt/scratch/6  -d  /mnt/scratch/7

Result:

		wall	sys	create rate	Physical write IO
		time	CPU	(avg files/s)	 IOPS	Bandwidth
		-----	-----	------------	------	---------
unpatched	6m56s	15m47s	24,000+/-500	26,000	130MB/s
patched		5m06s	13m28s	32,800+/-600	 1,500	180MB/s
improvement	-26.44%	-14.68%	  +36.67%	-94.23%	+38.46%

If I use zero length files, this workload at about 500 IOPS, so
plugging drops the data IOs from roughly 25,500/s to 1000/s.
3 lines of code, 35% better throughput for 15% less CPU.

The benefits of plugging at this layer are likely to be higher for
spinning media as the IO patterns for this workload are going make a
much bigger difference on high IO latency devices.....

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Jan Kara <jack@suse.cz>
---
 fs/fs-writeback.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index e907052..a9ff2b7 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -659,7 +659,9 @@ static long writeback_sb_inodes(struct super_block *sb,
 	unsigned long start_time = jiffies;
 	long write_chunk;
 	long wrote = 0;  /* count both pages and inodes */
+	struct blk_plug plug;
 
+	blk_start_plug(&plug);
 	while (!list_empty(&wb->b_io)) {
 		struct inode *inode = wb_inode(wb->b_io.prev);
 
@@ -756,6 +758,7 @@ static long writeback_sb_inodes(struct super_block *sb,
 				break;
 		}
 	}
+	blk_finish_plug(&plug);
 	return wrote;
 }
 
-- 
1.9.3


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH 2/9] inode: add IOP_NOTHASHED to avoid inode hash lock in evict
  2015-03-10 19:45 [PATCH 0/9] Sync and VFS scalability improvements Josef Bacik
  2015-03-10 19:45 ` [PATCH 1/9] writeback: plug writeback at a high level Josef Bacik
@ 2015-03-10 19:45 ` Josef Bacik
  2015-03-12  9:52   ` Al Viro
  2015-03-10 19:45 ` [PATCH 3/9] inode: convert inode_sb_list_lock to per-sb Josef Bacik
                   ` (8 subsequent siblings)
  10 siblings, 1 reply; 24+ messages in thread
From: Josef Bacik @ 2015-03-10 19:45 UTC (permalink / raw)
  To: linux-fsdevel, david, viro, jack, linux-kernel; +Cc: Dave Chinner

From: Dave Chinner <dchinner@redhat.com>

Some filesystems don't use the VFS inode hash and fake the fact they
are hashed so that all the writeback code works correctly. However,
this means the evict() path still tries to remove the inode from the
hash, meaning that the inode_hash_lock() needs to be taken
unnecessarily. Hence under certain workloads the inode_hash_lock can
be contended even if the inode is never actually hashed.

To avoid this, add an inode opflag to allow inode_hash_remove() to
avoid taking the hash lock on inodes have never actually been
hashed.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Jan Kara <jack@suse.cz>
---
 fs/xfs/xfs_iops.c  | 2 ++
 include/linux/fs.h | 3 ++-
 2 files changed, 4 insertions(+), 1 deletion(-)

diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c
index e53a903..5068629 100644
--- a/fs/xfs/xfs_iops.c
+++ b/fs/xfs/xfs_iops.c
@@ -1250,8 +1250,10 @@ xfs_setup_inode(
 	inode->i_state = I_NEW;
 
 	inode_sb_list_add(inode);
+
 	/* make the inode look hashed for the writeback code */
 	hlist_add_fake(&inode->i_hash);
+	inode->i_opflags |= IOP_NOTHASHED;
 
 	inode->i_mode	= ip->i_d.di_mode;
 	set_nlink(inode, ip->i_d.di_nlink);
diff --git a/include/linux/fs.h b/include/linux/fs.h
index b4d71b5..1045132 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -546,6 +546,7 @@ struct posix_acl;
 #define IOP_FASTPERM	0x0001
 #define IOP_LOOKUP	0x0002
 #define IOP_NOFOLLOW	0x0004
+#define IOP_NOTHASHED	0x0008	/* inode never hashed, avoid unhashing */
 
 /*
  * Keep mostly read-only and often accessed (especially for
@@ -2528,7 +2529,7 @@ static inline void insert_inode_hash(struct inode *inode)
 extern void __remove_inode_hash(struct inode *);
 static inline void remove_inode_hash(struct inode *inode)
 {
-	if (!inode_unhashed(inode))
+	if (!((inode->i_opflags & IOP_NOTHASHED) || inode_unhashed(inode)))
 		__remove_inode_hash(inode);
 }
 
-- 
1.9.3


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH 3/9] inode: convert inode_sb_list_lock to per-sb
  2015-03-10 19:45 [PATCH 0/9] Sync and VFS scalability improvements Josef Bacik
  2015-03-10 19:45 ` [PATCH 1/9] writeback: plug writeback at a high level Josef Bacik
  2015-03-10 19:45 ` [PATCH 2/9] inode: add IOP_NOTHASHED to avoid inode hash lock in evict Josef Bacik
@ 2015-03-10 19:45 ` Josef Bacik
  2015-03-10 19:45 ` [PATCH 4/9] sync: serialise per-superblock sync operations Josef Bacik
                   ` (7 subsequent siblings)
  10 siblings, 0 replies; 24+ messages in thread
From: Josef Bacik @ 2015-03-10 19:45 UTC (permalink / raw)
  To: linux-fsdevel, david, viro, jack, linux-kernel; +Cc: Dave Chinner

From: Dave Chinner <dchinner@redhat.com>

The process of reducing contention on per-superblock inode lists
starts with moving the locking to match the per-superblock inode
list. This takes the global lock out of the picture and reduces the
contention problems to within a single filesystem. This doesn't get
rid of contention as the locks still have global CPU scope, but it
does isolate operations on different superblocks form each other.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Jan Kara <jack@suse.cz>
---
 fs/block_dev.c                   | 12 ++++++------
 fs/drop_caches.c                 | 10 ++++++----
 fs/fs-writeback.c                | 12 ++++++------
 fs/inode.c                       | 28 +++++++++++++---------------
 fs/internal.h                    |  1 -
 fs/notify/inode_mark.c           | 20 ++++++++++----------
 fs/quota/dquot.c                 | 16 ++++++++--------
 fs/super.c                       |  3 ++-
 include/linux/fs.h               |  5 ++++-
 include/linux/fsnotify_backend.h |  4 ++--
 10 files changed, 57 insertions(+), 54 deletions(-)

diff --git a/fs/block_dev.c b/fs/block_dev.c
index 975266b..2eb2436 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -1753,7 +1753,7 @@ void iterate_bdevs(void (*func)(struct block_device *, void *), void *arg)
 {
 	struct inode *inode, *old_inode = NULL;
 
-	spin_lock(&inode_sb_list_lock);
+	spin_lock(&blockdev_superblock->s_inode_list_lock);
 	list_for_each_entry(inode, &blockdev_superblock->s_inodes, i_sb_list) {
 		struct address_space *mapping = inode->i_mapping;
 
@@ -1765,13 +1765,13 @@ void iterate_bdevs(void (*func)(struct block_device *, void *), void *arg)
 		}
 		__iget(inode);
 		spin_unlock(&inode->i_lock);
-		spin_unlock(&inode_sb_list_lock);
+		spin_unlock(&blockdev_superblock->s_inode_list_lock);
 		/*
 		 * We hold a reference to 'inode' so it couldn't have been
 		 * removed from s_inodes list while we dropped the
-		 * inode_sb_list_lock.  We cannot iput the inode now as we can
+		 * s_inode_list_lock  We cannot iput the inode now as we can
 		 * be holding the last reference and we cannot iput it under
-		 * inode_sb_list_lock. So we keep the reference and iput it
+		 * s_inode_list_lock. So we keep the reference and iput it
 		 * later.
 		 */
 		iput(old_inode);
@@ -1779,8 +1779,8 @@ void iterate_bdevs(void (*func)(struct block_device *, void *), void *arg)
 
 		func(I_BDEV(inode), arg);
 
-		spin_lock(&inode_sb_list_lock);
+		spin_lock(&blockdev_superblock->s_inode_list_lock);
 	}
-	spin_unlock(&inode_sb_list_lock);
+	spin_unlock(&blockdev_superblock->s_inode_list_lock);
 	iput(old_inode);
 }
diff --git a/fs/drop_caches.c b/fs/drop_caches.c
index 5718cb9..d72d52b 100644
--- a/fs/drop_caches.c
+++ b/fs/drop_caches.c
@@ -17,7 +17,7 @@ static void drop_pagecache_sb(struct super_block *sb, void *unused)
 {
 	struct inode *inode, *toput_inode = NULL;
 
-	spin_lock(&inode_sb_list_lock);
+	spin_lock(&sb->s_inode_list_lock);
 	list_for_each_entry(inode, &sb->s_inodes, i_sb_list) {
 		spin_lock(&inode->i_lock);
 		if ((inode->i_state & (I_FREEING|I_WILL_FREE|I_NEW)) ||
@@ -27,13 +27,15 @@ static void drop_pagecache_sb(struct super_block *sb, void *unused)
 		}
 		__iget(inode);
 		spin_unlock(&inode->i_lock);
-		spin_unlock(&inode_sb_list_lock);
+		spin_unlock(&sb->s_inode_list_lock);
+
 		invalidate_mapping_pages(inode->i_mapping, 0, -1);
 		iput(toput_inode);
 		toput_inode = inode;
-		spin_lock(&inode_sb_list_lock);
+
+		spin_lock(&sb->s_inode_list_lock);
 	}
-	spin_unlock(&inode_sb_list_lock);
+	spin_unlock(&sb->s_inode_list_lock);
 	iput(toput_inode);
 }
 
diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index a9ff2b7..9780f6c 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -1298,7 +1298,7 @@ static void wait_sb_inodes(struct super_block *sb)
 	 */
 	WARN_ON(!rwsem_is_locked(&sb->s_umount));
 
-	spin_lock(&inode_sb_list_lock);
+	spin_lock(&sb->s_inode_list_lock);
 
 	/*
 	 * Data integrity sync. Must wait for all pages under writeback,
@@ -1318,14 +1318,14 @@ static void wait_sb_inodes(struct super_block *sb)
 		}
 		__iget(inode);
 		spin_unlock(&inode->i_lock);
-		spin_unlock(&inode_sb_list_lock);
+		spin_unlock(&sb->s_inode_list_lock);
 
 		/*
 		 * We hold a reference to 'inode' so it couldn't have been
 		 * removed from s_inodes list while we dropped the
-		 * inode_sb_list_lock.  We cannot iput the inode now as we can
+		 * s_inode_list_lock.  We cannot iput the inode now as we can
 		 * be holding the last reference and we cannot iput it under
-		 * inode_sb_list_lock. So we keep the reference and iput it
+		 * s_inode_list_lock. So we keep the reference and iput it
 		 * later.
 		 */
 		iput(old_inode);
@@ -1335,9 +1335,9 @@ static void wait_sb_inodes(struct super_block *sb)
 
 		cond_resched();
 
-		spin_lock(&inode_sb_list_lock);
+		spin_lock(&sb->s_inode_list_lock);
 	}
-	spin_unlock(&inode_sb_list_lock);
+	spin_unlock(&sb->s_inode_list_lock);
 	iput(old_inode);
 }
 
diff --git a/fs/inode.c b/fs/inode.c
index f00b16f..bd61843 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -28,8 +28,8 @@
  *   inode->i_state, inode->i_hash, __iget()
  * Inode LRU list locks protect:
  *   inode->i_sb->s_inode_lru, inode->i_lru
- * inode_sb_list_lock protects:
- *   sb->s_inodes, inode->i_sb_list
+ * inode->i_sb->s_inode_list_lock protects:
+ *   inode->i_sb->s_inodes, inode->i_sb_list
  * bdi->wb.list_lock protects:
  *   bdi->wb.b_{dirty,io,more_io,dirty_time}, inode->i_wb_list
  * inode_hash_lock protects:
@@ -37,7 +37,7 @@
  *
  * Lock ordering:
  *
- * inode_sb_list_lock
+ * inode->i_sb->s_inode_list_lock
  *   inode->i_lock
  *     Inode LRU list locks
  *
@@ -45,7 +45,7 @@
  *   inode->i_lock
  *
  * inode_hash_lock
- *   inode_sb_list_lock
+ *   inode->i_sb->s_inode_list_lock
  *   inode->i_lock
  *
  * iunique_lock
@@ -57,8 +57,6 @@ static unsigned int i_hash_shift __read_mostly;
 static struct hlist_head *inode_hashtable __read_mostly;
 static __cacheline_aligned_in_smp DEFINE_SPINLOCK(inode_hash_lock);
 
-__cacheline_aligned_in_smp DEFINE_SPINLOCK(inode_sb_list_lock);
-
 /*
  * Empty aops. Can be used for the cases where the user does not
  * define any of the address_space operations.
@@ -424,18 +422,18 @@ static void inode_lru_list_del(struct inode *inode)
  */
 void inode_sb_list_add(struct inode *inode)
 {
-	spin_lock(&inode_sb_list_lock);
+	spin_lock(&inode->i_sb->s_inode_list_lock);
 	list_add(&inode->i_sb_list, &inode->i_sb->s_inodes);
-	spin_unlock(&inode_sb_list_lock);
+	spin_unlock(&inode->i_sb->s_inode_list_lock);
 }
 EXPORT_SYMBOL_GPL(inode_sb_list_add);
 
 static inline void inode_sb_list_del(struct inode *inode)
 {
 	if (!list_empty(&inode->i_sb_list)) {
-		spin_lock(&inode_sb_list_lock);
+		spin_lock(&inode->i_sb->s_inode_list_lock);
 		list_del_init(&inode->i_sb_list);
-		spin_unlock(&inode_sb_list_lock);
+		spin_unlock(&inode->i_sb->s_inode_list_lock);
 	}
 }
 
@@ -592,7 +590,7 @@ void evict_inodes(struct super_block *sb)
 	struct inode *inode, *next;
 	LIST_HEAD(dispose);
 
-	spin_lock(&inode_sb_list_lock);
+	spin_lock(&sb->s_inode_list_lock);
 	list_for_each_entry_safe(inode, next, &sb->s_inodes, i_sb_list) {
 		if (atomic_read(&inode->i_count))
 			continue;
@@ -608,7 +606,7 @@ void evict_inodes(struct super_block *sb)
 		spin_unlock(&inode->i_lock);
 		list_add(&inode->i_lru, &dispose);
 	}
-	spin_unlock(&inode_sb_list_lock);
+	spin_unlock(&sb->s_inode_list_lock);
 
 	dispose_list(&dispose);
 }
@@ -629,7 +627,7 @@ int invalidate_inodes(struct super_block *sb, bool kill_dirty)
 	struct inode *inode, *next;
 	LIST_HEAD(dispose);
 
-	spin_lock(&inode_sb_list_lock);
+	spin_lock(&sb->s_inode_list_lock);
 	list_for_each_entry_safe(inode, next, &sb->s_inodes, i_sb_list) {
 		spin_lock(&inode->i_lock);
 		if (inode->i_state & (I_NEW | I_FREEING | I_WILL_FREE)) {
@@ -652,7 +650,7 @@ int invalidate_inodes(struct super_block *sb, bool kill_dirty)
 		spin_unlock(&inode->i_lock);
 		list_add(&inode->i_lru, &dispose);
 	}
-	spin_unlock(&inode_sb_list_lock);
+	spin_unlock(&sb->s_inode_list_lock);
 
 	dispose_list(&dispose);
 
@@ -884,7 +882,7 @@ struct inode *new_inode(struct super_block *sb)
 {
 	struct inode *inode;
 
-	spin_lock_prefetch(&inode_sb_list_lock);
+	spin_lock_prefetch(&sb->s_inode_list_lock);
 
 	inode = new_inode_pseudo(sb);
 	if (inode)
diff --git a/fs/internal.h b/fs/internal.h
index 01dce1d..b6c350d 100644
--- a/fs/internal.h
+++ b/fs/internal.h
@@ -111,7 +111,6 @@ extern int open_check_o_direct(struct file *f);
 /*
  * inode.c
  */
-extern spinlock_t inode_sb_list_lock;
 extern long prune_icache_sb(struct super_block *sb, struct shrink_control *sc);
 extern void inode_add_lru(struct inode *inode);
 
diff --git a/fs/notify/inode_mark.c b/fs/notify/inode_mark.c
index 3daf513..a4e1a8f 100644
--- a/fs/notify/inode_mark.c
+++ b/fs/notify/inode_mark.c
@@ -163,17 +163,17 @@ int fsnotify_add_inode_mark(struct fsnotify_mark *mark,
 
 /**
  * fsnotify_unmount_inodes - an sb is unmounting.  handle any watched inodes.
- * @list: list of inodes being unmounted (sb->s_inodes)
+ * @sb: superblock being unmounted.
  *
  * Called during unmount with no locks held, so needs to be safe against
- * concurrent modifiers. We temporarily drop inode_sb_list_lock and CAN block.
+ * concurrent modifiers. We temporarily drop sb->s_inode_list_lock and CAN block.
  */
-void fsnotify_unmount_inodes(struct list_head *list)
+void fsnotify_unmount_inodes(struct super_block *sb)
 {
 	struct inode *inode, *next_i, *need_iput = NULL;
 
-	spin_lock(&inode_sb_list_lock);
-	list_for_each_entry_safe(inode, next_i, list, i_sb_list) {
+	spin_lock(&sb->s_inode_list_lock);
+	list_for_each_entry_safe(inode, next_i, &sb->s_inodes, i_sb_list) {
 		struct inode *need_iput_tmp;
 
 		/*
@@ -209,7 +209,7 @@ void fsnotify_unmount_inodes(struct list_head *list)
 		spin_unlock(&inode->i_lock);
 
 		/* In case the dropping of a reference would nuke next_i. */
-		while (&next_i->i_sb_list != list) {
+		while (&next_i->i_sb_list != &sb->s_inodes) {
 			spin_lock(&next_i->i_lock);
 			if (!(next_i->i_state & (I_FREEING | I_WILL_FREE)) &&
 						atomic_read(&next_i->i_count)) {
@@ -224,12 +224,12 @@ void fsnotify_unmount_inodes(struct list_head *list)
 		}
 
 		/*
-		 * We can safely drop inode_sb_list_lock here because either
+		 * We can safely drop s_inode_list_lock here because either
 		 * we actually hold references on both inode and next_i or
 		 * end of list.  Also no new inodes will be added since the
 		 * umount has begun.
 		 */
-		spin_unlock(&inode_sb_list_lock);
+		spin_unlock(&sb->s_inode_list_lock);
 
 		if (need_iput_tmp)
 			iput(need_iput_tmp);
@@ -241,7 +241,7 @@ void fsnotify_unmount_inodes(struct list_head *list)
 
 		iput(inode);
 
-		spin_lock(&inode_sb_list_lock);
+		spin_lock(&sb->s_inode_list_lock);
 	}
-	spin_unlock(&inode_sb_list_lock);
+	spin_unlock(&sb->s_inode_list_lock);
 }
diff --git a/fs/quota/dquot.c b/fs/quota/dquot.c
index 0ccd4ba..68c7ae3 100644
--- a/fs/quota/dquot.c
+++ b/fs/quota/dquot.c
@@ -920,7 +920,7 @@ static void add_dquot_ref(struct super_block *sb, int type)
 	int reserved = 0;
 #endif
 
-	spin_lock(&inode_sb_list_lock);
+	spin_lock(&sb->s_inode_list_lock);
 	list_for_each_entry(inode, &sb->s_inodes, i_sb_list) {
 		spin_lock(&inode->i_lock);
 		if ((inode->i_state & (I_FREEING|I_WILL_FREE|I_NEW)) ||
@@ -931,7 +931,7 @@ static void add_dquot_ref(struct super_block *sb, int type)
 		}
 		__iget(inode);
 		spin_unlock(&inode->i_lock);
-		spin_unlock(&inode_sb_list_lock);
+		spin_unlock(&sb->s_inode_list_lock);
 
 #ifdef CONFIG_QUOTA_DEBUG
 		if (unlikely(inode_get_rsv_space(inode) > 0))
@@ -943,15 +943,15 @@ static void add_dquot_ref(struct super_block *sb, int type)
 		/*
 		 * We hold a reference to 'inode' so it couldn't have been
 		 * removed from s_inodes list while we dropped the
-		 * inode_sb_list_lock We cannot iput the inode now as we can be
+		 * s_inode_list_lock. We cannot iput the inode now as we can be
 		 * holding the last reference and we cannot iput it under
-		 * inode_sb_list_lock. So we keep the reference and iput it
+		 * s_inode_list_lock. So we keep the reference and iput it
 		 * later.
 		 */
 		old_inode = inode;
-		spin_lock(&inode_sb_list_lock);
+		spin_lock(&sb->s_inode_list_lock);
 	}
-	spin_unlock(&inode_sb_list_lock);
+	spin_unlock(&sb->s_inode_list_lock);
 	iput(old_inode);
 
 #ifdef CONFIG_QUOTA_DEBUG
@@ -1019,7 +1019,7 @@ static void remove_dquot_ref(struct super_block *sb, int type,
 	struct inode *inode;
 	int reserved = 0;
 
-	spin_lock(&inode_sb_list_lock);
+	spin_lock(&sb->s_inode_list_lock);
 	list_for_each_entry(inode, &sb->s_inodes, i_sb_list) {
 		/*
 		 *  We have to scan also I_NEW inodes because they can already
@@ -1035,7 +1035,7 @@ static void remove_dquot_ref(struct super_block *sb, int type,
 		}
 		spin_unlock(&dq_data_lock);
 	}
-	spin_unlock(&inode_sb_list_lock);
+	spin_unlock(&sb->s_inode_list_lock);
 #ifdef CONFIG_QUOTA_DEBUG
 	if (reserved) {
 		printk(KERN_WARNING "VFS (%s): Writes happened after quota"
diff --git a/fs/super.c b/fs/super.c
index 2b7dc90..85d6a62 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -191,6 +191,7 @@ static struct super_block *alloc_super(struct file_system_type *type, int flags)
 	INIT_HLIST_NODE(&s->s_instances);
 	INIT_HLIST_BL_HEAD(&s->s_anon);
 	INIT_LIST_HEAD(&s->s_inodes);
+	spin_lock_init(&s->s_inode_list_lock);
 
 	if (list_lru_init_memcg(&s->s_dentry_lru))
 		goto fail;
@@ -399,7 +400,7 @@ void generic_shutdown_super(struct super_block *sb)
 		sync_filesystem(sb);
 		sb->s_flags &= ~MS_ACTIVE;
 
-		fsnotify_unmount_inodes(&sb->s_inodes);
+		fsnotify_unmount_inodes(sb);
 
 		evict_inodes(sb);
 
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 1045132..33a450b 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1245,7 +1245,6 @@ struct super_block {
 #endif
 	const struct xattr_handler **s_xattr;
 
-	struct list_head	s_inodes;	/* all inodes */
 	struct hlist_bl_head	s_anon;		/* anonymous dentries for (nfs) exporting */
 	struct list_head	s_mounts;	/* list of mounts; _not_ for fs use */
 	struct block_device	*s_bdev;
@@ -1316,6 +1315,10 @@ struct super_block {
 	 * Indicates how deep in a filesystem stack this SB is
 	 */
 	int s_stack_depth;
+
+	/* s_inode_list_lock protects s_inodes */
+	spinlock_t		s_inode_list_lock ____cacheline_aligned_in_smp;
+	struct list_head	s_inodes;	/* all inodes */
 };
 
 extern struct timespec current_fs_time(struct super_block *sb);
diff --git a/include/linux/fsnotify_backend.h b/include/linux/fsnotify_backend.h
index 0f313f9..236cbc4 100644
--- a/include/linux/fsnotify_backend.h
+++ b/include/linux/fsnotify_backend.h
@@ -359,7 +359,7 @@ extern void fsnotify_clear_marks_by_group_flags(struct fsnotify_group *group, un
 extern void fsnotify_clear_marks_by_group(struct fsnotify_group *group);
 extern void fsnotify_get_mark(struct fsnotify_mark *mark);
 extern void fsnotify_put_mark(struct fsnotify_mark *mark);
-extern void fsnotify_unmount_inodes(struct list_head *list);
+extern void fsnotify_unmount_inodes(struct super_block *sb);
 
 /* put here because inotify does some weird stuff when destroying watches */
 extern void fsnotify_init_event(struct fsnotify_event *event,
@@ -395,7 +395,7 @@ static inline u32 fsnotify_get_cookie(void)
 	return 0;
 }
 
-static inline void fsnotify_unmount_inodes(struct list_head *list)
+static inline void fsnotify_unmount_inodes(struct super_block *sb)
 {}
 
 #endif	/* CONFIG_FSNOTIFY */
-- 
1.9.3


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH 4/9] sync: serialise per-superblock sync operations
  2015-03-10 19:45 [PATCH 0/9] Sync and VFS scalability improvements Josef Bacik
                   ` (2 preceding siblings ...)
  2015-03-10 19:45 ` [PATCH 3/9] inode: convert inode_sb_list_lock to per-sb Josef Bacik
@ 2015-03-10 19:45 ` Josef Bacik
  2015-03-10 19:45 ` [PATCH 5/9] inode: rename i_wb_list to i_io_list Josef Bacik
                   ` (6 subsequent siblings)
  10 siblings, 0 replies; 24+ messages in thread
From: Josef Bacik @ 2015-03-10 19:45 UTC (permalink / raw)
  To: linux-fsdevel, david, viro, jack, linux-kernel; +Cc: Dave Chinner

From: Dave Chinner <dchinner@redhat.com>

When competing sync(2) calls walk the same filesystem, they need to
walk the list of inodes on the superblock to find all the inodes
that we need to wait for IO completion on. However, when multiple
wait_sb_inodes() calls do this at the same time, they contend on the
the inode_sb_list_lock and the contention causes system wide
slowdowns. In effect, concurrent sync(2) calls can take longer and
burn more CPU than if they were serialised.

Stop the worst of the contention by adding a per-sb mutex to wrap
around wait_sb_inodes() so that we only execute one sync(2) IO
completion walk per superblock superblock at a time and hence avoid
contention being triggered by concurrent sync(2) calls.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Jan Kara <jack@suse.cz>
---
 fs/fs-writeback.c  | 11 +++++++++++
 fs/super.c         |  1 +
 include/linux/fs.h |  2 ++
 3 files changed, 14 insertions(+)

diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index 9780f6c..dcfe21c 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -1288,6 +1288,15 @@ out_unlock_inode:
 }
 EXPORT_SYMBOL(__mark_inode_dirty);
 
+/*
+ * The @s_sync_lock is used to serialise concurrent sync operations
+ * to avoid lock contention problems with concurrent wait_sb_inodes() calls.
+ * Concurrent callers will block on the s_sync_lock rather than doing contending
+ * walks. The queueing maintains sync(2) required behaviour as all the IO that
+ * has been issued up to the time this function is enter is guaranteed to be
+ * completed by the time we have gained the lock and waited for all IO that is
+ * in progress regardless of the order callers are granted the lock.
+ */
 static void wait_sb_inodes(struct super_block *sb)
 {
 	struct inode *inode, *old_inode = NULL;
@@ -1298,6 +1307,7 @@ static void wait_sb_inodes(struct super_block *sb)
 	 */
 	WARN_ON(!rwsem_is_locked(&sb->s_umount));
 
+	mutex_lock(&sb->s_sync_lock);
 	spin_lock(&sb->s_inode_list_lock);
 
 	/*
@@ -1339,6 +1349,7 @@ static void wait_sb_inodes(struct super_block *sb)
 	}
 	spin_unlock(&sb->s_inode_list_lock);
 	iput(old_inode);
+	mutex_unlock(&sb->s_sync_lock);
 }
 
 /**
diff --git a/fs/super.c b/fs/super.c
index 85d6a62..6a05d94 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -190,6 +190,7 @@ static struct super_block *alloc_super(struct file_system_type *type, int flags)
 	s->s_flags = flags;
 	INIT_HLIST_NODE(&s->s_instances);
 	INIT_HLIST_BL_HEAD(&s->s_anon);
+	mutex_init(&s->s_sync_lock);
 	INIT_LIST_HEAD(&s->s_inodes);
 	spin_lock_init(&s->s_inode_list_lock);
 
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 33a450b..4418693 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1311,6 +1311,8 @@ struct super_block {
 	struct list_lru		s_inode_lru ____cacheline_aligned_in_smp;
 	struct rcu_head		rcu;
 
+	struct mutex		s_sync_lock;	/* sync serialisation lock */
+
 	/*
 	 * Indicates how deep in a filesystem stack this SB is
 	 */
-- 
1.9.3


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH 5/9] inode: rename i_wb_list to i_io_list
  2015-03-10 19:45 [PATCH 0/9] Sync and VFS scalability improvements Josef Bacik
                   ` (3 preceding siblings ...)
  2015-03-10 19:45 ` [PATCH 4/9] sync: serialise per-superblock sync operations Josef Bacik
@ 2015-03-10 19:45 ` Josef Bacik
  2015-03-10 19:45 ` [PATCH 6/9] bdi: add a new writeback list for sync Josef Bacik
                   ` (5 subsequent siblings)
  10 siblings, 0 replies; 24+ messages in thread
From: Josef Bacik @ 2015-03-10 19:45 UTC (permalink / raw)
  To: linux-fsdevel, david, viro, jack, linux-kernel; +Cc: Dave Chinner

From: Dave Chinner <dchinner@redhat.com>

There's a small consistency problem between the inode and writeback
naming. Writeback calls the "for IO" inode queues b_io and
b_more_io, but the inode calls these the "writeback list" or
i_wb_list. This makes it hard to an new "under writeback" list to
the inode, or call it an "under IO" list on the bdi because either
way we'll have writeback on IO and IO on writeback and it'll just be
confusing. I'm getting confused just writing this!

So, rename the inode "for IO" list variable to i_io_list so we can
add a new "writeback list" in a subsequent patch.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Jan Kara <jack@suse.cz>
---
 fs/fs-writeback.c  | 20 ++++++++++----------
 fs/inode.c         |  6 +++---
 include/linux/fs.h |  2 +-
 mm/backing-dev.c   |  8 ++++----
 4 files changed, 18 insertions(+), 18 deletions(-)

diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index dcfe21c..cabebde 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -84,7 +84,7 @@ EXPORT_SYMBOL_GPL(inode_to_bdi);
 
 static inline struct inode *wb_inode(struct list_head *head)
 {
-	return list_entry(head, struct inode, i_wb_list);
+	return list_entry(head, struct inode, i_io_list);
 }
 
 /*
@@ -193,7 +193,7 @@ void inode_wb_list_del(struct inode *inode)
 	struct backing_dev_info *bdi = inode_to_bdi(inode);
 
 	spin_lock(&bdi->wb.list_lock);
-	list_del_init(&inode->i_wb_list);
+	list_del_init(&inode->i_io_list);
 	spin_unlock(&bdi->wb.list_lock);
 }
 
@@ -216,7 +216,7 @@ static void redirty_tail(struct inode *inode, struct bdi_writeback *wb)
 		if (time_before(inode->dirtied_when, tail->dirtied_when))
 			inode->dirtied_when = jiffies;
 	}
-	list_move(&inode->i_wb_list, &wb->b_dirty);
+	list_move(&inode->i_io_list, &wb->b_dirty);
 }
 
 /*
@@ -225,7 +225,7 @@ static void redirty_tail(struct inode *inode, struct bdi_writeback *wb)
 static void requeue_io(struct inode *inode, struct bdi_writeback *wb)
 {
 	assert_spin_locked(&wb->list_lock);
-	list_move(&inode->i_wb_list, &wb->b_more_io);
+	list_move(&inode->i_io_list, &wb->b_more_io);
 }
 
 static void inode_sync_complete(struct inode *inode)
@@ -284,7 +284,7 @@ static int move_expired_inodes(struct list_head *delaying_queue,
 		if (older_than_this &&
 		    inode_dirtied_after(inode, *older_than_this))
 			break;
-		list_move(&inode->i_wb_list, &tmp);
+		list_move(&inode->i_io_list, &tmp);
 		moved++;
 		if (flags & EXPIRE_DIRTY_ATIME)
 			set_bit(__I_DIRTY_TIME_EXPIRED, &inode->i_state);
@@ -307,7 +307,7 @@ static int move_expired_inodes(struct list_head *delaying_queue,
 		list_for_each_prev_safe(pos, node, &tmp) {
 			inode = wb_inode(pos);
 			if (inode->i_sb == sb)
-				list_move(&inode->i_wb_list, dispatch_queue);
+				list_move(&inode->i_io_list, dispatch_queue);
 		}
 	}
 out:
@@ -458,10 +458,10 @@ static void requeue_inode(struct inode *inode, struct bdi_writeback *wb,
 		 */
 		redirty_tail(inode, wb);
 	} else if (inode->i_state & I_DIRTY_TIME) {
-		list_move(&inode->i_wb_list, &wb->b_dirty_time);
+		list_move(&inode->i_io_list, &wb->b_dirty_time);
 	} else {
 		/* The inode is clean. Remove from writeback lists. */
-		list_del_init(&inode->i_wb_list);
+		list_del_init(&inode->i_io_list);
 	}
 }
 
@@ -598,7 +598,7 @@ writeback_single_inode(struct inode *inode, struct bdi_writeback *wb,
 	 * touch it. See comment above for explanation.
 	 */
 	if (!(inode->i_state & I_DIRTY_ALL))
-		list_del_init(&inode->i_wb_list);
+		list_del_init(&inode->i_io_list);
 	spin_unlock(&wb->list_lock);
 	inode_sync_complete(inode);
 out:
@@ -1272,7 +1272,7 @@ void __mark_inode_dirty(struct inode *inode, int flags)
 			}
 
 			inode->dirtied_when = jiffies;
-			list_move(&inode->i_wb_list, dirtytime ?
+			list_move(&inode->i_io_list, dirtytime ?
 				  &bdi->wb.b_dirty_time : &bdi->wb.b_dirty);
 			spin_unlock(&bdi->wb.list_lock);
 			trace_writeback_dirty_inode_enqueue(inode);
diff --git a/fs/inode.c b/fs/inode.c
index bd61843..e0f2a60 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -31,7 +31,7 @@
  * inode->i_sb->s_inode_list_lock protects:
  *   inode->i_sb->s_inodes, inode->i_sb_list
  * bdi->wb.list_lock protects:
- *   bdi->wb.b_{dirty,io,more_io,dirty_time}, inode->i_wb_list
+ *   bdi->wb.b_{dirty,io,more_io,dirty_time}, inode->i_io_list
  * inode_hash_lock protects:
  *   inode_hashtable, inode->i_hash
  *
@@ -355,7 +355,7 @@ void inode_init_once(struct inode *inode)
 	memset(inode, 0, sizeof(*inode));
 	INIT_HLIST_NODE(&inode->i_hash);
 	INIT_LIST_HEAD(&inode->i_devices);
-	INIT_LIST_HEAD(&inode->i_wb_list);
+	INIT_LIST_HEAD(&inode->i_io_list);
 	INIT_LIST_HEAD(&inode->i_lru);
 	address_space_init_once(&inode->i_data);
 	i_size_ordered_init(inode);
@@ -523,7 +523,7 @@ static void evict(struct inode *inode)
 	BUG_ON(!(inode->i_state & I_FREEING));
 	BUG_ON(!list_empty(&inode->i_lru));
 
-	if (!list_empty(&inode->i_wb_list))
+	if (!list_empty(&inode->i_io_list))
 		inode_wb_list_del(inode);
 
 	inode_sb_list_del(inode);
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 4418693..64b16e0 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -607,7 +607,7 @@ struct inode {
 	unsigned long		dirtied_when;	/* jiffies of first dirtying */
 
 	struct hlist_node	i_hash;
-	struct list_head	i_wb_list;	/* backing dev IO list */
+	struct list_head	i_io_list;	/* backing dev IO list */
 	struct list_head	i_lru;		/* inode LRU list */
 	struct list_head	i_sb_list;
 	union {
diff --git a/mm/backing-dev.c b/mm/backing-dev.c
index 6dc4580..6e7a644 100644
--- a/mm/backing-dev.c
+++ b/mm/backing-dev.c
@@ -54,13 +54,13 @@ static int bdi_debug_stats_show(struct seq_file *m, void *v)
 
 	nr_dirty = nr_io = nr_more_io = nr_dirty_time = 0;
 	spin_lock(&wb->list_lock);
-	list_for_each_entry(inode, &wb->b_dirty, i_wb_list)
+	list_for_each_entry(inode, &wb->b_dirty, i_io_list)
 		nr_dirty++;
-	list_for_each_entry(inode, &wb->b_io, i_wb_list)
+	list_for_each_entry(inode, &wb->b_io, i_io_list)
 		nr_io++;
-	list_for_each_entry(inode, &wb->b_more_io, i_wb_list)
+	list_for_each_entry(inode, &wb->b_more_io, i_io_list)
 		nr_more_io++;
-	list_for_each_entry(inode, &wb->b_dirty_time, i_wb_list)
+	list_for_each_entry(inode, &wb->b_dirty_time, i_io_list)
 		if (inode->i_state & I_DIRTY_TIME)
 			nr_dirty_time++;
 	spin_unlock(&wb->list_lock);
-- 
1.9.3


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH 6/9] bdi: add a new writeback list for sync
  2015-03-10 19:45 [PATCH 0/9] Sync and VFS scalability improvements Josef Bacik
                   ` (4 preceding siblings ...)
  2015-03-10 19:45 ` [PATCH 5/9] inode: rename i_wb_list to i_io_list Josef Bacik
@ 2015-03-10 19:45 ` Josef Bacik
  2015-03-16 10:14   ` Jan Kara
  2015-03-10 19:45 ` [PATCH 7/9] writeback: periodically trim the writeback list Josef Bacik
                   ` (4 subsequent siblings)
  10 siblings, 1 reply; 24+ messages in thread
From: Josef Bacik @ 2015-03-10 19:45 UTC (permalink / raw)
  To: linux-fsdevel, david, viro, jack, linux-kernel; +Cc: Dave Chinner

From: Dave Chinner <dchinner@redhat.com>

wait_sb_inodes() current does a walk of all inodes in the filesystem
to find dirty one to wait on during sync. This is highly
inefficient and wastes a lot of CPU when there are lots of clean
cached inodes that we don't need to wait on.

To avoid this "all inode" walk, we need to track inodes that are
currently under writeback that we need to wait for. We do this by
adding inodes to a writeback list on the bdi when the mapping is
first tagged as having pages under writeback.  wait_sb_inodes() can
then walk this list of "inodes under IO" and wait specifically just
for the inodes that the current sync(2) needs to wait for.

To avoid needing all the realted locking to be safe against
interrupts, Jan Kara suggested that we be lazy about removal from
the writeback list. That is, we don't remove inodes from the
writeback list on IO completion, but do it directly during a
wait_sb_inodes() walk.

This means that the a rare sync(2) call will have some work to do
skipping clean inodes However, in the current problem case of
concurrent sync workloads, concurrent wait_sb_inodes() calls only
walk the very recently dispatched inodes and hence should have very
little work to do.

This also means that we have to remove the inodes from the writeback
list during eviction. Do this in inode_wait_for_writeback() once
all writeback on the inode is complete.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/fs-writeback.c           | 115 ++++++++++++++++++++++++++++++++++++++------
 fs/inode.c                  |   1 +
 include/linux/backing-dev.h |   3 ++
 include/linux/fs.h          |   1 +
 mm/backing-dev.c            |   1 +
 mm/page-writeback.c         |  14 ++++++
 6 files changed, 119 insertions(+), 16 deletions(-)

diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index cabebde..82b0f43 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -198,6 +198,34 @@ void inode_wb_list_del(struct inode *inode)
 }
 
 /*
+ * mark an inode as under writeback on the given bdi
+ */
+void bdi_mark_inode_writeback(struct backing_dev_info *bdi, struct inode *inode)
+{
+	WARN_ON_ONCE(bdi != inode_to_bdi(inode));
+	if (list_empty(&inode->i_wb_list)) {
+		spin_lock(&bdi->wb.list_lock);
+		if (list_empty(&inode->i_wb_list))
+			list_add_tail(&inode->i_wb_list, &bdi->wb.b_writeback);
+		spin_unlock(&bdi->wb.list_lock);
+	}
+}
+
+/*
+ * clear an inode as under writeback on the given bdi
+ */
+static void bdi_clear_inode_writeback(struct backing_dev_info *bdi,
+				      struct inode *inode)
+{
+	WARN_ON_ONCE(bdi != inode_to_bdi(inode));
+	if (!list_empty(&inode->i_wb_list)) {
+		spin_lock(&bdi->wb.list_lock);
+		list_del_init(&inode->i_wb_list);
+		spin_unlock(&bdi->wb.list_lock);
+	}
+}
+
+/*
  * Redirty an inode: set its when-it-was dirtied timestamp and move it to the
  * furthest end of its superblock's dirty-inode list.
  *
@@ -371,13 +399,23 @@ static void __inode_wait_for_writeback(struct inode *inode)
 }
 
 /*
- * Wait for writeback on an inode to complete. Caller must have inode pinned.
+ * Wait for writeback on an inode to complete during eviction. Caller must have
+ * inode pinned.
  */
 void inode_wait_for_writeback(struct inode *inode)
 {
+	BUG_ON(!(inode->i_state & I_FREEING));
+
 	spin_lock(&inode->i_lock);
 	__inode_wait_for_writeback(inode);
 	spin_unlock(&inode->i_lock);
+
+	/*
+	 * bd_inode's will have already had the bdev free'd so don't bother
+	 * doing the bdi_clear_inode_writeback.
+	 */
+	if (!sb_is_blkdev_sb(inode->i_sb))
+		bdi_clear_inode_writeback(inode_to_bdi(inode), inode);
 }
 
 /*
@@ -1299,7 +1337,9 @@ EXPORT_SYMBOL(__mark_inode_dirty);
  */
 static void wait_sb_inodes(struct super_block *sb)
 {
-	struct inode *inode, *old_inode = NULL;
+	struct backing_dev_info *bdi = sb->s_bdi;
+	struct inode *old_inode = NULL;
+	LIST_HEAD(sync_list);
 
 	/*
 	 * We need to be protected against the filesystem going from
@@ -1307,28 +1347,59 @@ static void wait_sb_inodes(struct super_block *sb)
 	 */
 	WARN_ON(!rwsem_is_locked(&sb->s_umount));
 
-	mutex_lock(&sb->s_sync_lock);
-	spin_lock(&sb->s_inode_list_lock);
-
 	/*
-	 * Data integrity sync. Must wait for all pages under writeback,
-	 * because there may have been pages dirtied before our sync
-	 * call, but which had writeout started before we write it out.
-	 * In which case, the inode may not be on the dirty list, but
-	 * we still have to wait for that writeout.
+	 * Data integrity sync. Must wait for all pages under writeback, because
+	 * there may have been pages dirtied before our sync call, but which had
+	 * writeout started before we write it out.  In which case, the inode
+	 * may not be on the dirty list, but we still have to wait for that
+	 * writeout.
+	 *
+	 * To avoid syncing inodes put under IO after we have started here,
+	 * splice the io list to a temporary list head and walk that. Newly
+	 * dirtied inodes will go onto the primary list so we won't wait for
+	 * them. This is safe to do as we are serialised by the s_sync_lock,
+	 * so we'll complete processing the complete list before the next
+	 * sync operation repeats the splice-and-walk process.
+	 *
+	 * Inodes that have pages under writeback after we've finished the wait
+	 * may or may not be on the primary list. They had pages under put IO
+	 * after we started our wait, so we need to make sure the next sync or
+	 * IO completion treats them correctly, Move them back to the primary
+	 * list and restart the walk.
+	 *
+	 * Inodes that are clean after we have waited for them don't belong on
+	 * any list, so simply remove them from the sync list and move onwards.
 	 */
-	list_for_each_entry(inode, &sb->s_inodes, i_sb_list) {
+	mutex_lock(&sb->s_sync_lock);
+	spin_lock(&bdi->wb.list_lock);
+	list_splice_init(&bdi->wb.b_writeback, &sync_list);
+
+	while (!list_empty(&sync_list)) {
+		struct inode *inode = list_first_entry(&sync_list, struct inode,
+						       i_wb_list);
 		struct address_space *mapping = inode->i_mapping;
 
+		/*
+		 * We are lazy on IO completion and don't remove inodes from the
+		 * list when they are clean. Detect that immediately and skip
+		 * inodes we don't ahve to wait on.
+		 */
+		if (!mapping_tagged(mapping, PAGECACHE_TAG_WRITEBACK)) {
+			list_del_init(&inode->i_wb_list);
+			cond_resched_lock(&bdi->wb.list_lock);
+			continue;
+		}
+
 		spin_lock(&inode->i_lock);
-		if ((inode->i_state & (I_FREEING|I_WILL_FREE|I_NEW)) ||
-		    (mapping->nrpages == 0)) {
+		if (inode->i_state & (I_FREEING|I_WILL_FREE|I_NEW)) {
+                        list_move(&inode->i_wb_list, &bdi->wb.b_writeback);
 			spin_unlock(&inode->i_lock);
+			cond_resched_lock(&bdi->wb.list_lock);
 			continue;
 		}
 		__iget(inode);
 		spin_unlock(&inode->i_lock);
-		spin_unlock(&sb->s_inode_list_lock);
+		spin_unlock(&bdi->wb.list_lock);
 
 		/*
 		 * We hold a reference to 'inode' so it couldn't have been
@@ -1345,9 +1416,21 @@ static void wait_sb_inodes(struct super_block *sb)
 
 		cond_resched();
 
-		spin_lock(&sb->s_inode_list_lock);
+		/*
+		 * the inode has been written back now, so check whether we
+		 * still have pages under IO and move it back to the primary
+		 * list if necessary
+		 */
+		spin_lock_irq(&mapping->tree_lock);
+		spin_lock(&bdi->wb.list_lock);
+		if (mapping_tagged(mapping, PAGECACHE_TAG_WRITEBACK)) {
+			WARN_ON(list_empty(&inode->i_wb_list));
+			list_move(&inode->i_wb_list, &bdi->wb.b_writeback);
+                } else
+			list_del_init(&inode->i_wb_list);
+		spin_unlock_irq(&mapping->tree_lock);
 	}
-	spin_unlock(&sb->s_inode_list_lock);
+	spin_unlock(&bdi->wb.list_lock);
 	iput(old_inode);
 	mutex_unlock(&sb->s_sync_lock);
 }
diff --git a/fs/inode.c b/fs/inode.c
index e0f2a60..b961e5a 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -356,6 +356,7 @@ void inode_init_once(struct inode *inode)
 	INIT_HLIST_NODE(&inode->i_hash);
 	INIT_LIST_HEAD(&inode->i_devices);
 	INIT_LIST_HEAD(&inode->i_io_list);
+	INIT_LIST_HEAD(&inode->i_wb_list);
 	INIT_LIST_HEAD(&inode->i_lru);
 	address_space_init_once(&inode->i_data);
 	i_size_ordered_init(inode);
diff --git a/include/linux/backing-dev.h b/include/linux/backing-dev.h
index aff923a..12d8224 100644
--- a/include/linux/backing-dev.h
+++ b/include/linux/backing-dev.h
@@ -56,6 +56,7 @@ struct bdi_writeback {
 	struct list_head b_io;		/* parked for writeback */
 	struct list_head b_more_io;	/* parked for more writeback */
 	struct list_head b_dirty_time;	/* time stamps are dirty */
+	struct list_head b_writeback;	/* inodes under writeback */
 	spinlock_t list_lock;		/* protects the b_* lists */
 };
 
@@ -124,6 +125,8 @@ void bdi_start_background_writeback(struct backing_dev_info *bdi);
 void bdi_writeback_workfn(struct work_struct *work);
 int bdi_has_dirty_io(struct backing_dev_info *bdi);
 void bdi_wakeup_thread_delayed(struct backing_dev_info *bdi);
+void bdi_mark_inode_writeback(struct backing_dev_info *bdi,
+			      struct inode *inode);
 
 extern spinlock_t bdi_lock;
 extern struct list_head bdi_list;
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 64b16e0..66284fe 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -610,6 +610,7 @@ struct inode {
 	struct list_head	i_io_list;	/* backing dev IO list */
 	struct list_head	i_lru;		/* inode LRU list */
 	struct list_head	i_sb_list;
+	struct list_head	i_wb_list;	/* backing dev writeback list */
 	union {
 		struct hlist_head	i_dentry;
 		struct rcu_head		i_rcu;
diff --git a/mm/backing-dev.c b/mm/backing-dev.c
index 6e7a644..df31958 100644
--- a/mm/backing-dev.c
+++ b/mm/backing-dev.c
@@ -386,6 +386,7 @@ static void bdi_wb_init(struct bdi_writeback *wb, struct backing_dev_info *bdi)
 	INIT_LIST_HEAD(&wb->b_io);
 	INIT_LIST_HEAD(&wb->b_more_io);
 	INIT_LIST_HEAD(&wb->b_dirty_time);
+	INIT_LIST_HEAD(&wb->b_writeback);
 	spin_lock_init(&wb->list_lock);
 	INIT_DELAYED_WORK(&wb->dwork, bdi_writeback_workfn);
 }
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 45e187b..4f21498 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -2359,11 +2359,25 @@ int __test_set_page_writeback(struct page *page, bool keep_write)
 		spin_lock_irqsave(&mapping->tree_lock, flags);
 		ret = TestSetPageWriteback(page);
 		if (!ret) {
+			bool on_wblist;
+
+			on_wblist = mapping_tagged(mapping,
+						   PAGECACHE_TAG_WRITEBACK);
+
 			radix_tree_tag_set(&mapping->page_tree,
 						page_index(page),
 						PAGECACHE_TAG_WRITEBACK);
 			if (bdi_cap_account_writeback(bdi))
 				__inc_bdi_stat(bdi, BDI_WRITEBACK);
+
+			/*
+			 * we can come through here when swapping anonymous
+			 * pages, so we don't necessarily have an inode to
+			 * track for sync. Inodes are remove lazily, so there is
+			 * no equivalent in test_clear_page_writeback().
+			 */
+			if (!on_wblist && mapping->host)
+				bdi_mark_inode_writeback(bdi, mapping->host);
 		}
 		if (!PageDirty(page))
 			radix_tree_tag_clear(&mapping->page_tree,
-- 
1.9.3


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH 7/9] writeback: periodically trim the writeback list
  2015-03-10 19:45 [PATCH 0/9] Sync and VFS scalability improvements Josef Bacik
                   ` (5 preceding siblings ...)
  2015-03-10 19:45 ` [PATCH 6/9] bdi: add a new writeback list for sync Josef Bacik
@ 2015-03-10 19:45 ` Josef Bacik
  2015-03-16 10:16   ` Jan Kara
  2015-03-10 19:45 ` [PATCH 8/9] inode: convert per-sb inode list to a list_lru Josef Bacik
                   ` (3 subsequent siblings)
  10 siblings, 1 reply; 24+ messages in thread
From: Josef Bacik @ 2015-03-10 19:45 UTC (permalink / raw)
  To: linux-fsdevel, david, viro, jack, linux-kernel; +Cc: Dave Chinner

From: Dave Chinner <dchinner@redhat.com>

Inodes are removed lazily from the bdi writeback list, so in the
absence of sync(2) work inodes will build up on the bdi writback
list even though they are no longer under IO. Use the periodic
kupdate work check to remove inodes no longer under IO from the
writeback list.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/fs-writeback.c | 19 +++++++++++++++++++
 1 file changed, 19 insertions(+)

diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index 82b0f43..aa0de0f 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -1037,6 +1037,23 @@ static long wb_check_background_flush(struct bdi_writeback *wb)
 	return 0;
 }
 
+/*
+ * clean out writeback list for all inodes that don't have IO in progress
+ */
+static void wb_trim_writeback_list(struct bdi_writeback *wb)
+{
+	struct inode *inode;
+	struct inode *tmp;
+
+	spin_lock(&wb->list_lock);
+	list_for_each_entry_safe(inode, tmp, &wb->b_writeback, i_wb_list) {
+		if (!mapping_tagged(inode->i_mapping, PAGECACHE_TAG_WRITEBACK))
+			list_del_init(&inode->i_wb_list);
+	}
+	spin_unlock(&wb->list_lock);
+
+}
+
 static long wb_check_old_data_flush(struct bdi_writeback *wb)
 {
 	unsigned long expired;
@@ -1053,6 +1070,8 @@ static long wb_check_old_data_flush(struct bdi_writeback *wb)
 	if (time_before(jiffies, expired))
 		return 0;
 
+	wb_trim_writeback_list(wb);
+
 	wb->last_old_flush = jiffies;
 	nr_pages = get_nr_dirty_pages();
 
-- 
1.9.3


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH 8/9] inode: convert per-sb inode list to a list_lru
  2015-03-10 19:45 [PATCH 0/9] Sync and VFS scalability improvements Josef Bacik
                   ` (6 preceding siblings ...)
  2015-03-10 19:45 ` [PATCH 7/9] writeback: periodically trim the writeback list Josef Bacik
@ 2015-03-10 19:45 ` Josef Bacik
  2015-03-16 12:27   ` Jan Kara
  2015-03-10 19:45 ` [PATCH 9/9] inode: don't softlockup when evicting inodes Josef Bacik
                   ` (2 subsequent siblings)
  10 siblings, 1 reply; 24+ messages in thread
From: Josef Bacik @ 2015-03-10 19:45 UTC (permalink / raw)
  To: linux-fsdevel, david, viro, jack, linux-kernel; +Cc: Dave Chinner

From: Dave Chinner <dchinner@redhat.com>

The per-superblock inode list and lock is a bottleneck for systems
that cycle inodes in and out of cache concurrently. The global lock
is a limiting factor.

Most of the additions to the sb inode list occur on the CPU that
allocated the inode, and most of the removals occur during evict()
calls as a result of memory reclaim. Both of these events are local
to the node that the inode belongs to, so it maps to the per-node
lists that the list_lru uses.

There are several places where the inode list is walked. These can
be converted easily to use list_lru_walk() to do their work on each
inode on the list.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/block_dev.c         |  76 ++++++++++++--------
 fs/drop_caches.c       |  58 ++++++++++-----
 fs/inode.c             | 136 ++++++++++++++++++-----------------
 fs/notify/inode_mark.c | 121 +++++++++++++-------------------
 fs/quota/dquot.c       | 187 ++++++++++++++++++++++++++++++++-----------------
 fs/super.c             |   8 ++-
 include/linux/fs.h     |   9 ++-
 7 files changed, 340 insertions(+), 255 deletions(-)

diff --git a/fs/block_dev.c b/fs/block_dev.c
index 2eb2436..d23ce6f 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -1749,38 +1749,56 @@ int __invalidate_device(struct block_device *bdev, bool kill_dirty)
 }
 EXPORT_SYMBOL(__invalidate_device);
 
-void iterate_bdevs(void (*func)(struct block_device *, void *), void *arg)
-{
-	struct inode *inode, *old_inode = NULL;
+struct bdev_iter {
+	void (*func)(struct block_device *, void *);
+	void *arg;
+	struct inode *toput_inode;
+};
 
-	spin_lock(&blockdev_superblock->s_inode_list_lock);
-	list_for_each_entry(inode, &blockdev_superblock->s_inodes, i_sb_list) {
-		struct address_space *mapping = inode->i_mapping;
+static enum lru_status
+bdev_iter_cb(struct list_head *item, struct list_lru_one *lru,
+	     spinlock_t *lock, void *cb_arg)
+{
+	struct bdev_iter *iter = cb_arg;
+	struct inode *inode = container_of(item, struct inode, i_sb_list);
 
-		spin_lock(&inode->i_lock);
-		if (inode->i_state & (I_FREEING|I_WILL_FREE|I_NEW) ||
-		    mapping->nrpages == 0) {
-			spin_unlock(&inode->i_lock);
-			continue;
-		}
-		__iget(inode);
+	spin_lock(&inode->i_lock);
+	if (inode->i_state & (I_FREEING|I_WILL_FREE|I_NEW) ||
+	    inode->i_mapping->nrpages == 0) {
 		spin_unlock(&inode->i_lock);
-		spin_unlock(&blockdev_superblock->s_inode_list_lock);
-		/*
-		 * We hold a reference to 'inode' so it couldn't have been
-		 * removed from s_inodes list while we dropped the
-		 * s_inode_list_lock  We cannot iput the inode now as we can
-		 * be holding the last reference and we cannot iput it under
-		 * s_inode_list_lock. So we keep the reference and iput it
-		 * later.
-		 */
-		iput(old_inode);
-		old_inode = inode;
+		return LRU_SKIP;
+	}
+	__iget(inode);
+	spin_unlock(&inode->i_lock);
+	spin_unlock(lock);
 
-		func(I_BDEV(inode), arg);
+	iput(iter->toput_inode);
+	iter->toput_inode = inode;
 
-		spin_lock(&blockdev_superblock->s_inode_list_lock);
-	}
-	spin_unlock(&blockdev_superblock->s_inode_list_lock);
-	iput(old_inode);
+	iter->func(I_BDEV(inode), iter->arg);
+
+	/*
+	 * Even though we have dropped the lock here, we can return LRU_SKIP as
+	 * we have a reference to the current inode and so it's next pointer is
+	 * guaranteed to be valid even though we dropped the list lock.
+	 */
+	spin_lock(lock);
+	return LRU_SKIP;
+}
+
+/*
+ * iterate_bdevs - run a callback across all block devices
+ */
+void iterate_bdevs(void (*func)(struct block_device *, void *), void *arg)
+{
+	struct bdev_iter iter = {
+		.func = func,
+		.arg = arg,
+	};
+
+	list_lru_walk(&blockdev_superblock->s_inode_list, bdev_iter_cb, &iter,
+		      ULONG_MAX);
+
+	/* the list walk doesn't release the last inode it sees! */
+	iput(iter.toput_inode);
 }
diff --git a/fs/drop_caches.c b/fs/drop_caches.c
index d72d52b..ee381e1 100644
--- a/fs/drop_caches.c
+++ b/fs/drop_caches.c
@@ -13,29 +13,51 @@
 /* A global variable is a bit ugly, but it keeps the code simple */
 int sysctl_drop_caches;
 
-static void drop_pagecache_sb(struct super_block *sb, void *unused)
+static enum lru_status
+drop_pagecache_inode(struct list_head *item, struct list_lru_one *lru,
+		     spinlock_t *lock, void *cb_arg)
 {
-	struct inode *inode, *toput_inode = NULL;
+	struct inode **toput_inode = cb_arg;
+	struct inode *inode = container_of(item, struct inode, i_sb_list);
 
-	spin_lock(&sb->s_inode_list_lock);
-	list_for_each_entry(inode, &sb->s_inodes, i_sb_list) {
-		spin_lock(&inode->i_lock);
-		if ((inode->i_state & (I_FREEING|I_WILL_FREE|I_NEW)) ||
-		    (inode->i_mapping->nrpages == 0)) {
-			spin_unlock(&inode->i_lock);
-			continue;
-		}
-		__iget(inode);
+	spin_lock(&inode->i_lock);
+	if ((inode->i_state & (I_FREEING|I_WILL_FREE|I_NEW)) ||
+	    (inode->i_mapping->nrpages == 0)) {
 		spin_unlock(&inode->i_lock);
-		spin_unlock(&sb->s_inode_list_lock);
+		return LRU_SKIP;
+	}
+	__iget(inode);
+	spin_unlock(&inode->i_lock);
+	spin_unlock(lock);
 
-		invalidate_mapping_pages(inode->i_mapping, 0, -1);
-		iput(toput_inode);
-		toput_inode = inode;
+	iput(*toput_inode);
+	*toput_inode = inode;
 
-		spin_lock(&sb->s_inode_list_lock);
-	}
-	spin_unlock(&sb->s_inode_list_lock);
+	invalidate_mapping_pages(inode->i_mapping, 0, -1);
+
+	/*
+	 * Even though we have dropped the lock here, we can return LRU_SKIP as
+	 * we have a reference to the current inode and so it's next pointer is
+	 * guaranteed to be valid even though we dropped the list lock.
+	 */
+	spin_lock(lock);
+	return LRU_SKIP;
+}
+
+
+/*
+ * This is a best effort scan, so we don't need to be absolutely sure we hit al
+ * inodes on the superblock. Hence a single pass is sufficient to catch them
+ * all.
+ */
+static void drop_pagecache_sb(struct super_block *sb, void *unused)
+{
+	struct inode *toput_inode = NULL;
+
+	list_lru_walk(&sb->s_inode_list, drop_pagecache_inode, &toput_inode,
+		      ULONG_MAX);
+
+	/* the list walk doesn't release the last inode it sees! */
 	iput(toput_inode);
 }
 
diff --git a/fs/inode.c b/fs/inode.c
index b961e5a..17da8801 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -28,8 +28,8 @@
  *   inode->i_state, inode->i_hash, __iget()
  * Inode LRU list locks protect:
  *   inode->i_sb->s_inode_lru, inode->i_lru
- * inode->i_sb->s_inode_list_lock protects:
- *   inode->i_sb->s_inodes, inode->i_sb_list
+ * Inode list locks protects:
+ *   inode->i_sb->s_inode_list, inode->i_sb_list
  * bdi->wb.list_lock protects:
  *   bdi->wb.b_{dirty,io,more_io,dirty_time}, inode->i_io_list
  * inode_hash_lock protects:
@@ -37,7 +37,7 @@
  *
  * Lock ordering:
  *
- * inode->i_sb->s_inode_list_lock
+ * Inode list lock
  *   inode->i_lock
  *     Inode LRU list locks
  *
@@ -45,7 +45,7 @@
  *   inode->i_lock
  *
  * inode_hash_lock
- *   inode->i_sb->s_inode_list_lock
+ *   Inode list lock
  *   inode->i_lock
  *
  * iunique_lock
@@ -357,6 +357,7 @@ void inode_init_once(struct inode *inode)
 	INIT_LIST_HEAD(&inode->i_devices);
 	INIT_LIST_HEAD(&inode->i_io_list);
 	INIT_LIST_HEAD(&inode->i_wb_list);
+	INIT_LIST_HEAD(&inode->i_sb_list);
 	INIT_LIST_HEAD(&inode->i_lru);
 	address_space_init_once(&inode->i_data);
 	i_size_ordered_init(inode);
@@ -423,19 +424,13 @@ static void inode_lru_list_del(struct inode *inode)
  */
 void inode_sb_list_add(struct inode *inode)
 {
-	spin_lock(&inode->i_sb->s_inode_list_lock);
-	list_add(&inode->i_sb_list, &inode->i_sb->s_inodes);
-	spin_unlock(&inode->i_sb->s_inode_list_lock);
+	list_lru_add(&inode->i_sb->s_inode_list, &inode->i_sb_list);
 }
 EXPORT_SYMBOL_GPL(inode_sb_list_add);
 
 static inline void inode_sb_list_del(struct inode *inode)
 {
-	if (!list_empty(&inode->i_sb_list)) {
-		spin_lock(&inode->i_sb->s_inode_list_lock);
-		list_del_init(&inode->i_sb_list);
-		spin_unlock(&inode->i_sb->s_inode_list_lock);
-	}
+	list_lru_del(&inode->i_sb->s_inode_list, &inode->i_sb_list);
 }
 
 static unsigned long hash(struct super_block *sb, unsigned long hashval)
@@ -577,6 +572,50 @@ static void dispose_list(struct list_head *head)
 	}
 }
 
+static enum lru_status
+__evict_inodes_isolate(struct list_head *item, struct list_lru_one *lru,
+		       spinlock_t *lock, void *cb_arg, bool kill_dirty)
+{
+	struct list_head *dispose = cb_arg;
+	struct inode	*inode = container_of(item, struct inode, i_sb_list);
+
+	if (atomic_read(&inode->i_count))
+		return LRU_SKIP;
+
+	spin_lock(&inode->i_lock);
+	if (inode->i_state & (I_NEW | I_FREEING | I_WILL_FREE)) {
+		spin_unlock(&inode->i_lock);
+		return LRU_SKIP;
+	}
+
+	if ((inode->i_state & I_DIRTY) && !kill_dirty) {
+		spin_unlock(&inode->i_lock);
+		return LRU_SKIP;
+	}
+
+	inode->i_state |= I_FREEING;
+	inode_lru_list_del(inode);
+	list_add(&inode->i_lru, dispose);
+
+	list_lru_isolate(lru, item);
+	spin_unlock(&inode->i_lock);
+	return LRU_REMOVED;
+}
+
+static enum lru_status
+evict_inodes_isolate(struct list_head *item, struct list_lru_one *lru,
+		     spinlock_t *lock, void *cb_arg)
+{
+	return __evict_inodes_isolate(item, lru, lock, cb_arg, true);
+}
+
+static enum lru_status
+invalidate_inodes_isolate(struct list_head *item, struct list_lru_one *lru,
+			  spinlock_t *lock, void *cb_arg)
+{
+	return __evict_inodes_isolate(item, lru, lock, cb_arg, false);
+}
+
 /**
  * evict_inodes	- evict all evictable inodes for a superblock
  * @sb:		superblock to operate on
@@ -588,28 +627,15 @@ static void dispose_list(struct list_head *head)
  */
 void evict_inodes(struct super_block *sb)
 {
-	struct inode *inode, *next;
-	LIST_HEAD(dispose);
-
-	spin_lock(&sb->s_inode_list_lock);
-	list_for_each_entry_safe(inode, next, &sb->s_inodes, i_sb_list) {
-		if (atomic_read(&inode->i_count))
-			continue;
-
-		spin_lock(&inode->i_lock);
-		if (inode->i_state & (I_NEW | I_FREEING | I_WILL_FREE)) {
-			spin_unlock(&inode->i_lock);
-			continue;
-		}
+	long freed;
 
-		inode->i_state |= I_FREEING;
-		inode_lru_list_del(inode);
-		spin_unlock(&inode->i_lock);
-		list_add(&inode->i_lru, &dispose);
-	}
-	spin_unlock(&sb->s_inode_list_lock);
+	do {
+		LIST_HEAD(dispose);
 
-	dispose_list(&dispose);
+		freed = list_lru_walk(&sb->s_inode_list, evict_inodes_isolate,
+				      &dispose, ULONG_MAX);
+		dispose_list(&dispose);
+	} while (freed > 0);
 }
 
 /**
@@ -624,38 +650,24 @@ void evict_inodes(struct super_block *sb)
  */
 int invalidate_inodes(struct super_block *sb, bool kill_dirty)
 {
-	int busy = 0;
-	struct inode *inode, *next;
-	LIST_HEAD(dispose);
+	list_lru_walk_cb isolate;
+	long freed;
 
-	spin_lock(&sb->s_inode_list_lock);
-	list_for_each_entry_safe(inode, next, &sb->s_inodes, i_sb_list) {
-		spin_lock(&inode->i_lock);
-		if (inode->i_state & (I_NEW | I_FREEING | I_WILL_FREE)) {
-			spin_unlock(&inode->i_lock);
-			continue;
-		}
-		if (inode->i_state & I_DIRTY_ALL && !kill_dirty) {
-			spin_unlock(&inode->i_lock);
-			busy = 1;
-			continue;
-		}
-		if (atomic_read(&inode->i_count)) {
-			spin_unlock(&inode->i_lock);
-			busy = 1;
-			continue;
-		}
+	isolate = kill_dirty ? evict_inodes_isolate :invalidate_inodes_isolate;
 
-		inode->i_state |= I_FREEING;
-		inode_lru_list_del(inode);
-		spin_unlock(&inode->i_lock);
-		list_add(&inode->i_lru, &dispose);
-	}
-	spin_unlock(&sb->s_inode_list_lock);
+	do {
+		LIST_HEAD(dispose);
 
-	dispose_list(&dispose);
+		freed = list_lru_walk(&sb->s_inode_list, isolate,
+				      &dispose, ULONG_MAX);
+		dispose_list(&dispose);
+	} while (freed > 0);
 
-	return busy;
+	/*
+	 * if we skipped any inodes because we couldn't isolate them, tell the
+	 * caller there are still active inodes.
+	 */
+	return !!list_lru_count(&sb->s_inode_list);
 }
 
 /*
@@ -849,7 +861,7 @@ EXPORT_SYMBOL(get_next_ino);
  *	@sb: superblock
  *
  *	Allocates a new inode for given superblock.
- *	Inode wont be chained in superblock s_inodes list
+ *	Inode wont be chained in superblock s_inode_list list
  *	This means :
  *	- fs can't be unmount
  *	- quotas, fsnotify, writeback can't work
@@ -883,8 +895,6 @@ struct inode *new_inode(struct super_block *sb)
 {
 	struct inode *inode;
 
-	spin_lock_prefetch(&sb->s_inode_list_lock);
-
 	inode = new_inode_pseudo(sb);
 	if (inode)
 		inode_sb_list_add(inode);
diff --git a/fs/notify/inode_mark.c b/fs/notify/inode_mark.c
index a4e1a8f..a0cdc66 100644
--- a/fs/notify/inode_mark.c
+++ b/fs/notify/inode_mark.c
@@ -161,87 +161,60 @@ int fsnotify_add_inode_mark(struct fsnotify_mark *mark,
 	return ret;
 }
 
-/**
- * fsnotify_unmount_inodes - an sb is unmounting.  handle any watched inodes.
- * @sb: superblock being unmounted.
- *
- * Called during unmount with no locks held, so needs to be safe against
- * concurrent modifiers. We temporarily drop sb->s_inode_list_lock and CAN block.
- */
-void fsnotify_unmount_inodes(struct super_block *sb)
-{
-	struct inode *inode, *next_i, *need_iput = NULL;
-
-	spin_lock(&sb->s_inode_list_lock);
-	list_for_each_entry_safe(inode, next_i, &sb->s_inodes, i_sb_list) {
-		struct inode *need_iput_tmp;
-
-		/*
-		 * We cannot __iget() an inode in state I_FREEING,
-		 * I_WILL_FREE, or I_NEW which is fine because by that point
-		 * the inode cannot have any associated watches.
-		 */
-		spin_lock(&inode->i_lock);
-		if (inode->i_state & (I_FREEING|I_WILL_FREE|I_NEW)) {
-			spin_unlock(&inode->i_lock);
-			continue;
-		}
+static enum lru_status
+fsnotify_unmount_inode(struct list_head *item, struct list_lru_one *lru,
+		       spinlock_t *lock, void *cb_arg)
+ {
+	struct inode **toput_inode = cb_arg;
+	struct inode *inode = container_of(item, struct inode, i_sb_list);
+
+	/* New or being freed inodes cannot have any associated watches. */
+	spin_lock(&inode->i_lock);
+	if (inode->i_state & (I_FREEING|I_WILL_FREE|I_NEW)) {
+		spin_unlock(&inode->i_lock);
+		return LRU_SKIP;
+	}
 
-		/*
-		 * If i_count is zero, the inode cannot have any watches and
-		 * doing an __iget/iput with MS_ACTIVE clear would actually
-		 * evict all inodes with zero i_count from icache which is
-		 * unnecessarily violent and may in fact be illegal to do.
-		 */
-		if (!atomic_read(&inode->i_count)) {
-			spin_unlock(&inode->i_lock);
-			continue;
-		}
-
-		need_iput_tmp = need_iput;
-		need_iput = NULL;
-
-		/* In case fsnotify_inode_delete() drops a reference. */
-		if (inode != need_iput_tmp)
-			__iget(inode);
-		else
-			need_iput_tmp = NULL;
+	/* If i_count is zero, the inode cannot have any watches */
+	if (!atomic_read(&inode->i_count)) {
 		spin_unlock(&inode->i_lock);
+		return LRU_SKIP;
+	}
 
-		/* In case the dropping of a reference would nuke next_i. */
-		while (&next_i->i_sb_list != &sb->s_inodes) {
-			spin_lock(&next_i->i_lock);
-			if (!(next_i->i_state & (I_FREEING | I_WILL_FREE)) &&
-						atomic_read(&next_i->i_count)) {
-				__iget(next_i);
-				need_iput = next_i;
-				spin_unlock(&next_i->i_lock);
-				break;
-			}
-			spin_unlock(&next_i->i_lock);
-			next_i = list_entry(next_i->i_sb_list.next,
-						struct inode, i_sb_list);
-		}
+	__iget(inode);
+	spin_unlock(&inode->i_lock);
+	spin_unlock(lock);
 
-		/*
-		 * We can safely drop s_inode_list_lock here because either
-		 * we actually hold references on both inode and next_i or
-		 * end of list.  Also no new inodes will be added since the
-		 * umount has begun.
-		 */
-		spin_unlock(&sb->s_inode_list_lock);
+	iput(*toput_inode);
+	*toput_inode = inode;
 
-		if (need_iput_tmp)
-			iput(need_iput_tmp);
+	/* for each watch, send FS_UNMOUNT and then remove it */
+	fsnotify(inode, FS_UNMOUNT, inode, FSNOTIFY_EVENT_INODE, NULL, 0);
+	fsnotify_inode_delete(inode);
 
-		/* for each watch, send FS_UNMOUNT and then remove it */
-		fsnotify(inode, FS_UNMOUNT, inode, FSNOTIFY_EVENT_INODE, NULL, 0);
+	/*
+	 * Even though we have dropped the lock here, we can return LRU_SKIP as
+	 * we have a reference to the current inode and so it's next pointer is
+	 * guaranteed to be valid even though we dropped the list lock.
+	 */
+	spin_lock(lock);
+	return LRU_SKIP;
+}
 
-		fsnotify_inode_delete(inode);
+/**
+ * fsnotify_unmount_inodes - an sb is unmounting.  handle any watched inodes.
+ * @sb: superblock being unmounted.
+ *
+ * Called during unmount with the sb->s_umount held exclusively and so the inode
+ * list will not grow and so a single pass will catch all inodes.
+ */
+void fsnotify_unmount_inodes(struct super_block *sb)
+{
+	struct inode *toput_inode = NULL;
 
-		iput(inode);
+	list_lru_walk(&sb->s_inode_list, fsnotify_unmount_inode, &toput_inode,
+		      ULONG_MAX);
 
-		spin_lock(&sb->s_inode_list_lock);
-	}
-	spin_unlock(&sb->s_inode_list_lock);
+	/* the list walk doesn't release the last inode it sees! */
+	iput(toput_inode);
 }
diff --git a/fs/quota/dquot.c b/fs/quota/dquot.c
index 68c7ae3..20c97c7 100644
--- a/fs/quota/dquot.c
+++ b/fs/quota/dquot.c
@@ -912,55 +912,81 @@ static int dqinit_needed(struct inode *inode, int type)
 	return 0;
 }
 
-/* This routine is guarded by dqonoff_mutex mutex */
-static void add_dquot_ref(struct super_block *sb, int type)
+static enum lru_status
+add_dquot_ref_type(struct list_head *item, struct list_lru_one *lru,
+		   spinlock_t *lock, void *cb_arg, int type)
 {
-	struct inode *inode, *old_inode = NULL;
-#ifdef CONFIG_QUOTA_DEBUG
-	int reserved = 0;
-#endif
+	struct inode **toput_inode = cb_arg;
+	struct inode *inode = container_of(item, struct inode, i_sb_list);
 
-	spin_lock(&sb->s_inode_list_lock);
-	list_for_each_entry(inode, &sb->s_inodes, i_sb_list) {
-		spin_lock(&inode->i_lock);
-		if ((inode->i_state & (I_FREEING|I_WILL_FREE|I_NEW)) ||
-		    !atomic_read(&inode->i_writecount) ||
-		    !dqinit_needed(inode, type)) {
-			spin_unlock(&inode->i_lock);
-			continue;
-		}
-		__iget(inode);
+	spin_lock(&inode->i_lock);
+	if ((inode->i_state & (I_FREEING|I_WILL_FREE|I_NEW)) ||
+	    !atomic_read(&inode->i_writecount) ||
+	    !dqinit_needed(inode, type)) {
 		spin_unlock(&inode->i_lock);
-		spin_unlock(&sb->s_inode_list_lock);
+		return LRU_SKIP;
+	}
+
+	__iget(inode);
+	spin_unlock(&inode->i_lock);
+	spin_unlock(lock);
 
 #ifdef CONFIG_QUOTA_DEBUG
-		if (unlikely(inode_get_rsv_space(inode) > 0))
-			reserved = 1;
+	if (unlikely(inode_get_rsv_space(inode) > 0))
+		quota_error(inode->i_sb, "Writes happened before quota was "
+			    "turned on thus quota information is probably "
+			    "inconsistent. Please run quotacheck(8)");
 #endif
-		iput(old_inode);
-		__dquot_initialize(inode, type);
 
-		/*
-		 * We hold a reference to 'inode' so it couldn't have been
-		 * removed from s_inodes list while we dropped the
-		 * s_inode_list_lock. We cannot iput the inode now as we can be
-		 * holding the last reference and we cannot iput it under
-		 * s_inode_list_lock. So we keep the reference and iput it
-		 * later.
-		 */
-		old_inode = inode;
-		spin_lock(&sb->s_inode_list_lock);
-	}
-	spin_unlock(&sb->s_inode_list_lock);
-	iput(old_inode);
+	iput(*toput_inode);
+	*toput_inode = inode;
 
-#ifdef CONFIG_QUOTA_DEBUG
-	if (reserved) {
-		quota_error(sb, "Writes happened before quota was turned on "
-			"thus quota information is probably inconsistent. "
-			"Please run quotacheck(8)");
+	__dquot_initialize(inode, type);
+
+	/*
+	 * Even though we have dropped the lock here, we can return LRU_SKIP as
+	 * we have a reference to the current inode and so it's next pointer is
+	 * guaranteed to be valid even though we dropped the list lock.
+	 */
+	spin_lock(lock);
+	return LRU_SKIP;
+}
+
+static enum lru_status
+add_dquot_ref_usr(struct list_head *item, struct list_lru_one *lru,
+		  spinlock_t *lock, void *cb_arg)
+{
+	return add_dquot_ref_type(item, lru, lock, cb_arg, USRQUOTA);
+}
+
+static enum lru_status
+add_dquot_ref_grp(struct list_head *item, struct list_lru_one *lru,
+		  spinlock_t *lock, void *cb_arg)
+{
+	return add_dquot_ref_type(item, lru, lock, cb_arg, GRPQUOTA);
+}
+
+/* add_dquot_ref is protected by the dqonoff_mutex mutex */
+void add_dquot_ref(struct super_block *sb, int type)
+{
+	struct inode *toput_inode = NULL;
+	list_lru_walk_cb isolate;
+
+	switch (type) {
+	case USRQUOTA:
+		isolate = add_dquot_ref_usr;
+		break;
+	case GRPQUOTA:
+		isolate = add_dquot_ref_grp;
+		break;
+	default:
+		WARN_ON_ONCE(1);
+		return;
 	}
-#endif
+	list_lru_walk(&sb->s_inode_list, isolate, &toput_inode, ULONG_MAX);
+
+	/* the list walk doesn't release the last inode it sees! */
+	iput(toput_inode);
 }
 
 /*
@@ -1013,36 +1039,67 @@ static void put_dquot_list(struct list_head *tofree_head)
 	}
 }
 
-static void remove_dquot_ref(struct super_block *sb, int type,
-		struct list_head *tofree_head)
+static enum lru_status
+remove_dquot_ref_type(struct list_head *item, struct list_lru_one *lru,
+		      spinlock_t *lock, void *cb_arg, int type)
 {
-	struct inode *inode;
-	int reserved = 0;
+	struct list_head *tofree_head = cb_arg;
+	struct inode *inode = container_of(item, struct inode, i_sb_list);
 
-	spin_lock(&sb->s_inode_list_lock);
-	list_for_each_entry(inode, &sb->s_inodes, i_sb_list) {
-		/*
-		 *  We have to scan also I_NEW inodes because they can already
-		 *  have quota pointer initialized. Luckily, we need to touch
-		 *  only quota pointers and these have separate locking
-		 *  (dq_data_lock).
-		 */
-		spin_lock(&dq_data_lock);
-		if (!IS_NOQUOTA(inode)) {
-			if (unlikely(inode_get_rsv_space(inode) > 0))
-				reserved = 1;
-			remove_inode_dquot_ref(inode, type, tofree_head);
-		}
-		spin_unlock(&dq_data_lock);
-	}
-	spin_unlock(&sb->s_inode_list_lock);
+	/*
+	 *  We have to scan also I_NEW inodes because they can already
+	 *  have quota pointer initialized. Luckily, we need to touch
+	 *  only quota pointers and these have separate locking
+	 *  (dqptr_sem).
+	 */
+	spin_lock(&dq_data_lock);
+	if (!IS_NOQUOTA(inode)) {
 #ifdef CONFIG_QUOTA_DEBUG
-	if (reserved) {
-		printk(KERN_WARNING "VFS (%s): Writes happened after quota"
+		if (unlikely(inode_get_rsv_space(inode) > 0)) {
+			printk_ratelimited(KERN_WARNING
+			"VFS (%s): Writes happened after quota"
 			" was disabled thus quota information is probably "
-			"inconsistent. Please run quotacheck(8).\n", sb->s_id);
-	}
+			"inconsistent. Please run quotacheck(8).\n",
+			inode->i_sb->s_id);
+		}
 #endif
+		remove_inode_dquot_ref(inode, type, tofree_head);
+	}
+	spin_unlock(&dq_data_lock);
+	return LRU_SKIP;
+}
+
+static enum lru_status
+remove_dquot_ref_usr(struct list_head *item, struct list_lru_one *lru,
+		     spinlock_t *lock, void *cb_arg)
+{
+	return remove_dquot_ref_type(item, lru, lock, cb_arg, USRQUOTA);
+}
+
+static enum lru_status
+remove_dquot_ref_grp(struct list_head *item, struct list_lru_one *lru,
+		     spinlock_t *lock, void *cb_arg)
+{
+	return remove_dquot_ref_type(item, lru, lock, cb_arg, GRPQUOTA);
+}
+
+static void remove_dquot_ref(struct super_block *sb, int type,
+		struct list_head *tofree_head)
+{
+	list_lru_walk_cb isolate;
+
+	switch (type) {
+	case USRQUOTA:
+		isolate = remove_dquot_ref_usr;
+		break;
+	case GRPQUOTA:
+		isolate = remove_dquot_ref_grp;
+		break;
+	default:
+		WARN_ON_ONCE(1);
+		return;
+	}
+	list_lru_walk(&sb->s_inode_list, isolate, tofree_head, ULONG_MAX);
 }
 
 /* Gather all references from inodes and drop them */
diff --git a/fs/super.c b/fs/super.c
index 6a05d94..721584a 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -146,6 +146,7 @@ static void destroy_super(struct super_block *s)
 	int i;
 	list_lru_destroy(&s->s_dentry_lru);
 	list_lru_destroy(&s->s_inode_lru);
+	list_lru_destroy(&s->s_inode_list);
 	for (i = 0; i < SB_FREEZE_LEVELS; i++)
 		percpu_counter_destroy(&s->s_writers.counter[i]);
 	security_sb_free(s);
@@ -191,13 +192,13 @@ static struct super_block *alloc_super(struct file_system_type *type, int flags)
 	INIT_HLIST_NODE(&s->s_instances);
 	INIT_HLIST_BL_HEAD(&s->s_anon);
 	mutex_init(&s->s_sync_lock);
-	INIT_LIST_HEAD(&s->s_inodes);
-	spin_lock_init(&s->s_inode_list_lock);
 
 	if (list_lru_init_memcg(&s->s_dentry_lru))
 		goto fail;
 	if (list_lru_init_memcg(&s->s_inode_lru))
 		goto fail;
+	if (list_lru_init(&s->s_inode_list))
+		goto fail;
 
 	init_rwsem(&s->s_umount);
 	lockdep_set_class(&s->s_umount, &type->s_umount_key);
@@ -294,6 +295,7 @@ void deactivate_locked_super(struct super_block *s)
 		 */
 		list_lru_destroy(&s->s_dentry_lru);
 		list_lru_destroy(&s->s_inode_lru);
+		list_lru_destroy(&s->s_inode_list);
 
 		put_filesystem(fs);
 		put_super(s);
@@ -413,7 +415,7 @@ void generic_shutdown_super(struct super_block *sb)
 		if (sop->put_super)
 			sop->put_super(sb);
 
-		if (!list_empty(&sb->s_inodes)) {
+		if (list_lru_count(&sb->s_inode_list)) {
 			printk("VFS: Busy inodes after unmount of %s. "
 			   "Self-destruct in 5 seconds.  Have a nice day...\n",
 			   sb->s_id);
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 66284fe..964aba3 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1319,9 +1319,12 @@ struct super_block {
 	 */
 	int s_stack_depth;
 
-	/* s_inode_list_lock protects s_inodes */
-	spinlock_t		s_inode_list_lock ____cacheline_aligned_in_smp;
-	struct list_head	s_inodes;	/* all inodes */
+	/*
+	 * the inode list is not strictly used as a LRU, but uses the list_lru
+	 * construct to provide a scalable list implemenation for adding,
+	 * removing and walking the inodes cached in memory.
+	 */
+	struct list_lru		s_inode_list ____cacheline_aligned_in_smp;
 };
 
 extern struct timespec current_fs_time(struct super_block *sb);
-- 
1.9.3


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH 9/9] inode: don't softlockup when evicting inodes
  2015-03-10 19:45 [PATCH 0/9] Sync and VFS scalability improvements Josef Bacik
                   ` (7 preceding siblings ...)
  2015-03-10 19:45 ` [PATCH 8/9] inode: convert per-sb inode list to a list_lru Josef Bacik
@ 2015-03-10 19:45 ` Josef Bacik
  2015-03-16 12:31   ` Jan Kara
  2015-03-16 11:39 ` [PATCH 0/9] Sync and VFS scalability improvements Jan Kara
  2015-03-25 11:18 ` Mel Gorman
  10 siblings, 1 reply; 24+ messages in thread
From: Josef Bacik @ 2015-03-10 19:45 UTC (permalink / raw)
  To: linux-fsdevel, david, viro, jack, linux-kernel

On a box with a lot of ram (148gb) I can make the box softlockup after running
an fs_mark job that creates hundreds of millions of empty files.  This is
because we never generate enough memory pressure to keep the number of inodes on
our unused list low, so when we go to unmount we have to evict ~100 million
inodes.  This makes one processor a very unhappy person, so add a cond_resched()
in dispose_list() and cond_resched_lock() in the eviction isolation function to
combat this.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fb.com>
---
 fs/inode.c | 8 ++++++++
 1 file changed, 8 insertions(+)

diff --git a/fs/inode.c b/fs/inode.c
index 17da8801..2191a3ce 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -569,6 +569,7 @@ static void dispose_list(struct list_head *head)
 		list_del_init(&inode->i_lru);
 
 		evict(inode);
+		cond_resched();
 	}
 }
 
@@ -599,6 +600,13 @@ __evict_inodes_isolate(struct list_head *item, struct list_lru_one *lru,
 
 	list_lru_isolate(lru, item);
 	spin_unlock(&inode->i_lock);
+
+	/*
+	 * We can have a ton of inodes to evict at unmount time, check to see if
+	 * we need to go to sleep for a bit so we don't livelock.
+	 */
+	if (cond_resched_lock(lock))
+		return LRU_REMOVED_RETRY;
 	return LRU_REMOVED;
 }
 
-- 
1.9.3


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* Re: [PATCH 2/9] inode: add IOP_NOTHASHED to avoid inode hash lock in evict
  2015-03-10 19:45 ` [PATCH 2/9] inode: add IOP_NOTHASHED to avoid inode hash lock in evict Josef Bacik
@ 2015-03-12  9:52   ` Al Viro
  2015-03-12 12:18     ` [PATCH] inode: add hlist_fake to avoid the " Josef Bacik
                       ` (2 more replies)
  0 siblings, 3 replies; 24+ messages in thread
From: Al Viro @ 2015-03-12  9:52 UTC (permalink / raw)
  To: Josef Bacik; +Cc: linux-fsdevel, david, jack, linux-kernel, Dave Chinner

On Tue, Mar 10, 2015 at 03:45:17PM -0400, Josef Bacik wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> Some filesystems don't use the VFS inode hash and fake the fact they
> are hashed so that all the writeback code works correctly. However,
> this means the evict() path still tries to remove the inode from the
> hash, meaning that the inode_hash_lock() needs to be taken
> unnecessarily. Hence under certain workloads the inode_hash_lock can
> be contended even if the inode is never actually hashed.
> 
> To avoid this, add an inode opflag to allow inode_hash_remove() to
> avoid taking the hash lock on inodes have never actually been
> hashed.

Why bother with flags, etc. when we could just do

static inline bool hlist_fake(struct hlist_node *h) 
{
	return h->pprev == &h->next;
}

> -	if (!inode_unhashed(inode))
> +	if (!((inode->i_opflags & IOP_NOTHASHED) || inode_unhashed(inode)))

and turn this check into
	if (!inode_unhashed(inode) && !hlist_fake(&inode->i_hash))

instead?

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH] inode: add hlist_fake to avoid the inode hash lock in evict
  2015-03-12  9:52   ` Al Viro
@ 2015-03-12 12:18     ` Josef Bacik
  2015-03-12 12:20     ` [PATCH] inode: add hlist_fake to avoid the inode hash lock in evict V2 Josef Bacik
  2015-03-12 12:24     ` [PATCH 2/9] inode: add IOP_NOTHASHED to avoid inode hash lock in evict Josef Bacik
  2 siblings, 0 replies; 24+ messages in thread
From: Josef Bacik @ 2015-03-12 12:18 UTC (permalink / raw)
  To: viro, linux-fsdevel, david, jack, linux-kernel; +Cc: Dave Chinner

From: Dave Chinner <dchinner@redhat.com>

Some filesystems don't use the VFS inode hash and fake the fact they
are hashed so that all the writeback code works correctly. However,
this means the evict() path still tries to remove the inode from the
hash, meaning that the inode_hash_lock() needs to be taken
unnecessarily. Hence under certain workloads the inode_hash_lock can
be contended even if the inode is never actually hashed.

To avoid this add hlist_fake to test if the inode isn't actually
hashed to avoid taking the hash lock on inodes that have never been
hashed.  Based on Dave Chinner's

inode: add IOP_NOTHASHED to avoid inode hash lock in evict

basd on Al's suggestions.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fb.com>
---
 include/linux/fs.h   | 2 +-
 include/linux/list.h | 5 +++++
 2 files changed, 6 insertions(+), 1 deletion(-)

diff --git a/include/linux/fs.h b/include/linux/fs.h
index b4d71b5..31da757 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2528,7 +2528,7 @@ static inline void insert_inode_hash(struct inode *inode)
 extern void __remove_inode_hash(struct inode *);
 static inline void remove_inode_hash(struct inode *inode)
 {
-	if (!inode_unhashed(inode))
+	if (!inode_unhashed(inode) && !hlist_fake(&inode->i_hash))
 		__remove_inode_hash(inode);
 }
 
diff --git a/include/linux/list.h b/include/linux/list.h
index feb773c..3e3e64a 100644
--- a/include/linux/list.h
+++ b/include/linux/list.h
@@ -672,6 +672,11 @@ static inline void hlist_add_fake(struct hlist_node *n)
 	n->pprev = &n->next;
 }
 
+static inline bool hlist_fake(struct hlist_node *h)
+{
+	return h->pprev == &h->next;
+}
+
 /*
  * Move a list from one list head to another. Fixup the pprev
  * reference of the first entry if it exists.
-- 
1.9.3


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH] inode: add hlist_fake to avoid the inode hash lock in evict V2
  2015-03-12  9:52   ` Al Viro
  2015-03-12 12:18     ` [PATCH] inode: add hlist_fake to avoid the " Josef Bacik
@ 2015-03-12 12:20     ` Josef Bacik
  2015-03-14  7:00       ` Jan Kara
  2015-03-12 12:24     ` [PATCH 2/9] inode: add IOP_NOTHASHED to avoid inode hash lock in evict Josef Bacik
  2 siblings, 1 reply; 24+ messages in thread
From: Josef Bacik @ 2015-03-12 12:20 UTC (permalink / raw)
  To: viro, linux-fsdevel, david, jack, linux-kernel

Some filesystems don't use the VFS inode hash and fake the fact they
are hashed so that all the writeback code works correctly. However,
this means the evict() path still tries to remove the inode from the
hash, meaning that the inode_hash_lock() needs to be taken
unnecessarily. Hence under certain workloads the inode_hash_lock can
be contended even if the inode is never actually hashed.

To avoid this add hlist_fake to test if the inode isn't actually
hashed to avoid taking the hash lock on inodes that have never been
hashed.  Based on Dave Chinner's

inode: add IOP_NOTHASHED to avoid inode hash lock in evict

basd on Al's suggestions.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fb.com>
---
V1->V2: Argh sorry, forgot to reset the author, my bad.

 include/linux/fs.h   | 2 +-
 include/linux/list.h | 5 +++++
 2 files changed, 6 insertions(+), 1 deletion(-)

diff --git a/include/linux/fs.h b/include/linux/fs.h
index b4d71b5..31da757 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2528,7 +2528,7 @@ static inline void insert_inode_hash(struct inode *inode)
 extern void __remove_inode_hash(struct inode *);
 static inline void remove_inode_hash(struct inode *inode)
 {
-	if (!inode_unhashed(inode))
+	if (!inode_unhashed(inode) && !hlist_fake(&inode->i_hash))
 		__remove_inode_hash(inode);
 }
 
diff --git a/include/linux/list.h b/include/linux/list.h
index feb773c..3e3e64a 100644
--- a/include/linux/list.h
+++ b/include/linux/list.h
@@ -672,6 +672,11 @@ static inline void hlist_add_fake(struct hlist_node *n)
 	n->pprev = &n->next;
 }
 
+static inline bool hlist_fake(struct hlist_node *h)
+{
+	return h->pprev == &h->next;
+}
+
 /*
  * Move a list from one list head to another. Fixup the pprev
  * reference of the first entry if it exists.
-- 
1.9.3


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* Re: [PATCH 2/9] inode: add IOP_NOTHASHED to avoid inode hash lock in evict
  2015-03-12  9:52   ` Al Viro
  2015-03-12 12:18     ` [PATCH] inode: add hlist_fake to avoid the " Josef Bacik
  2015-03-12 12:20     ` [PATCH] inode: add hlist_fake to avoid the inode hash lock in evict V2 Josef Bacik
@ 2015-03-12 12:24     ` Josef Bacik
  2 siblings, 0 replies; 24+ messages in thread
From: Josef Bacik @ 2015-03-12 12:24 UTC (permalink / raw)
  To: Al Viro; +Cc: linux-fsdevel, david, jack, linux-kernel, Dave Chinner

On 03/12/2015 05:52 AM, Al Viro wrote:
> On Tue, Mar 10, 2015 at 03:45:17PM -0400, Josef Bacik wrote:
>> From: Dave Chinner <dchinner@redhat.com>
>>
>> Some filesystems don't use the VFS inode hash and fake the fact they
>> are hashed so that all the writeback code works correctly. However,
>> this means the evict() path still tries to remove the inode from the
>> hash, meaning that the inode_hash_lock() needs to be taken
>> unnecessarily. Hence under certain workloads the inode_hash_lock can
>> be contended even if the inode is never actually hashed.
>>
>> To avoid this, add an inode opflag to allow inode_hash_remove() to
>> avoid taking the hash lock on inodes have never actually been
>> hashed.
>
> Why bother with flags, etc. when we could just do
>
> static inline bool hlist_fake(struct hlist_node *h)
> {
> 	return h->pprev == &h->next;
> }
>
>> -	if (!inode_unhashed(inode))
>> +	if (!((inode->i_opflags & IOP_NOTHASHED) || inode_unhashed(inode)))
>
> and turn this check into
> 	if (!inode_unhashed(inode) && !hlist_fake(&inode->i_hash))
>
> instead?
>

I made the change and sent the new patch, hopefully it worked, sometimes 
my laptop screws up git-send-email.  If everything looks good to you Al 
you can pull from my tree if you don't want to scrape from the list

git://git.kernel.org/pub/scm/linux/kernel/git/josef/btrfs-next.git 
superblock-scaling

Thanks,

Josef

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH] inode: add hlist_fake to avoid the inode hash lock in evict V2
  2015-03-12 12:20     ` [PATCH] inode: add hlist_fake to avoid the inode hash lock in evict V2 Josef Bacik
@ 2015-03-14  7:00       ` Jan Kara
  0 siblings, 0 replies; 24+ messages in thread
From: Jan Kara @ 2015-03-14  7:00 UTC (permalink / raw)
  To: Josef Bacik; +Cc: viro, linux-fsdevel, david, jack, linux-kernel

On Thu 12-03-15 08:20:23, Josef Bacik wrote:
> Some filesystems don't use the VFS inode hash and fake the fact they
> are hashed so that all the writeback code works correctly. However,
> this means the evict() path still tries to remove the inode from the
> hash, meaning that the inode_hash_lock() needs to be taken
> unnecessarily. Hence under certain workloads the inode_hash_lock can
> be contended even if the inode is never actually hashed.
> 
> To avoid this add hlist_fake to test if the inode isn't actually
> hashed to avoid taking the hash lock on inodes that have never been
> hashed.  Based on Dave Chinner's
> 
> inode: add IOP_NOTHASHED to avoid inode hash lock in evict
> 
> basd on Al's suggestions.  Thanks,
> 
> Signed-off-by: Josef Bacik <jbacik@fb.com>
  Looks good. You can add:
Reviewed-by: Jan Kara <jack@suse.cz>

								Honza

> ---
> V1->V2: Argh sorry, forgot to reset the author, my bad.
> 
>  include/linux/fs.h   | 2 +-
>  include/linux/list.h | 5 +++++
>  2 files changed, 6 insertions(+), 1 deletion(-)
> 
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index b4d71b5..31da757 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -2528,7 +2528,7 @@ static inline void insert_inode_hash(struct inode *inode)
>  extern void __remove_inode_hash(struct inode *);
>  static inline void remove_inode_hash(struct inode *inode)
>  {
> -	if (!inode_unhashed(inode))
> +	if (!inode_unhashed(inode) && !hlist_fake(&inode->i_hash))
>  		__remove_inode_hash(inode);
>  }
>  
> diff --git a/include/linux/list.h b/include/linux/list.h
> index feb773c..3e3e64a 100644
> --- a/include/linux/list.h
> +++ b/include/linux/list.h
> @@ -672,6 +672,11 @@ static inline void hlist_add_fake(struct hlist_node *n)
>  	n->pprev = &n->next;
>  }
>  
> +static inline bool hlist_fake(struct hlist_node *h)
> +{
> +	return h->pprev == &h->next;
> +}
> +
>  /*
>   * Move a list from one list head to another. Fixup the pprev
>   * reference of the first entry if it exists.
> -- 
> 1.9.3
> 
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 6/9] bdi: add a new writeback list for sync
  2015-03-10 19:45 ` [PATCH 6/9] bdi: add a new writeback list for sync Josef Bacik
@ 2015-03-16 10:14   ` Jan Kara
  0 siblings, 0 replies; 24+ messages in thread
From: Jan Kara @ 2015-03-16 10:14 UTC (permalink / raw)
  To: Josef Bacik; +Cc: linux-fsdevel, david, viro, jack, linux-kernel, Dave Chinner

On Tue 10-03-15 15:45:21, Josef Bacik wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> wait_sb_inodes() current does a walk of all inodes in the filesystem
> to find dirty one to wait on during sync. This is highly
> inefficient and wastes a lot of CPU when there are lots of clean
> cached inodes that we don't need to wait on.
> 
> To avoid this "all inode" walk, we need to track inodes that are
> currently under writeback that we need to wait for. We do this by
> adding inodes to a writeback list on the bdi when the mapping is
> first tagged as having pages under writeback.  wait_sb_inodes() can
> then walk this list of "inodes under IO" and wait specifically just
> for the inodes that the current sync(2) needs to wait for.
> 
> To avoid needing all the realted locking to be safe against
> interrupts, Jan Kara suggested that we be lazy about removal from
> the writeback list. That is, we don't remove inodes from the
> writeback list on IO completion, but do it directly during a
> wait_sb_inodes() walk.
> 
> This means that the a rare sync(2) call will have some work to do
> skipping clean inodes However, in the current problem case of
> concurrent sync workloads, concurrent wait_sb_inodes() calls only
> walk the very recently dispatched inodes and hence should have very
> little work to do.
> 
> This also means that we have to remove the inodes from the writeback
> list during eviction. Do this in inode_wait_for_writeback() once
> all writeback on the inode is complete.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
  The code looks mostly fine. Two comments below.
...
>  /*
> - * Wait for writeback on an inode to complete. Caller must have inode pinned.
> + * Wait for writeback on an inode to complete during eviction. Caller must have
> + * inode pinned.
>   */
>  void inode_wait_for_writeback(struct inode *inode)
>  {
> +	BUG_ON(!(inode->i_state & I_FREEING));
> +
>  	spin_lock(&inode->i_lock);
>  	__inode_wait_for_writeback(inode);
>  	spin_unlock(&inode->i_lock);
> +
> +	/*
> +	 * bd_inode's will have already had the bdev free'd so don't bother
> +	 * doing the bdi_clear_inode_writeback.
> +	 */
> +	if (!sb_is_blkdev_sb(inode->i_sb))
> +		bdi_clear_inode_writeback(inode_to_bdi(inode), inode);
  Umm, I don't get this even though I've read the comment several times.
Why don't we want to remove bdev inode from writeback list of
blockdev_super? Is the comment suggesting that bdev inode cannot be on
writeback list?

> @@ -1307,28 +1347,59 @@ static void wait_sb_inodes(struct super_block *sb)
>  	 */
>  	WARN_ON(!rwsem_is_locked(&sb->s_umount));
>  
> -	mutex_lock(&sb->s_sync_lock);
> -	spin_lock(&sb->s_inode_list_lock);
> -
>  	/*
> -	 * Data integrity sync. Must wait for all pages under writeback,
> -	 * because there may have been pages dirtied before our sync
> -	 * call, but which had writeout started before we write it out.
> -	 * In which case, the inode may not be on the dirty list, but
> -	 * we still have to wait for that writeout.
> +	 * Data integrity sync. Must wait for all pages under writeback, because
> +	 * there may have been pages dirtied before our sync call, but which had
> +	 * writeout started before we write it out.  In which case, the inode
> +	 * may not be on the dirty list, but we still have to wait for that
> +	 * writeout.
> +	 *
> +	 * To avoid syncing inodes put under IO after we have started here,
> +	 * splice the io list to a temporary list head and walk that. Newly
> +	 * dirtied inodes will go onto the primary list so we won't wait for
> +	 * them. This is safe to do as we are serialised by the s_sync_lock,
> +	 * so we'll complete processing the complete list before the next
> +	 * sync operation repeats the splice-and-walk process.
> +	 *
> +	 * Inodes that have pages under writeback after we've finished the wait
> +	 * may or may not be on the primary list. They had pages under put IO
> +	 * after we started our wait, so we need to make sure the next sync or
> +	 * IO completion treats them correctly, Move them back to the primary
> +	 * list and restart the walk.
> +	 *
> +	 * Inodes that are clean after we have waited for them don't belong on
> +	 * any list, so simply remove them from the sync list and move onwards.
>  	 */
> -	list_for_each_entry(inode, &sb->s_inodes, i_sb_list) {
> +	mutex_lock(&sb->s_sync_lock);
> +	spin_lock(&bdi->wb.list_lock);
> +	list_splice_init(&bdi->wb.b_writeback, &sync_list);
> +
> +	while (!list_empty(&sync_list)) {
> +		struct inode *inode = list_first_entry(&sync_list, struct inode,
> +						       i_wb_list);
>  		struct address_space *mapping = inode->i_mapping;
>  
> +		/*
> +		 * We are lazy on IO completion and don't remove inodes from the
> +		 * list when they are clean. Detect that immediately and skip
> +		 * inodes we don't ahve to wait on.
> +		 */
> +		if (!mapping_tagged(mapping, PAGECACHE_TAG_WRITEBACK)) {
> +			list_del_init(&inode->i_wb_list);
> +			cond_resched_lock(&bdi->wb.list_lock);
> +			continue;
> +		}
> +
>  		spin_lock(&inode->i_lock);
> -		if ((inode->i_state & (I_FREEING|I_WILL_FREE|I_NEW)) ||
> -		    (mapping->nrpages == 0)) {
> +		if (inode->i_state & (I_FREEING|I_WILL_FREE|I_NEW)) {
> +                        list_move(&inode->i_wb_list, &bdi->wb.b_writeback);
  Indenting looks damaged here...

>  			spin_unlock(&inode->i_lock);
> +			cond_resched_lock(&bdi->wb.list_lock);
>  			continue;
  Why do we put freeing inodes back to b_writeback list? For I_FREEING and
I_WILL_FREE we could as well just delete them. For I_NEW we would start
waiting for them once I_NEW gets cleared but still I_NEW inodes under
writeback look somewhat worrying (flusher thread skips them so the
semantics isn't clear there). Anyway, probably lets just keep code as is
for I_NEW but moving back to b_writeback for I_FREEING | I_WILL_FREE looks
just dumb.

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 7/9] writeback: periodically trim the writeback list
  2015-03-10 19:45 ` [PATCH 7/9] writeback: periodically trim the writeback list Josef Bacik
@ 2015-03-16 10:16   ` Jan Kara
  2015-03-16 11:43     ` Jan Kara
  0 siblings, 1 reply; 24+ messages in thread
From: Jan Kara @ 2015-03-16 10:16 UTC (permalink / raw)
  To: Josef Bacik; +Cc: linux-fsdevel, david, viro, jack, linux-kernel, Dave Chinner

On Tue 10-03-15 15:45:22, Josef Bacik wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> Inodes are removed lazily from the bdi writeback list, so in the
> absence of sync(2) work inodes will build up on the bdi writback
> list even though they are no longer under IO. Use the periodic
> kupdate work check to remove inodes no longer under IO from the
> writeback list.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
  Looks good. You can add:
Reviewed-by: Jan Kara <jack@suse.cz>

								Honza

> ---
>  fs/fs-writeback.c | 19 +++++++++++++++++++
>  1 file changed, 19 insertions(+)
> 
> diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
> index 82b0f43..aa0de0f 100644
> --- a/fs/fs-writeback.c
> +++ b/fs/fs-writeback.c
> @@ -1037,6 +1037,23 @@ static long wb_check_background_flush(struct bdi_writeback *wb)
>  	return 0;
>  }
>  
> +/*
> + * clean out writeback list for all inodes that don't have IO in progress
> + */
> +static void wb_trim_writeback_list(struct bdi_writeback *wb)
> +{
> +	struct inode *inode;
> +	struct inode *tmp;
> +
> +	spin_lock(&wb->list_lock);
> +	list_for_each_entry_safe(inode, tmp, &wb->b_writeback, i_wb_list) {
> +		if (!mapping_tagged(inode->i_mapping, PAGECACHE_TAG_WRITEBACK))
> +			list_del_init(&inode->i_wb_list);
> +	}
> +	spin_unlock(&wb->list_lock);
> +
> +}
> +
>  static long wb_check_old_data_flush(struct bdi_writeback *wb)
>  {
>  	unsigned long expired;
> @@ -1053,6 +1070,8 @@ static long wb_check_old_data_flush(struct bdi_writeback *wb)
>  	if (time_before(jiffies, expired))
>  		return 0;
>  
> +	wb_trim_writeback_list(wb);
> +
>  	wb->last_old_flush = jiffies;
>  	nr_pages = get_nr_dirty_pages();
>  
> -- 
> 1.9.3
> 
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 0/9] Sync and VFS scalability improvements
  2015-03-10 19:45 [PATCH 0/9] Sync and VFS scalability improvements Josef Bacik
                   ` (8 preceding siblings ...)
  2015-03-10 19:45 ` [PATCH 9/9] inode: don't softlockup when evicting inodes Josef Bacik
@ 2015-03-16 11:39 ` Jan Kara
  2015-03-25 11:18 ` Mel Gorman
  10 siblings, 0 replies; 24+ messages in thread
From: Jan Kara @ 2015-03-16 11:39 UTC (permalink / raw)
  To: Josef Bacik; +Cc: linux-fsdevel, david, viro, jack, linux-kernel

On Tue 10-03-15 15:45:15, Josef Bacik wrote:
> These are patches that Dave Chinner wrote two years ago that are still very much
> needed today.  I recently ran into a problem where I had millions of inodes that
> needed to be evicted at unmount time and it soft locked up the box and kept any
> other file system from doing work.  These patches fix this problem by breaking
> the inode_sb_list_lock into per-sb, and then converting the per sb inode list
> into a list_lru for better scalability.
> 
> I've also pulled forward Dave's sync scalability patches which still seem pretty
> relevant.  I had to fix a couple of these to bring them forward but I touched
> very little and I've preserved the authorship of everything.  I added the
> Reviewed-by's that were there when the patches were originally submitted.  I've
> run this through xfstests on btrfs and xfs and verified that everything seems to
> be working.  If you are interested the original submission can be found here
> 
> http://lwn.net/Articles/561569/
> 
> Finally the last patch is from me and this fixes the softlockup problems I was
> seeing on unmount with a large amount of inodes that needed to be evicted.
   Just a general process note - when resending patches like this, you
should add your Signed-off-by. We know both you and Dave well so it's
mostly a cosmetic thing but it's better to do it anyway...

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 7/9] writeback: periodically trim the writeback list
  2015-03-16 10:16   ` Jan Kara
@ 2015-03-16 11:43     ` Jan Kara
  0 siblings, 0 replies; 24+ messages in thread
From: Jan Kara @ 2015-03-16 11:43 UTC (permalink / raw)
  To: Josef Bacik; +Cc: linux-fsdevel, david, viro, jack, linux-kernel, Dave Chinner

On Mon 16-03-15 11:16:08, Jan Kara wrote:
> On Tue 10-03-15 15:45:22, Josef Bacik wrote:
> > From: Dave Chinner <dchinner@redhat.com>
> > 
> > Inodes are removed lazily from the bdi writeback list, so in the
> > absence of sync(2) work inodes will build up on the bdi writback
> > list even though they are no longer under IO. Use the periodic
> > kupdate work check to remove inodes no longer under IO from the
> > writeback list.
> > 
> > Signed-off-by: Dave Chinner <dchinner@redhat.com>
>   Looks good. You can add:
> Reviewed-by: Jan Kara <jack@suse.cz>
  BTW, it would be nice to add here some numbers showing how sync was
speeded up with the last two patches. The change makes a lot of sense but
it would be nice as a sanity check...

								Honza

> > ---
> >  fs/fs-writeback.c | 19 +++++++++++++++++++
> >  1 file changed, 19 insertions(+)
> > 
> > diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
> > index 82b0f43..aa0de0f 100644
> > --- a/fs/fs-writeback.c
> > +++ b/fs/fs-writeback.c
> > @@ -1037,6 +1037,23 @@ static long wb_check_background_flush(struct bdi_writeback *wb)
> >  	return 0;
> >  }
> >  
> > +/*
> > + * clean out writeback list for all inodes that don't have IO in progress
> > + */
> > +static void wb_trim_writeback_list(struct bdi_writeback *wb)
> > +{
> > +	struct inode *inode;
> > +	struct inode *tmp;
> > +
> > +	spin_lock(&wb->list_lock);
> > +	list_for_each_entry_safe(inode, tmp, &wb->b_writeback, i_wb_list) {
> > +		if (!mapping_tagged(inode->i_mapping, PAGECACHE_TAG_WRITEBACK))
> > +			list_del_init(&inode->i_wb_list);
> > +	}
> > +	spin_unlock(&wb->list_lock);
> > +
> > +}
> > +
> >  static long wb_check_old_data_flush(struct bdi_writeback *wb)
> >  {
> >  	unsigned long expired;
> > @@ -1053,6 +1070,8 @@ static long wb_check_old_data_flush(struct bdi_writeback *wb)
> >  	if (time_before(jiffies, expired))
> >  		return 0;
> >  
> > +	wb_trim_writeback_list(wb);
> > +
> >  	wb->last_old_flush = jiffies;
> >  	nr_pages = get_nr_dirty_pages();
> >  
> > -- 
> > 1.9.3
> > 
> -- 
> Jan Kara <jack@suse.cz>
> SUSE Labs, CR
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 8/9] inode: convert per-sb inode list to a list_lru
  2015-03-10 19:45 ` [PATCH 8/9] inode: convert per-sb inode list to a list_lru Josef Bacik
@ 2015-03-16 12:27   ` Jan Kara
  2015-03-16 15:34     ` Josef Bacik
  0 siblings, 1 reply; 24+ messages in thread
From: Jan Kara @ 2015-03-16 12:27 UTC (permalink / raw)
  To: Josef Bacik; +Cc: linux-fsdevel, david, viro, jack, linux-kernel, Dave Chinner

  Hello,

On Tue 10-03-15 15:45:23, Josef Bacik wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> The per-superblock inode list and lock is a bottleneck for systems
> that cycle inodes in and out of cache concurrently. The global lock
> is a limiting factor.
> 
> Most of the additions to the sb inode list occur on the CPU that
> allocated the inode, and most of the removals occur during evict()
> calls as a result of memory reclaim. Both of these events are local
> to the node that the inode belongs to, so it maps to the per-node
> lists that the list_lru uses.
> 
> There are several places where the inode list is walked. These can
> be converted easily to use list_lru_walk() to do their work on each
> inode on the list.
  Err, after looking at the patch I'm not so sure list_lru is such a great
fit for inode sb list. We don't use any of its LRU features so arguably
that just makes things less readable, there's some memcg code which I hope
isn't activated for the inode list but how do I tell? We just use it as
general per-numa-node linked lists so if we go down the path of
per-numa-node linked list I'd prefer to have an abstraction just for that.
Also there are some lifetime issues I'll comment on in the code.

And do we really want per-numa-node lists? Why not per-cpu? Is it the
memory cost we are concerned about? Or is it the cost of iterating all CPUs
when wanting to traverse all inodes? The changelog doesn't tell. The
traversal is used on unmount (inode eviction, fsnotify mark eviction),
quota on/off (adding / removing dquot pointers to / from inodes),
drop_caches, sync (iterate all block devices). Sync is IMO the only one
where CPU overhead might be a concern but we could create a better method
for iterating all block devices. After all creating / removing block
devices could happily use just ordinary lists. So CPU overhead shouldn't be
a huge problem. Memory overhead is (list_head + spinlock) per-sb-per-cpu.
That doesn't sound too terrible either.

When speaking about this with Andi Kleen, he was suggesting that we maybe
want to not have a per-cpu lists but a list per a batch of CPUs (say 8, the
actual number would be tuned during boot based on the number of actual
CPUs). That may be an option as well.

I wouldn't like to block this series just for this patch since all the
others are fine (modulo some minor comments). So could you repost the
series without this patch so that it can get queued for the coming merge
window? It should be easy to rebase the patch 9/9 to not need this patch.

> diff --git a/fs/block_dev.c b/fs/block_dev.c
> index 2eb2436..d23ce6f 100644
> --- a/fs/block_dev.c
> +++ b/fs/block_dev.c
> @@ -1749,38 +1749,56 @@ int __invalidate_device(struct block_device *bdev, bool kill_dirty)
>  }
>  EXPORT_SYMBOL(__invalidate_device);
>  
> -void iterate_bdevs(void (*func)(struct block_device *, void *), void *arg)
> -{
> -	struct inode *inode, *old_inode = NULL;
> +struct bdev_iter {
> +	void (*func)(struct block_device *, void *);
> +	void *arg;
> +	struct inode *toput_inode;
> +};
>  
> -	spin_lock(&blockdev_superblock->s_inode_list_lock);
> -	list_for_each_entry(inode, &blockdev_superblock->s_inodes, i_sb_list) {
> -		struct address_space *mapping = inode->i_mapping;
> +static enum lru_status
> +bdev_iter_cb(struct list_head *item, struct list_lru_one *lru,
> +	     spinlock_t *lock, void *cb_arg)
> +{
> +	struct bdev_iter *iter = cb_arg;
> +	struct inode *inode = container_of(item, struct inode, i_sb_list);
>  
> -		spin_lock(&inode->i_lock);
> -		if (inode->i_state & (I_FREEING|I_WILL_FREE|I_NEW) ||
> -		    mapping->nrpages == 0) {
> -			spin_unlock(&inode->i_lock);
> -			continue;
> -		}
> -		__iget(inode);
> +	spin_lock(&inode->i_lock);
> +	if (inode->i_state & (I_FREEING|I_WILL_FREE|I_NEW) ||
> +	    inode->i_mapping->nrpages == 0) {
>  		spin_unlock(&inode->i_lock);
> -		spin_unlock(&blockdev_superblock->s_inode_list_lock);
> -		/*
> -		 * We hold a reference to 'inode' so it couldn't have been
> -		 * removed from s_inodes list while we dropped the
> -		 * s_inode_list_lock  We cannot iput the inode now as we can
> -		 * be holding the last reference and we cannot iput it under
> -		 * s_inode_list_lock. So we keep the reference and iput it
> -		 * later.
> -		 */
> -		iput(old_inode);
> -		old_inode = inode;
> +		return LRU_SKIP;
> +	}
> +	__iget(inode);
> +	spin_unlock(&inode->i_lock);
> +	spin_unlock(lock);
>  
> -		func(I_BDEV(inode), arg);
> +	iput(iter->toput_inode);
> +	iter->toput_inode = inode;
>  
> -		spin_lock(&blockdev_superblock->s_inode_list_lock);
> -	}
> -	spin_unlock(&blockdev_superblock->s_inode_list_lock);
> -	iput(old_inode);
> +	iter->func(I_BDEV(inode), iter->arg);
> +
> +	/*
> +	 * Even though we have dropped the lock here, we can return LRU_SKIP as
> +	 * we have a reference to the current inode and so it's next pointer is
> +	 * guaranteed to be valid even though we dropped the list lock.
> +	 */
> +	spin_lock(lock);
> +	return LRU_SKIP;
> +}
  So I actually think doing the iteration like this is buggy with list_lru
code. When we pin inode by grabbing i_count, we are only sure the inode
won't go away under us. However there's nothing which protects from list
modifications. So this method is only safe to be combined with
list_for_each[_entry](). Specifically it does *not* work when combined with
list_for_each_safe() which __list_lru_walk_one() uses because the next
pointer that is cached may become invalid once we drop the lock. And I'd
really hate to try to make these two work together - that would seem really
fragile since subtle changes to how lru_list iteration work could break
inode iteration.

This is true for all the iterations over the inodes that are playing with
i_count. It's not just about this particular case of blockdev iteration.

> diff --git a/fs/notify/inode_mark.c b/fs/notify/inode_mark.c
> index a4e1a8f..a0cdc66 100644
> --- a/fs/notify/inode_mark.c
> +++ b/fs/notify/inode_mark.c
> @@ -161,87 +161,60 @@ int fsnotify_add_inode_mark(struct fsnotify_mark *mark,
>  	return ret;
>  }
>  
> -/**
> - * fsnotify_unmount_inodes - an sb is unmounting.  handle any watched inodes.
> - * @sb: superblock being unmounted.
> - *
> - * Called during unmount with no locks held, so needs to be safe against
> - * concurrent modifiers. We temporarily drop sb->s_inode_list_lock and CAN block.
> - */
> -void fsnotify_unmount_inodes(struct super_block *sb)
> -{
> -	struct inode *inode, *next_i, *need_iput = NULL;
> -
> -	spin_lock(&sb->s_inode_list_lock);
> -	list_for_each_entry_safe(inode, next_i, &sb->s_inodes, i_sb_list) {
> -		struct inode *need_iput_tmp;
> -
> -		/*
> -		 * We cannot __iget() an inode in state I_FREEING,
> -		 * I_WILL_FREE, or I_NEW which is fine because by that point
> -		 * the inode cannot have any associated watches.
> -		 */
> -		spin_lock(&inode->i_lock);
> -		if (inode->i_state & (I_FREEING|I_WILL_FREE|I_NEW)) {
> -			spin_unlock(&inode->i_lock);
> -			continue;
> -		}
> +static enum lru_status
> +fsnotify_unmount_inode(struct list_head *item, struct list_lru_one *lru,
> +		       spinlock_t *lock, void *cb_arg)
> + {
> +	struct inode **toput_inode = cb_arg;
> +	struct inode *inode = container_of(item, struct inode, i_sb_list);
> +
> +	/* New or being freed inodes cannot have any associated watches. */
> +	spin_lock(&inode->i_lock);
> +	if (inode->i_state & (I_FREEING|I_WILL_FREE|I_NEW)) {
> +		spin_unlock(&inode->i_lock);
> +		return LRU_SKIP;
> +	}
>  
> -		/*
> -		 * If i_count is zero, the inode cannot have any watches and
> -		 * doing an __iget/iput with MS_ACTIVE clear would actually
> -		 * evict all inodes with zero i_count from icache which is
> -		 * unnecessarily violent and may in fact be illegal to do.
> -		 */
> -		if (!atomic_read(&inode->i_count)) {
> -			spin_unlock(&inode->i_lock);
> -			continue;
> -		}
> -
> -		need_iput_tmp = need_iput;
> -		need_iput = NULL;
> -
> -		/* In case fsnotify_inode_delete() drops a reference. */
> -		if (inode != need_iput_tmp)
> -			__iget(inode);
> -		else
> -			need_iput_tmp = NULL;
> +	/* If i_count is zero, the inode cannot have any watches */
> +	if (!atomic_read(&inode->i_count)) {
>  		spin_unlock(&inode->i_lock);
> +		return LRU_SKIP;
> +	}
>  
> -		/* In case the dropping of a reference would nuke next_i. */
> -		while (&next_i->i_sb_list != &sb->s_inodes) {
> -			spin_lock(&next_i->i_lock);
> -			if (!(next_i->i_state & (I_FREEING | I_WILL_FREE)) &&
> -						atomic_read(&next_i->i_count)) {
> -				__iget(next_i);
> -				need_iput = next_i;
> -				spin_unlock(&next_i->i_lock);
> -				break;
> -			}
> -			spin_unlock(&next_i->i_lock);
> -			next_i = list_entry(next_i->i_sb_list.next,
> -						struct inode, i_sb_list);
> -		}
> +	__iget(inode);
> +	spin_unlock(&inode->i_lock);
> +	spin_unlock(lock);
>  
> -		/*
> -		 * We can safely drop s_inode_list_lock here because either
> -		 * we actually hold references on both inode and next_i or
> -		 * end of list.  Also no new inodes will be added since the
> -		 * umount has begun.
> -		 */
> -		spin_unlock(&sb->s_inode_list_lock);
> +	iput(*toput_inode);
> +	*toput_inode = inode;
>  
> -		if (need_iput_tmp)
> -			iput(need_iput_tmp);
> +	/* for each watch, send FS_UNMOUNT and then remove it */
> +	fsnotify(inode, FS_UNMOUNT, inode, FSNOTIFY_EVENT_INODE, NULL, 0);
> +	fsnotify_inode_delete(inode);
>  
> -		/* for each watch, send FS_UNMOUNT and then remove it */
> -		fsnotify(inode, FS_UNMOUNT, inode, FSNOTIFY_EVENT_INODE, NULL, 0);
> +	/*
> +	 * Even though we have dropped the lock here, we can return LRU_SKIP as
> +	 * we have a reference to the current inode and so it's next pointer is
> +	 * guaranteed to be valid even though we dropped the list lock.
> +	 */
> +	spin_lock(lock);
> +	return LRU_SKIP;
> +}
  So in the original code of fsnotify_unmount_inodes() you can see the
hoops you have to jump through when having to iterate inode list with 
list_for_each_entry_safe(). Deleting that code without deeper thought
wasn't a good idea ;).

As a side note, it should be possible to convert this code to use just
list_for_each() and avoid all that crap but it will require me to dwelve into
fsnotify magic again to verify it's correct. I guess I'll leave that after
your patches are queued so that I don't make life harder to you.

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 9/9] inode: don't softlockup when evicting inodes
  2015-03-10 19:45 ` [PATCH 9/9] inode: don't softlockup when evicting inodes Josef Bacik
@ 2015-03-16 12:31   ` Jan Kara
  0 siblings, 0 replies; 24+ messages in thread
From: Jan Kara @ 2015-03-16 12:31 UTC (permalink / raw)
  To: Josef Bacik; +Cc: linux-fsdevel, david, viro, jack, linux-kernel

On Tue 10-03-15 15:45:24, Josef Bacik wrote:
> On a box with a lot of ram (148gb) I can make the box softlockup after running
> an fs_mark job that creates hundreds of millions of empty files.  This is
> because we never generate enough memory pressure to keep the number of inodes on
> our unused list low, so when we go to unmount we have to evict ~100 million
> inodes.  This makes one processor a very unhappy person, so add a cond_resched()
> in dispose_list() and cond_resched_lock() in the eviction isolation function to
> combat this.  Thanks,
> 
> Signed-off-by: Josef Bacik <jbacik@fb.com>
  The patch looks good. I even have a vague memory that we did something
like this in out SLES kernels but I'm not sure where the patch got lost.
Anyway it should be possible to rebase this so that it doesn't need the
list_lru conversion...

								Honza
> ---
>  fs/inode.c | 8 ++++++++
>  1 file changed, 8 insertions(+)
> 
> diff --git a/fs/inode.c b/fs/inode.c
> index 17da8801..2191a3ce 100644
> --- a/fs/inode.c
> +++ b/fs/inode.c
> @@ -569,6 +569,7 @@ static void dispose_list(struct list_head *head)
>  		list_del_init(&inode->i_lru);
>  
>  		evict(inode);
> +		cond_resched();
>  	}
>  }
>  
> @@ -599,6 +600,13 @@ __evict_inodes_isolate(struct list_head *item, struct list_lru_one *lru,
>  
>  	list_lru_isolate(lru, item);
>  	spin_unlock(&inode->i_lock);
> +
> +	/*
> +	 * We can have a ton of inodes to evict at unmount time, check to see if
> +	 * we need to go to sleep for a bit so we don't livelock.
> +	 */
> +	if (cond_resched_lock(lock))
> +		return LRU_REMOVED_RETRY;
>  	return LRU_REMOVED;
>  }
> -- 
> 1.9.3
> 
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 8/9] inode: convert per-sb inode list to a list_lru
  2015-03-16 12:27   ` Jan Kara
@ 2015-03-16 15:34     ` Josef Bacik
  2015-03-16 15:48       ` Jan Kara
  0 siblings, 1 reply; 24+ messages in thread
From: Josef Bacik @ 2015-03-16 15:34 UTC (permalink / raw)
  To: Jan Kara; +Cc: linux-fsdevel, david, viro, linux-kernel, Dave Chinner

On 03/16/2015 08:27 AM, Jan Kara wrote:
>    Hello,
>
> On Tue 10-03-15 15:45:23, Josef Bacik wrote:
>> From: Dave Chinner <dchinner@redhat.com>
>>
>> The per-superblock inode list and lock is a bottleneck for systems
>> that cycle inodes in and out of cache concurrently. The global lock
>> is a limiting factor.
>>
>> Most of the additions to the sb inode list occur on the CPU that
>> allocated the inode, and most of the removals occur during evict()
>> calls as a result of memory reclaim. Both of these events are local
>> to the node that the inode belongs to, so it maps to the per-node
>> lists that the list_lru uses.
>>
>> There are several places where the inode list is walked. These can
>> be converted easily to use list_lru_walk() to do their work on each
>> inode on the list.
>    Err, after looking at the patch I'm not so sure list_lru is such a great
> fit for inode sb list. We don't use any of its LRU features so arguably
> that just makes things less readable, there's some memcg code which I hope
> isn't activated for the inode list but how do I tell? We just use it as
> general per-numa-node linked lists so if we go down the path of
> per-numa-node linked list I'd prefer to have an abstraction just for that.
> Also there are some lifetime issues I'll comment on in the code.
>
> And do we really want per-numa-node lists? Why not per-cpu? Is it the
> memory cost we are concerned about? Or is it the cost of iterating all CPUs
> when wanting to traverse all inodes? The changelog doesn't tell. The
> traversal is used on unmount (inode eviction, fsnotify mark eviction),
> quota on/off (adding / removing dquot pointers to / from inodes),
> drop_caches, sync (iterate all block devices). Sync is IMO the only one
> where CPU overhead might be a concern but we could create a better method
> for iterating all block devices. After all creating / removing block
> devices could happily use just ordinary lists. So CPU overhead shouldn't be
> a huge problem. Memory overhead is (list_head + spinlock) per-sb-per-cpu.
> That doesn't sound too terrible either.

Yeah sorry, I had it in my head that the list was already per-cpu, but 
you're right its per-numa.

>
> When speaking about this with Andi Kleen, he was suggesting that we maybe
> want to not have a per-cpu lists but a list per a batch of CPUs (say 8, the
> actual number would be tuned during boot based on the number of actual
> CPUs). That may be an option as well.
>
> I wouldn't like to block this series just for this patch since all the
> others are fine (modulo some minor comments). So could you repost the
> series without this patch so that it can get queued for the coming merge
> window? It should be easy to rebase the patch 9/9 to not need this patch.
>

Yeah I can do that.

>> diff --git a/fs/block_dev.c b/fs/block_dev.c
>> index 2eb2436..d23ce6f 100644
>> --- a/fs/block_dev.c
>> +++ b/fs/block_dev.c
>> @@ -1749,38 +1749,56 @@ int __invalidate_device(struct block_device *bdev, bool kill_dirty)
>>   }
>>   EXPORT_SYMBOL(__invalidate_device);
>>
>> -void iterate_bdevs(void (*func)(struct block_device *, void *), void *arg)
>> -{
>> -	struct inode *inode, *old_inode = NULL;
>> +struct bdev_iter {
>> +	void (*func)(struct block_device *, void *);
>> +	void *arg;
>> +	struct inode *toput_inode;
>> +};
>>
>> -	spin_lock(&blockdev_superblock->s_inode_list_lock);
>> -	list_for_each_entry(inode, &blockdev_superblock->s_inodes, i_sb_list) {
>> -		struct address_space *mapping = inode->i_mapping;
>> +static enum lru_status
>> +bdev_iter_cb(struct list_head *item, struct list_lru_one *lru,
>> +	     spinlock_t *lock, void *cb_arg)
>> +{
>> +	struct bdev_iter *iter = cb_arg;
>> +	struct inode *inode = container_of(item, struct inode, i_sb_list);
>>
>> -		spin_lock(&inode->i_lock);
>> -		if (inode->i_state & (I_FREEING|I_WILL_FREE|I_NEW) ||
>> -		    mapping->nrpages == 0) {
>> -			spin_unlock(&inode->i_lock);
>> -			continue;
>> -		}
>> -		__iget(inode);
>> +	spin_lock(&inode->i_lock);
>> +	if (inode->i_state & (I_FREEING|I_WILL_FREE|I_NEW) ||
>> +	    inode->i_mapping->nrpages == 0) {
>>   		spin_unlock(&inode->i_lock);
>> -		spin_unlock(&blockdev_superblock->s_inode_list_lock);
>> -		/*
>> -		 * We hold a reference to 'inode' so it couldn't have been
>> -		 * removed from s_inodes list while we dropped the
>> -		 * s_inode_list_lock  We cannot iput the inode now as we can
>> -		 * be holding the last reference and we cannot iput it under
>> -		 * s_inode_list_lock. So we keep the reference and iput it
>> -		 * later.
>> -		 */
>> -		iput(old_inode);
>> -		old_inode = inode;
>> +		return LRU_SKIP;
>> +	}
>> +	__iget(inode);
>> +	spin_unlock(&inode->i_lock);
>> +	spin_unlock(lock);
>>
>> -		func(I_BDEV(inode), arg);
>> +	iput(iter->toput_inode);
>> +	iter->toput_inode = inode;
>>
>> -		spin_lock(&blockdev_superblock->s_inode_list_lock);
>> -	}
>> -	spin_unlock(&blockdev_superblock->s_inode_list_lock);
>> -	iput(old_inode);
>> +	iter->func(I_BDEV(inode), iter->arg);
>> +
>> +	/*
>> +	 * Even though we have dropped the lock here, we can return LRU_SKIP as
>> +	 * we have a reference to the current inode and so it's next pointer is
>> +	 * guaranteed to be valid even though we dropped the list lock.
>> +	 */
>> +	spin_lock(lock);
>> +	return LRU_SKIP;
>> +}
>    So I actually think doing the iteration like this is buggy with list_lru
> code. When we pin inode by grabbing i_count, we are only sure the inode
> won't go away under us. However there's nothing which protects from list
> modifications. So this method is only safe to be combined with
> list_for_each[_entry](). Specifically it does *not* work when combined with
> list_for_each_safe() which __list_lru_walk_one() uses because the next
> pointer that is cached may become invalid once we drop the lock. And I'd
> really hate to try to make these two work together - that would seem really
> fragile since subtle changes to how lru_list iteration work could break
> inode iteration.
>
> This is true for all the iterations over the inodes that are playing with
> i_count. It's not just about this particular case of blockdev iteration.
>
>> diff --git a/fs/notify/inode_mark.c b/fs/notify/inode_mark.c
>> index a4e1a8f..a0cdc66 100644
>> --- a/fs/notify/inode_mark.c
>> +++ b/fs/notify/inode_mark.c
>> @@ -161,87 +161,60 @@ int fsnotify_add_inode_mark(struct fsnotify_mark *mark,
>>   	return ret;
>>   }
>>
>> -/**
>> - * fsnotify_unmount_inodes - an sb is unmounting.  handle any watched inodes.
>> - * @sb: superblock being unmounted.
>> - *
>> - * Called during unmount with no locks held, so needs to be safe against
>> - * concurrent modifiers. We temporarily drop sb->s_inode_list_lock and CAN block.
>> - */
>> -void fsnotify_unmount_inodes(struct super_block *sb)
>> -{
>> -	struct inode *inode, *next_i, *need_iput = NULL;
>> -
>> -	spin_lock(&sb->s_inode_list_lock);
>> -	list_for_each_entry_safe(inode, next_i, &sb->s_inodes, i_sb_list) {
>> -		struct inode *need_iput_tmp;
>> -
>> -		/*
>> -		 * We cannot __iget() an inode in state I_FREEING,
>> -		 * I_WILL_FREE, or I_NEW which is fine because by that point
>> -		 * the inode cannot have any associated watches.
>> -		 */
>> -		spin_lock(&inode->i_lock);
>> -		if (inode->i_state & (I_FREEING|I_WILL_FREE|I_NEW)) {
>> -			spin_unlock(&inode->i_lock);
>> -			continue;
>> -		}
>> +static enum lru_status
>> +fsnotify_unmount_inode(struct list_head *item, struct list_lru_one *lru,
>> +		       spinlock_t *lock, void *cb_arg)
>> + {
>> +	struct inode **toput_inode = cb_arg;
>> +	struct inode *inode = container_of(item, struct inode, i_sb_list);
>> +
>> +	/* New or being freed inodes cannot have any associated watches. */
>> +	spin_lock(&inode->i_lock);
>> +	if (inode->i_state & (I_FREEING|I_WILL_FREE|I_NEW)) {
>> +		spin_unlock(&inode->i_lock);
>> +		return LRU_SKIP;
>> +	}
>>
>> -		/*
>> -		 * If i_count is zero, the inode cannot have any watches and
>> -		 * doing an __iget/iput with MS_ACTIVE clear would actually
>> -		 * evict all inodes with zero i_count from icache which is
>> -		 * unnecessarily violent and may in fact be illegal to do.
>> -		 */
>> -		if (!atomic_read(&inode->i_count)) {
>> -			spin_unlock(&inode->i_lock);
>> -			continue;
>> -		}
>> -
>> -		need_iput_tmp = need_iput;
>> -		need_iput = NULL;
>> -
>> -		/* In case fsnotify_inode_delete() drops a reference. */
>> -		if (inode != need_iput_tmp)
>> -			__iget(inode);
>> -		else
>> -			need_iput_tmp = NULL;
>> +	/* If i_count is zero, the inode cannot have any watches */
>> +	if (!atomic_read(&inode->i_count)) {
>>   		spin_unlock(&inode->i_lock);
>> +		return LRU_SKIP;
>> +	}
>>
>> -		/* In case the dropping of a reference would nuke next_i. */
>> -		while (&next_i->i_sb_list != &sb->s_inodes) {
>> -			spin_lock(&next_i->i_lock);
>> -			if (!(next_i->i_state & (I_FREEING | I_WILL_FREE)) &&
>> -						atomic_read(&next_i->i_count)) {
>> -				__iget(next_i);
>> -				need_iput = next_i;
>> -				spin_unlock(&next_i->i_lock);
>> -				break;
>> -			}
>> -			spin_unlock(&next_i->i_lock);
>> -			next_i = list_entry(next_i->i_sb_list.next,
>> -						struct inode, i_sb_list);
>> -		}
>> +	__iget(inode);
>> +	spin_unlock(&inode->i_lock);
>> +	spin_unlock(lock);
>>
>> -		/*
>> -		 * We can safely drop s_inode_list_lock here because either
>> -		 * we actually hold references on both inode and next_i or
>> -		 * end of list.  Also no new inodes will be added since the
>> -		 * umount has begun.
>> -		 */
>> -		spin_unlock(&sb->s_inode_list_lock);
>> +	iput(*toput_inode);
>> +	*toput_inode = inode;
>>
>> -		if (need_iput_tmp)
>> -			iput(need_iput_tmp);
>> +	/* for each watch, send FS_UNMOUNT and then remove it */
>> +	fsnotify(inode, FS_UNMOUNT, inode, FSNOTIFY_EVENT_INODE, NULL, 0);
>> +	fsnotify_inode_delete(inode);
>>
>> -		/* for each watch, send FS_UNMOUNT and then remove it */
>> -		fsnotify(inode, FS_UNMOUNT, inode, FSNOTIFY_EVENT_INODE, NULL, 0);
>> +	/*
>> +	 * Even though we have dropped the lock here, we can return LRU_SKIP as
>> +	 * we have a reference to the current inode and so it's next pointer is
>> +	 * guaranteed to be valid even though we dropped the list lock.
>> +	 */
>> +	spin_lock(lock);
>> +	return LRU_SKIP;
>> +}
>    So in the original code of fsnotify_unmount_inodes() you can see the
> hoops you have to jump through when having to iterate inode list with
> list_for_each_entry_safe(). Deleting that code without deeper thought
> wasn't a good idea ;).
>
> As a side note, it should be possible to convert this code to use just
> list_for_each() and avoid all that crap but it will require me to dwelve into
> fsnotify magic again to verify it's correct. I guess I'll leave that after
> your patches are queued so that I don't make life harder to you.
>

Well these two cases are just bugs, we are supposed to return LRU_RETRY 
every time we drop the lru list lock so we loop back to the beginning. 
But I agree with your other points, I'll just drop this one and 
find/create a batched per-cpu list to use instead and do that separately 
so we can still get the meat of this patchset in.  Thanks,

Josef


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 8/9] inode: convert per-sb inode list to a list_lru
  2015-03-16 15:34     ` Josef Bacik
@ 2015-03-16 15:48       ` Jan Kara
  0 siblings, 0 replies; 24+ messages in thread
From: Jan Kara @ 2015-03-16 15:48 UTC (permalink / raw)
  To: Josef Bacik
  Cc: Jan Kara, linux-fsdevel, david, viro, linux-kernel, Dave Chinner

On Mon 16-03-15 11:34:41, Josef Bacik wrote:
...
> >>diff --git a/fs/block_dev.c b/fs/block_dev.c
> >>index 2eb2436..d23ce6f 100644
> >>--- a/fs/block_dev.c
> >>+++ b/fs/block_dev.c
> >>@@ -1749,38 +1749,56 @@ int __invalidate_device(struct block_device *bdev, bool kill_dirty)
> >>  }
> >>  EXPORT_SYMBOL(__invalidate_device);
> >>
> >>-void iterate_bdevs(void (*func)(struct block_device *, void *), void *arg)
> >>-{
> >>-	struct inode *inode, *old_inode = NULL;
> >>+struct bdev_iter {
> >>+	void (*func)(struct block_device *, void *);
> >>+	void *arg;
> >>+	struct inode *toput_inode;
> >>+};
> >>
> >>-	spin_lock(&blockdev_superblock->s_inode_list_lock);
> >>-	list_for_each_entry(inode, &blockdev_superblock->s_inodes, i_sb_list) {
> >>-		struct address_space *mapping = inode->i_mapping;
> >>+static enum lru_status
> >>+bdev_iter_cb(struct list_head *item, struct list_lru_one *lru,
> >>+	     spinlock_t *lock, void *cb_arg)
> >>+{
> >>+	struct bdev_iter *iter = cb_arg;
> >>+	struct inode *inode = container_of(item, struct inode, i_sb_list);
> >>
> >>-		spin_lock(&inode->i_lock);
> >>-		if (inode->i_state & (I_FREEING|I_WILL_FREE|I_NEW) ||
> >>-		    mapping->nrpages == 0) {
> >>-			spin_unlock(&inode->i_lock);
> >>-			continue;
> >>-		}
> >>-		__iget(inode);
> >>+	spin_lock(&inode->i_lock);
> >>+	if (inode->i_state & (I_FREEING|I_WILL_FREE|I_NEW) ||
> >>+	    inode->i_mapping->nrpages == 0) {
> >>  		spin_unlock(&inode->i_lock);
> >>-		spin_unlock(&blockdev_superblock->s_inode_list_lock);
> >>-		/*
> >>-		 * We hold a reference to 'inode' so it couldn't have been
> >>-		 * removed from s_inodes list while we dropped the
> >>-		 * s_inode_list_lock  We cannot iput the inode now as we can
> >>-		 * be holding the last reference and we cannot iput it under
> >>-		 * s_inode_list_lock. So we keep the reference and iput it
> >>-		 * later.
> >>-		 */
> >>-		iput(old_inode);
> >>-		old_inode = inode;
> >>+		return LRU_SKIP;
> >>+	}
> >>+	__iget(inode);
> >>+	spin_unlock(&inode->i_lock);
> >>+	spin_unlock(lock);
> >>
> >>-		func(I_BDEV(inode), arg);
> >>+	iput(iter->toput_inode);
> >>+	iter->toput_inode = inode;
> >>
> >>-		spin_lock(&blockdev_superblock->s_inode_list_lock);
> >>-	}
> >>-	spin_unlock(&blockdev_superblock->s_inode_list_lock);
> >>-	iput(old_inode);
> >>+	iter->func(I_BDEV(inode), iter->arg);
> >>+
> >>+	/*
> >>+	 * Even though we have dropped the lock here, we can return LRU_SKIP as
> >>+	 * we have a reference to the current inode and so it's next pointer is
> >>+	 * guaranteed to be valid even though we dropped the list lock.
> >>+	 */
> >>+	spin_lock(lock);
> >>+	return LRU_SKIP;
> >>+}
> >   So I actually think doing the iteration like this is buggy with list_lru
> >code. When we pin inode by grabbing i_count, we are only sure the inode
> >won't go away under us. However there's nothing which protects from list
> >modifications. So this method is only safe to be combined with
> >list_for_each[_entry](). Specifically it does *not* work when combined with
> >list_for_each_safe() which __list_lru_walk_one() uses because the next
> >pointer that is cached may become invalid once we drop the lock. And I'd
> >really hate to try to make these two work together - that would seem really
> >fragile since subtle changes to how lru_list iteration work could break
> >inode iteration.
> >
> >This is true for all the iterations over the inodes that are playing with
> >i_count. It's not just about this particular case of blockdev iteration.
> >
> >>diff --git a/fs/notify/inode_mark.c b/fs/notify/inode_mark.c
> >>index a4e1a8f..a0cdc66 100644
> >>--- a/fs/notify/inode_mark.c
> >>+++ b/fs/notify/inode_mark.c
> >>@@ -161,87 +161,60 @@ int fsnotify_add_inode_mark(struct fsnotify_mark *mark,
> >>  	return ret;
> >>  }
> >>
> >>-/**
> >>- * fsnotify_unmount_inodes - an sb is unmounting.  handle any watched inodes.
> >>- * @sb: superblock being unmounted.
> >>- *
> >>- * Called during unmount with no locks held, so needs to be safe against
> >>- * concurrent modifiers. We temporarily drop sb->s_inode_list_lock and CAN block.
> >>- */
> >>-void fsnotify_unmount_inodes(struct super_block *sb)
> >>-{
> >>-	struct inode *inode, *next_i, *need_iput = NULL;
> >>-
> >>-	spin_lock(&sb->s_inode_list_lock);
> >>-	list_for_each_entry_safe(inode, next_i, &sb->s_inodes, i_sb_list) {
> >>-		struct inode *need_iput_tmp;
> >>-
> >>-		/*
> >>-		 * We cannot __iget() an inode in state I_FREEING,
> >>-		 * I_WILL_FREE, or I_NEW which is fine because by that point
> >>-		 * the inode cannot have any associated watches.
> >>-		 */
> >>-		spin_lock(&inode->i_lock);
> >>-		if (inode->i_state & (I_FREEING|I_WILL_FREE|I_NEW)) {
> >>-			spin_unlock(&inode->i_lock);
> >>-			continue;
> >>-		}
> >>+static enum lru_status
> >>+fsnotify_unmount_inode(struct list_head *item, struct list_lru_one *lru,
> >>+		       spinlock_t *lock, void *cb_arg)
> >>+ {
> >>+	struct inode **toput_inode = cb_arg;
> >>+	struct inode *inode = container_of(item, struct inode, i_sb_list);
> >>+
> >>+	/* New or being freed inodes cannot have any associated watches. */
> >>+	spin_lock(&inode->i_lock);
> >>+	if (inode->i_state & (I_FREEING|I_WILL_FREE|I_NEW)) {
> >>+		spin_unlock(&inode->i_lock);
> >>+		return LRU_SKIP;
> >>+	}
> >>
> >>-		/*
> >>-		 * If i_count is zero, the inode cannot have any watches and
> >>-		 * doing an __iget/iput with MS_ACTIVE clear would actually
> >>-		 * evict all inodes with zero i_count from icache which is
> >>-		 * unnecessarily violent and may in fact be illegal to do.
> >>-		 */
> >>-		if (!atomic_read(&inode->i_count)) {
> >>-			spin_unlock(&inode->i_lock);
> >>-			continue;
> >>-		}
> >>-
> >>-		need_iput_tmp = need_iput;
> >>-		need_iput = NULL;
> >>-
> >>-		/* In case fsnotify_inode_delete() drops a reference. */
> >>-		if (inode != need_iput_tmp)
> >>-			__iget(inode);
> >>-		else
> >>-			need_iput_tmp = NULL;
> >>+	/* If i_count is zero, the inode cannot have any watches */
> >>+	if (!atomic_read(&inode->i_count)) {
> >>  		spin_unlock(&inode->i_lock);
> >>+		return LRU_SKIP;
> >>+	}
> >>
> >>-		/* In case the dropping of a reference would nuke next_i. */
> >>-		while (&next_i->i_sb_list != &sb->s_inodes) {
> >>-			spin_lock(&next_i->i_lock);
> >>-			if (!(next_i->i_state & (I_FREEING | I_WILL_FREE)) &&
> >>-						atomic_read(&next_i->i_count)) {
> >>-				__iget(next_i);
> >>-				need_iput = next_i;
> >>-				spin_unlock(&next_i->i_lock);
> >>-				break;
> >>-			}
> >>-			spin_unlock(&next_i->i_lock);
> >>-			next_i = list_entry(next_i->i_sb_list.next,
> >>-						struct inode, i_sb_list);
> >>-		}
> >>+	__iget(inode);
> >>+	spin_unlock(&inode->i_lock);
> >>+	spin_unlock(lock);
> >>
> >>-		/*
> >>-		 * We can safely drop s_inode_list_lock here because either
> >>-		 * we actually hold references on both inode and next_i or
> >>-		 * end of list.  Also no new inodes will be added since the
> >>-		 * umount has begun.
> >>-		 */
> >>-		spin_unlock(&sb->s_inode_list_lock);
> >>+	iput(*toput_inode);
> >>+	*toput_inode = inode;
> >>
> >>-		if (need_iput_tmp)
> >>-			iput(need_iput_tmp);
> >>+	/* for each watch, send FS_UNMOUNT and then remove it */
> >>+	fsnotify(inode, FS_UNMOUNT, inode, FSNOTIFY_EVENT_INODE, NULL, 0);
> >>+	fsnotify_inode_delete(inode);
> >>
> >>-		/* for each watch, send FS_UNMOUNT and then remove it */
> >>-		fsnotify(inode, FS_UNMOUNT, inode, FSNOTIFY_EVENT_INODE, NULL, 0);
> >>+	/*
> >>+	 * Even though we have dropped the lock here, we can return LRU_SKIP as
> >>+	 * we have a reference to the current inode and so it's next pointer is
> >>+	 * guaranteed to be valid even though we dropped the list lock.
> >>+	 */
> >>+	spin_lock(lock);
> >>+	return LRU_SKIP;
> >>+}
> >   So in the original code of fsnotify_unmount_inodes() you can see the
> >hoops you have to jump through when having to iterate inode list with
> >list_for_each_entry_safe(). Deleting that code without deeper thought
> >wasn't a good idea ;).
> >
> >As a side note, it should be possible to convert this code to use just
> >list_for_each() and avoid all that crap but it will require me to dwelve into
> >fsnotify magic again to verify it's correct. I guess I'll leave that after
> >your patches are queued so that I don't make life harder to you.
> >
> 
> Well these two cases are just bugs, we are supposed to return
> LRU_RETRY every time we drop the lru list lock so we loop back to
> the beginning. But I agree with your other points, I'll just drop
> this one and find/create a batched per-cpu list to use instead and
> do that separately so we can still get the meat of this patchset in.
  I don't think you can just return LRU_RETRY. That way e.g. iterating
all bdevs to wait for IO would become an infinite loop. The trouble with
inode iteration is we need to iterate all of them, sleep on each inode but
we don't remove the inode from the list (at least not in all callers) so
forward progress isn't guaranteed by that.

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 0/9] Sync and VFS scalability improvements
  2015-03-10 19:45 [PATCH 0/9] Sync and VFS scalability improvements Josef Bacik
                   ` (9 preceding siblings ...)
  2015-03-16 11:39 ` [PATCH 0/9] Sync and VFS scalability improvements Jan Kara
@ 2015-03-25 11:18 ` Mel Gorman
  10 siblings, 0 replies; 24+ messages in thread
From: Mel Gorman @ 2015-03-25 11:18 UTC (permalink / raw)
  To: Josef Bacik; +Cc: linux-fsdevel, david, viro, jack, linux-kernel

On Tue, Mar 10, 2015 at 03:45:15PM -0400, Josef Bacik wrote:
> These are patches that Dave Chinner wrote two years ago that are still very much
> needed today.  I recently ran into a problem where I had millions of inodes that
> needed to be evicted at unmount time and it soft locked up the box and kept any
> other file system from doing work.  These patches fix this problem by breaking
> the inode_sb_list_lock into per-sb, and then converting the per sb inode list
> into a list_lru for better scalability.
> 

I'm not reviewing this series, I just heard it mentioned at LSF/MM and again
on a DAX thread and am taking a quick glance out of curiousity. FWIW, I saw
the lockup-on-umount warning every time I ran Dave's xfsrepair workload when
unmounting after running fsmark. I didn't report it because Dave mentioned
at Vault that it was a known problem and someone was looking at it already.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 24+ messages in thread

end of thread, other threads:[~2015-03-25 11:19 UTC | newest]

Thread overview: 24+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-03-10 19:45 [PATCH 0/9] Sync and VFS scalability improvements Josef Bacik
2015-03-10 19:45 ` [PATCH 1/9] writeback: plug writeback at a high level Josef Bacik
2015-03-10 19:45 ` [PATCH 2/9] inode: add IOP_NOTHASHED to avoid inode hash lock in evict Josef Bacik
2015-03-12  9:52   ` Al Viro
2015-03-12 12:18     ` [PATCH] inode: add hlist_fake to avoid the " Josef Bacik
2015-03-12 12:20     ` [PATCH] inode: add hlist_fake to avoid the inode hash lock in evict V2 Josef Bacik
2015-03-14  7:00       ` Jan Kara
2015-03-12 12:24     ` [PATCH 2/9] inode: add IOP_NOTHASHED to avoid inode hash lock in evict Josef Bacik
2015-03-10 19:45 ` [PATCH 3/9] inode: convert inode_sb_list_lock to per-sb Josef Bacik
2015-03-10 19:45 ` [PATCH 4/9] sync: serialise per-superblock sync operations Josef Bacik
2015-03-10 19:45 ` [PATCH 5/9] inode: rename i_wb_list to i_io_list Josef Bacik
2015-03-10 19:45 ` [PATCH 6/9] bdi: add a new writeback list for sync Josef Bacik
2015-03-16 10:14   ` Jan Kara
2015-03-10 19:45 ` [PATCH 7/9] writeback: periodically trim the writeback list Josef Bacik
2015-03-16 10:16   ` Jan Kara
2015-03-16 11:43     ` Jan Kara
2015-03-10 19:45 ` [PATCH 8/9] inode: convert per-sb inode list to a list_lru Josef Bacik
2015-03-16 12:27   ` Jan Kara
2015-03-16 15:34     ` Josef Bacik
2015-03-16 15:48       ` Jan Kara
2015-03-10 19:45 ` [PATCH 9/9] inode: don't softlockup when evicting inodes Josef Bacik
2015-03-16 12:31   ` Jan Kara
2015-03-16 11:39 ` [PATCH 0/9] Sync and VFS scalability improvements Jan Kara
2015-03-25 11:18 ` Mel Gorman

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).