linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/4] change sb_writers to use percpu_rw_semaphore
@ 2015-07-22 21:15 Oleg Nesterov
  2015-07-22 21:15 ` [PATCH 1/4] percpu-rwsem: introduce percpu_down_read_trylock() Oleg Nesterov
                   ` (4 more replies)
  0 siblings, 5 replies; 17+ messages in thread
From: Oleg Nesterov @ 2015-07-22 21:15 UTC (permalink / raw)
  To: Al Viro, Dave Chinner, Dave Hansen, Jan Kara
  Cc: Paul E. McKenney, Peter Zijlstra, linux-fsdevel, linux-kernel

On top of "[PATCH 0/4] sb_write: lockdep fixes/cleanups" series.
Now that it was reviewed (thanks Jan!), let me send the actual
conversion.

1-2 add the simple percpu_rw_semaphore changes, this does not
conflict with the pending rcu_sync changes.

3/4 is really ugly but please see the changelog, this is the
temporary kludge to avoid the problems with other percpu-rwsem
changes routed via another tree.

4/4 looks simple and straightforward after the previous series.

Testing. Well, so far I only verified that ioctl(FIFREEZE) +
ioctl(FITHAW) seems to wors "as expected" on my testing machine
with ext3. So probably this needs more testing. Will try to do
this later. And after that we can hopefully remove the "trylock"
hack in __sb_start_write(), this series doesn't remove it.

But. I will be travelling till the end of the next week, and I'm
not sure I will have the internet access. So let me apologize in
advance if (most probably) I won't be able to reply until I return.

Please review.

Oleg.

 fs/super.c                    |  134 +++++++++++++++--------------------------
 include/linux/fs.h            |   22 +++----
 include/linux/percpu-rwsem.h  |   20 ++++++
 kernel/locking/percpu-rwsem.c |   13 ++++
 4 files changed, 89 insertions(+), 100 deletions(-)


^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH 1/4] percpu-rwsem: introduce percpu_down_read_trylock()
  2015-07-22 21:15 [PATCH 0/4] change sb_writers to use percpu_rw_semaphore Oleg Nesterov
@ 2015-07-22 21:15 ` Oleg Nesterov
  2015-07-22 21:15 ` [PATCH 2/4] percpu-rwsem: introduce percpu_rwsem_release() and percpu_rwsem_acquire() Oleg Nesterov
                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 17+ messages in thread
From: Oleg Nesterov @ 2015-07-22 21:15 UTC (permalink / raw)
  To: Al Viro, Dave Chinner, Dave Hansen, Jan Kara
  Cc: Paul E. McKenney, Peter Zijlstra, linux-fsdevel, linux-kernel

Add percpu_down_read_trylock(), it will have the user soon.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
---
 include/linux/percpu-rwsem.h  |    1 +
 kernel/locking/percpu-rwsem.c |   13 +++++++++++++
 2 files changed, 14 insertions(+), 0 deletions(-)

diff --git a/include/linux/percpu-rwsem.h b/include/linux/percpu-rwsem.h
index 3e58226..3ebf982 100644
--- a/include/linux/percpu-rwsem.h
+++ b/include/linux/percpu-rwsem.h
@@ -17,6 +17,7 @@ struct percpu_rw_semaphore {
 };
 
 extern void percpu_down_read(struct percpu_rw_semaphore *);
+extern int  percpu_down_read_trylock(struct percpu_rw_semaphore *);
 extern void percpu_up_read(struct percpu_rw_semaphore *);
 
 extern void percpu_down_write(struct percpu_rw_semaphore *);
diff --git a/kernel/locking/percpu-rwsem.c b/kernel/locking/percpu-rwsem.c
index 2c54c64..4bc2127 100644
--- a/kernel/locking/percpu-rwsem.c
+++ b/kernel/locking/percpu-rwsem.c
@@ -73,6 +73,19 @@ void percpu_down_read(struct percpu_rw_semaphore *brw)
 	__up_read(&brw->rw_sem);
 }
 
+int percpu_down_read_trylock(struct percpu_rw_semaphore *brw)
+{
+	if (unlikely(!update_fast_ctr(brw, +1))) {
+		if (!__down_read_trylock(&brw->rw_sem))
+			return 0;
+		atomic_inc(&brw->slow_read_ctr);
+		__up_read(&brw->rw_sem);
+	}
+
+	rwsem_acquire_read(&brw->rw_sem.dep_map, 0, 1, _RET_IP_);
+	return 1;
+}
+
 void percpu_up_read(struct percpu_rw_semaphore *brw)
 {
 	rwsem_release(&brw->rw_sem.dep_map, 1, _RET_IP_);
-- 
1.5.5.1


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH 2/4] percpu-rwsem: introduce percpu_rwsem_release() and percpu_rwsem_acquire()
  2015-07-22 21:15 [PATCH 0/4] change sb_writers to use percpu_rw_semaphore Oleg Nesterov
  2015-07-22 21:15 ` [PATCH 1/4] percpu-rwsem: introduce percpu_down_read_trylock() Oleg Nesterov
@ 2015-07-22 21:15 ` Oleg Nesterov
  2015-07-31 10:20   ` Peter Zijlstra
  2015-07-22 21:15 ` [PATCH 3/4] shift percpu_counter_destroy() into destroy_super_work() Oleg Nesterov
                   ` (2 subsequent siblings)
  4 siblings, 1 reply; 17+ messages in thread
From: Oleg Nesterov @ 2015-07-22 21:15 UTC (permalink / raw)
  To: Al Viro, Dave Chinner, Dave Hansen, Jan Kara
  Cc: Paul E. McKenney, Peter Zijlstra, linux-fsdevel, linux-kernel

Add percpu_rwsem_release() and percpu_rwsem_acquire() for the users
which need to return to userspace with percpu-rwsem lock held and/or
pass the ownership to another thread.

TODO: change percpu_rwsem_release() to use rwsem_clear_owner(). We can
either fold kernel/locking/rwsem.h into include/linux/rwsem.h, or add
the non-inline percpu_rwsem_clear_owner().

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
---
 include/linux/percpu-rwsem.h |   19 +++++++++++++++++++
 1 files changed, 19 insertions(+), 0 deletions(-)

diff --git a/include/linux/percpu-rwsem.h b/include/linux/percpu-rwsem.h
index 3ebf982..06af654 100644
--- a/include/linux/percpu-rwsem.h
+++ b/include/linux/percpu-rwsem.h
@@ -33,4 +33,23 @@ extern void percpu_free_rwsem(struct percpu_rw_semaphore *);
 	__percpu_init_rwsem(brw, #brw, &rwsem_key);		\
 })
 
+
+#define percpu_rwsem_is_held(sem) lockdep_is_held(&(sem)->rw_sem)
+
+static inline void percpu_rwsem_release(struct percpu_rw_semaphore *sem,
+					bool read, unsigned long ip)
+{
+	lock_release(&sem->rw_sem.dep_map, 1, ip);
+#ifdef CONFIG_RWSEM_SPIN_ON_OWNER
+	if (!read)
+		sem->rw_sem.owner = NULL;
+#endif
+}
+
+static inline void percpu_rwsem_acquire(struct percpu_rw_semaphore *sem,
+					bool read, unsigned long ip)
+{
+	lock_acquire(&sem->rw_sem.dep_map, 0, 1, read, 1, NULL, ip);
+}
+
 #endif
-- 
1.5.5.1


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH 3/4] shift percpu_counter_destroy() into destroy_super_work()
  2015-07-22 21:15 [PATCH 0/4] change sb_writers to use percpu_rw_semaphore Oleg Nesterov
  2015-07-22 21:15 ` [PATCH 1/4] percpu-rwsem: introduce percpu_down_read_trylock() Oleg Nesterov
  2015-07-22 21:15 ` [PATCH 2/4] percpu-rwsem: introduce percpu_rwsem_release() and percpu_rwsem_acquire() Oleg Nesterov
@ 2015-07-22 21:15 ` Oleg Nesterov
  2015-07-28  8:36   ` Jan Kara
  2015-07-22 21:15 ` [PATCH 4/4] change sb_writers to use percpu_rw_semaphore Oleg Nesterov
  2015-08-07 19:55 ` [PATCH 0/4] " Oleg Nesterov
  4 siblings, 1 reply; 17+ messages in thread
From: Oleg Nesterov @ 2015-07-22 21:15 UTC (permalink / raw)
  To: Al Viro, Dave Chinner, Dave Hansen, Jan Kara
  Cc: Paul E. McKenney, Peter Zijlstra, linux-fsdevel, linux-kernel

Of course, this patch is ugly as hell. It will be (partially)
reverted later. We add it to ensure that other WIP changes in
percpu_rw_semaphore won't break fs/super.c.

We do not even need this change right now, percpu_free_rwsem()
is fine in atomic context. But we are going to change this, it
will be might_sleep() after we merge the rcu_sync() patches.

And even after that we do not really need destroy_super_work(),
we will kill it in any case. Instead, destroy_super_rcu() should
just check that rss->cb_state == CB_IDLE and do call_rcu() again
in the (very unlikely) case this is not true.

So this is just the temporary kludge which helps us to avoid the
conflicts with the changes which will be (hopefully) routed via
rcu tree.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
---
 fs/super.c         |   23 +++++++++++++++++++----
 include/linux/fs.h |    3 ++-
 2 files changed, 21 insertions(+), 5 deletions(-)

diff --git a/fs/super.c b/fs/super.c
index b4db3ee..886fddf 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -135,6 +135,24 @@ static unsigned long super_cache_count(struct shrinker *shrink,
 	return total_objects;
 }
 
+static void destroy_super_work(struct work_struct *work)
+{
+	struct super_block *s = container_of(work, struct super_block,
+							destroy_work);
+	int i;
+
+	for (i = 0; i < SB_FREEZE_LEVELS; i++)
+		percpu_counter_destroy(&s->s_writers.counter[i]);
+	kfree(s);
+}
+
+static void destroy_super_rcu(struct rcu_head *head)
+{
+	struct super_block *s = container_of(head, struct super_block, rcu);
+	INIT_WORK(&s->destroy_work, destroy_super_work);
+	schedule_work(&s->destroy_work);
+}
+
 /**
  *	destroy_super	-	frees a superblock
  *	@s: superblock to free
@@ -143,16 +161,13 @@ static unsigned long super_cache_count(struct shrinker *shrink,
  */
 static void destroy_super(struct super_block *s)
 {
-	int i;
 	list_lru_destroy(&s->s_dentry_lru);
 	list_lru_destroy(&s->s_inode_lru);
-	for (i = 0; i < SB_FREEZE_LEVELS; i++)
-		percpu_counter_destroy(&s->s_writers.counter[i]);
 	security_sb_free(s);
 	WARN_ON(!list_empty(&s->s_mounts));
 	kfree(s->s_subtype);
 	kfree(s->s_options);
-	kfree_rcu(s, rcu);
+	call_rcu(&s->rcu, destroy_super_rcu);
 }
 
 /**
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 78ac768..6addccc 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -30,6 +30,7 @@
 #include <linux/lockdep.h>
 #include <linux/percpu-rwsem.h>
 #include <linux/blk_types.h>
+#include <linux/workqueue.h>
 
 #include <asm/byteorder.h>
 #include <uapi/linux/fs.h>
@@ -1346,7 +1347,7 @@ struct super_block {
 	struct list_lru		s_dentry_lru ____cacheline_aligned_in_smp;
 	struct list_lru		s_inode_lru ____cacheline_aligned_in_smp;
 	struct rcu_head		rcu;
-
+	struct work_struct	destroy_work;
 	/*
 	 * Indicates how deep in a filesystem stack this SB is
 	 */
-- 
1.5.5.1


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH 4/4] change sb_writers to use percpu_rw_semaphore
  2015-07-22 21:15 [PATCH 0/4] change sb_writers to use percpu_rw_semaphore Oleg Nesterov
                   ` (2 preceding siblings ...)
  2015-07-22 21:15 ` [PATCH 3/4] shift percpu_counter_destroy() into destroy_super_work() Oleg Nesterov
@ 2015-07-22 21:15 ` Oleg Nesterov
  2015-07-22 21:34   ` Oleg Nesterov
                     ` (2 more replies)
  2015-08-07 19:55 ` [PATCH 0/4] " Oleg Nesterov
  4 siblings, 3 replies; 17+ messages in thread
From: Oleg Nesterov @ 2015-07-22 21:15 UTC (permalink / raw)
  To: Al Viro, Dave Chinner, Dave Hansen, Jan Kara
  Cc: Paul E. McKenney, Peter Zijlstra, linux-fsdevel, linux-kernel

We can remove everything from struct sb_writers except frozen
and add the array of percpu_rw_semaphore's instead.

This patch doesn't remove sb_writers->wait_unfrozen yet, we keep
it for get_super_thawed(). We will probably remove it later.

This change tries to address the following problems:

	- Firstly, __sb_start_write() looks simply buggy. It does
	  __sb_end_write() if it sees ->frozen, but if it migrates
	  to another CPU before percpu_counter_dec(), sb_wait_write()
	  can wrongly succeed if there is another task which holds
	  the same "semaphore": sb_wait_write() can miss the result
	  of the previous percpu_counter_inc() but see the result
	  of this percpu_counter_dec().

	- As Dave Hansen reports, it is suboptimal. The trivial
	  microbenchmark that writes to a tmpfs file in a loop runs
	  12% faster if we change this code to rely on RCU and kill
	  the memory barriers.

	- This code doesn't look simple. It would be better to rely
	  on the generic locking code.

	  According to Dave, this change adds the same performance
	  improvement.

Perhaps we should also cleanup the usage of ->frozen. It would be
better to set/clear (say) SB_FREEZE_WRITE with the corresponding
write-lock held. Currently freeze_super() has to set SB_FREEZE_WRITE
before sb_wait_write(SB_FREEZE_WRITE) to avoid the race with itself,
we can add another state. The "From now on, no new normal writers
can start" removed by this patch was not really correct.

Note: with this change both freeze_super() and thaw_super() will do
synchronize_sched_expedited() 3 times. This is just ugly. But:

	- This will be "fixed" by the rcu_sync changes we are going
	  to merge. After that freeze_super()->percpu_down_write()
	  will use synchronize_sched(), and thaw_super() won't use
	  synchronize() at all.

	  This doesn't need any changes in fs/super.c.

	- Once we merge rcu_sync changes, we can also change super.c
	  so that all wb_write->rw_sem's will share the single ->rss
	  in struct sb_writes, then freeze_super() will need only one
	  synchronize_sched().

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
---
 fs/super.c         |  113 ++++++++++++++--------------------------------------
 include/linux/fs.h |   19 +++------
 2 files changed, 36 insertions(+), 96 deletions(-)

diff --git a/fs/super.c b/fs/super.c
index 886fddf..29b811b 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -142,7 +142,7 @@ static void destroy_super_work(struct work_struct *work)
 	int i;
 
 	for (i = 0; i < SB_FREEZE_LEVELS; i++)
-		percpu_counter_destroy(&s->s_writers.counter[i]);
+		percpu_free_rwsem(&s->s_writers.rw_sem[i]);
 	kfree(s);
 }
 
@@ -193,13 +193,11 @@ static struct super_block *alloc_super(struct file_system_type *type, int flags)
 		goto fail;
 
 	for (i = 0; i < SB_FREEZE_LEVELS; i++) {
-		if (percpu_counter_init(&s->s_writers.counter[i], 0,
-					GFP_KERNEL) < 0)
+		if (__percpu_init_rwsem(&s->s_writers.rw_sem[i],
+					sb_writers_name[i],
+					&type->s_writers_key[i]))
 			goto fail;
-		lockdep_init_map(&s->s_writers.lock_map[i], sb_writers_name[i],
-				 &type->s_writers_key[i], 0);
 	}
-	init_waitqueue_head(&s->s_writers.wait);
 	init_waitqueue_head(&s->s_writers.wait_unfrozen);
 	s->s_bdi = &noop_backing_dev_info;
 	s->s_flags = flags;
@@ -1161,47 +1159,10 @@ out:
  */
 void __sb_end_write(struct super_block *sb, int level)
 {
-	percpu_counter_dec(&sb->s_writers.counter[level-1]);
-	/*
-	 * Make sure s_writers are updated before we wake up waiters in
-	 * freeze_super().
-	 */
-	smp_mb();
-	if (waitqueue_active(&sb->s_writers.wait))
-		wake_up(&sb->s_writers.wait);
-	rwsem_release(&sb->s_writers.lock_map[level-1], 1, _RET_IP_);
+	percpu_up_read(sb->s_writers.rw_sem + level-1);
 }
 EXPORT_SYMBOL(__sb_end_write);
 
-static int do_sb_start_write(struct super_block *sb, int level, bool wait,
-				unsigned long ip)
-{
-	if (wait)
-		rwsem_acquire_read(&sb->s_writers.lock_map[level-1], 0, 0, ip);
-retry:
-	if (unlikely(sb->s_writers.frozen >= level)) {
-		if (!wait)
-			return 0;
-		wait_event(sb->s_writers.wait_unfrozen,
-			   sb->s_writers.frozen < level);
-	}
-
-	percpu_counter_inc(&sb->s_writers.counter[level-1]);
-	/*
-	 * Make sure counter is updated before we check for frozen.
-	 * freeze_super() first sets frozen and then checks the counter.
-	 */
-	smp_mb();
-	if (unlikely(sb->s_writers.frozen >= level)) {
-		__sb_end_write(sb, level);
-		goto retry;
-	}
-
-	if (!wait)
-		rwsem_acquire_read(&sb->s_writers.lock_map[level-1], 0, 1, ip);
-	return 1;
-}
-
 /*
  * This is an internal function, please use sb_start_{write,pagefault,intwrite}
  * instead.
@@ -1209,7 +1170,7 @@ retry:
 int __sb_start_write(struct super_block *sb, int level, bool wait)
 {
 	bool force_trylock = false;
-	int ret;
+	int ret = 1;
 
 #ifdef CONFIG_LOCKDEP
 	/*
@@ -1225,13 +1186,17 @@ int __sb_start_write(struct super_block *sb, int level, bool wait)
 		int i;
 
 		for (i = 0; i < level - 1; i++)
-			if (lock_is_held(&sb->s_writers.lock_map[i])) {
+			if (percpu_rwsem_is_held(sb->s_writers.rw_sem + i)) {
 				force_trylock = true;
 				break;
 			}
 	}
 #endif
-	ret = do_sb_start_write(sb, level, wait && !force_trylock, _RET_IP_);
+	if (wait && !force_trylock)
+		percpu_down_read(sb->s_writers.rw_sem + level-1);
+	else
+		ret = percpu_down_read_trylock(sb->s_writers.rw_sem + level-1);
+
 	WARN_ON(force_trylock & !ret);
 	return ret;
 }
@@ -1249,25 +1214,7 @@ EXPORT_SYMBOL(__sb_start_write);
  */
 static void sb_wait_write(struct super_block *sb, int level)
 {
-	s64 writers;
-
-	rwsem_acquire(&sb->s_writers.lock_map[level-1], 0, 0, _THIS_IP_);
-
-	do {
-		DEFINE_WAIT(wait);
-		/*
-		 * We use a barrier in prepare_to_wait() to separate setting
-		 * of frozen and checking of the counter
-		 */
-		prepare_to_wait(&sb->s_writers.wait, &wait,
-				TASK_UNINTERRUPTIBLE);
-
-		writers = percpu_counter_sum(&sb->s_writers.counter[level-1]);
-		if (writers)
-			schedule();
-
-		finish_wait(&sb->s_writers.wait, &wait);
-	} while (writers);
+	percpu_down_write(sb->s_writers.rw_sem + level-1);
 }
 
 static void sb_freeze_release(struct super_block *sb)
@@ -1275,7 +1222,7 @@ static void sb_freeze_release(struct super_block *sb)
 	int level;
 	/* Avoid the warning from lockdep_sys_exit() */
 	for (level = 0; level < SB_FREEZE_LEVELS; ++level)
-		rwsem_release(sb->s_writers.lock_map + level, 1, _THIS_IP_);
+		percpu_rwsem_release(sb->s_writers.rw_sem + level, 0, _THIS_IP_);
 }
 
 static void sb_freeze_acquire(struct super_block *sb)
@@ -1283,7 +1230,15 @@ static void sb_freeze_acquire(struct super_block *sb)
 	int level;
 
 	for (level = 0; level < SB_FREEZE_LEVELS; ++level)
-		rwsem_acquire(sb->s_writers.lock_map + level, 0, 1, _THIS_IP_);
+		percpu_rwsem_acquire(sb->s_writers.rw_sem + level, 0, _THIS_IP_);
+}
+
+static void sb_freeze_unlock(struct super_block *sb)
+{
+	int level;
+
+	for (level = 0; level < SB_FREEZE_LEVELS; ++level)
+		percpu_up_write(sb->s_writers.rw_sem + level);
 }
 
 /**
@@ -1342,20 +1297,14 @@ int freeze_super(struct super_block *sb)
 		return 0;
 	}
 
-	/* From now on, no new normal writers can start */
 	sb->s_writers.frozen = SB_FREEZE_WRITE;
-	smp_wmb();
-
 	/* Release s_umount to preserve sb_start_write -> s_umount ordering */
 	up_write(&sb->s_umount);
-
 	sb_wait_write(sb, SB_FREEZE_WRITE);
+	down_write(&sb->s_umount);
 
 	/* Now we go and block page faults... */
-	down_write(&sb->s_umount);
 	sb->s_writers.frozen = SB_FREEZE_PAGEFAULT;
-	smp_wmb();
-
 	sb_wait_write(sb, SB_FREEZE_PAGEFAULT);
 
 	/* All writers are done so after syncing there won't be dirty data */
@@ -1363,7 +1312,6 @@ int freeze_super(struct super_block *sb)
 
 	/* Now wait for internal filesystem counter */
 	sb->s_writers.frozen = SB_FREEZE_FS;
-	smp_wmb();
 	sb_wait_write(sb, SB_FREEZE_FS);
 
 	if (sb->s_op->freeze_fs) {
@@ -1372,9 +1320,8 @@ int freeze_super(struct super_block *sb)
 			printk(KERN_ERR
 				"VFS:Filesystem freeze failed\n");
 			sb->s_writers.frozen = SB_UNFROZEN;
-			smp_wmb();
+			sb_freeze_unlock(sb);
 			wake_up(&sb->s_writers.wait_unfrozen);
-			sb_freeze_release(sb);
 			deactivate_locked_super(sb);
 			return ret;
 		}
@@ -1406,8 +1353,10 @@ int thaw_super(struct super_block *sb)
 		return -EINVAL;
 	}
 
-	if (sb->s_flags & MS_RDONLY)
+	if (sb->s_flags & MS_RDONLY) {
+		sb->s_writers.frozen = SB_UNFROZEN;
 		goto out;
+	}
 
 	sb_freeze_acquire(sb);
 
@@ -1422,13 +1371,11 @@ int thaw_super(struct super_block *sb)
 		}
 	}
 
-	sb_freeze_release(sb);
-out:
 	sb->s_writers.frozen = SB_UNFROZEN;
-	smp_wmb();
+	sb_freeze_unlock(sb);
+out:
 	wake_up(&sb->s_writers.wait_unfrozen);
 	deactivate_locked_super(sb);
-
 	return 0;
 }
 EXPORT_SYMBOL(thaw_super);
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 6addccc..fb2cb4a 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1,7 +1,6 @@
 #ifndef _LINUX_FS_H
 #define _LINUX_FS_H
 
-
 #include <linux/linkage.h>
 #include <linux/wait.h>
 #include <linux/kdev_t.h>
@@ -31,6 +30,7 @@
 #include <linux/percpu-rwsem.h>
 #include <linux/blk_types.h>
 #include <linux/workqueue.h>
+#include <linux/percpu-rwsem.h>
 
 #include <asm/byteorder.h>
 #include <uapi/linux/fs.h>
@@ -1247,16 +1247,9 @@ enum {
 #define SB_FREEZE_LEVELS (SB_FREEZE_COMPLETE - 1)
 
 struct sb_writers {
-	/* Counters for counting writers at each level */
-	struct percpu_counter	counter[SB_FREEZE_LEVELS];
-	wait_queue_head_t	wait;		/* queue for waiting for
-						   writers / faults to finish */
-	int			frozen;		/* Is sb frozen? */
-	wait_queue_head_t	wait_unfrozen;	/* queue for waiting for
-						   sb to be thawed */
-#ifdef CONFIG_DEBUG_LOCK_ALLOC
-	struct lockdep_map	lock_map[SB_FREEZE_LEVELS];
-#endif
+	int				frozen;		/* Is sb frozen? */
+	wait_queue_head_t		wait_unfrozen;	/* for get_super_thawed() */
+	struct percpu_rw_semaphore	rw_sem[SB_FREEZE_LEVELS];
 };
 
 struct super_block {
@@ -1364,9 +1357,9 @@ void __sb_end_write(struct super_block *sb, int level);
 int __sb_start_write(struct super_block *sb, int level, bool wait);
 
 #define __sb_acquire_write(sb, lev)	\
-	rwsem_acquire_read(&(sb)->s_writers.lock_map[(lev)-1], 0, 1, _THIS_IP_)
+	percpu_rwsem_acquire(&(sb)->s_writers.rw_sem[(lev)-1], 1, _THIS_IP_)
 #define __sb_release_write(sb, lev)	\
-	rwsem_release(&(sb)->s_writers.lock_map[(lev)-1], 1, _THIS_IP_)
+	percpu_rwsem_release(&(sb)->s_writers.rw_sem[(lev)-1], 1, _THIS_IP_)
 
 /**
  * sb_end_write - drop write access to a superblock
-- 
1.5.5.1


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [PATCH 4/4] change sb_writers to use percpu_rw_semaphore
  2015-07-22 21:15 ` [PATCH 4/4] change sb_writers to use percpu_rw_semaphore Oleg Nesterov
@ 2015-07-22 21:34   ` Oleg Nesterov
  2015-07-28  8:34   ` Jan Kara
  2015-08-07 19:54   ` Oleg Nesterov
  2 siblings, 0 replies; 17+ messages in thread
From: Oleg Nesterov @ 2015-07-22 21:34 UTC (permalink / raw)
  To: Al Viro, Dave Chinner, Dave Hansen, Jan Kara
  Cc: Paul E. McKenney, Peter Zijlstra, linux-fsdevel, linux-kernel

Sorry for noise, but let me say just in case...

On 07/22, Oleg Nesterov wrote:
>
> Perhaps we should also cleanup the usage of ->frozen. It would be
> better to set/clear (say) SB_FREEZE_WRITE with the corresponding
> write-lock held. Currently freeze_super() has to set SB_FREEZE_WRITE
> before sb_wait_write(SB_FREEZE_WRITE) to avoid the race with itself,

"Currently" means "after this change". Before this change we obviously
need to increment ->frozen before sb_wait_write().

> The "From now on, no new normal writers
> can start" removed by this patch was not really correct.

Yes, it was confusing even without this change.

Oleg.


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 4/4] change sb_writers to use percpu_rw_semaphore
  2015-07-22 21:15 ` [PATCH 4/4] change sb_writers to use percpu_rw_semaphore Oleg Nesterov
  2015-07-22 21:34   ` Oleg Nesterov
@ 2015-07-28  8:34   ` Jan Kara
  2015-08-03 17:30     ` Oleg Nesterov
  2015-08-07 19:54   ` Oleg Nesterov
  2 siblings, 1 reply; 17+ messages in thread
From: Jan Kara @ 2015-07-28  8:34 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Al Viro, Dave Chinner, Dave Hansen, Jan Kara, Paul E. McKenney,
	Peter Zijlstra, linux-fsdevel, linux-kernel

On Wed 22-07-15 23:15:41, Oleg Nesterov wrote:
> We can remove everything from struct sb_writers except frozen
> and add the array of percpu_rw_semaphore's instead.
> 
> This patch doesn't remove sb_writers->wait_unfrozen yet, we keep
> it for get_super_thawed(). We will probably remove it later.

Yeah, that would be a nice cleanup.
 
> This change tries to address the following problems:
> 
> 	- Firstly, __sb_start_write() looks simply buggy. It does
> 	  __sb_end_write() if it sees ->frozen, but if it migrates
> 	  to another CPU before percpu_counter_dec(), sb_wait_write()
> 	  can wrongly succeed if there is another task which holds
> 	  the same "semaphore": sb_wait_write() can miss the result
> 	  of the previous percpu_counter_inc() but see the result
> 	  of this percpu_counter_dec().
> 
> 	- As Dave Hansen reports, it is suboptimal. The trivial
> 	  microbenchmark that writes to a tmpfs file in a loop runs
> 	  12% faster if we change this code to rely on RCU and kill
> 	  the memory barriers.
> 
> 	- This code doesn't look simple. It would be better to rely
> 	  on the generic locking code.
> 
> 	  According to Dave, this change adds the same performance
> 	  improvement.
> 
> Perhaps we should also cleanup the usage of ->frozen. It would be
> better to set/clear (say) SB_FREEZE_WRITE with the corresponding
> write-lock held. Currently freeze_super() has to set SB_FREEZE_WRITE
> before sb_wait_write(SB_FREEZE_WRITE) to avoid the race with itself,
> we can add another state. The "From now on, no new normal writers
> can start" removed by this patch was not really correct.

The patch looks good, just one question: Why wasn't the above comment
really correct? Do you mean it wouldn't be correct after your changes? I
agree with that.

Also when you'd like to "cleanup the usage of ->frozen", you have to be
careful no only about races with freeze_super() itself but also about races
with remount (that's one of the reasons why we use s_umount for protecting
modifications of ->frozen). So I'm not sure how much we can actually
improve on code readability...

Anyway, you can add:

Reviewed-by: Jan Kara <jack@suse.com>

								Honza

> Note: with this change both freeze_super() and thaw_super() will do
> synchronize_sched_expedited() 3 times. This is just ugly. But:
> 
> 	- This will be "fixed" by the rcu_sync changes we are going
> 	  to merge. After that freeze_super()->percpu_down_write()
> 	  will use synchronize_sched(), and thaw_super() won't use
> 	  synchronize() at all.
> 
> 	  This doesn't need any changes in fs/super.c.
> 
> 	- Once we merge rcu_sync changes, we can also change super.c
> 	  so that all wb_write->rw_sem's will share the single ->rss
> 	  in struct sb_writes, then freeze_super() will need only one
> 	  synchronize_sched().
> 
> Signed-off-by: Oleg Nesterov <oleg@redhat.com>
> ---
>  fs/super.c         |  113 ++++++++++++++--------------------------------------
>  include/linux/fs.h |   19 +++------
>  2 files changed, 36 insertions(+), 96 deletions(-)
> 
> diff --git a/fs/super.c b/fs/super.c
> index 886fddf..29b811b 100644
> --- a/fs/super.c
> +++ b/fs/super.c
> @@ -142,7 +142,7 @@ static void destroy_super_work(struct work_struct *work)
>  	int i;
>  
>  	for (i = 0; i < SB_FREEZE_LEVELS; i++)
> -		percpu_counter_destroy(&s->s_writers.counter[i]);
> +		percpu_free_rwsem(&s->s_writers.rw_sem[i]);
>  	kfree(s);
>  }
>  
> @@ -193,13 +193,11 @@ static struct super_block *alloc_super(struct file_system_type *type, int flags)
>  		goto fail;
>  
>  	for (i = 0; i < SB_FREEZE_LEVELS; i++) {
> -		if (percpu_counter_init(&s->s_writers.counter[i], 0,
> -					GFP_KERNEL) < 0)
> +		if (__percpu_init_rwsem(&s->s_writers.rw_sem[i],
> +					sb_writers_name[i],
> +					&type->s_writers_key[i]))
>  			goto fail;
> -		lockdep_init_map(&s->s_writers.lock_map[i], sb_writers_name[i],
> -				 &type->s_writers_key[i], 0);
>  	}
> -	init_waitqueue_head(&s->s_writers.wait);
>  	init_waitqueue_head(&s->s_writers.wait_unfrozen);
>  	s->s_bdi = &noop_backing_dev_info;
>  	s->s_flags = flags;
> @@ -1161,47 +1159,10 @@ out:
>   */
>  void __sb_end_write(struct super_block *sb, int level)
>  {
> -	percpu_counter_dec(&sb->s_writers.counter[level-1]);
> -	/*
> -	 * Make sure s_writers are updated before we wake up waiters in
> -	 * freeze_super().
> -	 */
> -	smp_mb();
> -	if (waitqueue_active(&sb->s_writers.wait))
> -		wake_up(&sb->s_writers.wait);
> -	rwsem_release(&sb->s_writers.lock_map[level-1], 1, _RET_IP_);
> +	percpu_up_read(sb->s_writers.rw_sem + level-1);
>  }
>  EXPORT_SYMBOL(__sb_end_write);
>  
> -static int do_sb_start_write(struct super_block *sb, int level, bool wait,
> -				unsigned long ip)
> -{
> -	if (wait)
> -		rwsem_acquire_read(&sb->s_writers.lock_map[level-1], 0, 0, ip);
> -retry:
> -	if (unlikely(sb->s_writers.frozen >= level)) {
> -		if (!wait)
> -			return 0;
> -		wait_event(sb->s_writers.wait_unfrozen,
> -			   sb->s_writers.frozen < level);
> -	}
> -
> -	percpu_counter_inc(&sb->s_writers.counter[level-1]);
> -	/*
> -	 * Make sure counter is updated before we check for frozen.
> -	 * freeze_super() first sets frozen and then checks the counter.
> -	 */
> -	smp_mb();
> -	if (unlikely(sb->s_writers.frozen >= level)) {
> -		__sb_end_write(sb, level);
> -		goto retry;
> -	}
> -
> -	if (!wait)
> -		rwsem_acquire_read(&sb->s_writers.lock_map[level-1], 0, 1, ip);
> -	return 1;
> -}
> -
>  /*
>   * This is an internal function, please use sb_start_{write,pagefault,intwrite}
>   * instead.
> @@ -1209,7 +1170,7 @@ retry:
>  int __sb_start_write(struct super_block *sb, int level, bool wait)
>  {
>  	bool force_trylock = false;
> -	int ret;
> +	int ret = 1;
>  
>  #ifdef CONFIG_LOCKDEP
>  	/*
> @@ -1225,13 +1186,17 @@ int __sb_start_write(struct super_block *sb, int level, bool wait)
>  		int i;
>  
>  		for (i = 0; i < level - 1; i++)
> -			if (lock_is_held(&sb->s_writers.lock_map[i])) {
> +			if (percpu_rwsem_is_held(sb->s_writers.rw_sem + i)) {
>  				force_trylock = true;
>  				break;
>  			}
>  	}
>  #endif
> -	ret = do_sb_start_write(sb, level, wait && !force_trylock, _RET_IP_);
> +	if (wait && !force_trylock)
> +		percpu_down_read(sb->s_writers.rw_sem + level-1);
> +	else
> +		ret = percpu_down_read_trylock(sb->s_writers.rw_sem + level-1);
> +
>  	WARN_ON(force_trylock & !ret);
>  	return ret;
>  }
> @@ -1249,25 +1214,7 @@ EXPORT_SYMBOL(__sb_start_write);
>   */
>  static void sb_wait_write(struct super_block *sb, int level)
>  {
> -	s64 writers;
> -
> -	rwsem_acquire(&sb->s_writers.lock_map[level-1], 0, 0, _THIS_IP_);
> -
> -	do {
> -		DEFINE_WAIT(wait);
> -		/*
> -		 * We use a barrier in prepare_to_wait() to separate setting
> -		 * of frozen and checking of the counter
> -		 */
> -		prepare_to_wait(&sb->s_writers.wait, &wait,
> -				TASK_UNINTERRUPTIBLE);
> -
> -		writers = percpu_counter_sum(&sb->s_writers.counter[level-1]);
> -		if (writers)
> -			schedule();
> -
> -		finish_wait(&sb->s_writers.wait, &wait);
> -	} while (writers);
> +	percpu_down_write(sb->s_writers.rw_sem + level-1);
>  }
>  
>  static void sb_freeze_release(struct super_block *sb)
> @@ -1275,7 +1222,7 @@ static void sb_freeze_release(struct super_block *sb)
>  	int level;
>  	/* Avoid the warning from lockdep_sys_exit() */
>  	for (level = 0; level < SB_FREEZE_LEVELS; ++level)
> -		rwsem_release(sb->s_writers.lock_map + level, 1, _THIS_IP_);
> +		percpu_rwsem_release(sb->s_writers.rw_sem + level, 0, _THIS_IP_);
>  }
>  
>  static void sb_freeze_acquire(struct super_block *sb)
> @@ -1283,7 +1230,15 @@ static void sb_freeze_acquire(struct super_block *sb)
>  	int level;
>  
>  	for (level = 0; level < SB_FREEZE_LEVELS; ++level)
> -		rwsem_acquire(sb->s_writers.lock_map + level, 0, 1, _THIS_IP_);
> +		percpu_rwsem_acquire(sb->s_writers.rw_sem + level, 0, _THIS_IP_);
> +}
> +
> +static void sb_freeze_unlock(struct super_block *sb)
> +{
> +	int level;
> +
> +	for (level = 0; level < SB_FREEZE_LEVELS; ++level)
> +		percpu_up_write(sb->s_writers.rw_sem + level);
>  }
>  
>  /**
> @@ -1342,20 +1297,14 @@ int freeze_super(struct super_block *sb)
>  		return 0;
>  	}
>  
> -	/* From now on, no new normal writers can start */
>  	sb->s_writers.frozen = SB_FREEZE_WRITE;
> -	smp_wmb();
> -
>  	/* Release s_umount to preserve sb_start_write -> s_umount ordering */
>  	up_write(&sb->s_umount);
> -
>  	sb_wait_write(sb, SB_FREEZE_WRITE);
> +	down_write(&sb->s_umount);
>  
>  	/* Now we go and block page faults... */
> -	down_write(&sb->s_umount);
>  	sb->s_writers.frozen = SB_FREEZE_PAGEFAULT;
> -	smp_wmb();
> -
>  	sb_wait_write(sb, SB_FREEZE_PAGEFAULT);
>  
>  	/* All writers are done so after syncing there won't be dirty data */
> @@ -1363,7 +1312,6 @@ int freeze_super(struct super_block *sb)
>  
>  	/* Now wait for internal filesystem counter */
>  	sb->s_writers.frozen = SB_FREEZE_FS;
> -	smp_wmb();
>  	sb_wait_write(sb, SB_FREEZE_FS);
>  
>  	if (sb->s_op->freeze_fs) {
> @@ -1372,9 +1320,8 @@ int freeze_super(struct super_block *sb)
>  			printk(KERN_ERR
>  				"VFS:Filesystem freeze failed\n");
>  			sb->s_writers.frozen = SB_UNFROZEN;
> -			smp_wmb();
> +			sb_freeze_unlock(sb);
>  			wake_up(&sb->s_writers.wait_unfrozen);
> -			sb_freeze_release(sb);
>  			deactivate_locked_super(sb);
>  			return ret;
>  		}
> @@ -1406,8 +1353,10 @@ int thaw_super(struct super_block *sb)
>  		return -EINVAL;
>  	}
>  
> -	if (sb->s_flags & MS_RDONLY)
> +	if (sb->s_flags & MS_RDONLY) {
> +		sb->s_writers.frozen = SB_UNFROZEN;
>  		goto out;
> +	}
>  
>  	sb_freeze_acquire(sb);
>  
> @@ -1422,13 +1371,11 @@ int thaw_super(struct super_block *sb)
>  		}
>  	}
>  
> -	sb_freeze_release(sb);
> -out:
>  	sb->s_writers.frozen = SB_UNFROZEN;
> -	smp_wmb();
> +	sb_freeze_unlock(sb);
> +out:
>  	wake_up(&sb->s_writers.wait_unfrozen);
>  	deactivate_locked_super(sb);
> -
>  	return 0;
>  }
>  EXPORT_SYMBOL(thaw_super);
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index 6addccc..fb2cb4a 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -1,7 +1,6 @@
>  #ifndef _LINUX_FS_H
>  #define _LINUX_FS_H
>  
> -
>  #include <linux/linkage.h>
>  #include <linux/wait.h>
>  #include <linux/kdev_t.h>
> @@ -31,6 +30,7 @@
>  #include <linux/percpu-rwsem.h>
>  #include <linux/blk_types.h>
>  #include <linux/workqueue.h>
> +#include <linux/percpu-rwsem.h>
>  
>  #include <asm/byteorder.h>
>  #include <uapi/linux/fs.h>
> @@ -1247,16 +1247,9 @@ enum {
>  #define SB_FREEZE_LEVELS (SB_FREEZE_COMPLETE - 1)
>  
>  struct sb_writers {
> -	/* Counters for counting writers at each level */
> -	struct percpu_counter	counter[SB_FREEZE_LEVELS];
> -	wait_queue_head_t	wait;		/* queue for waiting for
> -						   writers / faults to finish */
> -	int			frozen;		/* Is sb frozen? */
> -	wait_queue_head_t	wait_unfrozen;	/* queue for waiting for
> -						   sb to be thawed */
> -#ifdef CONFIG_DEBUG_LOCK_ALLOC
> -	struct lockdep_map	lock_map[SB_FREEZE_LEVELS];
> -#endif
> +	int				frozen;		/* Is sb frozen? */
> +	wait_queue_head_t		wait_unfrozen;	/* for get_super_thawed() */
> +	struct percpu_rw_semaphore	rw_sem[SB_FREEZE_LEVELS];
>  };
>  
>  struct super_block {
> @@ -1364,9 +1357,9 @@ void __sb_end_write(struct super_block *sb, int level);
>  int __sb_start_write(struct super_block *sb, int level, bool wait);
>  
>  #define __sb_acquire_write(sb, lev)	\
> -	rwsem_acquire_read(&(sb)->s_writers.lock_map[(lev)-1], 0, 1, _THIS_IP_)
> +	percpu_rwsem_acquire(&(sb)->s_writers.rw_sem[(lev)-1], 1, _THIS_IP_)
>  #define __sb_release_write(sb, lev)	\
> -	rwsem_release(&(sb)->s_writers.lock_map[(lev)-1], 1, _THIS_IP_)
> +	percpu_rwsem_release(&(sb)->s_writers.rw_sem[(lev)-1], 1, _THIS_IP_)
>  
>  /**
>   * sb_end_write - drop write access to a superblock
> -- 
> 1.5.5.1
> 
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 3/4] shift percpu_counter_destroy() into destroy_super_work()
  2015-07-22 21:15 ` [PATCH 3/4] shift percpu_counter_destroy() into destroy_super_work() Oleg Nesterov
@ 2015-07-28  8:36   ` Jan Kara
  0 siblings, 0 replies; 17+ messages in thread
From: Jan Kara @ 2015-07-28  8:36 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Al Viro, Dave Chinner, Dave Hansen, Jan Kara, Paul E. McKenney,
	Peter Zijlstra, linux-fsdevel, linux-kernel

On Wed 22-07-15 23:15:38, Oleg Nesterov wrote:
> Of course, this patch is ugly as hell. It will be (partially)
> reverted later. We add it to ensure that other WIP changes in
> percpu_rw_semaphore won't break fs/super.c.
> 
> We do not even need this change right now, percpu_free_rwsem()
> is fine in atomic context. But we are going to change this, it
> will be might_sleep() after we merge the rcu_sync() patches.
> 
> And even after that we do not really need destroy_super_work(),
> we will kill it in any case. Instead, destroy_super_rcu() should
> just check that rss->cb_state == CB_IDLE and do call_rcu() again
> in the (very unlikely) case this is not true.
> 
> So this is just the temporary kludge which helps us to avoid the
> conflicts with the changes which will be (hopefully) routed via
> rcu tree.
> 
> Signed-off-by: Oleg Nesterov <oleg@redhat.com>

OK, looks acceptable to me. You can add:

Reviewed-by: Jan Kara <jack@suse.com>

								Honza

> ---
>  fs/super.c         |   23 +++++++++++++++++++----
>  include/linux/fs.h |    3 ++-
>  2 files changed, 21 insertions(+), 5 deletions(-)
> 
> diff --git a/fs/super.c b/fs/super.c
> index b4db3ee..886fddf 100644
> --- a/fs/super.c
> +++ b/fs/super.c
> @@ -135,6 +135,24 @@ static unsigned long super_cache_count(struct shrinker *shrink,
>  	return total_objects;
>  }
>  
> +static void destroy_super_work(struct work_struct *work)
> +{
> +	struct super_block *s = container_of(work, struct super_block,
> +							destroy_work);
> +	int i;
> +
> +	for (i = 0; i < SB_FREEZE_LEVELS; i++)
> +		percpu_counter_destroy(&s->s_writers.counter[i]);
> +	kfree(s);
> +}
> +
> +static void destroy_super_rcu(struct rcu_head *head)
> +{
> +	struct super_block *s = container_of(head, struct super_block, rcu);
> +	INIT_WORK(&s->destroy_work, destroy_super_work);
> +	schedule_work(&s->destroy_work);
> +}
> +
>  /**
>   *	destroy_super	-	frees a superblock
>   *	@s: superblock to free
> @@ -143,16 +161,13 @@ static unsigned long super_cache_count(struct shrinker *shrink,
>   */
>  static void destroy_super(struct super_block *s)
>  {
> -	int i;
>  	list_lru_destroy(&s->s_dentry_lru);
>  	list_lru_destroy(&s->s_inode_lru);
> -	for (i = 0; i < SB_FREEZE_LEVELS; i++)
> -		percpu_counter_destroy(&s->s_writers.counter[i]);
>  	security_sb_free(s);
>  	WARN_ON(!list_empty(&s->s_mounts));
>  	kfree(s->s_subtype);
>  	kfree(s->s_options);
> -	kfree_rcu(s, rcu);
> +	call_rcu(&s->rcu, destroy_super_rcu);
>  }
>  
>  /**
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index 78ac768..6addccc 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -30,6 +30,7 @@
>  #include <linux/lockdep.h>
>  #include <linux/percpu-rwsem.h>
>  #include <linux/blk_types.h>
> +#include <linux/workqueue.h>
>  
>  #include <asm/byteorder.h>
>  #include <uapi/linux/fs.h>
> @@ -1346,7 +1347,7 @@ struct super_block {
>  	struct list_lru		s_dentry_lru ____cacheline_aligned_in_smp;
>  	struct list_lru		s_inode_lru ____cacheline_aligned_in_smp;
>  	struct rcu_head		rcu;
> -
> +	struct work_struct	destroy_work;
>  	/*
>  	 * Indicates how deep in a filesystem stack this SB is
>  	 */
> -- 
> 1.5.5.1
> 
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 2/4] percpu-rwsem: introduce percpu_rwsem_release() and percpu_rwsem_acquire()
  2015-07-22 21:15 ` [PATCH 2/4] percpu-rwsem: introduce percpu_rwsem_release() and percpu_rwsem_acquire() Oleg Nesterov
@ 2015-07-31 10:20   ` Peter Zijlstra
  2015-08-03 15:40     ` Oleg Nesterov
  0 siblings, 1 reply; 17+ messages in thread
From: Peter Zijlstra @ 2015-07-31 10:20 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Al Viro, Dave Chinner, Dave Hansen, Jan Kara, Paul E. McKenney,
	linux-fsdevel, linux-kernel

On Wed, Jul 22, 2015 at 11:15:35PM +0200, Oleg Nesterov wrote:
> Add percpu_rwsem_release() and percpu_rwsem_acquire() for the users
> which need to return to userspace with percpu-rwsem lock held and/or
> pass the ownership to another thread.
> 
> TODO: change percpu_rwsem_release() to use rwsem_clear_owner(). We can
> either fold kernel/locking/rwsem.h into include/linux/rwsem.h, or add
> the non-inline percpu_rwsem_clear_owner().
> 
> Signed-off-by: Oleg Nesterov <oleg@redhat.com>
> ---
>  include/linux/percpu-rwsem.h |   19 +++++++++++++++++++
>  1 files changed, 19 insertions(+), 0 deletions(-)
> 
> diff --git a/include/linux/percpu-rwsem.h b/include/linux/percpu-rwsem.h
> index 3ebf982..06af654 100644
> --- a/include/linux/percpu-rwsem.h
> +++ b/include/linux/percpu-rwsem.h
> @@ -33,4 +33,23 @@ extern void percpu_free_rwsem(struct percpu_rw_semaphore *);
>  	__percpu_init_rwsem(brw, #brw, &rwsem_key);		\
>  })
>  
> +
> +#define percpu_rwsem_is_held(sem) lockdep_is_held(&(sem)->rw_sem)
> +
> +static inline void percpu_rwsem_release(struct percpu_rw_semaphore *sem,
> +					bool read, unsigned long ip)
> +{
> +	lock_release(&sem->rw_sem.dep_map, 1, ip);
> +#ifdef CONFIG_RWSEM_SPIN_ON_OWNER
> +	if (!read)
> +		sem->rw_sem.owner = NULL;
> +#endif
> +}
> +
> +static inline void percpu_rwsem_acquire(struct percpu_rw_semaphore *sem,
> +					bool read, unsigned long ip)
> +{
> +	lock_acquire(&sem->rw_sem.dep_map, 0, 1, read, 1, NULL, ip);
> +}

This is of course entirely vile.. Can't we open code that in the freezer
code? Having helpers here might give some people the impression that
this is a sane thing to do.

Also, when you do that in the freezer, put a big honking comment on it.


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 2/4] percpu-rwsem: introduce percpu_rwsem_release() and percpu_rwsem_acquire()
  2015-07-31 10:20   ` Peter Zijlstra
@ 2015-08-03 15:40     ` Oleg Nesterov
  0 siblings, 0 replies; 17+ messages in thread
From: Oleg Nesterov @ 2015-08-03 15:40 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Al Viro, Dave Chinner, Dave Hansen, Jan Kara, Paul E. McKenney,
	linux-fsdevel, linux-kernel

On 07/31, Peter Zijlstra wrote:
>
> On Wed, Jul 22, 2015 at 11:15:35PM +0200, Oleg Nesterov wrote:
> > +
> > +static inline void percpu_rwsem_release(struct percpu_rw_semaphore *sem,
> > +					bool read, unsigned long ip)
> > +{
> > +	lock_release(&sem->rw_sem.dep_map, 1, ip);
> > +#ifdef CONFIG_RWSEM_SPIN_ON_OWNER
> > +	if (!read)
> > +		sem->rw_sem.owner = NULL;
> > +#endif
> > +}
> > +
> > +static inline void percpu_rwsem_acquire(struct percpu_rw_semaphore *sem,
> > +					bool read, unsigned long ip)
> > +{
> > +	lock_acquire(&sem->rw_sem.dep_map, 0, 1, read, 1, NULL, ip);
> > +}
>
> This is of course entirely vile.. Can't we open code that in the freezer
> code?

Move these helpers somewhere in fs/super.c ? I don't think we should.
All members in struct percpu_rw_semaphore are private, nobody outside
of percpu-rwsem.[ch] should touch ->rw_sem at least.

> Having helpers here might give some people the impression that
> this is a sane thing to do.

Yes, but this doesn't differ from other lockdep release/acquire helpers,
they should be used with care.

BTW, we also need these helpers to add the multi-writer support, but this
is almost off-topic right now.

> Also, when you do that in the freezer, put a big honking comment on it.

OK, I'll update the comment(s).

Oleg.


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 4/4] change sb_writers to use percpu_rw_semaphore
  2015-07-28  8:34   ` Jan Kara
@ 2015-08-03 17:30     ` Oleg Nesterov
  0 siblings, 0 replies; 17+ messages in thread
From: Oleg Nesterov @ 2015-08-03 17:30 UTC (permalink / raw)
  To: Jan Kara
  Cc: Al Viro, Dave Chinner, Dave Hansen, Paul E. McKenney,
	Peter Zijlstra, linux-fsdevel, linux-kernel

Hi Jan,

Thanks for your review and sorry for delay, I was on vacation.

On 07/28, Jan Kara wrote:
>
> On Wed 22-07-15 23:15:41, Oleg Nesterov wrote:
> >
> > Perhaps we should also cleanup the usage of ->frozen. It would be
> > better to set/clear (say) SB_FREEZE_WRITE with the corresponding
> > write-lock held. Currently freeze_super() has to set SB_FREEZE_WRITE
> > before sb_wait_write(SB_FREEZE_WRITE) to avoid the race with itself,
> > we can add another state. The "From now on, no new normal writers
> > can start" removed by this patch was not really correct.
>
> The patch looks good, just one question: Why wasn't the above comment
> really correct?

It is not that I think it was wrong, just not 100% accurate even before
this change. "w_writers.frozen = SB_FREEZE_WRITE" itself can't guarantee
that "no new normal writers can start". We do not know when other CPU's
will see the result of this STORE.

> Do you mean it wouldn't be correct after your changes? I
> agree with that.

Yes, yes, this was the actual reason to remove this comment. Sorry for
confusion.

> Also when you'd like to "cleanup the usage of ->frozen", you have to be
> careful no only about races with freeze_super() itself but also about races
> with remount (that's one of the reasons why we use s_umount for protecting
> modifications of ->frozen). So I'm not sure how much we can actually
> improve on code readability...

Yes, me too. Probably I should simply remove this (confusing) part of the
changelog.

> Anyway, you can add:
>
> Reviewed-by: Jan Kara <jack@suse.com>

Thanks!

OK. Now I'll try to actually test this all. Hopefully this week.

Oleg.


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 4/4] change sb_writers to use percpu_rw_semaphore
  2015-07-22 21:15 ` [PATCH 4/4] change sb_writers to use percpu_rw_semaphore Oleg Nesterov
  2015-07-22 21:34   ` Oleg Nesterov
  2015-07-28  8:34   ` Jan Kara
@ 2015-08-07 19:54   ` Oleg Nesterov
  2 siblings, 0 replies; 17+ messages in thread
From: Oleg Nesterov @ 2015-08-07 19:54 UTC (permalink / raw)
  To: Al Viro, Dave Chinner, Dave Hansen, Jan Kara
  Cc: Paul E. McKenney, Peter Zijlstra, linux-fsdevel, linux-kernel

On 07/22, Oleg Nesterov wrote:
>
> +static void sb_freeze_unlock(struct super_block *sb)
> +{
> +	int level;
> +
> +	for (level = 0; level < SB_FREEZE_LEVELS; ++level)
> +		percpu_up_write(sb->s_writers.rw_sem + level);
>  }

OK, this is not exactly right, see the fix below.

Otherwise seems to work, but see another email I'll send in reply to 0/4.

Oleg.

--- a/fs/super.c
+++ b/fs/super.c
@@ -1237,7 +1237,7 @@ static void sb_freeze_unlock(struct super_block *sb)
 {
 	int level;
 
-	for (level = 0; level < SB_FREEZE_LEVELS; ++level)
+	for (level = SB_FREEZE_LEVELS; --level >= 0; )
 		percpu_up_write(sb->s_writers.rw_sem + level);
 }
 


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 0/4] change sb_writers to use percpu_rw_semaphore
  2015-07-22 21:15 [PATCH 0/4] change sb_writers to use percpu_rw_semaphore Oleg Nesterov
                   ` (3 preceding siblings ...)
  2015-07-22 21:15 ` [PATCH 4/4] change sb_writers to use percpu_rw_semaphore Oleg Nesterov
@ 2015-08-07 19:55 ` Oleg Nesterov
  2015-08-10 14:59   ` Jan Kara
  4 siblings, 1 reply; 17+ messages in thread
From: Oleg Nesterov @ 2015-08-07 19:55 UTC (permalink / raw)
  To: Al Viro, Dave Chinner, Dave Hansen, Jan Kara
  Cc: Paul E. McKenney, Peter Zijlstra, linux-fsdevel, linux-kernel

Jan, Dave, please help.

I'll try to re-check/re-test, but so far I think that this and the
previous series are correct. However 3/4 from the previous series
(attached at the end just in case) uncovers (I think) some problems
in xfs locking.

What should I do now?

On 07/22, Oleg Nesterov wrote:
>
> Testing. Well, so far I only verified that ioctl(FIFREEZE) +
> ioctl(FITHAW) seems to wors "as expected" on my testing machine
> with ext3. So probably this needs more testing.

Finally I got the testing machine and ran xfstests, I did

	dd if=/dev/zero of=TEST.img bs=1MiB count=4000
	dd if=/dev/zero of=SCRATCH.img bs=1MiB count=4000

	losetup --find --show TEST.img
	losetup --find --show SCRATCH.img

	mkfs.xfs -f /dev/loop0
	mkfs.xfs -f /dev/loop1

	mkdir -p TEST SCRATCH

	mount /dev/loop0 TEST
	mount /dev/loop1 SCRATCH

	TEST_DEV=/dev/loop0 TEST_DIR=TEST SCRATCH_DEV=/dev/loop1 SCRATCH_MNT=SCRATCH \
	./check `grep -il freeze tests/*/???`

several times and every time the result looks like below:

	FSTYP         -- xfs (non-debug)
	PLATFORM      -- Linux/x86_64 intel-canoepass-10 4.1.0+
	MKFS_OPTIONS  -- -f -bsize=4096 /dev/loop1
	MOUNT_OPTIONS -- -o context=system_u:object_r:nfs_t:s0 /dev/loop1 SCRATCH

	generic/068 59s ... 61s
	generic/085 23s ... 26s
	generic/280 2s ... 3s
	generic/311 169s ... 167s
	xfs/011 21s ... 20s
	xfs/119 32s ... 21s
	xfs/297 455s ... 301s
	Ran: generic/068 generic/085 generic/280 generic/311 xfs/011 xfs/119 xfs/297
	Passed all 7 tests

But with CONFIG_LOCKDEP=y generic/068 triggers the warning:

	[   66.092831] ======================================================
	[   66.099726] [ INFO: possible circular locking dependency detected ]
	[   66.106719] 4.1.0+ #2 Not tainted
	[   66.110414] -------------------------------------------------------
	[   66.117405] fsstress/2210 is trying to acquire lock:
	[   66.122944]  (&mm->mmap_sem){++++++}, at: [<ffffffff81200562>] might_fault+0x42/0xa0
	[   66.131637]
		       but task is already holding lock:
	[   66.138146]  (&xfs_dir_ilock_class){++++..}, at: [<ffffffffa057da52>] xfs_ilock+0xc2/0x170 [xfs]
	[   66.148022]
		       which lock already depends on the new lock.

	[   66.157141]
		       the existing dependency chain (in reverse order) is:
	[   66.165490]
		       -> #4 (&xfs_dir_ilock_class){++++..}:
	[   66.170974]        [<ffffffff810fc94e>] lock_acquire+0xbe/0x150
	[   66.177596]        [<ffffffff810f4bbc>] down_write_nested+0x3c/0x70
	[   66.184605]        [<ffffffffa057dab6>] xfs_ilock+0x126/0x170 [xfs]
	[   66.191645]        [<ffffffffa057c4da>] xfs_setattr_nonsize+0x3ba/0x5d0 [xfs]
	[   66.199638]        [<ffffffffa057cb5a>] xfs_vn_setattr+0x9a/0xd0 [xfs]
	[   66.206950]        [<ffffffff81279411>] notify_change+0x271/0x3a0
	[   66.213762]        [<ffffffff8125391b>] chown_common.isra.11+0x15b/0x200
	[   66.221251]        [<ffffffff812551a6>] SyS_lchown+0xa6/0xf0
	[   66.227576]        [<ffffffff8182662e>] system_call_fastpath+0x12/0x76
	[   66.234878]
		       -> #3 (sb_internal#2){++++++}:
	[   66.239698]        [<ffffffff810fc94e>] lock_acquire+0xbe/0x150
	[   66.246316]        [<ffffffff81824256>] down_write+0x36/0x70
	[   66.252644]        [<ffffffff81100979>] percpu_down_write+0x39/0x110
	[   66.259760]        [<ffffffff8125901d>] freeze_super+0xbd/0x190
	[   66.266369]        [<ffffffff8126dbc8>] do_vfs_ioctl+0x3f8/0x520
	[   66.273082]        [<ffffffff8126dd71>] SyS_ioctl+0x81/0xa0
	[   66.279311]        [<ffffffff8182662e>] system_call_fastpath+0x12/0x76
	[   66.286610]
		       -> #2 (sb_pagefaults#2){++++++}:
	[   66.291623]        [<ffffffff810fc94e>] lock_acquire+0xbe/0x150
	[   66.298237]        [<ffffffff811007c1>] percpu_down_read+0x51/0xa0
	[   66.305144]        [<ffffffff8125979b>] __sb_start_write+0xdb/0x120
	[   66.312140]        [<ffffffff81295eda>] block_page_mkwrite+0x3a/0xb0
	[   66.319245]        [<ffffffffa057046e>] xfs_filemap_page_mkwrite+0x5e/0xb0 [xfs]
	[   66.327533]        [<ffffffff812009a8>] do_page_mkwrite+0x58/0x100
	[   66.334433]        [<ffffffff81205497>] handle_mm_fault+0x537/0x17c0
	[   66.341533]        [<ffffffff8106a03c>] __do_page_fault+0x19c/0x450
	[   66.348542]        [<ffffffff8106a31f>] do_page_fault+0x2f/0x80
	[   66.355172]        [<ffffffff818286e8>] page_fault+0x28/0x30
	[   66.361493]
		       -> #1 (&(&ip->i_mmaplock)->mr_lock){++++++}:
	[   66.367659]        [<ffffffff810fc94e>] lock_acquire+0xbe/0x150
	[   66.374275]        [<ffffffff810f4aef>] down_read_nested+0x3f/0x60
	[   66.381185]        [<ffffffffa057daf8>] xfs_ilock+0x168/0x170 [xfs]
	[   66.388212]        [<ffffffffa057050c>] xfs_filemap_fault+0x4c/0xb0 [xfs]
	[   66.395817]        [<ffffffff81200a9e>] __do_fault+0x4e/0x100
	[   66.402239]        [<ffffffff81205454>] handle_mm_fault+0x4f4/0x17c0
	[   66.409340]        [<ffffffff8106a03c>] __do_page_fault+0x19c/0x450
	[   66.416344]        [<ffffffff8106a31f>] do_page_fault+0x2f/0x80
	[   66.422959]        [<ffffffff818286e8>] page_fault+0x28/0x30
	[   66.429283]
		       -> #0 (&mm->mmap_sem){++++++}:
	[   66.434093]        [<ffffffff810fb867>] __lock_acquire+0x20d7/0x2100
	[   66.441194]        [<ffffffff810fc94e>] lock_acquire+0xbe/0x150
	[   66.447808]        [<ffffffff8120058f>] might_fault+0x6f/0xa0
	[   66.454235]        [<ffffffff8126e05a>] filldir+0x9a/0x130
	[   66.460373]        [<ffffffffa056daed>] xfs_dir2_sf_getdents.isra.10+0x1bd/0x220 [xfs]
	[   66.469233]        [<ffffffffa056e72e>] xfs_readdir+0x17e/0x1a0 [xfs]
	[   66.476449]        [<ffffffffa057095b>] xfs_file_readdir+0x2b/0x30 [xfs]
	[   66.483954]        [<ffffffff8126de2a>] iterate_dir+0x9a/0x140
	[   66.490473]        [<ffffffff8126e361>] SyS_getdents+0x91/0x120
	[   66.497091]        [<ffffffff8182662e>] system_call_fastpath+0x12/0x76
	[   66.504381]
		       other info that might help us debug this:

	[   66.513316] Chain exists of:
			 &mm->mmap_sem --> sb_internal#2 --> &xfs_dir_ilock_class

	[   66.522619]  Possible unsafe locking scenario:

	[   66.529222]        CPU0                    CPU1
	[   66.534275]        ----                    ----
	[   66.539327]   lock(&xfs_dir_ilock_class);
	[   66.543823]                                lock(sb_internal#2);
	[   66.550465]                                lock(&xfs_dir_ilock_class);
	[   66.557767]   lock(&mm->mmap_sem);
	[   66.561580]
			*** DEADLOCK ***

	[   66.568186] 2 locks held by fsstress/2210:
	[   66.572753]  #0:  (&type->i_mutex_dir_key#5){+.+.+.}, at: [<ffffffff8126ddf1>] iterate_dir+0x61/0x140
	[   66.583103]  #1:  (&xfs_dir_ilock_class){++++..}, at: [<ffffffffa057da52>] xfs_ilock+0xc2/0x170 [xfs]
	[   66.593457]
		       stack backtrace:
	[   66.598321] CPU: 26 PID: 2210 Comm: fsstress Not tainted 4.1.0+ #2
	[   66.605215] Hardware name: Intel Corporation S2600CP/S2600CP, BIOS RMLSDP.86I.R3.27.D685.1305151734 05/15/2013
	[   66.616372]  0000000000000000 0000000048bb9c77 ffff8817fa127ad8 ffffffff8181d9dd
	[   66.624663]  0000000000000000 ffffffff8288f500 ffff8817fa127b28 ffffffff810f7b26
	[   66.632955]  0000000000000002 ffff8817fa127b98 ffff8817fa127b28 ffff881803f06480
	[   66.641249] Call Trace:
	[   66.643983]  [<ffffffff8181d9dd>] dump_stack+0x45/0x57
	[   66.649713]  [<ffffffff810f7b26>] print_circular_bug+0x206/0x280
	[   66.656413]  [<ffffffff810fb867>] __lock_acquire+0x20d7/0x2100
	[   66.662919]  [<ffffffff810fa1b7>] ? __lock_acquire+0xa27/0x2100
	[   66.669523]  [<ffffffff810fc94e>] lock_acquire+0xbe/0x150
	[   66.675543]  [<ffffffff81200562>] ? might_fault+0x42/0xa0
	[   66.681566]  [<ffffffff8120058f>] might_fault+0x6f/0xa0
	[   66.687394]  [<ffffffff81200562>] ? might_fault+0x42/0xa0
	[   66.693418]  [<ffffffff8126e05a>] filldir+0x9a/0x130
	[   66.698968]  [<ffffffffa057db34>] ? xfs_ilock_data_map_shared+0x34/0x40 [xfs]
	[   66.706941]  [<ffffffffa056daed>] xfs_dir2_sf_getdents.isra.10+0x1bd/0x220 [xfs]
	[   66.715203]  [<ffffffffa057da52>] ? xfs_ilock+0xc2/0x170 [xfs]
	[   66.721712]  [<ffffffffa056e72e>] xfs_readdir+0x17e/0x1a0 [xfs]
	[   66.728316]  [<ffffffff81822b6d>] ? mutex_lock_killable_nested+0x3ad/0x480
	[   66.735998]  [<ffffffffa057095b>] xfs_file_readdir+0x2b/0x30 [xfs]
	[   66.742894]  [<ffffffff8126de2a>] iterate_dir+0x9a/0x140
	[   66.748819]  [<ffffffff8127a1ac>] ? __fget_light+0x6c/0xa0
	[   66.754938]  [<ffffffff8126e361>] SyS_getdents+0x91/0x120
	[   66.760958]  [<ffffffff8126dfc0>] ? fillonedir+0xf0/0xf0
	[   66.766884]  [<ffffffff8182662e>] system_call_fastpath+0x12/0x76

The chain reported by lockdep is not exactly the same every time,
but similar.

Once again, I'll recheck. but the patch below still looks correct
to me, and I think that it is actually a fix.

Before this patch freeze_super() calls sync_filesystem() and
s_op->freeze_fs(sb) without ->s_writers locks (as it seen by
lockdep) and this is wrong.

After this patch lockdep knows about ->s_writers locks held and this
is what we want, but apparently this way lockdep can notice the new
dependencies.

Oleg.

------------------------------------------------------------------------------
Subject: [PATCH 3/4] move rwsem_release() from sb_wait_write() to freeze_super()

Move the "fool the lockdep" code from sb_wait_write() into the new
simple helper, sb_lockdep_release(), called once before return from
freeze_super().

This is preparation, but imo this makes sense in any case. This way
we do not hide from lockdep the "write" locks we hold when we call
s_op->freeze_fs(sb).

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
---
 fs/super.c |   16 ++++++++++------
 1 files changed, 10 insertions(+), 6 deletions(-)

diff --git a/fs/super.c b/fs/super.c
index d0fdd49..e7ea9f1 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -1236,16 +1236,10 @@ static void sb_wait_write(struct super_block *sb, int level)
 {
 	s64 writers;
 
-	/*
-	 * We just cycle-through lockdep here so that it does not complain
-	 * about returning with lock to userspace
-	 */
 	rwsem_acquire(&sb->s_writers.lock_map[level-1], 0, 0, _THIS_IP_);
-	rwsem_release(&sb->s_writers.lock_map[level-1], 1, _THIS_IP_);
 
 	do {
 		DEFINE_WAIT(wait);
-
 		/*
 		 * We use a barrier in prepare_to_wait() to separate setting
 		 * of frozen and checking of the counter
@@ -1261,6 +1255,14 @@ static void sb_wait_write(struct super_block *sb, int level)
 	} while (writers);
 }
 
+static void sb_freeze_release(struct super_block *sb)
+{
+	int level;
+	/* Avoid the warning from lockdep_sys_exit() */
+	for (level = 0; level < SB_FREEZE_LEVELS; ++level)
+		rwsem_release(sb->s_writers.lock_map + level, 1, _THIS_IP_);
+}
+
 /**
  * freeze_super - lock the filesystem and force it into a consistent state
  * @sb: the super to lock
@@ -1349,6 +1351,7 @@ int freeze_super(struct super_block *sb)
 			sb->s_writers.frozen = SB_UNFROZEN;
 			smp_wmb();
 			wake_up(&sb->s_writers.wait_unfrozen);
+			sb_freeze_release(sb);
 			deactivate_locked_super(sb);
 			return ret;
 		}
@@ -1358,6 +1361,7 @@ int freeze_super(struct super_block *sb)
 	 * sees write activity when frozen is set to SB_FREEZE_COMPLETE.
 	 */
 	sb->s_writers.frozen = SB_FREEZE_COMPLETE;
+	sb_freeze_release(sb);
 	up_write(&sb->s_umount);
 	return 0;
 }
-- 
1.5.5.1



^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [PATCH 0/4] change sb_writers to use percpu_rw_semaphore
  2015-08-07 19:55 ` [PATCH 0/4] " Oleg Nesterov
@ 2015-08-10 14:59   ` Jan Kara
  2015-08-10 22:41     ` Dave Chinner
  0 siblings, 1 reply; 17+ messages in thread
From: Jan Kara @ 2015-08-10 14:59 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Al Viro, Dave Chinner, Dave Hansen, Jan Kara, Paul E. McKenney,
	Peter Zijlstra, linux-fsdevel, linux-kernel

  Hello,

On Fri 07-08-15 21:55:52, Oleg Nesterov wrote:
> I'll try to re-check/re-test, but so far I think that this and the
> previous series are correct. However 3/4 from the previous series
> (attached at the end just in case) uncovers (I think) some problems
> in xfs locking.
> 
> What should I do now?
> 
> On 07/22, Oleg Nesterov wrote:
> >
> > Testing. Well, so far I only verified that ioctl(FIFREEZE) +
> > ioctl(FITHAW) seems to wors "as expected" on my testing machine
> > with ext3. So probably this needs more testing.
> 
> Finally I got the testing machine and ran xfstests, I did
> 
> 	dd if=/dev/zero of=TEST.img bs=1MiB count=4000
> 	dd if=/dev/zero of=SCRATCH.img bs=1MiB count=4000
> 
> 	losetup --find --show TEST.img
> 	losetup --find --show SCRATCH.img
> 
> 	mkfs.xfs -f /dev/loop0
> 	mkfs.xfs -f /dev/loop1
> 
> 	mkdir -p TEST SCRATCH
> 
> 	mount /dev/loop0 TEST
> 	mount /dev/loop1 SCRATCH
> 
> 	TEST_DEV=/dev/loop0 TEST_DIR=TEST SCRATCH_DEV=/dev/loop1 SCRATCH_MNT=SCRATCH \
> 	./check `grep -il freeze tests/*/???`
> 
> several times and every time the result looks like below:
> 
> 	FSTYP         -- xfs (non-debug)
> 	PLATFORM      -- Linux/x86_64 intel-canoepass-10 4.1.0+
> 	MKFS_OPTIONS  -- -f -bsize=4096 /dev/loop1
> 	MOUNT_OPTIONS -- -o context=system_u:object_r:nfs_t:s0 /dev/loop1 SCRATCH
> 
> 	generic/068 59s ... 61s
> 	generic/085 23s ... 26s
> 	generic/280 2s ... 3s
> 	generic/311 169s ... 167s
> 	xfs/011 21s ... 20s
> 	xfs/119 32s ... 21s
> 	xfs/297 455s ... 301s
> 	Ran: generic/068 generic/085 generic/280 generic/311 xfs/011 xfs/119 xfs/297
> 	Passed all 7 tests

Hum, I had a look at the lockdep report below and it's one of the
peculiarities of the freeze protection. For record let me repeat the full
argument for why I don't think there's a possibility for a real deadlock.
Feel free to skip to the end if you get bored.  

One would like to construct the lock chain as:

CPU0 (chown foo dir)	CPU1 (readdir dir)	CPU2 (page fault)
process Y		process X, thread 0	process X, thread 1

			get ILOCK for dir
gets freeze protection
starts transaction in xfs_setattr_nonsize
waits to get ILOCK on 'dir'
						get mmap_sem for X
			wait for mmap_sem for process X
			  in filldir()
						wait for freeze protection in
						  xfs_page_mkwrite

and CPU3 then being in freeze_super() blocking CPU2 and waiting for CPU0 to
finish it's freeze-protected section. But this cannot happen. The reason is
that we block writers level-by-level and thus while there are writers at
level X, we do not block writers at level X+1. So in this particular case
freeze_super() will block waiting for CPU0 to finish its freeze protected
section while CPU2 is free to continue.

In general we have a chain like

freeze L0 -> freeze L1 -> freeze L2 -> ILOCK -> mmap_sem --\
                A                                          |
                \------------------------------------------/

But since ILOCK is always acquired with freeze protection at L0 and we can
block at L1 only after there are no writers at L0, this loop can never
happen.

Note that if we use the property of freezing that lock at level X+1 cannot
block when we hold lock at level X, we can as well simplify the dependency
graph and track in it only the lowest level of freeze lock that is
currently acquired (since the levels above it cannot block and do not in
any way influence blocking of other processes either and thus are
irrelevant for the purpose of deadlock detection). Then the dependency
graph we'd get would be:

freeze L0 -> ILOCK -> mmap_sem -> freeze L1

and we have a nice acyclic graph we like to see... So probably we have to
hack the lockdep instrumentation some more and just don't tell lockdep
about freeze locks at higher levels if we already hold a lock at lower
level. Thoughts?

								Honza

> But with CONFIG_LOCKDEP=y generic/068 triggers the warning:
> 
> 	[   66.092831] ======================================================
> 	[   66.099726] [ INFO: possible circular locking dependency detected ]
> 	[   66.106719] 4.1.0+ #2 Not tainted
> 	[   66.110414] -------------------------------------------------------
> 	[   66.117405] fsstress/2210 is trying to acquire lock:
> 	[   66.122944]  (&mm->mmap_sem){++++++}, at: [<ffffffff81200562>] might_fault+0x42/0xa0
> 	[   66.131637]
> 		       but task is already holding lock:
> 	[   66.138146]  (&xfs_dir_ilock_class){++++..}, at: [<ffffffffa057da52>] xfs_ilock+0xc2/0x170 [xfs]
> 	[   66.148022]
> 		       which lock already depends on the new lock.
> 
> 	[   66.157141]
> 		       the existing dependency chain (in reverse order) is:
> 	[   66.165490]
> 		       -> #4 (&xfs_dir_ilock_class){++++..}:
> 	[   66.170974]        [<ffffffff810fc94e>] lock_acquire+0xbe/0x150
> 	[   66.177596]        [<ffffffff810f4bbc>] down_write_nested+0x3c/0x70
> 	[   66.184605]        [<ffffffffa057dab6>] xfs_ilock+0x126/0x170 [xfs]
> 	[   66.191645]        [<ffffffffa057c4da>] xfs_setattr_nonsize+0x3ba/0x5d0 [xfs]
> 	[   66.199638]        [<ffffffffa057cb5a>] xfs_vn_setattr+0x9a/0xd0 [xfs]
> 	[   66.206950]        [<ffffffff81279411>] notify_change+0x271/0x3a0
> 	[   66.213762]        [<ffffffff8125391b>] chown_common.isra.11+0x15b/0x200
> 	[   66.221251]        [<ffffffff812551a6>] SyS_lchown+0xa6/0xf0
> 	[   66.227576]        [<ffffffff8182662e>] system_call_fastpath+0x12/0x76
> 	[   66.234878]
> 		       -> #3 (sb_internal#2){++++++}:
> 	[   66.239698]        [<ffffffff810fc94e>] lock_acquire+0xbe/0x150
> 	[   66.246316]        [<ffffffff81824256>] down_write+0x36/0x70
> 	[   66.252644]        [<ffffffff81100979>] percpu_down_write+0x39/0x110
> 	[   66.259760]        [<ffffffff8125901d>] freeze_super+0xbd/0x190
> 	[   66.266369]        [<ffffffff8126dbc8>] do_vfs_ioctl+0x3f8/0x520
> 	[   66.273082]        [<ffffffff8126dd71>] SyS_ioctl+0x81/0xa0
> 	[   66.279311]        [<ffffffff8182662e>] system_call_fastpath+0x12/0x76
> 	[   66.286610]
> 		       -> #2 (sb_pagefaults#2){++++++}:
> 	[   66.291623]        [<ffffffff810fc94e>] lock_acquire+0xbe/0x150
> 	[   66.298237]        [<ffffffff811007c1>] percpu_down_read+0x51/0xa0
> 	[   66.305144]        [<ffffffff8125979b>] __sb_start_write+0xdb/0x120
> 	[   66.312140]        [<ffffffff81295eda>] block_page_mkwrite+0x3a/0xb0
> 	[   66.319245]        [<ffffffffa057046e>] xfs_filemap_page_mkwrite+0x5e/0xb0 [xfs]
> 	[   66.327533]        [<ffffffff812009a8>] do_page_mkwrite+0x58/0x100
> 	[   66.334433]        [<ffffffff81205497>] handle_mm_fault+0x537/0x17c0
> 	[   66.341533]        [<ffffffff8106a03c>] __do_page_fault+0x19c/0x450
> 	[   66.348542]        [<ffffffff8106a31f>] do_page_fault+0x2f/0x80
> 	[   66.355172]        [<ffffffff818286e8>] page_fault+0x28/0x30
> 	[   66.361493]
> 		       -> #1 (&(&ip->i_mmaplock)->mr_lock){++++++}:
> 	[   66.367659]        [<ffffffff810fc94e>] lock_acquire+0xbe/0x150
> 	[   66.374275]        [<ffffffff810f4aef>] down_read_nested+0x3f/0x60
> 	[   66.381185]        [<ffffffffa057daf8>] xfs_ilock+0x168/0x170 [xfs]
> 	[   66.388212]        [<ffffffffa057050c>] xfs_filemap_fault+0x4c/0xb0 [xfs]
> 	[   66.395817]        [<ffffffff81200a9e>] __do_fault+0x4e/0x100
> 	[   66.402239]        [<ffffffff81205454>] handle_mm_fault+0x4f4/0x17c0
> 	[   66.409340]        [<ffffffff8106a03c>] __do_page_fault+0x19c/0x450
> 	[   66.416344]        [<ffffffff8106a31f>] do_page_fault+0x2f/0x80
> 	[   66.422959]        [<ffffffff818286e8>] page_fault+0x28/0x30
> 	[   66.429283]
> 		       -> #0 (&mm->mmap_sem){++++++}:
> 	[   66.434093]        [<ffffffff810fb867>] __lock_acquire+0x20d7/0x2100
> 	[   66.441194]        [<ffffffff810fc94e>] lock_acquire+0xbe/0x150
> 	[   66.447808]        [<ffffffff8120058f>] might_fault+0x6f/0xa0
> 	[   66.454235]        [<ffffffff8126e05a>] filldir+0x9a/0x130
> 	[   66.460373]        [<ffffffffa056daed>] xfs_dir2_sf_getdents.isra.10+0x1bd/0x220 [xfs]
> 	[   66.469233]        [<ffffffffa056e72e>] xfs_readdir+0x17e/0x1a0 [xfs]
> 	[   66.476449]        [<ffffffffa057095b>] xfs_file_readdir+0x2b/0x30 [xfs]
> 	[   66.483954]        [<ffffffff8126de2a>] iterate_dir+0x9a/0x140
> 	[   66.490473]        [<ffffffff8126e361>] SyS_getdents+0x91/0x120
> 	[   66.497091]        [<ffffffff8182662e>] system_call_fastpath+0x12/0x76
> 	[   66.504381]
> 		       other info that might help us debug this:
> 
> 	[   66.513316] Chain exists of:
> 			 &mm->mmap_sem --> sb_internal#2 --> &xfs_dir_ilock_class
> 
> 	[   66.522619]  Possible unsafe locking scenario:
> 
> 	[   66.529222]        CPU0                    CPU1
> 	[   66.534275]        ----                    ----
> 	[   66.539327]   lock(&xfs_dir_ilock_class);
> 	[   66.543823]                                lock(sb_internal#2);
> 	[   66.550465]                                lock(&xfs_dir_ilock_class);
> 	[   66.557767]   lock(&mm->mmap_sem);
> 	[   66.561580]
> 			*** DEADLOCK ***
> 
> 	[   66.568186] 2 locks held by fsstress/2210:
> 	[   66.572753]  #0:  (&type->i_mutex_dir_key#5){+.+.+.}, at: [<ffffffff8126ddf1>] iterate_dir+0x61/0x140
> 	[   66.583103]  #1:  (&xfs_dir_ilock_class){++++..}, at: [<ffffffffa057da52>] xfs_ilock+0xc2/0x170 [xfs]
> 	[   66.593457]
> 		       stack backtrace:
> 	[   66.598321] CPU: 26 PID: 2210 Comm: fsstress Not tainted 4.1.0+ #2
> 	[   66.605215] Hardware name: Intel Corporation S2600CP/S2600CP, BIOS RMLSDP.86I.R3.27.D685.1305151734 05/15/2013
> 	[   66.616372]  0000000000000000 0000000048bb9c77 ffff8817fa127ad8 ffffffff8181d9dd
> 	[   66.624663]  0000000000000000 ffffffff8288f500 ffff8817fa127b28 ffffffff810f7b26
> 	[   66.632955]  0000000000000002 ffff8817fa127b98 ffff8817fa127b28 ffff881803f06480
> 	[   66.641249] Call Trace:
> 	[   66.643983]  [<ffffffff8181d9dd>] dump_stack+0x45/0x57
> 	[   66.649713]  [<ffffffff810f7b26>] print_circular_bug+0x206/0x280
> 	[   66.656413]  [<ffffffff810fb867>] __lock_acquire+0x20d7/0x2100
> 	[   66.662919]  [<ffffffff810fa1b7>] ? __lock_acquire+0xa27/0x2100
> 	[   66.669523]  [<ffffffff810fc94e>] lock_acquire+0xbe/0x150
> 	[   66.675543]  [<ffffffff81200562>] ? might_fault+0x42/0xa0
> 	[   66.681566]  [<ffffffff8120058f>] might_fault+0x6f/0xa0
> 	[   66.687394]  [<ffffffff81200562>] ? might_fault+0x42/0xa0
> 	[   66.693418]  [<ffffffff8126e05a>] filldir+0x9a/0x130
> 	[   66.698968]  [<ffffffffa057db34>] ? xfs_ilock_data_map_shared+0x34/0x40 [xfs]
> 	[   66.706941]  [<ffffffffa056daed>] xfs_dir2_sf_getdents.isra.10+0x1bd/0x220 [xfs]
> 	[   66.715203]  [<ffffffffa057da52>] ? xfs_ilock+0xc2/0x170 [xfs]
> 	[   66.721712]  [<ffffffffa056e72e>] xfs_readdir+0x17e/0x1a0 [xfs]
> 	[   66.728316]  [<ffffffff81822b6d>] ? mutex_lock_killable_nested+0x3ad/0x480
> 	[   66.735998]  [<ffffffffa057095b>] xfs_file_readdir+0x2b/0x30 [xfs]
> 	[   66.742894]  [<ffffffff8126de2a>] iterate_dir+0x9a/0x140
> 	[   66.748819]  [<ffffffff8127a1ac>] ? __fget_light+0x6c/0xa0
> 	[   66.754938]  [<ffffffff8126e361>] SyS_getdents+0x91/0x120
> 	[   66.760958]  [<ffffffff8126dfc0>] ? fillonedir+0xf0/0xf0
> 	[   66.766884]  [<ffffffff8182662e>] system_call_fastpath+0x12/0x76
> 
> The chain reported by lockdep is not exactly the same every time,
> but similar.
> 
> Once again, I'll recheck. but the patch below still looks correct
> to me, and I think that it is actually a fix.
> 
> Before this patch freeze_super() calls sync_filesystem() and
> s_op->freeze_fs(sb) without ->s_writers locks (as it seen by
> lockdep) and this is wrong.
> 
> After this patch lockdep knows about ->s_writers locks held and this
> is what we want, but apparently this way lockdep can notice the new
> dependencies.
> 
> Oleg.
> 
> ------------------------------------------------------------------------------
> Subject: [PATCH 3/4] move rwsem_release() from sb_wait_write() to freeze_super()
> 
> Move the "fool the lockdep" code from sb_wait_write() into the new
> simple helper, sb_lockdep_release(), called once before return from
> freeze_super().
> 
> This is preparation, but imo this makes sense in any case. This way
> we do not hide from lockdep the "write" locks we hold when we call
> s_op->freeze_fs(sb).
> 
> Signed-off-by: Oleg Nesterov <oleg@redhat.com>
> ---
>  fs/super.c |   16 ++++++++++------
>  1 files changed, 10 insertions(+), 6 deletions(-)
> 
> diff --git a/fs/super.c b/fs/super.c
> index d0fdd49..e7ea9f1 100644
> --- a/fs/super.c
> +++ b/fs/super.c
> @@ -1236,16 +1236,10 @@ static void sb_wait_write(struct super_block *sb, int level)
>  {
>  	s64 writers;
>  
> -	/*
> -	 * We just cycle-through lockdep here so that it does not complain
> -	 * about returning with lock to userspace
> -	 */
>  	rwsem_acquire(&sb->s_writers.lock_map[level-1], 0, 0, _THIS_IP_);
> -	rwsem_release(&sb->s_writers.lock_map[level-1], 1, _THIS_IP_);
>  
>  	do {
>  		DEFINE_WAIT(wait);
> -
>  		/*
>  		 * We use a barrier in prepare_to_wait() to separate setting
>  		 * of frozen and checking of the counter
> @@ -1261,6 +1255,14 @@ static void sb_wait_write(struct super_block *sb, int level)
>  	} while (writers);
>  }
>  
> +static void sb_freeze_release(struct super_block *sb)
> +{
> +	int level;
> +	/* Avoid the warning from lockdep_sys_exit() */
> +	for (level = 0; level < SB_FREEZE_LEVELS; ++level)
> +		rwsem_release(sb->s_writers.lock_map + level, 1, _THIS_IP_);
> +}
> +
>  /**
>   * freeze_super - lock the filesystem and force it into a consistent state
>   * @sb: the super to lock
> @@ -1349,6 +1351,7 @@ int freeze_super(struct super_block *sb)
>  			sb->s_writers.frozen = SB_UNFROZEN;
>  			smp_wmb();
>  			wake_up(&sb->s_writers.wait_unfrozen);
> +			sb_freeze_release(sb);
>  			deactivate_locked_super(sb);
>  			return ret;
>  		}
> @@ -1358,6 +1361,7 @@ int freeze_super(struct super_block *sb)
>  	 * sees write activity when frozen is set to SB_FREEZE_COMPLETE.
>  	 */
>  	sb->s_writers.frozen = SB_FREEZE_COMPLETE;
> +	sb_freeze_release(sb);
>  	up_write(&sb->s_umount);
>  	return 0;
>  }
> -- 
> 1.5.5.1
> 
> 
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 0/4] change sb_writers to use percpu_rw_semaphore
  2015-08-10 14:59   ` Jan Kara
@ 2015-08-10 22:41     ` Dave Chinner
  2015-08-11 13:16       ` Oleg Nesterov
  0 siblings, 1 reply; 17+ messages in thread
From: Dave Chinner @ 2015-08-10 22:41 UTC (permalink / raw)
  To: Jan Kara
  Cc: Oleg Nesterov, Al Viro, Dave Hansen, Paul E. McKenney,
	Peter Zijlstra, linux-fsdevel, linux-kernel

On Mon, Aug 10, 2015 at 04:59:42PM +0200, Jan Kara wrote:
> On Fri 07-08-15 21:55:52, Oleg Nesterov wrote:
> > I'll try to re-check/re-test, but so far I think that this and the
> > previous series are correct. However 3/4 from the previous series
> > (attached at the end just in case) uncovers (I think) some problems
> > in xfs locking.
....
> 
> Hum, I had a look at the lockdep report below and it's one of the
> peculiarities of the freeze protection. For record let me repeat the full
> argument for why I don't think there's a possibility for a real deadlock.
> Feel free to skip to the end if you get bored.  
> 
> One would like to construct the lock chain as:
> 
> CPU0 (chown foo dir)	CPU1 (readdir dir)	CPU2 (page fault)
> process Y		process X, thread 0	process X, thread 1
> 
> 			get ILOCK for dir
> gets freeze protection
> starts transaction in xfs_setattr_nonsize
> waits to get ILOCK on 'dir'
> 						get mmap_sem for X
> 			wait for mmap_sem for process X
> 			  in filldir()
> 						wait for freeze protection in
> 						  xfs_page_mkwrite
> 
> and CPU3 then being in freeze_super() blocking CPU2 and waiting for CPU0 to
> finish it's freeze-protected section. But this cannot happen. The reason is
> that we block writers level-by-level and thus while there are writers at
> level X, we do not block writers at level X+1. So in this particular case
> freeze_super() will block waiting for CPU0 to finish its freeze protected
> section while CPU2 is free to continue.
> 
> In general we have a chain like
> 
> freeze L0 -> freeze L1 -> freeze L2 -> ILOCK -> mmap_sem --\
>                 A                                          |
>                 \------------------------------------------/
> 
> But since ILOCK is always acquired with freeze protection at L0 and we can
> block at L1 only after there are no writers at L0, this loop can never
> happen.
> 
> Note that if we use the property of freezing that lock at level X+1 cannot
> block when we hold lock at level X, we can as well simplify the dependency
> graph and track in it only the lowest level of freeze lock that is
> currently acquired (since the levels above it cannot block and do not in
> any way influence blocking of other processes either and thus are
> irrelevant for the purpose of deadlock detection). Then the dependency
> graph we'd get would be:
> 
> freeze L0 -> ILOCK -> mmap_sem -> freeze L1
> 
> and we have a nice acyclic graph we like to see... So probably we have to
> hack the lockdep instrumentation some more and just don't tell lockdep
> about freeze locks at higher levels if we already hold a lock at lower
> level. Thoughts?

The XFS directory ilock->filldir->might_fault locking path has been
generating false positives in quite a lot of places because of
things we do on one side of the mmap_sem in filesystem paths vs
thigs we do on the other side of the mmap_sem in the page fault
path.

I'm getting sick of these stupid lockdep false positives. I think I
need to rework the XFS readdir locking patch I wrote a while back:

http://oss.sgi.com/archives/xfs/2014-03/msg00146.html

At the time it wasn't clear what the best way forward was. Right now
I think the method I originally used (IOLOCK for directory data and
hence readdir, ILOCK for everything else) will be sufficient; if we
need to do anything smarter that can be addressed further down the
road.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 0/4] change sb_writers to use percpu_rw_semaphore
  2015-08-10 22:41     ` Dave Chinner
@ 2015-08-11 13:16       ` Oleg Nesterov
  2015-08-11 13:29         ` Jan Kara
  0 siblings, 1 reply; 17+ messages in thread
From: Oleg Nesterov @ 2015-08-11 13:16 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Jan Kara, Al Viro, Dave Hansen, Paul E. McKenney, Peter Zijlstra,
	linux-fsdevel, linux-kernel

On 08/11, Dave Chinner wrote:
>
> On Mon, Aug 10, 2015 at 04:59:42PM +0200, Jan Kara wrote:
> >
> > One would like to construct the lock chain as:
> >
> > CPU0 (chown foo dir)	CPU1 (readdir dir)	CPU2 (page fault)
> > process Y		process X, thread 0	process X, thread 1
> >
> > 			get ILOCK for dir
> > gets freeze protection
> > starts transaction in xfs_setattr_nonsize
> > waits to get ILOCK on 'dir'
> > 						get mmap_sem for X
> > 			wait for mmap_sem for process X
> > 			  in filldir()
> > 						wait for freeze protection in
> > 						  xfs_page_mkwrite
> >
> > and CPU3 then being in freeze_super() blocking CPU2 and waiting for CPU0 to
> > finish it's freeze-protected section. But this cannot happen. The reason is
> > that we block writers level-by-level and thus while there are writers at
> > level X, we do not block writers at level X+1. So in this particular case
> > freeze_super() will block waiting for CPU0 to finish its freeze protected
> > section while CPU2 is free to continue.
> >
> > In general we have a chain like
> >
> > freeze L0 -> freeze L1 -> freeze L2 -> ILOCK -> mmap_sem --\
> >                 A                                          |
> >                 \------------------------------------------/
> >
> > But since ILOCK is always acquired with freeze protection at L0 and we can
> > block at L1 only after there are no writers at L0, this loop can never
> > happen.
> >
> > Note that if we use the property of freezing that lock at level X+1 cannot
> > block when we hold lock at level X, we can as well simplify the dependency
> > graph and track in it only the lowest level of freeze lock that is
> > currently acquired (since the levels above it cannot block and do not in
> > any way influence blocking of other processes either and thus are
> > irrelevant for the purpose of deadlock detection). Then the dependency
> > graph we'd get would be:
> >
> > freeze L0 -> ILOCK -> mmap_sem -> freeze L1
> >
> > and we have a nice acyclic graph we like to see... So probably we have to
> > hack the lockdep instrumentation some more and just don't tell lockdep
> > about freeze locks at higher levels if we already hold a lock at lower
> > level. Thoughts?
>
> The XFS directory ilock->filldir->might_fault locking path has been
> generating false positives in quite a lot of places because of
> things we do on one side of the mmap_sem in filesystem paths vs
> thigs we do on the other side of the mmap_sem in the page fault
> path.

OK. Dave, Jan, thanks a lot.

I was also confused because I didn't know that "Chain exists of" part
of print_circular_bug() only prints the _partial_ chain, and I have
to admit that I do not even understand which part it actually shows...

I'll drop

	move rwsem_release() from sb_wait_write() to freeze_super()
	change thaw_super() to re-acquire s_writers.lock_map

from the previous series and resend everything. Lets change sb_writers to
use percpu_rw_semaphore first, then try to improve the lockdep annotations.

See the interdiff below. With this change I have

	TEST_DEV=/dev/loop0 TEST_DIR=TEST SCRATCH_DEV=/dev/loop1 SCRATCH_MNT=SCRATCH \
	./check `grep -il freeze tests/*/???`

	...

	Ran: generic/068 generic/085 generic/280 generic/311 xfs/011 xfs/119 xfs/297
	Passed all 7 tests

anything else I should test?

Oleg.

this needs a comment in sb_wait_write() to explain that this is not what
we want.

--- a/fs/super.c
+++ b/fs/super.c
@@ -1215,27 +1215,15 @@ EXPORT_SYMBOL(__sb_start_write);
 static void sb_wait_write(struct super_block *sb, int level)
 {
 	percpu_down_write(sb->s_writers.rw_sem + level-1);
+	percpu_rwsem_release(sb->s_writers.rw_sem + level-1, 0, _THIS_IP_);
 }
 
-static void sb_freeze_release(struct super_block *sb)
-{
-	int level;
-	/* Avoid the warning from lockdep_sys_exit() */
-	for (level = 0; level < SB_FREEZE_LEVELS; ++level)
-		percpu_rwsem_release(sb->s_writers.rw_sem + level, 0, _THIS_IP_);
-}
-
-static void sb_freeze_acquire(struct super_block *sb)
+static void sb_freeze_unlock(struct super_block *sb)
 {
 	int level;
 
 	for (level = 0; level < SB_FREEZE_LEVELS; ++level)
 		percpu_rwsem_acquire(sb->s_writers.rw_sem + level, 0, _THIS_IP_);
-}
-
-static void sb_freeze_unlock(struct super_block *sb)
-{
-	int level;
 
 	for (level = SB_FREEZE_LEVELS; --level >= 0; )
 		percpu_up_write(sb->s_writers.rw_sem + level);
@@ -1331,7 +1319,6 @@ int freeze_super(struct super_block *sb)
 	 * sees write activity when frozen is set to SB_FREEZE_COMPLETE.
 	 */
 	sb->s_writers.frozen = SB_FREEZE_COMPLETE;
-	sb_freeze_release(sb);
 	up_write(&sb->s_umount);
 	return 0;
 }
@@ -1358,14 +1345,11 @@ int thaw_super(struct super_block *sb)
 		goto out;
 	}
 
-	sb_freeze_acquire(sb);
-
 	if (sb->s_op->unfreeze_fs) {
 		error = sb->s_op->unfreeze_fs(sb);
 		if (error) {
 			printk(KERN_ERR
 				"VFS:Filesystem thaw failed\n");
-			sb_freeze_release(sb);
 			up_write(&sb->s_umount);
 			return error;
 		}


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 0/4] change sb_writers to use percpu_rw_semaphore
  2015-08-11 13:16       ` Oleg Nesterov
@ 2015-08-11 13:29         ` Jan Kara
  0 siblings, 0 replies; 17+ messages in thread
From: Jan Kara @ 2015-08-11 13:29 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Dave Chinner, Jan Kara, Al Viro, Dave Hansen, Paul E. McKenney,
	Peter Zijlstra, linux-fsdevel, linux-kernel

On Tue 11-08-15 15:16:26, Oleg Nesterov wrote:
> On 08/11, Dave Chinner wrote:
> >
> > On Mon, Aug 10, 2015 at 04:59:42PM +0200, Jan Kara wrote:
> > >
> > > One would like to construct the lock chain as:
> > >
> > > CPU0 (chown foo dir)	CPU1 (readdir dir)	CPU2 (page fault)
> > > process Y		process X, thread 0	process X, thread 1
> > >
> > > 			get ILOCK for dir
> > > gets freeze protection
> > > starts transaction in xfs_setattr_nonsize
> > > waits to get ILOCK on 'dir'
> > > 						get mmap_sem for X
> > > 			wait for mmap_sem for process X
> > > 			  in filldir()
> > > 						wait for freeze protection in
> > > 						  xfs_page_mkwrite
> > >
> > > and CPU3 then being in freeze_super() blocking CPU2 and waiting for CPU0 to
> > > finish it's freeze-protected section. But this cannot happen. The reason is
> > > that we block writers level-by-level and thus while there are writers at
> > > level X, we do not block writers at level X+1. So in this particular case
> > > freeze_super() will block waiting for CPU0 to finish its freeze protected
> > > section while CPU2 is free to continue.
> > >
> > > In general we have a chain like
> > >
> > > freeze L0 -> freeze L1 -> freeze L2 -> ILOCK -> mmap_sem --\
> > >                 A                                          |
> > >                 \------------------------------------------/
> > >
> > > But since ILOCK is always acquired with freeze protection at L0 and we can
> > > block at L1 only after there are no writers at L0, this loop can never
> > > happen.
> > >
> > > Note that if we use the property of freezing that lock at level X+1 cannot
> > > block when we hold lock at level X, we can as well simplify the dependency
> > > graph and track in it only the lowest level of freeze lock that is
> > > currently acquired (since the levels above it cannot block and do not in
> > > any way influence blocking of other processes either and thus are
> > > irrelevant for the purpose of deadlock detection). Then the dependency
> > > graph we'd get would be:
> > >
> > > freeze L0 -> ILOCK -> mmap_sem -> freeze L1
> > >
> > > and we have a nice acyclic graph we like to see... So probably we have to
> > > hack the lockdep instrumentation some more and just don't tell lockdep
> > > about freeze locks at higher levels if we already hold a lock at lower
> > > level. Thoughts?
> >
> > The XFS directory ilock->filldir->might_fault locking path has been
> > generating false positives in quite a lot of places because of
> > things we do on one side of the mmap_sem in filesystem paths vs
> > thigs we do on the other side of the mmap_sem in the page fault
> > path.
> 
> OK. Dave, Jan, thanks a lot.
> 
> I was also confused because I didn't know that "Chain exists of" part
> of print_circular_bug() only prints the _partial_ chain, and I have
> to admit that I do not even understand which part it actually shows...
> 
> I'll drop
> 
> 	move rwsem_release() from sb_wait_write() to freeze_super()
> 	change thaw_super() to re-acquire s_writers.lock_map
> 
> from the previous series and resend everything. Lets change sb_writers to
> use percpu_rw_semaphore first, then try to improve the lockdep annotations.

Yeah, that sounds like a good plan.

> See the interdiff below. With this change I have
> 
> 	TEST_DEV=/dev/loop0 TEST_DIR=TEST SCRATCH_DEV=/dev/loop1 SCRATCH_MNT=SCRATCH \
> 	./check `grep -il freeze tests/*/???`
> 
> 	...
> 
> 	Ran: generic/068 generic/085 generic/280 generic/311 xfs/011 xfs/119 xfs/297
> 	Passed all 7 tests
> 
> anything else I should test?

The diff looks good and if these tests pass without a warning then we can
be reasonably confident things are fine.

> this needs a comment in sb_wait_write() to explain that this is not what
> we want.

Yup, would be nice.

								Honza

> 
> --- a/fs/super.c
> +++ b/fs/super.c
> @@ -1215,27 +1215,15 @@ EXPORT_SYMBOL(__sb_start_write);
>  static void sb_wait_write(struct super_block *sb, int level)
>  {
>  	percpu_down_write(sb->s_writers.rw_sem + level-1);
> +	percpu_rwsem_release(sb->s_writers.rw_sem + level-1, 0, _THIS_IP_);
>  }
>  
> -static void sb_freeze_release(struct super_block *sb)
> -{
> -	int level;
> -	/* Avoid the warning from lockdep_sys_exit() */
> -	for (level = 0; level < SB_FREEZE_LEVELS; ++level)
> -		percpu_rwsem_release(sb->s_writers.rw_sem + level, 0, _THIS_IP_);
> -}
> -
> -static void sb_freeze_acquire(struct super_block *sb)
> +static void sb_freeze_unlock(struct super_block *sb)
>  {
>  	int level;
>  
>  	for (level = 0; level < SB_FREEZE_LEVELS; ++level)
>  		percpu_rwsem_acquire(sb->s_writers.rw_sem + level, 0, _THIS_IP_);
> -}
> -
> -static void sb_freeze_unlock(struct super_block *sb)
> -{
> -	int level;
>  
>  	for (level = SB_FREEZE_LEVELS; --level >= 0; )
>  		percpu_up_write(sb->s_writers.rw_sem + level);
> @@ -1331,7 +1319,6 @@ int freeze_super(struct super_block *sb)
>  	 * sees write activity when frozen is set to SB_FREEZE_COMPLETE.
>  	 */
>  	sb->s_writers.frozen = SB_FREEZE_COMPLETE;
> -	sb_freeze_release(sb);
>  	up_write(&sb->s_umount);
>  	return 0;
>  }
> @@ -1358,14 +1345,11 @@ int thaw_super(struct super_block *sb)
>  		goto out;
>  	}
>  
> -	sb_freeze_acquire(sb);
> -
>  	if (sb->s_op->unfreeze_fs) {
>  		error = sb->s_op->unfreeze_fs(sb);
>  		if (error) {
>  			printk(KERN_ERR
>  				"VFS:Filesystem thaw failed\n");
> -			sb_freeze_release(sb);
>  			up_write(&sb->s_umount);
>  			return error;
>  		}
> 
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2015-08-11 13:29 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-07-22 21:15 [PATCH 0/4] change sb_writers to use percpu_rw_semaphore Oleg Nesterov
2015-07-22 21:15 ` [PATCH 1/4] percpu-rwsem: introduce percpu_down_read_trylock() Oleg Nesterov
2015-07-22 21:15 ` [PATCH 2/4] percpu-rwsem: introduce percpu_rwsem_release() and percpu_rwsem_acquire() Oleg Nesterov
2015-07-31 10:20   ` Peter Zijlstra
2015-08-03 15:40     ` Oleg Nesterov
2015-07-22 21:15 ` [PATCH 3/4] shift percpu_counter_destroy() into destroy_super_work() Oleg Nesterov
2015-07-28  8:36   ` Jan Kara
2015-07-22 21:15 ` [PATCH 4/4] change sb_writers to use percpu_rw_semaphore Oleg Nesterov
2015-07-22 21:34   ` Oleg Nesterov
2015-07-28  8:34   ` Jan Kara
2015-08-03 17:30     ` Oleg Nesterov
2015-08-07 19:54   ` Oleg Nesterov
2015-08-07 19:55 ` [PATCH 0/4] " Oleg Nesterov
2015-08-10 14:59   ` Jan Kara
2015-08-10 22:41     ` Dave Chinner
2015-08-11 13:16       ` Oleg Nesterov
2015-08-11 13:29         ` Jan Kara

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).