linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v2 0/8] change sb_writers to use percpu_rw_semaphore
@ 2015-08-11 17:03 Oleg Nesterov
  2015-08-11 17:03 ` [PATCH v2 1/8] introduce __sb_{acquire,release}_write() helpers Oleg Nesterov
                   ` (8 more replies)
  0 siblings, 9 replies; 26+ messages in thread
From: Oleg Nesterov @ 2015-08-11 17:03 UTC (permalink / raw)
  To: Al Viro, Dave Chinner, Dave Hansen, Jan Kara
  Cc: Paul E. McKenney, Peter Zijlstra, linux-fsdevel, linux-kernel

Jan, please consider this series. I didn't dare to preserve your acks,
I hope you can ack v2 too.

The only essential change is that I dropped the lockdep improvements
as we discussed. This means that 8/8 was changed a bit, and I decided
to add the new documentation patch, see 3/8.

6/8 is new too, I forgot to do this in V1.

To remind, I tested this with the following script:

        dd if=/dev/zero of=TEST.img bs=1MiB count=4000
        dd if=/dev/zero of=SCRATCH.img bs=1MiB count=4000

        losetup --find --show TEST.img
        losetup --find --show SCRATCH.img

        mkfs.xfs -f /dev/loop0
        mkfs.xfs -f /dev/loop1

        mkdir -p TEST SCRATCH

        mount /dev/loop0 TEST
        mount /dev/loop1 SCRATCH

        TEST_DEV=/dev/loop0 TEST_DIR=TEST SCRATCH_DEV=/dev/loop1 SCRATCH_MNT=SCRATCH \
        ./check `grep -il freeze tests/*/???`

seems to work:

	Ran: generic/068 generic/085 generic/280 generic/311 xfs/011 xfs/119 xfs/297
        Passed all 7 tests

and I see nothing interesting in dmesg.

If I replace mkfs.xfs above with mkfs.btrfs or mkfs.ext4 everything
looks fine too:

	FSTYP         -- btrfs
	PLATFORM      -- Linux/x86_64 intel-canoepass-10 4.2.0-rc6+
	MKFS_OPTIONS  -- /dev/loop1
	MOUNT_OPTIONS -- -o context=system_u:object_r:nfs_t:s0 /dev/loop1 SCRATCH

	generic/068 60s ... 53s
	generic/085 24s ... 16s
	generic/280 2s ... [not run] disk quotas not supported by this filesystem type: btrfs
	generic/311 165s ... 136s
	xfs/011 20s ... [not run] not suitable for this filesystem type: btrfs
	xfs/119 22s ... [not run] not suitable for this filesystem type: btrfs
	xfs/297 270s ... [not run] not suitable for this filesystem type: btrfs
	Ran: generic/068 generic/085 generic/311
	Not run: generic/280 xfs/011 xfs/119 xfs/297
	Passed all 3 tests

and	

	FSTYP         -- ext4
	PLATFORM      -- Linux/x86_64 intel-canoepass-10 4.2.0-rc6+
	MKFS_OPTIONS  -- /dev/loop1
	MOUNT_OPTIONS -- -o acl,user_xattr -o context=system_u:object_r:nfs_t:s0 /dev/loop1 SCRATCH

	generic/068 53s ... 52s
	generic/085 16s ... 8s
	generic/280 2s ... 4s
	generic/311 136s ... 173s
	xfs/011 20s ... [not run] not suitable for this filesystem type: ext4
	xfs/119 22s ... [not run] not suitable for this filesystem type: ext4
	xfs/297 270s ... [not run] not suitable for this filesystem type: ext4
	Ran: generic/068 generic/085 generic/280 generic/311
	Not run: xfs/011 xfs/119 xfs/297
	Passed all 4 tests

Oleg.

 arch/Kconfig                  |    1 -
 fs/btrfs/transaction.c        |    8 +--
 fs/super.c                    |  167 ++++++++++++++++++-----------------------
 fs/xfs/xfs_aops.c             |    6 +-
 include/linux/fs.h            |   23 +++---
 include/linux/percpu-rwsem.h  |   20 +++++
 init/Kconfig                  |    1 -
 kernel/locking/Makefile       |    3 +-
 kernel/locking/percpu-rwsem.c |   13 +++
 lib/Kconfig                   |    3 -
 10 files changed, 122 insertions(+), 123 deletions(-)


^ permalink raw reply	[flat|nested] 26+ messages in thread

* [PATCH v2 1/8] introduce __sb_{acquire,release}_write() helpers
  2015-08-11 17:03 [PATCH v2 0/8] change sb_writers to use percpu_rw_semaphore Oleg Nesterov
@ 2015-08-11 17:03 ` Oleg Nesterov
  2015-08-13  9:45   ` Jan Kara
  2015-08-11 17:04 ` [PATCH v2 2/8] fix the broken lockdep logic in __sb_start_write() Oleg Nesterov
                   ` (7 subsequent siblings)
  8 siblings, 1 reply; 26+ messages in thread
From: Oleg Nesterov @ 2015-08-11 17:03 UTC (permalink / raw)
  To: Al Viro, Dave Chinner, Dave Hansen, Jan Kara
  Cc: Paul E. McKenney, Peter Zijlstra, linux-fsdevel, linux-kernel

Preparation to hide the sb->s_writers internals from xfs and btrfs.
Add 2 trivial define's they can use rather than play with ->s_writers
directly. No changes in btrfs/transaction.o and xfs/xfs_aops.o.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
---
 fs/btrfs/transaction.c |    8 ++------
 fs/xfs/xfs_aops.c      |    6 ++----
 include/linux/fs.h     |    5 +++++
 3 files changed, 9 insertions(+), 10 deletions(-)

diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c
index 5628e25..6dca4e9 100644
--- a/fs/btrfs/transaction.c
+++ b/fs/btrfs/transaction.c
@@ -1620,9 +1620,7 @@ static void do_async_commit(struct work_struct *work)
 	 * Tell lockdep about it.
 	 */
 	if (ac->newtrans->type & __TRANS_FREEZABLE)
-		rwsem_acquire_read(
-		     &ac->root->fs_info->sb->s_writers.lock_map[SB_FREEZE_FS-1],
-		     0, 1, _THIS_IP_);
+		__sb_acquire_write(ac->root->fs_info->sb, SB_FREEZE_FS);
 
 	current->journal_info = ac->newtrans;
 
@@ -1661,9 +1659,7 @@ int btrfs_commit_transaction_async(struct btrfs_trans_handle *trans,
 	 * async commit thread will be the one to unlock it.
 	 */
 	if (ac->newtrans->type & __TRANS_FREEZABLE)
-		rwsem_release(
-			&root->fs_info->sb->s_writers.lock_map[SB_FREEZE_FS-1],
-			1, _THIS_IP_);
+		__sb_release_write(root->fs_info->sb, SB_FREEZE_FS);
 
 	schedule_work(&ac->work);
 
diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
index a56960d..8034c78 100644
--- a/fs/xfs/xfs_aops.c
+++ b/fs/xfs/xfs_aops.c
@@ -119,8 +119,7 @@ xfs_setfilesize_trans_alloc(
 	 * We may pass freeze protection with a transaction.  So tell lockdep
 	 * we released it.
 	 */
-	rwsem_release(&ioend->io_inode->i_sb->s_writers.lock_map[SB_FREEZE_FS-1],
-		      1, _THIS_IP_);
+	__sb_release_write(ioend->io_inode->i_sb, SB_FREEZE_FS);
 	/*
 	 * We hand off the transaction to the completion thread now, so
 	 * clear the flag here.
@@ -171,8 +170,7 @@ xfs_setfilesize_ioend(
 	 * Similarly for freeze protection.
 	 */
 	current_set_flags_nested(&tp->t_pflags, PF_FSTRANS);
-	rwsem_acquire_read(&VFS_I(ip)->i_sb->s_writers.lock_map[SB_FREEZE_FS-1],
-			   0, 1, _THIS_IP_);
+	__sb_acquire_write(VFS_I(ip)->i_sb, SB_FREEZE_FS);
 
 	return xfs_setfilesize(ip, tp, ioend->io_offset, ioend->io_size);
 }
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 35ec87e..78ac768 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1362,6 +1362,11 @@ extern struct timespec current_fs_time(struct super_block *sb);
 void __sb_end_write(struct super_block *sb, int level);
 int __sb_start_write(struct super_block *sb, int level, bool wait);
 
+#define __sb_acquire_write(sb, lev)	\
+	rwsem_acquire_read(&(sb)->s_writers.lock_map[(lev)-1], 0, 1, _THIS_IP_)
+#define __sb_release_write(sb, lev)	\
+	rwsem_release(&(sb)->s_writers.lock_map[(lev)-1], 1, _THIS_IP_)
+
 /**
  * sb_end_write - drop write access to a superblock
  * @sb: the super we wrote to
-- 
1.5.5.1


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH v2 2/8] fix the broken lockdep logic in __sb_start_write()
  2015-08-11 17:03 [PATCH v2 0/8] change sb_writers to use percpu_rw_semaphore Oleg Nesterov
  2015-08-11 17:03 ` [PATCH v2 1/8] introduce __sb_{acquire,release}_write() helpers Oleg Nesterov
@ 2015-08-11 17:04 ` Oleg Nesterov
  2015-08-13 10:02   ` Jan Kara
  2015-08-11 17:04 ` [PATCH v2 3/8] document rwsem_release() in sb_wait_write() Oleg Nesterov
                   ` (6 subsequent siblings)
  8 siblings, 1 reply; 26+ messages in thread
From: Oleg Nesterov @ 2015-08-11 17:04 UTC (permalink / raw)
  To: Al Viro, Dave Chinner, Dave Hansen, Jan Kara
  Cc: Paul E. McKenney, Peter Zijlstra, linux-fsdevel, linux-kernel

1. wait_event(frozen < level) without rwsem_acquire_read() is just
   wrong from lockdep perspective. If we are going to deadlock
   because the caller is buggy, lockdep detect this problem.

2. __sb_start_write() can race with thaw_super() + freeze_super(),
   and after "goto retry" the 2nd  acquire_freeze_lock() is wrong.

3. The "tell lockdep we are doing trylock" hack doesn't look nice.

   I think this is correct, but this logic should be more explicit.
   Yes, the recursive read_lock() is fine if we hold the lock on a
   higher level. But we do not need to fool lockdep. If we can not
   deadlock in this case then try-lock must not fail and we can use
   use wait == F throughout this code.

Note: as Dave Chinner explains, the "trylock" hack and the fat comment
can be probably removed. But this needs a separate change and it will
be trivial: just kill __sb_start_write() and rename do_sb_start_write()
back to __sb_start_write().

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
---
 fs/super.c |   73 ++++++++++++++++++++++++++++++++---------------------------
 1 files changed, 40 insertions(+), 33 deletions(-)

diff --git a/fs/super.c b/fs/super.c
index 928c20f..d0fdd49 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -1158,38 +1158,11 @@ void __sb_end_write(struct super_block *sb, int level)
 }
 EXPORT_SYMBOL(__sb_end_write);
 
-#ifdef CONFIG_LOCKDEP
-/*
- * We want lockdep to tell us about possible deadlocks with freezing but
- * it's it bit tricky to properly instrument it. Getting a freeze protection
- * works as getting a read lock but there are subtle problems. XFS for example
- * gets freeze protection on internal level twice in some cases, which is OK
- * only because we already hold a freeze protection also on higher level. Due
- * to these cases we have to tell lockdep we are doing trylock when we
- * already hold a freeze protection for a higher freeze level.
- */
-static void acquire_freeze_lock(struct super_block *sb, int level, bool trylock,
+static int do_sb_start_write(struct super_block *sb, int level, bool wait,
 				unsigned long ip)
 {
-	int i;
-
-	if (!trylock) {
-		for (i = 0; i < level - 1; i++)
-			if (lock_is_held(&sb->s_writers.lock_map[i])) {
-				trylock = true;
-				break;
-			}
-	}
-	rwsem_acquire_read(&sb->s_writers.lock_map[level-1], 0, trylock, ip);
-}
-#endif
-
-/*
- * This is an internal function, please use sb_start_{write,pagefault,intwrite}
- * instead.
- */
-int __sb_start_write(struct super_block *sb, int level, bool wait)
-{
+	if (wait)
+		rwsem_acquire_read(&sb->s_writers.lock_map[level-1], 0, 0, ip);
 retry:
 	if (unlikely(sb->s_writers.frozen >= level)) {
 		if (!wait)
@@ -1198,9 +1171,6 @@ retry:
 			   sb->s_writers.frozen < level);
 	}
 
-#ifdef CONFIG_LOCKDEP
-	acquire_freeze_lock(sb, level, !wait, _RET_IP_);
-#endif
 	percpu_counter_inc(&sb->s_writers.counter[level-1]);
 	/*
 	 * Make sure counter is updated before we check for frozen.
@@ -1211,8 +1181,45 @@ retry:
 		__sb_end_write(sb, level);
 		goto retry;
 	}
+
+	if (!wait)
+		rwsem_acquire_read(&sb->s_writers.lock_map[level-1], 0, 1, ip);
 	return 1;
 }
+
+/*
+ * This is an internal function, please use sb_start_{write,pagefault,intwrite}
+ * instead.
+ */
+int __sb_start_write(struct super_block *sb, int level, bool wait)
+{
+	bool force_trylock = false;
+	int ret;
+
+#ifdef CONFIG_LOCKDEP
+	/*
+	 * We want lockdep to tell us about possible deadlocks with freezing
+	 * but it's it bit tricky to properly instrument it. Getting a freeze
+	 * protection works as getting a read lock but there are subtle
+	 * problems. XFS for example gets freeze protection on internal level
+	 * twice in some cases, which is OK only because we already hold a
+	 * freeze protection also on higher level. Due to these cases we have
+	 * to use wait == F (trylock mode) which must not fail.
+	 */
+	if (wait) {
+		int i;
+
+		for (i = 0; i < level - 1; i++)
+			if (lock_is_held(&sb->s_writers.lock_map[i])) {
+				force_trylock = true;
+				break;
+			}
+	}
+#endif
+	ret = do_sb_start_write(sb, level, wait && !force_trylock, _RET_IP_);
+	WARN_ON(force_trylock & !ret);
+	return ret;
+}
 EXPORT_SYMBOL(__sb_start_write);
 
 /**
-- 
1.5.5.1


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH v2 3/8] document rwsem_release() in sb_wait_write()
  2015-08-11 17:03 [PATCH v2 0/8] change sb_writers to use percpu_rw_semaphore Oleg Nesterov
  2015-08-11 17:03 ` [PATCH v2 1/8] introduce __sb_{acquire,release}_write() helpers Oleg Nesterov
  2015-08-11 17:04 ` [PATCH v2 2/8] fix the broken lockdep logic in __sb_start_write() Oleg Nesterov
@ 2015-08-11 17:04 ` Oleg Nesterov
  2015-08-13 10:22   ` Jan Kara
  2015-08-11 17:04 ` [PATCH v2 4/8] percpu-rwsem: introduce percpu_down_read_trylock() Oleg Nesterov
                   ` (5 subsequent siblings)
  8 siblings, 1 reply; 26+ messages in thread
From: Oleg Nesterov @ 2015-08-11 17:04 UTC (permalink / raw)
  To: Al Viro, Dave Chinner, Dave Hansen, Jan Kara
  Cc: Paul E. McKenney, Peter Zijlstra, linux-fsdevel, linux-kernel

Not only we need to avoid the warning from lockdep_sys_exit(), the
caller of freeze_super() can never release this lock. Another thread
can do this, so there is another reason for rwsem_release().

Plus the comment should explain why we have to fool lockdep.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
---
 fs/super.c |   12 +++++++++---
 1 files changed, 9 insertions(+), 3 deletions(-)

diff --git a/fs/super.c b/fs/super.c
index d0fdd49..89b58fb 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -1236,11 +1236,17 @@ static void sb_wait_write(struct super_block *sb, int level)
 {
 	s64 writers;
 
+	rwsem_acquire(&sb->s_writers.lock_map[level-1], 0, 0, _THIS_IP_);
 	/*
-	 * We just cycle-through lockdep here so that it does not complain
-	 * about returning with lock to userspace
+	 * We are going to return to userspace and forget about this lock, the
+	 * ownership goes to the caller of thaw_super() which does unlock.
+	 *
+	 * FIXME: we should do this before return from freeze_super() after we
+	 * called sync_filesystem(sb) and s_op->freeze_fs(sb), and thaw_super()
+	 * should re-acquire these locks before s_op->unfreeze_fs(sb). However
+	 * this leads to lockdep false-positives, so currently we do the early
+	 * release right after acquire.
 	 */
-	rwsem_acquire(&sb->s_writers.lock_map[level-1], 0, 0, _THIS_IP_);
 	rwsem_release(&sb->s_writers.lock_map[level-1], 1, _THIS_IP_);
 
 	do {
-- 
1.5.5.1


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH v2 4/8] percpu-rwsem: introduce percpu_down_read_trylock()
  2015-08-11 17:03 [PATCH v2 0/8] change sb_writers to use percpu_rw_semaphore Oleg Nesterov
                   ` (2 preceding siblings ...)
  2015-08-11 17:04 ` [PATCH v2 3/8] document rwsem_release() in sb_wait_write() Oleg Nesterov
@ 2015-08-11 17:04 ` Oleg Nesterov
  2015-08-11 17:04 ` [PATCH v2 5/8] percpu-rwsem: introduce percpu_rwsem_release() and percpu_rwsem_acquire() Oleg Nesterov
                   ` (4 subsequent siblings)
  8 siblings, 0 replies; 26+ messages in thread
From: Oleg Nesterov @ 2015-08-11 17:04 UTC (permalink / raw)
  To: Al Viro, Dave Chinner, Dave Hansen, Jan Kara
  Cc: Paul E. McKenney, Peter Zijlstra, linux-fsdevel, linux-kernel

Add percpu_down_read_trylock(), it will have the user soon.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
---
 include/linux/percpu-rwsem.h  |    1 +
 kernel/locking/percpu-rwsem.c |   13 +++++++++++++
 2 files changed, 14 insertions(+), 0 deletions(-)

diff --git a/include/linux/percpu-rwsem.h b/include/linux/percpu-rwsem.h
index 3e58226..3ebf982 100644
--- a/include/linux/percpu-rwsem.h
+++ b/include/linux/percpu-rwsem.h
@@ -17,6 +17,7 @@ struct percpu_rw_semaphore {
 };
 
 extern void percpu_down_read(struct percpu_rw_semaphore *);
+extern int  percpu_down_read_trylock(struct percpu_rw_semaphore *);
 extern void percpu_up_read(struct percpu_rw_semaphore *);
 
 extern void percpu_down_write(struct percpu_rw_semaphore *);
diff --git a/kernel/locking/percpu-rwsem.c b/kernel/locking/percpu-rwsem.c
index 2c54c64..4bc2127 100644
--- a/kernel/locking/percpu-rwsem.c
+++ b/kernel/locking/percpu-rwsem.c
@@ -73,6 +73,19 @@ void percpu_down_read(struct percpu_rw_semaphore *brw)
 	__up_read(&brw->rw_sem);
 }
 
+int percpu_down_read_trylock(struct percpu_rw_semaphore *brw)
+{
+	if (unlikely(!update_fast_ctr(brw, +1))) {
+		if (!__down_read_trylock(&brw->rw_sem))
+			return 0;
+		atomic_inc(&brw->slow_read_ctr);
+		__up_read(&brw->rw_sem);
+	}
+
+	rwsem_acquire_read(&brw->rw_sem.dep_map, 0, 1, _RET_IP_);
+	return 1;
+}
+
 void percpu_up_read(struct percpu_rw_semaphore *brw)
 {
 	rwsem_release(&brw->rw_sem.dep_map, 1, _RET_IP_);
-- 
1.5.5.1


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH v2 5/8] percpu-rwsem: introduce percpu_rwsem_release() and percpu_rwsem_acquire()
  2015-08-11 17:03 [PATCH v2 0/8] change sb_writers to use percpu_rw_semaphore Oleg Nesterov
                   ` (3 preceding siblings ...)
  2015-08-11 17:04 ` [PATCH v2 4/8] percpu-rwsem: introduce percpu_down_read_trylock() Oleg Nesterov
@ 2015-08-11 17:04 ` Oleg Nesterov
  2015-08-11 17:04 ` [PATCH v2 6/8] percpu-rwsem: kill CONFIG_PERCPU_RWSEM Oleg Nesterov
                   ` (3 subsequent siblings)
  8 siblings, 0 replies; 26+ messages in thread
From: Oleg Nesterov @ 2015-08-11 17:04 UTC (permalink / raw)
  To: Al Viro, Dave Chinner, Dave Hansen, Jan Kara
  Cc: Paul E. McKenney, Peter Zijlstra, linux-fsdevel, linux-kernel

Add percpu_rwsem_release() and percpu_rwsem_acquire() for the users
which need to return to userspace with percpu-rwsem lock held and/or
pass the ownership to another thread.

TODO: change percpu_rwsem_release() to use rwsem_clear_owner(). We can
either fold kernel/locking/rwsem.h into include/linux/rwsem.h, or add
the non-inline percpu_rwsem_clear_owner().

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
---
 include/linux/percpu-rwsem.h |   19 +++++++++++++++++++
 1 files changed, 19 insertions(+), 0 deletions(-)

diff --git a/include/linux/percpu-rwsem.h b/include/linux/percpu-rwsem.h
index 3ebf982..06af654 100644
--- a/include/linux/percpu-rwsem.h
+++ b/include/linux/percpu-rwsem.h
@@ -33,4 +33,23 @@ extern void percpu_free_rwsem(struct percpu_rw_semaphore *);
 	__percpu_init_rwsem(brw, #brw, &rwsem_key);		\
 })
 
+
+#define percpu_rwsem_is_held(sem) lockdep_is_held(&(sem)->rw_sem)
+
+static inline void percpu_rwsem_release(struct percpu_rw_semaphore *sem,
+					bool read, unsigned long ip)
+{
+	lock_release(&sem->rw_sem.dep_map, 1, ip);
+#ifdef CONFIG_RWSEM_SPIN_ON_OWNER
+	if (!read)
+		sem->rw_sem.owner = NULL;
+#endif
+}
+
+static inline void percpu_rwsem_acquire(struct percpu_rw_semaphore *sem,
+					bool read, unsigned long ip)
+{
+	lock_acquire(&sem->rw_sem.dep_map, 0, 1, read, 1, NULL, ip);
+}
+
 #endif
-- 
1.5.5.1


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH v2 6/8] percpu-rwsem: kill CONFIG_PERCPU_RWSEM
  2015-08-11 17:03 [PATCH v2 0/8] change sb_writers to use percpu_rw_semaphore Oleg Nesterov
                   ` (4 preceding siblings ...)
  2015-08-11 17:04 ` [PATCH v2 5/8] percpu-rwsem: introduce percpu_rwsem_release() and percpu_rwsem_acquire() Oleg Nesterov
@ 2015-08-11 17:04 ` Oleg Nesterov
  2015-08-11 17:04 ` [PATCH v2 7/8] shift percpu_counter_destroy() into destroy_super_work() Oleg Nesterov
                   ` (2 subsequent siblings)
  8 siblings, 0 replies; 26+ messages in thread
From: Oleg Nesterov @ 2015-08-11 17:04 UTC (permalink / raw)
  To: Al Viro, Dave Chinner, Dave Hansen, Jan Kara
  Cc: Paul E. McKenney, Peter Zijlstra, linux-fsdevel, linux-kernel

Remove CONFIG_PERCPU_RWSEM, the next patch adds the unconditional
user of percpu_rw_semaphore.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
---
 arch/Kconfig            |    1 -
 init/Kconfig            |    1 -
 kernel/locking/Makefile |    3 +--
 lib/Kconfig             |    3 ---
 4 files changed, 1 insertions(+), 7 deletions(-)

diff --git a/arch/Kconfig b/arch/Kconfig
index a65eafb..94d6471 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -87,7 +87,6 @@ config KPROBES_ON_FTRACE
 
 config UPROBES
 	def_bool n
-	select PERCPU_RWSEM
 	help
 	  Uprobes is the user-space counterpart to kprobes: they
 	  enable instrumentation applications (such as 'perf probe')
diff --git a/init/Kconfig b/init/Kconfig
index b9b824b..dc24dec 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -938,7 +938,6 @@ config NUMA_BALANCING_DEFAULT_ENABLED
 menuconfig CGROUPS
 	bool "Control Group support"
 	select KERNFS
-	select PERCPU_RWSEM
 	help
 	  This option adds support for grouping sets of processes together, for
 	  use with process control subsystems such as Cpusets, CFS, memory
diff --git a/kernel/locking/Makefile b/kernel/locking/Makefile
index de7a416..a05a29f 100644
--- a/kernel/locking/Makefile
+++ b/kernel/locking/Makefile
@@ -1,5 +1,5 @@
 
-obj-y += mutex.o semaphore.o rwsem.o
+obj-y += mutex.o semaphore.o rwsem.o percpu-rwsem.o
 
 ifdef CONFIG_FUNCTION_TRACER
 CFLAGS_REMOVE_lockdep.o = $(CC_FLAGS_FTRACE)
@@ -24,6 +24,5 @@ obj-$(CONFIG_DEBUG_SPINLOCK) += spinlock.o
 obj-$(CONFIG_DEBUG_SPINLOCK) += spinlock_debug.o
 obj-$(CONFIG_RWSEM_GENERIC_SPINLOCK) += rwsem-spinlock.o
 obj-$(CONFIG_RWSEM_XCHGADD_ALGORITHM) += rwsem-xadd.o
-obj-$(CONFIG_PERCPU_RWSEM) += percpu-rwsem.o
 obj-$(CONFIG_QUEUED_RWLOCKS) += qrwlock.o
 obj-$(CONFIG_LOCK_TORTURE_TEST) += locktorture.o
diff --git a/lib/Kconfig b/lib/Kconfig
index 601965a..597aa66 100644
--- a/lib/Kconfig
+++ b/lib/Kconfig
@@ -53,9 +53,6 @@ config GENERIC_IO
 config STMP_DEVICE
 	bool
 
-config PERCPU_RWSEM
-	bool
-
 config ARCH_USE_CMPXCHG_LOCKREF
 	bool
 
-- 
1.5.5.1


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH v2 7/8] shift percpu_counter_destroy() into destroy_super_work()
  2015-08-11 17:03 [PATCH v2 0/8] change sb_writers to use percpu_rw_semaphore Oleg Nesterov
                   ` (5 preceding siblings ...)
  2015-08-11 17:04 ` [PATCH v2 6/8] percpu-rwsem: kill CONFIG_PERCPU_RWSEM Oleg Nesterov
@ 2015-08-11 17:04 ` Oleg Nesterov
  2015-08-13 10:35   ` Jan Kara
  2015-08-11 17:04 ` [PATCH v2 8/8] change sb_writers to use percpu_rw_semaphore Oleg Nesterov
  2015-08-12 13:11 ` [PATCH v2 9/8] don't fool lockdep in freeze_super() and thaw_super() paths Oleg Nesterov
  8 siblings, 1 reply; 26+ messages in thread
From: Oleg Nesterov @ 2015-08-11 17:04 UTC (permalink / raw)
  To: Al Viro, Dave Chinner, Dave Hansen, Jan Kara
  Cc: Paul E. McKenney, Peter Zijlstra, linux-fsdevel, linux-kernel

Of course, this patch is ugly as hell. It will be (partially)
reverted later. We add it to ensure that other WIP changes in
percpu_rw_semaphore won't break fs/super.c.

We do not even need this change right now, percpu_free_rwsem()
is fine in atomic context. But we are going to change this, it
will be might_sleep() after we merge the rcu_sync() patches.

And even after that we do not really need destroy_super_work(),
we will kill it in any case. Instead, destroy_super_rcu() should
just check that rss->cb_state == CB_IDLE and do call_rcu() again
in the (very unlikely) case this is not true.

So this is just the temporary kludge which helps us to avoid the
conflicts with the changes which will be (hopefully) routed via
rcu tree.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
---
 fs/super.c         |   23 +++++++++++++++++++----
 include/linux/fs.h |    3 ++-
 2 files changed, 21 insertions(+), 5 deletions(-)

diff --git a/fs/super.c b/fs/super.c
index 89b58fb..75436e2 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -135,6 +135,24 @@ static unsigned long super_cache_count(struct shrinker *shrink,
 	return total_objects;
 }
 
+static void destroy_super_work(struct work_struct *work)
+{
+	struct super_block *s = container_of(work, struct super_block,
+							destroy_work);
+	int i;
+
+	for (i = 0; i < SB_FREEZE_LEVELS; i++)
+		percpu_counter_destroy(&s->s_writers.counter[i]);
+	kfree(s);
+}
+
+static void destroy_super_rcu(struct rcu_head *head)
+{
+	struct super_block *s = container_of(head, struct super_block, rcu);
+	INIT_WORK(&s->destroy_work, destroy_super_work);
+	schedule_work(&s->destroy_work);
+}
+
 /**
  *	destroy_super	-	frees a superblock
  *	@s: superblock to free
@@ -143,16 +161,13 @@ static unsigned long super_cache_count(struct shrinker *shrink,
  */
 static void destroy_super(struct super_block *s)
 {
-	int i;
 	list_lru_destroy(&s->s_dentry_lru);
 	list_lru_destroy(&s->s_inode_lru);
-	for (i = 0; i < SB_FREEZE_LEVELS; i++)
-		percpu_counter_destroy(&s->s_writers.counter[i]);
 	security_sb_free(s);
 	WARN_ON(!list_empty(&s->s_mounts));
 	kfree(s->s_subtype);
 	kfree(s->s_options);
-	kfree_rcu(s, rcu);
+	call_rcu(&s->rcu, destroy_super_rcu);
 }
 
 /**
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 78ac768..6addccc 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -30,6 +30,7 @@
 #include <linux/lockdep.h>
 #include <linux/percpu-rwsem.h>
 #include <linux/blk_types.h>
+#include <linux/workqueue.h>
 
 #include <asm/byteorder.h>
 #include <uapi/linux/fs.h>
@@ -1346,7 +1347,7 @@ struct super_block {
 	struct list_lru		s_dentry_lru ____cacheline_aligned_in_smp;
 	struct list_lru		s_inode_lru ____cacheline_aligned_in_smp;
 	struct rcu_head		rcu;
-
+	struct work_struct	destroy_work;
 	/*
 	 * Indicates how deep in a filesystem stack this SB is
 	 */
-- 
1.5.5.1


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH v2 8/8] change sb_writers to use percpu_rw_semaphore
  2015-08-11 17:03 [PATCH v2 0/8] change sb_writers to use percpu_rw_semaphore Oleg Nesterov
                   ` (6 preceding siblings ...)
  2015-08-11 17:04 ` [PATCH v2 7/8] shift percpu_counter_destroy() into destroy_super_work() Oleg Nesterov
@ 2015-08-11 17:04 ` Oleg Nesterov
  2015-08-13 10:48   ` Jan Kara
  2015-08-12 13:11 ` [PATCH v2 9/8] don't fool lockdep in freeze_super() and thaw_super() paths Oleg Nesterov
  8 siblings, 1 reply; 26+ messages in thread
From: Oleg Nesterov @ 2015-08-11 17:04 UTC (permalink / raw)
  To: Al Viro, Dave Chinner, Dave Hansen, Jan Kara
  Cc: Paul E. McKenney, Peter Zijlstra, linux-fsdevel, linux-kernel

We can remove everything from struct sb_writers except frozen
and add the array of percpu_rw_semaphore's instead.

This patch doesn't remove sb_writers->wait_unfrozen yet, we keep
it for get_super_thawed(). We will probably remove it later.

This change tries to address the following problems:

	- Firstly, __sb_start_write() looks simply buggy. It does
	  __sb_end_write() if it sees ->frozen, but if it migrates
	  to another CPU before percpu_counter_dec(), sb_wait_write()
	  can wrongly succeed if there is another task which holds
	  the same "semaphore": sb_wait_write() can miss the result
	  of the previous percpu_counter_inc() but see the result
	  of this percpu_counter_dec().

	- As Dave Hansen reports, it is suboptimal. The trivial
	  microbenchmark that writes to a tmpfs file in a loop runs
	  12% faster if we change this code to rely on RCU and kill
	  the memory barriers.

	- This code doesn't look simple. It would be better to rely
	  on the generic locking code.

	  According to Dave, this change adds the same performance
	  improvement.

Note: with this change both freeze_super() and thaw_super() will do
synchronize_sched_expedited() 3 times. This is just ugly. But:

	- This will be "fixed" by the rcu_sync changes we are going
	  to merge. After that freeze_super()->percpu_down_write()
	  will use synchronize_sched(), and thaw_super() won't use
	  synchronize() at all.

	  This doesn't need any changes in fs/super.c.

	- Once we merge rcu_sync changes, we can also change super.c
	  so that all wb_write->rw_sem's will share the single ->rss
	  in struct sb_writes, then freeze_super() will need only one
	  synchronize_sched().

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
---
 fs/super.c         |  107 ++++++++++++++--------------------------------------
 include/linux/fs.h |   19 +++------
 2 files changed, 35 insertions(+), 91 deletions(-)

diff --git a/fs/super.c b/fs/super.c
index 75436e2..8762997 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -142,7 +142,7 @@ static void destroy_super_work(struct work_struct *work)
 	int i;
 
 	for (i = 0; i < SB_FREEZE_LEVELS; i++)
-		percpu_counter_destroy(&s->s_writers.counter[i]);
+		percpu_free_rwsem(&s->s_writers.rw_sem[i]);
 	kfree(s);
 }
 
@@ -193,13 +193,11 @@ static struct super_block *alloc_super(struct file_system_type *type, int flags)
 		goto fail;
 
 	for (i = 0; i < SB_FREEZE_LEVELS; i++) {
-		if (percpu_counter_init(&s->s_writers.counter[i], 0,
-					GFP_KERNEL) < 0)
+		if (__percpu_init_rwsem(&s->s_writers.rw_sem[i],
+					sb_writers_name[i],
+					&type->s_writers_key[i]))
 			goto fail;
-		lockdep_init_map(&s->s_writers.lock_map[i], sb_writers_name[i],
-				 &type->s_writers_key[i], 0);
 	}
-	init_waitqueue_head(&s->s_writers.wait);
 	init_waitqueue_head(&s->s_writers.wait_unfrozen);
 	s->s_bdi = &noop_backing_dev_info;
 	s->s_flags = flags;
@@ -1161,47 +1159,10 @@ out:
  */
 void __sb_end_write(struct super_block *sb, int level)
 {
-	percpu_counter_dec(&sb->s_writers.counter[level-1]);
-	/*
-	 * Make sure s_writers are updated before we wake up waiters in
-	 * freeze_super().
-	 */
-	smp_mb();
-	if (waitqueue_active(&sb->s_writers.wait))
-		wake_up(&sb->s_writers.wait);
-	rwsem_release(&sb->s_writers.lock_map[level-1], 1, _RET_IP_);
+	percpu_up_read(sb->s_writers.rw_sem + level-1);
 }
 EXPORT_SYMBOL(__sb_end_write);
 
-static int do_sb_start_write(struct super_block *sb, int level, bool wait,
-				unsigned long ip)
-{
-	if (wait)
-		rwsem_acquire_read(&sb->s_writers.lock_map[level-1], 0, 0, ip);
-retry:
-	if (unlikely(sb->s_writers.frozen >= level)) {
-		if (!wait)
-			return 0;
-		wait_event(sb->s_writers.wait_unfrozen,
-			   sb->s_writers.frozen < level);
-	}
-
-	percpu_counter_inc(&sb->s_writers.counter[level-1]);
-	/*
-	 * Make sure counter is updated before we check for frozen.
-	 * freeze_super() first sets frozen and then checks the counter.
-	 */
-	smp_mb();
-	if (unlikely(sb->s_writers.frozen >= level)) {
-		__sb_end_write(sb, level);
-		goto retry;
-	}
-
-	if (!wait)
-		rwsem_acquire_read(&sb->s_writers.lock_map[level-1], 0, 1, ip);
-	return 1;
-}
-
 /*
  * This is an internal function, please use sb_start_{write,pagefault,intwrite}
  * instead.
@@ -1209,7 +1170,7 @@ retry:
 int __sb_start_write(struct super_block *sb, int level, bool wait)
 {
 	bool force_trylock = false;
-	int ret;
+	int ret = 1;
 
 #ifdef CONFIG_LOCKDEP
 	/*
@@ -1225,13 +1186,17 @@ int __sb_start_write(struct super_block *sb, int level, bool wait)
 		int i;
 
 		for (i = 0; i < level - 1; i++)
-			if (lock_is_held(&sb->s_writers.lock_map[i])) {
+			if (percpu_rwsem_is_held(sb->s_writers.rw_sem + i)) {
 				force_trylock = true;
 				break;
 			}
 	}
 #endif
-	ret = do_sb_start_write(sb, level, wait && !force_trylock, _RET_IP_);
+	if (wait && !force_trylock)
+		percpu_down_read(sb->s_writers.rw_sem + level-1);
+	else
+		ret = percpu_down_read_trylock(sb->s_writers.rw_sem + level-1);
+
 	WARN_ON(force_trylock & !ret);
 	return ret;
 }
@@ -1249,9 +1214,7 @@ EXPORT_SYMBOL(__sb_start_write);
  */
 static void sb_wait_write(struct super_block *sb, int level)
 {
-	s64 writers;
-
-	rwsem_acquire(&sb->s_writers.lock_map[level-1], 0, 0, _THIS_IP_);
+	percpu_down_write(sb->s_writers.rw_sem + level-1);
 	/*
 	 * We are going to return to userspace and forget about this lock, the
 	 * ownership goes to the caller of thaw_super() which does unlock.
@@ -1262,24 +1225,18 @@ static void sb_wait_write(struct super_block *sb, int level)
 	 * this leads to lockdep false-positives, so currently we do the early
 	 * release right after acquire.
 	 */
-	rwsem_release(&sb->s_writers.lock_map[level-1], 1, _THIS_IP_);
-
-	do {
-		DEFINE_WAIT(wait);
+	percpu_rwsem_release(sb->s_writers.rw_sem + level-1, 0, _THIS_IP_);
+}
 
-		/*
-		 * We use a barrier in prepare_to_wait() to separate setting
-		 * of frozen and checking of the counter
-		 */
-		prepare_to_wait(&sb->s_writers.wait, &wait,
-				TASK_UNINTERRUPTIBLE);
+static void sb_freeze_unlock(struct super_block *sb)
+{
+	int level;
 
-		writers = percpu_counter_sum(&sb->s_writers.counter[level-1]);
-		if (writers)
-			schedule();
+	for (level = 0; level < SB_FREEZE_LEVELS; ++level)
+		percpu_rwsem_acquire(sb->s_writers.rw_sem + level, 0, _THIS_IP_);
 
-		finish_wait(&sb->s_writers.wait, &wait);
-	} while (writers);
+	for (level = SB_FREEZE_LEVELS; --level >= 0; )
+		percpu_up_write(sb->s_writers.rw_sem + level);
 }
 
 /**
@@ -1338,20 +1295,14 @@ int freeze_super(struct super_block *sb)
 		return 0;
 	}
 
-	/* From now on, no new normal writers can start */
 	sb->s_writers.frozen = SB_FREEZE_WRITE;
-	smp_wmb();
-
 	/* Release s_umount to preserve sb_start_write -> s_umount ordering */
 	up_write(&sb->s_umount);
-
 	sb_wait_write(sb, SB_FREEZE_WRITE);
+	down_write(&sb->s_umount);
 
 	/* Now we go and block page faults... */
-	down_write(&sb->s_umount);
 	sb->s_writers.frozen = SB_FREEZE_PAGEFAULT;
-	smp_wmb();
-
 	sb_wait_write(sb, SB_FREEZE_PAGEFAULT);
 
 	/* All writers are done so after syncing there won't be dirty data */
@@ -1359,7 +1310,6 @@ int freeze_super(struct super_block *sb)
 
 	/* Now wait for internal filesystem counter */
 	sb->s_writers.frozen = SB_FREEZE_FS;
-	smp_wmb();
 	sb_wait_write(sb, SB_FREEZE_FS);
 
 	if (sb->s_op->freeze_fs) {
@@ -1368,7 +1318,7 @@ int freeze_super(struct super_block *sb)
 			printk(KERN_ERR
 				"VFS:Filesystem freeze failed\n");
 			sb->s_writers.frozen = SB_UNFROZEN;
-			smp_wmb();
+			sb_freeze_unlock(sb);
 			wake_up(&sb->s_writers.wait_unfrozen);
 			deactivate_locked_super(sb);
 			return ret;
@@ -1400,8 +1350,10 @@ int thaw_super(struct super_block *sb)
 		return -EINVAL;
 	}
 
-	if (sb->s_flags & MS_RDONLY)
+	if (sb->s_flags & MS_RDONLY) {
+		sb->s_writers.frozen = SB_UNFROZEN;
 		goto out;
+	}
 
 	if (sb->s_op->unfreeze_fs) {
 		error = sb->s_op->unfreeze_fs(sb);
@@ -1413,12 +1365,11 @@ int thaw_super(struct super_block *sb)
 		}
 	}
 
-out:
 	sb->s_writers.frozen = SB_UNFROZEN;
-	smp_wmb();
+	sb_freeze_unlock(sb);
+out:
 	wake_up(&sb->s_writers.wait_unfrozen);
 	deactivate_locked_super(sb);
-
 	return 0;
 }
 EXPORT_SYMBOL(thaw_super);
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 6addccc..fb2cb4a 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1,7 +1,6 @@
 #ifndef _LINUX_FS_H
 #define _LINUX_FS_H
 
-
 #include <linux/linkage.h>
 #include <linux/wait.h>
 #include <linux/kdev_t.h>
@@ -31,6 +30,7 @@
 #include <linux/percpu-rwsem.h>
 #include <linux/blk_types.h>
 #include <linux/workqueue.h>
+#include <linux/percpu-rwsem.h>
 
 #include <asm/byteorder.h>
 #include <uapi/linux/fs.h>
@@ -1247,16 +1247,9 @@ enum {
 #define SB_FREEZE_LEVELS (SB_FREEZE_COMPLETE - 1)
 
 struct sb_writers {
-	/* Counters for counting writers at each level */
-	struct percpu_counter	counter[SB_FREEZE_LEVELS];
-	wait_queue_head_t	wait;		/* queue for waiting for
-						   writers / faults to finish */
-	int			frozen;		/* Is sb frozen? */
-	wait_queue_head_t	wait_unfrozen;	/* queue for waiting for
-						   sb to be thawed */
-#ifdef CONFIG_DEBUG_LOCK_ALLOC
-	struct lockdep_map	lock_map[SB_FREEZE_LEVELS];
-#endif
+	int				frozen;		/* Is sb frozen? */
+	wait_queue_head_t		wait_unfrozen;	/* for get_super_thawed() */
+	struct percpu_rw_semaphore	rw_sem[SB_FREEZE_LEVELS];
 };
 
 struct super_block {
@@ -1364,9 +1357,9 @@ void __sb_end_write(struct super_block *sb, int level);
 int __sb_start_write(struct super_block *sb, int level, bool wait);
 
 #define __sb_acquire_write(sb, lev)	\
-	rwsem_acquire_read(&(sb)->s_writers.lock_map[(lev)-1], 0, 1, _THIS_IP_)
+	percpu_rwsem_acquire(&(sb)->s_writers.rw_sem[(lev)-1], 1, _THIS_IP_)
 #define __sb_release_write(sb, lev)	\
-	rwsem_release(&(sb)->s_writers.lock_map[(lev)-1], 1, _THIS_IP_)
+	percpu_rwsem_release(&(sb)->s_writers.rw_sem[(lev)-1], 1, _THIS_IP_)
 
 /**
  * sb_end_write - drop write access to a superblock
-- 
1.5.5.1


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH v2 9/8] don't fool lockdep in freeze_super() and thaw_super() paths
  2015-08-11 17:03 [PATCH v2 0/8] change sb_writers to use percpu_rw_semaphore Oleg Nesterov
                   ` (7 preceding siblings ...)
  2015-08-11 17:04 ` [PATCH v2 8/8] change sb_writers to use percpu_rw_semaphore Oleg Nesterov
@ 2015-08-12 13:11 ` Oleg Nesterov
  2015-08-13 11:01   ` Jan Kara
  8 siblings, 1 reply; 26+ messages in thread
From: Oleg Nesterov @ 2015-08-12 13:11 UTC (permalink / raw)
  To: Al Viro, Dave Chinner, Dave Hansen, Jan Kara
  Cc: Paul E. McKenney, Peter Zijlstra, linux-fsdevel, linux-kernel

On 08/11, Oleg Nesterov wrote:
>
> The only essential change is that I dropped the lockdep improvements
> as we discussed. This means that 8/8 was changed a bit, and I decided
> to add the new documentation patch, see 3/8.

Update: The recent

	[PATCH 0/2] xfs: kill lockdep false positives from readdir

changes from Dave fixed the problems ILOCK false-positives. So we can
add the additional patch which (modulo comments) just turns v2 back into
v1.

Dave, Jan, you seem to agree with these patches. How should we route
this all?

-------------------------------------------------------------------------------
Subject: [PATCH v2 9/8] don't fool lockdep in freeze_super() and thaw_super() paths

sb_wait_write()->percpu_rwsem_release() fools lockdep to avoid the
false-positives. Now that xfs was fixed by Dave we can remove it and
change freeze_super() and thaw_super() to run with s_writers.rw_sem
locks held; we add two trivial helpers for that, sb_freeze_release()
and sb_freeze_acquire().

While at it, kill the outdated part of the comment above sb_wait_write.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
---
 fs/super.c |   41 ++++++++++++++++++++++++++---------------
 1 files changed, 26 insertions(+), 15 deletions(-)

diff --git a/fs/super.c b/fs/super.c
index 8762997..91c9756 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -1208,32 +1208,39 @@ EXPORT_SYMBOL(__sb_start_write);
  * @level: type of writers we wait for (normal vs page fault)
  *
  * This function waits until there are no writers of given type to given file
- * system. Caller of this function should make sure there can be no new writers
- * of type @level before calling this function. Otherwise this function can
- * livelock.
+ * system.
  */
 static void sb_wait_write(struct super_block *sb, int level)
 {
 	percpu_down_write(sb->s_writers.rw_sem + level-1);
-	/*
-	 * We are going to return to userspace and forget about this lock, the
-	 * ownership goes to the caller of thaw_super() which does unlock.
-	 *
-	 * FIXME: we should do this before return from freeze_super() after we
-	 * called sync_filesystem(sb) and s_op->freeze_fs(sb), and thaw_super()
-	 * should re-acquire these locks before s_op->unfreeze_fs(sb). However
-	 * this leads to lockdep false-positives, so currently we do the early
-	 * release right after acquire.
-	 */
-	percpu_rwsem_release(sb->s_writers.rw_sem + level-1, 0, _THIS_IP_);
 }
 
-static void sb_freeze_unlock(struct super_block *sb)
+/*
+ * We are going to return to userspace and forget about these locks, the
+ * ownership goes to the caller of thaw_super() which does unlock().
+ */
+static void sb_freeze_release(struct super_block *sb)
+{
+	int level;
+
+	for (level = SB_FREEZE_LEVELS; --level >= 0; )
+		percpu_rwsem_release(sb->s_writers.rw_sem + level, 0, _THIS_IP_);
+}
+
+/*
+ * Tell lockdep we are holding these locks before we call ->unfreeze_fs(sb).
+ */
+static void sb_freeze_acquire(struct super_block *sb)
 {
 	int level;
 
 	for (level = 0; level < SB_FREEZE_LEVELS; ++level)
 		percpu_rwsem_acquire(sb->s_writers.rw_sem + level, 0, _THIS_IP_);
+}
+
+static void sb_freeze_unlock(struct super_block *sb)
+{
+	int level;
 
 	for (level = SB_FREEZE_LEVELS; --level >= 0; )
 		percpu_up_write(sb->s_writers.rw_sem + level);
@@ -1329,6 +1336,7 @@ int freeze_super(struct super_block *sb)
 	 * sees write activity when frozen is set to SB_FREEZE_COMPLETE.
 	 */
 	sb->s_writers.frozen = SB_FREEZE_COMPLETE;
+	sb_freeze_release(sb);
 	up_write(&sb->s_umount);
 	return 0;
 }
@@ -1355,11 +1363,14 @@ int thaw_super(struct super_block *sb)
 		goto out;
 	}
 
+	sb_freeze_acquire(sb);
+
 	if (sb->s_op->unfreeze_fs) {
 		error = sb->s_op->unfreeze_fs(sb);
 		if (error) {
 			printk(KERN_ERR
 				"VFS:Filesystem thaw failed\n");
+			sb_freeze_release(sb);
 			up_write(&sb->s_umount);
 			return error;
 		}
-- 
1.5.5.1



^ permalink raw reply related	[flat|nested] 26+ messages in thread

* Re: [PATCH v2 1/8] introduce __sb_{acquire,release}_write() helpers
  2015-08-11 17:03 ` [PATCH v2 1/8] introduce __sb_{acquire,release}_write() helpers Oleg Nesterov
@ 2015-08-13  9:45   ` Jan Kara
  2015-08-13  9:56     ` Jan Kara
  0 siblings, 1 reply; 26+ messages in thread
From: Jan Kara @ 2015-08-13  9:45 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Al Viro, Dave Chinner, Dave Hansen, Jan Kara, Paul E. McKenney,
	Peter Zijlstra, linux-fsdevel, linux-kernel

On Tue 11-08-15 19:03:58, Oleg Nesterov wrote:
> Preparation to hide the sb->s_writers internals from xfs and btrfs.
> Add 2 trivial define's they can use rather than play with ->s_writers
> directly. No changes in btrfs/transaction.o and xfs/xfs_aops.o.
> 
> Signed-off-by: Oleg Nesterov <oleg@redhat.com>

Looks good. You can add:

Reviewed-by: Jan Kara <jack@suse.com>

								Honza

> ---
>  fs/btrfs/transaction.c |    8 ++------
>  fs/xfs/xfs_aops.c      |    6 ++----
>  include/linux/fs.h     |    5 +++++
>  3 files changed, 9 insertions(+), 10 deletions(-)
> 
> diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c
> index 5628e25..6dca4e9 100644
> --- a/fs/btrfs/transaction.c
> +++ b/fs/btrfs/transaction.c
> @@ -1620,9 +1620,7 @@ static void do_async_commit(struct work_struct *work)
>  	 * Tell lockdep about it.
>  	 */
>  	if (ac->newtrans->type & __TRANS_FREEZABLE)
> -		rwsem_acquire_read(
> -		     &ac->root->fs_info->sb->s_writers.lock_map[SB_FREEZE_FS-1],
> -		     0, 1, _THIS_IP_);
> +		__sb_acquire_write(ac->root->fs_info->sb, SB_FREEZE_FS);
>  
>  	current->journal_info = ac->newtrans;
>  
> @@ -1661,9 +1659,7 @@ int btrfs_commit_transaction_async(struct btrfs_trans_handle *trans,
>  	 * async commit thread will be the one to unlock it.
>  	 */
>  	if (ac->newtrans->type & __TRANS_FREEZABLE)
> -		rwsem_release(
> -			&root->fs_info->sb->s_writers.lock_map[SB_FREEZE_FS-1],
> -			1, _THIS_IP_);
> +		__sb_release_write(root->fs_info->sb, SB_FREEZE_FS);
>  
>  	schedule_work(&ac->work);
>  
> diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
> index a56960d..8034c78 100644
> --- a/fs/xfs/xfs_aops.c
> +++ b/fs/xfs/xfs_aops.c
> @@ -119,8 +119,7 @@ xfs_setfilesize_trans_alloc(
>  	 * We may pass freeze protection with a transaction.  So tell lockdep
>  	 * we released it.
>  	 */
> -	rwsem_release(&ioend->io_inode->i_sb->s_writers.lock_map[SB_FREEZE_FS-1],
> -		      1, _THIS_IP_);
> +	__sb_release_write(ioend->io_inode->i_sb, SB_FREEZE_FS);
>  	/*
>  	 * We hand off the transaction to the completion thread now, so
>  	 * clear the flag here.
> @@ -171,8 +170,7 @@ xfs_setfilesize_ioend(
>  	 * Similarly for freeze protection.
>  	 */
>  	current_set_flags_nested(&tp->t_pflags, PF_FSTRANS);
> -	rwsem_acquire_read(&VFS_I(ip)->i_sb->s_writers.lock_map[SB_FREEZE_FS-1],
> -			   0, 1, _THIS_IP_);
> +	__sb_acquire_write(VFS_I(ip)->i_sb, SB_FREEZE_FS);
>  
>  	return xfs_setfilesize(ip, tp, ioend->io_offset, ioend->io_size);
>  }
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index 35ec87e..78ac768 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -1362,6 +1362,11 @@ extern struct timespec current_fs_time(struct super_block *sb);
>  void __sb_end_write(struct super_block *sb, int level);
>  int __sb_start_write(struct super_block *sb, int level, bool wait);
>  
> +#define __sb_acquire_write(sb, lev)	\
> +	rwsem_acquire_read(&(sb)->s_writers.lock_map[(lev)-1], 0, 1, _THIS_IP_)
> +#define __sb_release_write(sb, lev)	\
> +	rwsem_release(&(sb)->s_writers.lock_map[(lev)-1], 1, _THIS_IP_)
> +
>  /**
>   * sb_end_write - drop write access to a superblock
>   * @sb: the super we wrote to
> -- 
> 1.5.5.1
> 
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v2 1/8] introduce __sb_{acquire,release}_write() helpers
  2015-08-13  9:45   ` Jan Kara
@ 2015-08-13  9:56     ` Jan Kara
  2015-08-13 13:17       ` Oleg Nesterov
  0 siblings, 1 reply; 26+ messages in thread
From: Jan Kara @ 2015-08-13  9:56 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Al Viro, Dave Chinner, Dave Hansen, Jan Kara, Paul E. McKenney,
	Peter Zijlstra, linux-fsdevel, linux-kernel

On Thu 13-08-15 11:45:52, Jan Kara wrote:
> On Tue 11-08-15 19:03:58, Oleg Nesterov wrote:
> > Preparation to hide the sb->s_writers internals from xfs and btrfs.
> > Add 2 trivial define's they can use rather than play with ->s_writers
> > directly. No changes in btrfs/transaction.o and xfs/xfs_aops.o.
> > 
> > Signed-off-by: Oleg Nesterov <oleg@redhat.com>
> 
> Looks good. You can add:
> 
> Reviewed-by: Jan Kara <jack@suse.com>

One comment when looking at other patches - I'd prefer somewhat better name
than __sb_acquire_write(). It doesn't tell that it's only a trylock
acquisition. Maybe something like

__sb_writers_acquire_nowait()

and then

__sb_writers_release()?

								Honza
> > ---
> >  fs/btrfs/transaction.c |    8 ++------
> >  fs/xfs/xfs_aops.c      |    6 ++----
> >  include/linux/fs.h     |    5 +++++
> >  3 files changed, 9 insertions(+), 10 deletions(-)
> > 
> > diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c
> > index 5628e25..6dca4e9 100644
> > --- a/fs/btrfs/transaction.c
> > +++ b/fs/btrfs/transaction.c
> > @@ -1620,9 +1620,7 @@ static void do_async_commit(struct work_struct *work)
> >  	 * Tell lockdep about it.
> >  	 */
> >  	if (ac->newtrans->type & __TRANS_FREEZABLE)
> > -		rwsem_acquire_read(
> > -		     &ac->root->fs_info->sb->s_writers.lock_map[SB_FREEZE_FS-1],
> > -		     0, 1, _THIS_IP_);
> > +		__sb_acquire_write(ac->root->fs_info->sb, SB_FREEZE_FS);
> >  
> >  	current->journal_info = ac->newtrans;
> >  
> > @@ -1661,9 +1659,7 @@ int btrfs_commit_transaction_async(struct btrfs_trans_handle *trans,
> >  	 * async commit thread will be the one to unlock it.
> >  	 */
> >  	if (ac->newtrans->type & __TRANS_FREEZABLE)
> > -		rwsem_release(
> > -			&root->fs_info->sb->s_writers.lock_map[SB_FREEZE_FS-1],
> > -			1, _THIS_IP_);
> > +		__sb_release_write(root->fs_info->sb, SB_FREEZE_FS);
> >  
> >  	schedule_work(&ac->work);
> >  
> > diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
> > index a56960d..8034c78 100644
> > --- a/fs/xfs/xfs_aops.c
> > +++ b/fs/xfs/xfs_aops.c
> > @@ -119,8 +119,7 @@ xfs_setfilesize_trans_alloc(
> >  	 * We may pass freeze protection with a transaction.  So tell lockdep
> >  	 * we released it.
> >  	 */
> > -	rwsem_release(&ioend->io_inode->i_sb->s_writers.lock_map[SB_FREEZE_FS-1],
> > -		      1, _THIS_IP_);
> > +	__sb_release_write(ioend->io_inode->i_sb, SB_FREEZE_FS);
> >  	/*
> >  	 * We hand off the transaction to the completion thread now, so
> >  	 * clear the flag here.
> > @@ -171,8 +170,7 @@ xfs_setfilesize_ioend(
> >  	 * Similarly for freeze protection.
> >  	 */
> >  	current_set_flags_nested(&tp->t_pflags, PF_FSTRANS);
> > -	rwsem_acquire_read(&VFS_I(ip)->i_sb->s_writers.lock_map[SB_FREEZE_FS-1],
> > -			   0, 1, _THIS_IP_);
> > +	__sb_acquire_write(VFS_I(ip)->i_sb, SB_FREEZE_FS);
> >  
> >  	return xfs_setfilesize(ip, tp, ioend->io_offset, ioend->io_size);
> >  }
> > diff --git a/include/linux/fs.h b/include/linux/fs.h
> > index 35ec87e..78ac768 100644
> > --- a/include/linux/fs.h
> > +++ b/include/linux/fs.h
> > @@ -1362,6 +1362,11 @@ extern struct timespec current_fs_time(struct super_block *sb);
> >  void __sb_end_write(struct super_block *sb, int level);
> >  int __sb_start_write(struct super_block *sb, int level, bool wait);
> >  
> > +#define __sb_acquire_write(sb, lev)	\
> > +	rwsem_acquire_read(&(sb)->s_writers.lock_map[(lev)-1], 0, 1, _THIS_IP_)
> > +#define __sb_release_write(sb, lev)	\
> > +	rwsem_release(&(sb)->s_writers.lock_map[(lev)-1], 1, _THIS_IP_)
> > +
> >  /**
> >   * sb_end_write - drop write access to a superblock
> >   * @sb: the super we wrote to
> > -- 
> > 1.5.5.1
> > 
> -- 
> Jan Kara <jack@suse.com>
> SUSE Labs, CR
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v2 2/8] fix the broken lockdep logic in __sb_start_write()
  2015-08-11 17:04 ` [PATCH v2 2/8] fix the broken lockdep logic in __sb_start_write() Oleg Nesterov
@ 2015-08-13 10:02   ` Jan Kara
  2015-08-13 13:22     ` Oleg Nesterov
  0 siblings, 1 reply; 26+ messages in thread
From: Jan Kara @ 2015-08-13 10:02 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Al Viro, Dave Chinner, Dave Hansen, Jan Kara, Paul E. McKenney,
	Peter Zijlstra, linux-fsdevel, linux-kernel

On Tue 11-08-15 19:04:01, Oleg Nesterov wrote:
> 1. wait_event(frozen < level) without rwsem_acquire_read() is just
>    wrong from lockdep perspective. If we are going to deadlock
>    because the caller is buggy, lockdep detect this problem.
> 
> 2. __sb_start_write() can race with thaw_super() + freeze_super(),
>    and after "goto retry" the 2nd  acquire_freeze_lock() is wrong.
> 
> 3. The "tell lockdep we are doing trylock" hack doesn't look nice.
> 
>    I think this is correct, but this logic should be more explicit.
>    Yes, the recursive read_lock() is fine if we hold the lock on a
>    higher level. But we do not need to fool lockdep. If we can not
>    deadlock in this case then try-lock must not fail and we can use
>    use wait == F throughout this code.
> 
> Note: as Dave Chinner explains, the "trylock" hack and the fat comment
> can be probably removed. But this needs a separate change and it will
> be trivial: just kill __sb_start_write() and rename do_sb_start_write()
> back to __sb_start_write().
> 
> Signed-off-by: Oleg Nesterov <oleg@redhat.com>

Just a nit below...

> +	if (wait)
> +		rwsem_acquire_read(&sb->s_writers.lock_map[level-1], 0, 0, ip);

If we provided also __sb_writers_acquire() helper (in addition to _nowait)
variant, we could use these helpers in __sb_start_write() /
__sb_end_write() as well which would look better to me when we already have
them.

Once this is updated, feel free to add:

Reviewed-by: Jan Kara <jack@suse.com>

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v2 3/8] document rwsem_release() in sb_wait_write()
  2015-08-11 17:04 ` [PATCH v2 3/8] document rwsem_release() in sb_wait_write() Oleg Nesterov
@ 2015-08-13 10:22   ` Jan Kara
  0 siblings, 0 replies; 26+ messages in thread
From: Jan Kara @ 2015-08-13 10:22 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Al Viro, Dave Chinner, Dave Hansen, Jan Kara, Paul E. McKenney,
	Peter Zijlstra, linux-fsdevel, linux-kernel

On Tue 11-08-15 19:04:04, Oleg Nesterov wrote:
> Not only we need to avoid the warning from lockdep_sys_exit(), the
> caller of freeze_super() can never release this lock. Another thread
> can do this, so there is another reason for rwsem_release().
> 
> Plus the comment should explain why we have to fool lockdep.
> 
> Signed-off-by: Oleg Nesterov <oleg@redhat.com>

Looks good. You can add:

Reviewed-by: Jan Kara <jack@suse.com>

								Honza

> ---
>  fs/super.c |   12 +++++++++---
>  1 files changed, 9 insertions(+), 3 deletions(-)
> 
> diff --git a/fs/super.c b/fs/super.c
> index d0fdd49..89b58fb 100644
> --- a/fs/super.c
> +++ b/fs/super.c
> @@ -1236,11 +1236,17 @@ static void sb_wait_write(struct super_block *sb, int level)
>  {
>  	s64 writers;
>  
> +	rwsem_acquire(&sb->s_writers.lock_map[level-1], 0, 0, _THIS_IP_);
>  	/*
> -	 * We just cycle-through lockdep here so that it does not complain
> -	 * about returning with lock to userspace
> +	 * We are going to return to userspace and forget about this lock, the
> +	 * ownership goes to the caller of thaw_super() which does unlock.
> +	 *
> +	 * FIXME: we should do this before return from freeze_super() after we
> +	 * called sync_filesystem(sb) and s_op->freeze_fs(sb), and thaw_super()
> +	 * should re-acquire these locks before s_op->unfreeze_fs(sb). However
> +	 * this leads to lockdep false-positives, so currently we do the early
> +	 * release right after acquire.
>  	 */
> -	rwsem_acquire(&sb->s_writers.lock_map[level-1], 0, 0, _THIS_IP_);
>  	rwsem_release(&sb->s_writers.lock_map[level-1], 1, _THIS_IP_);
>  
>  	do {
> -- 
> 1.5.5.1
> 
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v2 7/8] shift percpu_counter_destroy() into destroy_super_work()
  2015-08-11 17:04 ` [PATCH v2 7/8] shift percpu_counter_destroy() into destroy_super_work() Oleg Nesterov
@ 2015-08-13 10:35   ` Jan Kara
  2015-08-13 13:36     ` Oleg Nesterov
  0 siblings, 1 reply; 26+ messages in thread
From: Jan Kara @ 2015-08-13 10:35 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Al Viro, Dave Chinner, Dave Hansen, Jan Kara, Paul E. McKenney,
	Peter Zijlstra, linux-fsdevel, linux-kernel

On Tue 11-08-15 19:04:16, Oleg Nesterov wrote:
> Of course, this patch is ugly as hell. It will be (partially)
> reverted later. We add it to ensure that other WIP changes in
> percpu_rw_semaphore won't break fs/super.c.
> 
> We do not even need this change right now, percpu_free_rwsem()
> is fine in atomic context. But we are going to change this, it
> will be might_sleep() after we merge the rcu_sync() patches.
> 
> And even after that we do not really need destroy_super_work(),
> we will kill it in any case. Instead, destroy_super_rcu() should
> just check that rss->cb_state == CB_IDLE and do call_rcu() again
> in the (very unlikely) case this is not true.
> 
> So this is just the temporary kludge which helps us to avoid the
> conflicts with the changes which will be (hopefully) routed via
> rcu tree.
> 
> Signed-off-by: Oleg Nesterov <oleg@redhat.com>

Looking into this again, it would seem somewhat cleaner to me to move the
destruction to deactivate_locked_super() instead. We already do this for
list_lru_destroy() because that can sleep as well and so I'd prefer to keep
these two things together. You have to be somewhat careful with the failure
path in alloc_super() but that's easily doable...

								Honza

> ---
>  fs/super.c         |   23 +++++++++++++++++++----
>  include/linux/fs.h |    3 ++-
>  2 files changed, 21 insertions(+), 5 deletions(-)
> 
> diff --git a/fs/super.c b/fs/super.c
> index 89b58fb..75436e2 100644
> --- a/fs/super.c
> +++ b/fs/super.c
> @@ -135,6 +135,24 @@ static unsigned long super_cache_count(struct shrinker *shrink,
>  	return total_objects;
>  }
>  
> +static void destroy_super_work(struct work_struct *work)
> +{
> +	struct super_block *s = container_of(work, struct super_block,
> +							destroy_work);
> +	int i;
> +
> +	for (i = 0; i < SB_FREEZE_LEVELS; i++)
> +		percpu_counter_destroy(&s->s_writers.counter[i]);
> +	kfree(s);
> +}
> +
> +static void destroy_super_rcu(struct rcu_head *head)
> +{
> +	struct super_block *s = container_of(head, struct super_block, rcu);
> +	INIT_WORK(&s->destroy_work, destroy_super_work);
> +	schedule_work(&s->destroy_work);
> +}
> +
>  /**
>   *	destroy_super	-	frees a superblock
>   *	@s: superblock to free
> @@ -143,16 +161,13 @@ static unsigned long super_cache_count(struct shrinker *shrink,
>   */
>  static void destroy_super(struct super_block *s)
>  {
> -	int i;
>  	list_lru_destroy(&s->s_dentry_lru);
>  	list_lru_destroy(&s->s_inode_lru);
> -	for (i = 0; i < SB_FREEZE_LEVELS; i++)
> -		percpu_counter_destroy(&s->s_writers.counter[i]);
>  	security_sb_free(s);
>  	WARN_ON(!list_empty(&s->s_mounts));
>  	kfree(s->s_subtype);
>  	kfree(s->s_options);
> -	kfree_rcu(s, rcu);
> +	call_rcu(&s->rcu, destroy_super_rcu);
>  }
>  
>  /**
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index 78ac768..6addccc 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -30,6 +30,7 @@
>  #include <linux/lockdep.h>
>  #include <linux/percpu-rwsem.h>
>  #include <linux/blk_types.h>
> +#include <linux/workqueue.h>
>  
>  #include <asm/byteorder.h>
>  #include <uapi/linux/fs.h>
> @@ -1346,7 +1347,7 @@ struct super_block {
>  	struct list_lru		s_dentry_lru ____cacheline_aligned_in_smp;
>  	struct list_lru		s_inode_lru ____cacheline_aligned_in_smp;
>  	struct rcu_head		rcu;
> -
> +	struct work_struct	destroy_work;
>  	/*
>  	 * Indicates how deep in a filesystem stack this SB is
>  	 */
> -- 
> 1.5.5.1
> 
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v2 8/8] change sb_writers to use percpu_rw_semaphore
  2015-08-11 17:04 ` [PATCH v2 8/8] change sb_writers to use percpu_rw_semaphore Oleg Nesterov
@ 2015-08-13 10:48   ` Jan Kara
  0 siblings, 0 replies; 26+ messages in thread
From: Jan Kara @ 2015-08-13 10:48 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Al Viro, Dave Chinner, Dave Hansen, Jan Kara, Paul E. McKenney,
	Peter Zijlstra, linux-fsdevel, linux-kernel

On Tue 11-08-15 19:04:19, Oleg Nesterov wrote:
> We can remove everything from struct sb_writers except frozen
> and add the array of percpu_rw_semaphore's instead.
> 
> This patch doesn't remove sb_writers->wait_unfrozen yet, we keep
> it for get_super_thawed(). We will probably remove it later.
> 
> This change tries to address the following problems:
> 
> 	- Firstly, __sb_start_write() looks simply buggy. It does
> 	  __sb_end_write() if it sees ->frozen, but if it migrates
> 	  to another CPU before percpu_counter_dec(), sb_wait_write()
> 	  can wrongly succeed if there is another task which holds
> 	  the same "semaphore": sb_wait_write() can miss the result
> 	  of the previous percpu_counter_inc() but see the result
> 	  of this percpu_counter_dec().
> 
> 	- As Dave Hansen reports, it is suboptimal. The trivial
> 	  microbenchmark that writes to a tmpfs file in a loop runs
> 	  12% faster if we change this code to rely on RCU and kill
> 	  the memory barriers.
> 
> 	- This code doesn't look simple. It would be better to rely
> 	  on the generic locking code.
> 
> 	  According to Dave, this change adds the same performance
> 	  improvement.
> 
> Note: with this change both freeze_super() and thaw_super() will do
> synchronize_sched_expedited() 3 times. This is just ugly. But:
> 
> 	- This will be "fixed" by the rcu_sync changes we are going
> 	  to merge. After that freeze_super()->percpu_down_write()
> 	  will use synchronize_sched(), and thaw_super() won't use
> 	  synchronize() at all.
> 
> 	  This doesn't need any changes in fs/super.c.
> 
> 	- Once we merge rcu_sync changes, we can also change super.c
> 	  so that all wb_write->rw_sem's will share the single ->rss
> 	  in struct sb_writes, then freeze_super() will need only one
> 	  synchronize_sched().
> 
> Signed-off-by: Oleg Nesterov <oleg@redhat.com>

The patch looks good. You can add:

Reviewed-by: Jan Kara <jack@suse.com>

								Honza

> ---
>  fs/super.c         |  107 ++++++++++++++--------------------------------------
>  include/linux/fs.h |   19 +++------
>  2 files changed, 35 insertions(+), 91 deletions(-)
> 
> diff --git a/fs/super.c b/fs/super.c
> index 75436e2..8762997 100644
> --- a/fs/super.c
> +++ b/fs/super.c
> @@ -142,7 +142,7 @@ static void destroy_super_work(struct work_struct *work)
>  	int i;
>  
>  	for (i = 0; i < SB_FREEZE_LEVELS; i++)
> -		percpu_counter_destroy(&s->s_writers.counter[i]);
> +		percpu_free_rwsem(&s->s_writers.rw_sem[i]);
>  	kfree(s);
>  }
>  
> @@ -193,13 +193,11 @@ static struct super_block *alloc_super(struct file_system_type *type, int flags)
>  		goto fail;
>  
>  	for (i = 0; i < SB_FREEZE_LEVELS; i++) {
> -		if (percpu_counter_init(&s->s_writers.counter[i], 0,
> -					GFP_KERNEL) < 0)
> +		if (__percpu_init_rwsem(&s->s_writers.rw_sem[i],
> +					sb_writers_name[i],
> +					&type->s_writers_key[i]))
>  			goto fail;
> -		lockdep_init_map(&s->s_writers.lock_map[i], sb_writers_name[i],
> -				 &type->s_writers_key[i], 0);
>  	}
> -	init_waitqueue_head(&s->s_writers.wait);
>  	init_waitqueue_head(&s->s_writers.wait_unfrozen);
>  	s->s_bdi = &noop_backing_dev_info;
>  	s->s_flags = flags;
> @@ -1161,47 +1159,10 @@ out:
>   */
>  void __sb_end_write(struct super_block *sb, int level)
>  {
> -	percpu_counter_dec(&sb->s_writers.counter[level-1]);
> -	/*
> -	 * Make sure s_writers are updated before we wake up waiters in
> -	 * freeze_super().
> -	 */
> -	smp_mb();
> -	if (waitqueue_active(&sb->s_writers.wait))
> -		wake_up(&sb->s_writers.wait);
> -	rwsem_release(&sb->s_writers.lock_map[level-1], 1, _RET_IP_);
> +	percpu_up_read(sb->s_writers.rw_sem + level-1);
>  }
>  EXPORT_SYMBOL(__sb_end_write);
>  
> -static int do_sb_start_write(struct super_block *sb, int level, bool wait,
> -				unsigned long ip)
> -{
> -	if (wait)
> -		rwsem_acquire_read(&sb->s_writers.lock_map[level-1], 0, 0, ip);
> -retry:
> -	if (unlikely(sb->s_writers.frozen >= level)) {
> -		if (!wait)
> -			return 0;
> -		wait_event(sb->s_writers.wait_unfrozen,
> -			   sb->s_writers.frozen < level);
> -	}
> -
> -	percpu_counter_inc(&sb->s_writers.counter[level-1]);
> -	/*
> -	 * Make sure counter is updated before we check for frozen.
> -	 * freeze_super() first sets frozen and then checks the counter.
> -	 */
> -	smp_mb();
> -	if (unlikely(sb->s_writers.frozen >= level)) {
> -		__sb_end_write(sb, level);
> -		goto retry;
> -	}
> -
> -	if (!wait)
> -		rwsem_acquire_read(&sb->s_writers.lock_map[level-1], 0, 1, ip);
> -	return 1;
> -}
> -
>  /*
>   * This is an internal function, please use sb_start_{write,pagefault,intwrite}
>   * instead.
> @@ -1209,7 +1170,7 @@ retry:
>  int __sb_start_write(struct super_block *sb, int level, bool wait)
>  {
>  	bool force_trylock = false;
> -	int ret;
> +	int ret = 1;
>  
>  #ifdef CONFIG_LOCKDEP
>  	/*
> @@ -1225,13 +1186,17 @@ int __sb_start_write(struct super_block *sb, int level, bool wait)
>  		int i;
>  
>  		for (i = 0; i < level - 1; i++)
> -			if (lock_is_held(&sb->s_writers.lock_map[i])) {
> +			if (percpu_rwsem_is_held(sb->s_writers.rw_sem + i)) {
>  				force_trylock = true;
>  				break;
>  			}
>  	}
>  #endif
> -	ret = do_sb_start_write(sb, level, wait && !force_trylock, _RET_IP_);
> +	if (wait && !force_trylock)
> +		percpu_down_read(sb->s_writers.rw_sem + level-1);
> +	else
> +		ret = percpu_down_read_trylock(sb->s_writers.rw_sem + level-1);
> +
>  	WARN_ON(force_trylock & !ret);
>  	return ret;
>  }
> @@ -1249,9 +1214,7 @@ EXPORT_SYMBOL(__sb_start_write);
>   */
>  static void sb_wait_write(struct super_block *sb, int level)
>  {
> -	s64 writers;
> -
> -	rwsem_acquire(&sb->s_writers.lock_map[level-1], 0, 0, _THIS_IP_);
> +	percpu_down_write(sb->s_writers.rw_sem + level-1);
>  	/*
>  	 * We are going to return to userspace and forget about this lock, the
>  	 * ownership goes to the caller of thaw_super() which does unlock.
> @@ -1262,24 +1225,18 @@ static void sb_wait_write(struct super_block *sb, int level)
>  	 * this leads to lockdep false-positives, so currently we do the early
>  	 * release right after acquire.
>  	 */
> -	rwsem_release(&sb->s_writers.lock_map[level-1], 1, _THIS_IP_);
> -
> -	do {
> -		DEFINE_WAIT(wait);
> +	percpu_rwsem_release(sb->s_writers.rw_sem + level-1, 0, _THIS_IP_);
> +}
>  
> -		/*
> -		 * We use a barrier in prepare_to_wait() to separate setting
> -		 * of frozen and checking of the counter
> -		 */
> -		prepare_to_wait(&sb->s_writers.wait, &wait,
> -				TASK_UNINTERRUPTIBLE);
> +static void sb_freeze_unlock(struct super_block *sb)
> +{
> +	int level;
>  
> -		writers = percpu_counter_sum(&sb->s_writers.counter[level-1]);
> -		if (writers)
> -			schedule();
> +	for (level = 0; level < SB_FREEZE_LEVELS; ++level)
> +		percpu_rwsem_acquire(sb->s_writers.rw_sem + level, 0, _THIS_IP_);
>  
> -		finish_wait(&sb->s_writers.wait, &wait);
> -	} while (writers);
> +	for (level = SB_FREEZE_LEVELS; --level >= 0; )
> +		percpu_up_write(sb->s_writers.rw_sem + level);
>  }
>  
>  /**
> @@ -1338,20 +1295,14 @@ int freeze_super(struct super_block *sb)
>  		return 0;
>  	}
>  
> -	/* From now on, no new normal writers can start */
>  	sb->s_writers.frozen = SB_FREEZE_WRITE;
> -	smp_wmb();
> -
>  	/* Release s_umount to preserve sb_start_write -> s_umount ordering */
>  	up_write(&sb->s_umount);
> -
>  	sb_wait_write(sb, SB_FREEZE_WRITE);
> +	down_write(&sb->s_umount);
>  
>  	/* Now we go and block page faults... */
> -	down_write(&sb->s_umount);
>  	sb->s_writers.frozen = SB_FREEZE_PAGEFAULT;
> -	smp_wmb();
> -
>  	sb_wait_write(sb, SB_FREEZE_PAGEFAULT);
>  
>  	/* All writers are done so after syncing there won't be dirty data */
> @@ -1359,7 +1310,6 @@ int freeze_super(struct super_block *sb)
>  
>  	/* Now wait for internal filesystem counter */
>  	sb->s_writers.frozen = SB_FREEZE_FS;
> -	smp_wmb();
>  	sb_wait_write(sb, SB_FREEZE_FS);
>  
>  	if (sb->s_op->freeze_fs) {
> @@ -1368,7 +1318,7 @@ int freeze_super(struct super_block *sb)
>  			printk(KERN_ERR
>  				"VFS:Filesystem freeze failed\n");
>  			sb->s_writers.frozen = SB_UNFROZEN;
> -			smp_wmb();
> +			sb_freeze_unlock(sb);
>  			wake_up(&sb->s_writers.wait_unfrozen);
>  			deactivate_locked_super(sb);
>  			return ret;
> @@ -1400,8 +1350,10 @@ int thaw_super(struct super_block *sb)
>  		return -EINVAL;
>  	}
>  
> -	if (sb->s_flags & MS_RDONLY)
> +	if (sb->s_flags & MS_RDONLY) {
> +		sb->s_writers.frozen = SB_UNFROZEN;
>  		goto out;
> +	}
>  
>  	if (sb->s_op->unfreeze_fs) {
>  		error = sb->s_op->unfreeze_fs(sb);
> @@ -1413,12 +1365,11 @@ int thaw_super(struct super_block *sb)
>  		}
>  	}
>  
> -out:
>  	sb->s_writers.frozen = SB_UNFROZEN;
> -	smp_wmb();
> +	sb_freeze_unlock(sb);
> +out:
>  	wake_up(&sb->s_writers.wait_unfrozen);
>  	deactivate_locked_super(sb);
> -
>  	return 0;
>  }
>  EXPORT_SYMBOL(thaw_super);
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index 6addccc..fb2cb4a 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -1,7 +1,6 @@
>  #ifndef _LINUX_FS_H
>  #define _LINUX_FS_H
>  
> -
>  #include <linux/linkage.h>
>  #include <linux/wait.h>
>  #include <linux/kdev_t.h>
> @@ -31,6 +30,7 @@
>  #include <linux/percpu-rwsem.h>
>  #include <linux/blk_types.h>
>  #include <linux/workqueue.h>
> +#include <linux/percpu-rwsem.h>
>  
>  #include <asm/byteorder.h>
>  #include <uapi/linux/fs.h>
> @@ -1247,16 +1247,9 @@ enum {
>  #define SB_FREEZE_LEVELS (SB_FREEZE_COMPLETE - 1)
>  
>  struct sb_writers {
> -	/* Counters for counting writers at each level */
> -	struct percpu_counter	counter[SB_FREEZE_LEVELS];
> -	wait_queue_head_t	wait;		/* queue for waiting for
> -						   writers / faults to finish */
> -	int			frozen;		/* Is sb frozen? */
> -	wait_queue_head_t	wait_unfrozen;	/* queue for waiting for
> -						   sb to be thawed */
> -#ifdef CONFIG_DEBUG_LOCK_ALLOC
> -	struct lockdep_map	lock_map[SB_FREEZE_LEVELS];
> -#endif
> +	int				frozen;		/* Is sb frozen? */
> +	wait_queue_head_t		wait_unfrozen;	/* for get_super_thawed() */
> +	struct percpu_rw_semaphore	rw_sem[SB_FREEZE_LEVELS];
>  };
>  
>  struct super_block {
> @@ -1364,9 +1357,9 @@ void __sb_end_write(struct super_block *sb, int level);
>  int __sb_start_write(struct super_block *sb, int level, bool wait);
>  
>  #define __sb_acquire_write(sb, lev)	\
> -	rwsem_acquire_read(&(sb)->s_writers.lock_map[(lev)-1], 0, 1, _THIS_IP_)
> +	percpu_rwsem_acquire(&(sb)->s_writers.rw_sem[(lev)-1], 1, _THIS_IP_)
>  #define __sb_release_write(sb, lev)	\
> -	rwsem_release(&(sb)->s_writers.lock_map[(lev)-1], 1, _THIS_IP_)
> +	percpu_rwsem_release(&(sb)->s_writers.rw_sem[(lev)-1], 1, _THIS_IP_)
>  
>  /**
>   * sb_end_write - drop write access to a superblock
> -- 
> 1.5.5.1
> 
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v2 9/8] don't fool lockdep in freeze_super() and thaw_super() paths
  2015-08-12 13:11 ` [PATCH v2 9/8] don't fool lockdep in freeze_super() and thaw_super() paths Oleg Nesterov
@ 2015-08-13 11:01   ` Jan Kara
  2015-08-13 13:58     ` Oleg Nesterov
  0 siblings, 1 reply; 26+ messages in thread
From: Jan Kara @ 2015-08-13 11:01 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Al Viro, Dave Chinner, Dave Hansen, Jan Kara, Paul E. McKenney,
	Peter Zijlstra, linux-fsdevel, linux-kernel

On Wed 12-08-15 15:11:38, Oleg Nesterov wrote:
> On 08/11, Oleg Nesterov wrote:
> >
> > The only essential change is that I dropped the lockdep improvements
> > as we discussed. This means that 8/8 was changed a bit, and I decided
> > to add the new documentation patch, see 3/8.
> 
> Update: The recent
> 
> 	[PATCH 0/2] xfs: kill lockdep false positives from readdir
> 
> changes from Dave fixed the problems ILOCK false-positives. So we can
> add the additional patch which (modulo comments) just turns v2 back into
> v1.
> 
> Dave, Jan, you seem to agree with these patches. How should we route
> this all?

Regarding the routing, ideally Al Viro should take these as a VFS
maintainer.
 
> -------------------------------------------------------------------------------
> Subject: [PATCH v2 9/8] don't fool lockdep in freeze_super() and thaw_super() paths
> 
> sb_wait_write()->percpu_rwsem_release() fools lockdep to avoid the
> false-positives. Now that xfs was fixed by Dave we can remove it and
> change freeze_super() and thaw_super() to run with s_writers.rw_sem
> locks held; we add two trivial helpers for that, sb_freeze_release()
> and sb_freeze_acquire().
> 
> While at it, kill the outdated part of the comment above sb_wait_write.
> 
> Signed-off-by: Oleg Nesterov <oleg@redhat.com>

The patch looks good. Just one nit:

> +	for (level = SB_FREEZE_LEVELS; --level >= 0; )
> +		percpu_rwsem_release(sb->s_writers.rw_sem + level, 0, _THIS_IP_);

It is more common (and to me more readable) to have the loop written as:

for (level = SB_FREEZE_LEVELS - 1; level >= 0; level--)

I agree what you do is shorter but IMHO it's just an unnecessary
obfuscation :)

Otherwise feel free to add:

Reviewed-by: Jan Kara <jack@suse.com>

								Honza

> +}
> +
> +/*
> + * Tell lockdep we are holding these locks before we call ->unfreeze_fs(sb).
> + */
> +static void sb_freeze_acquire(struct super_block *sb)
>  {
>  	int level;
>  
>  	for (level = 0; level < SB_FREEZE_LEVELS; ++level)
>  		percpu_rwsem_acquire(sb->s_writers.rw_sem + level, 0, _THIS_IP_);
> +}
> +
> +static void sb_freeze_unlock(struct super_block *sb)
> +{
> +	int level;
>  
>  	for (level = SB_FREEZE_LEVELS; --level >= 0; )
>  		percpu_up_write(sb->s_writers.rw_sem + level);
> @@ -1329,6 +1336,7 @@ int freeze_super(struct super_block *sb)
>  	 * sees write activity when frozen is set to SB_FREEZE_COMPLETE.
>  	 */
>  	sb->s_writers.frozen = SB_FREEZE_COMPLETE;
> +	sb_freeze_release(sb);
>  	up_write(&sb->s_umount);
>  	return 0;
>  }
> @@ -1355,11 +1363,14 @@ int thaw_super(struct super_block *sb)
>  		goto out;
>  	}
>  
> +	sb_freeze_acquire(sb);
> +
>  	if (sb->s_op->unfreeze_fs) {
>  		error = sb->s_op->unfreeze_fs(sb);
>  		if (error) {
>  			printk(KERN_ERR
>  				"VFS:Filesystem thaw failed\n");
> +			sb_freeze_release(sb);
>  			up_write(&sb->s_umount);
>  			return error;
>  		}
> -- 
> 1.5.5.1
> 
> 
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v2 1/8] introduce __sb_{acquire,release}_write() helpers
  2015-08-13  9:56     ` Jan Kara
@ 2015-08-13 13:17       ` Oleg Nesterov
  2015-08-13 13:32         ` Jan Kara
  0 siblings, 1 reply; 26+ messages in thread
From: Oleg Nesterov @ 2015-08-13 13:17 UTC (permalink / raw)
  To: Jan Kara
  Cc: Al Viro, Dave Chinner, Dave Hansen, Paul E. McKenney,
	Peter Zijlstra, linux-fsdevel, linux-kernel

On 08/13, Jan Kara wrote:
>
> On Thu 13-08-15 11:45:52, Jan Kara wrote:
> > On Tue 11-08-15 19:03:58, Oleg Nesterov wrote:
> > > Preparation to hide the sb->s_writers internals from xfs and btrfs.
> > > Add 2 trivial define's they can use rather than play with ->s_writers
> > > directly. No changes in btrfs/transaction.o and xfs/xfs_aops.o.
> > >
> > > Signed-off-by: Oleg Nesterov <oleg@redhat.com>
> >
> > Looks good. You can add:
> >
> > Reviewed-by: Jan Kara <jack@suse.com>
>
> One comment when looking at other patches - I'd prefer somewhat better name
> than __sb_acquire_write().

Yes, __sb_acquire_write() doesn't look nice and I agree with any
naming.

> It doesn't tell that it's only a trylock
> acquisition. Maybe something like

But it is not actually "trylock"... This lock was already locked but
not by us. __sb_release_write + __sb_acquire_write is used to transfer
the ownership,

> __sb_writers_acquire_nowait()
>
> and then
>
> __sb_writers_release()?

so I agree with any naming, I'll update this patch... but perhaps
__sb_writers_acquire() without "_nowait" make more sense?

Oleg.


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v2 2/8] fix the broken lockdep logic in __sb_start_write()
  2015-08-13 10:02   ` Jan Kara
@ 2015-08-13 13:22     ` Oleg Nesterov
  2015-08-13 13:29       ` Jan Kara
  0 siblings, 1 reply; 26+ messages in thread
From: Oleg Nesterov @ 2015-08-13 13:22 UTC (permalink / raw)
  To: Jan Kara
  Cc: Al Viro, Dave Chinner, Dave Hansen, Paul E. McKenney,
	Peter Zijlstra, linux-fsdevel, linux-kernel

On 08/13, Jan Kara wrote:
>
> On Tue 11-08-15 19:04:01, Oleg Nesterov wrote:
> > 1. wait_event(frozen < level) without rwsem_acquire_read() is just
> >    wrong from lockdep perspective. If we are going to deadlock
> >    because the caller is buggy, lockdep detect this problem.
> >
> > 2. __sb_start_write() can race with thaw_super() + freeze_super(),
> >    and after "goto retry" the 2nd  acquire_freeze_lock() is wrong.
> >
> > 3. The "tell lockdep we are doing trylock" hack doesn't look nice.
> >
> >    I think this is correct, but this logic should be more explicit.
> >    Yes, the recursive read_lock() is fine if we hold the lock on a
> >    higher level. But we do not need to fool lockdep. If we can not
> >    deadlock in this case then try-lock must not fail and we can use
> >    use wait == F throughout this code.
> >
> > Note: as Dave Chinner explains, the "trylock" hack and the fat comment
> > can be probably removed. But this needs a separate change and it will
> > be trivial: just kill __sb_start_write() and rename do_sb_start_write()
> > back to __sb_start_write().
> >
> > Signed-off-by: Oleg Nesterov <oleg@redhat.com>
>
> Just a nit below...
>
> > +	if (wait)
> > +		rwsem_acquire_read(&sb->s_writers.lock_map[level-1], 0, 0, ip);
>
> If we provided also __sb_writers_acquire() helper (in addition to _nowait)
> variant, we could use these helpers in __sb_start_write() /
> __sb_end_write() as well which would look better to me when we already have
> them.

Why? This code goes away after 8/8.

Oleg.


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v2 2/8] fix the broken lockdep logic in __sb_start_write()
  2015-08-13 13:22     ` Oleg Nesterov
@ 2015-08-13 13:29       ` Jan Kara
  0 siblings, 0 replies; 26+ messages in thread
From: Jan Kara @ 2015-08-13 13:29 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Jan Kara, Al Viro, Dave Chinner, Dave Hansen, Paul E. McKenney,
	Peter Zijlstra, linux-fsdevel, linux-kernel

On Thu 13-08-15 15:22:23, Oleg Nesterov wrote:
> On 08/13, Jan Kara wrote:
> >
> > On Tue 11-08-15 19:04:01, Oleg Nesterov wrote:
> > > 1. wait_event(frozen < level) without rwsem_acquire_read() is just
> > >    wrong from lockdep perspective. If we are going to deadlock
> > >    because the caller is buggy, lockdep detect this problem.
> > >
> > > 2. __sb_start_write() can race with thaw_super() + freeze_super(),
> > >    and after "goto retry" the 2nd  acquire_freeze_lock() is wrong.
> > >
> > > 3. The "tell lockdep we are doing trylock" hack doesn't look nice.
> > >
> > >    I think this is correct, but this logic should be more explicit.
> > >    Yes, the recursive read_lock() is fine if we hold the lock on a
> > >    higher level. But we do not need to fool lockdep. If we can not
> > >    deadlock in this case then try-lock must not fail and we can use
> > >    use wait == F throughout this code.
> > >
> > > Note: as Dave Chinner explains, the "trylock" hack and the fat comment
> > > can be probably removed. But this needs a separate change and it will
> > > be trivial: just kill __sb_start_write() and rename do_sb_start_write()
> > > back to __sb_start_write().
> > >
> > > Signed-off-by: Oleg Nesterov <oleg@redhat.com>
> >
> > Just a nit below...
> >
> > > +	if (wait)
> > > +		rwsem_acquire_read(&sb->s_writers.lock_map[level-1], 0, 0, ip);
> >
> > If we provided also __sb_writers_acquire() helper (in addition to _nowait)
> > variant, we could use these helpers in __sb_start_write() /
> > __sb_end_write() as well which would look better to me when we already have
> > them.
> 
> Why? This code goes away after 8/8.

Right, OK. Objection retracted :).

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v2 1/8] introduce __sb_{acquire,release}_write() helpers
  2015-08-13 13:17       ` Oleg Nesterov
@ 2015-08-13 13:32         ` Jan Kara
  2015-08-13 13:37           ` Oleg Nesterov
  0 siblings, 1 reply; 26+ messages in thread
From: Jan Kara @ 2015-08-13 13:32 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Jan Kara, Al Viro, Dave Chinner, Dave Hansen, Paul E. McKenney,
	Peter Zijlstra, linux-fsdevel, linux-kernel

On Thu 13-08-15 15:17:44, Oleg Nesterov wrote:
> On 08/13, Jan Kara wrote:
> >
> > On Thu 13-08-15 11:45:52, Jan Kara wrote:
> > > On Tue 11-08-15 19:03:58, Oleg Nesterov wrote:
> > > > Preparation to hide the sb->s_writers internals from xfs and btrfs.
> > > > Add 2 trivial define's they can use rather than play with ->s_writers
> > > > directly. No changes in btrfs/transaction.o and xfs/xfs_aops.o.
> > > >
> > > > Signed-off-by: Oleg Nesterov <oleg@redhat.com>
> > >
> > > Looks good. You can add:
> > >
> > > Reviewed-by: Jan Kara <jack@suse.com>
> >
> > One comment when looking at other patches - I'd prefer somewhat better name
> > than __sb_acquire_write().
> 
> Yes, __sb_acquire_write() doesn't look nice and I agree with any
> naming.
> 
> > It doesn't tell that it's only a trylock
> > acquisition. Maybe something like
> 
> But it is not actually "trylock"... This lock was already locked but
> not by us. __sb_release_write + __sb_acquire_write is used to transfer
> the ownership,
>
> > __sb_writers_acquire_nowait()
> >
> > and then
> >
> > __sb_writers_release()?
> 
> so I agree with any naming, I'll update this patch... but perhaps
> __sb_writers_acquire() without "_nowait" make more sense?

OK, drop _nowait - maybe:

__sb_writers_acquired()

(note the additional 'd' at the end) to suggest that we already hold the lock
and only want to tell lockdep about it?

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v2 7/8] shift percpu_counter_destroy() into destroy_super_work()
  2015-08-13 10:35   ` Jan Kara
@ 2015-08-13 13:36     ` Oleg Nesterov
  2015-08-13 14:09       ` Jan Kara
  0 siblings, 1 reply; 26+ messages in thread
From: Oleg Nesterov @ 2015-08-13 13:36 UTC (permalink / raw)
  To: Jan Kara
  Cc: Al Viro, Dave Chinner, Dave Hansen, Paul E. McKenney,
	Peter Zijlstra, linux-fsdevel, linux-kernel

On 08/13, Jan Kara wrote:
>
> On Tue 11-08-15 19:04:16, Oleg Nesterov wrote:
> >
> > So this is just the temporary kludge which helps us to avoid the
> > conflicts with the changes which will be (hopefully) routed via
> > rcu tree.
> >
> > Signed-off-by: Oleg Nesterov <oleg@redhat.com>
>
> Looking into this again, it would seem somewhat cleaner to me to move the
> destruction to deactivate_locked_super() instead.

Heh ;) You know, I was looking at deactivate_locked_super(). However, I
simply do not understand this code enough, I failed to verify it would
be safe to destroy s_writers there.

And. Please note destroy_super() in alloc_super() error path, so this
needs a bit more changes in any case.

Can't we live with this hack for now? To remind, it will be reverted
(at least partially) in any case. Yes, yes, it is very ugly and the
changelog documents this fact. But it looks simple and safe. To me
it would be better to make the conversion first, then cleanup this
horror after another discussion.

What do you think?

Oleg.


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v2 1/8] introduce __sb_{acquire,release}_write() helpers
  2015-08-13 13:32         ` Jan Kara
@ 2015-08-13 13:37           ` Oleg Nesterov
  0 siblings, 0 replies; 26+ messages in thread
From: Oleg Nesterov @ 2015-08-13 13:37 UTC (permalink / raw)
  To: Jan Kara
  Cc: Al Viro, Dave Chinner, Dave Hansen, Paul E. McKenney,
	Peter Zijlstra, linux-fsdevel, linux-kernel

On 08/13, Jan Kara wrote:
>
> On Thu 13-08-15 15:17:44, Oleg Nesterov wrote:
> >
> > so I agree with any naming, I'll update this patch... but perhaps
> > __sb_writers_acquire() without "_nowait" make more sense?
>
> OK, drop _nowait - maybe:
>
> __sb_writers_acquired()
>
> (note the additional 'd' at the end) to suggest that we already hold the lock
> and only want to tell lockdep about it?

Agreed! Will do in v3.

Oleg.


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v2 9/8] don't fool lockdep in freeze_super() and thaw_super() paths
  2015-08-13 11:01   ` Jan Kara
@ 2015-08-13 13:58     ` Oleg Nesterov
  0 siblings, 0 replies; 26+ messages in thread
From: Oleg Nesterov @ 2015-08-13 13:58 UTC (permalink / raw)
  To: Jan Kara
  Cc: Al Viro, Dave Chinner, Dave Hansen, Paul E. McKenney,
	Peter Zijlstra, linux-fsdevel, linux-kernel

On 08/13, Jan Kara wrote:
>
> On Wed 12-08-15 15:11:38, Oleg Nesterov wrote:
> > On 08/11, Oleg Nesterov wrote:
> > >
> > > The only essential change is that I dropped the lockdep improvements
> > > as we discussed. This means that 8/8 was changed a bit, and I decided
> > > to add the new documentation patch, see 3/8.
> >
> > Update: The recent
> >
> > 	[PATCH 0/2] xfs: kill lockdep false positives from readdir
> >
> > changes from Dave fixed the problems ILOCK false-positives. So we can
> > add the additional patch which (modulo comments) just turns v2 back into
> > v1.
> >
> > Dave, Jan, you seem to agree with these patches. How should we route
> > this all?
>
> Regarding the routing, ideally Al Viro should take these as a VFS
> maintainer.

OK. I'll send v3.

But to remind, this particular patch depends on Dave's fixes, so I will
send it later.

And I forgot to mention that I have another patch which removes the
trylock hack from __sb_start_write() as Dave suggested, it passed the
tests. But again, I'd really like to send it separately so that it can
be reverted in (unlikely) case something else does recursive read_lock().

> > Subject: [PATCH v2 9/8] don't fool lockdep in freeze_super() and thaw_super() paths
> >
> > sb_wait_write()->percpu_rwsem_release() fools lockdep to avoid the
> > false-positives. Now that xfs was fixed by Dave we can remove it and
> > change freeze_super() and thaw_super() to run with s_writers.rw_sem
> > locks held; we add two trivial helpers for that, sb_freeze_release()
> > and sb_freeze_acquire().
> >
> > While at it, kill the outdated part of the comment above sb_wait_write.
> >
> > Signed-off-by: Oleg Nesterov <oleg@redhat.com>
>
> The patch looks good. Just one nit:
>
> > +	for (level = SB_FREEZE_LEVELS; --level >= 0; )
> > +		percpu_rwsem_release(sb->s_writers.rw_sem + level, 0, _THIS_IP_);
>
> It is more common (and to me more readable) to have the loop written as:
>
> for (level = SB_FREEZE_LEVELS - 1; level >= 0; level--)
>
> I agree what you do is shorter but IMHO it's just an unnecessary
> obfuscation :)

Agreed, will fix.

Oleg.


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v2 7/8] shift percpu_counter_destroy() into destroy_super_work()
  2015-08-13 13:36     ` Oleg Nesterov
@ 2015-08-13 14:09       ` Jan Kara
  2015-08-13 15:20         ` Oleg Nesterov
  0 siblings, 1 reply; 26+ messages in thread
From: Jan Kara @ 2015-08-13 14:09 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Jan Kara, Al Viro, Dave Chinner, Dave Hansen, Paul E. McKenney,
	Peter Zijlstra, linux-fsdevel, linux-kernel

On Thu 13-08-15 15:36:16, Oleg Nesterov wrote:
> On 08/13, Jan Kara wrote:
> >
> > On Tue 11-08-15 19:04:16, Oleg Nesterov wrote:
> > >
> > > So this is just the temporary kludge which helps us to avoid the
> > > conflicts with the changes which will be (hopefully) routed via
> > > rcu tree.
> > >
> > > Signed-off-by: Oleg Nesterov <oleg@redhat.com>
> >
> > Looking into this again, it would seem somewhat cleaner to me to move the
> > destruction to deactivate_locked_super() instead.
> 
> Heh ;) You know, I was looking at deactivate_locked_super(). However, I
> simply do not understand this code enough, I failed to verify it would
> be safe to destroy s_writers there.

Yes, it will be safe. After ->kill_sb() callback the filesystem is dead.
There can be someone still holding reference to superblock but these are
just users inspecting the structure definitely not caring about freeze
protection.

> And. Please note destroy_super() in alloc_super() error path, so this
> needs a bit more changes in any case.

Yes. But you can sleep in alloc_super() so that would be easy enough.

> Can't we live with this hack for now? To remind, it will be reverted
> (at least partially) in any case. Yes, yes, it is very ugly and the
> changelog documents this fact. But it looks simple and safe. To me
> it would be better to make the conversion first, then cleanup this
> horror after another discussion.

All I care about is that long-term, all handling from destroy_super() that
needs to sleep ends up in one place. So if you promise you'll make this
happen I can live with the workqueue solution for now (but you have to
convince also Al as a maintainer ;).

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v2 7/8] shift percpu_counter_destroy() into destroy_super_work()
  2015-08-13 14:09       ` Jan Kara
@ 2015-08-13 15:20         ` Oleg Nesterov
  0 siblings, 0 replies; 26+ messages in thread
From: Oleg Nesterov @ 2015-08-13 15:20 UTC (permalink / raw)
  To: Jan Kara
  Cc: Al Viro, Dave Chinner, Dave Hansen, Paul E. McKenney,
	Peter Zijlstra, linux-fsdevel, linux-kernel

On 08/13, Jan Kara wrote:
>
> On Thu 13-08-15 15:36:16, Oleg Nesterov wrote:
> > On 08/13, Jan Kara wrote:
> > >
> > > Looking into this again, it would seem somewhat cleaner to me to move the
> > > destruction to deactivate_locked_super() instead.
> >
> > Heh ;) You know, I was looking at deactivate_locked_super(). However, I
> > simply do not understand this code enough, I failed to verify it would
> > be safe to destroy s_writers there.
>
> Yes, it will be safe. After ->kill_sb() callback the filesystem is dead.
> There can be someone still holding reference to superblock but these are
> just users inspecting the structure definitely not caring about freeze
> protection.

OK, thanks.

> > And. Please note destroy_super() in alloc_super() error path, so this
> > needs a bit more changes in any case.
>
> Yes. But you can sleep in alloc_super() so that would be easy enough.

Yes, yes, I didn't mean this is a problem.

> > Can't we live with this hack for now? To remind, it will be reverted
> > (at least partially) in any case. Yes, yes, it is very ugly and the
> > changelog documents this fact. But it looks simple and safe. To me
> > it would be better to make the conversion first, then cleanup this
> > horror after another discussion.
>
> All I care about is that long-term, all handling from destroy_super() that
> needs to sleep ends up in one place. So if you promise you'll make this
> happen I can live with the workqueue solution for now

I certainly promise I will try to do something in any case ;)

But let me repeat another reason why I think we should do this later.
The necessary changes depend on other work-in-progress rcu_sync changes
in percpu_rw_semaphore.

Now that you confirm that we should not worry about sb_writers after
deactivate_locked_super(), the cleanup looks even simpler than I
thought initially:

	1. We do not even need to destroy the counters in
	   deactivate_locked_super(). It should only stop the
	   (potentially) pending rcu-callback(s).

	2. Just revert this patch altogether.

> (but you have to
> convince also Al as a maintainer ;).

Perhaps he won't notice how ugly this change is? If you won't tell him.

Oleg.


^ permalink raw reply	[flat|nested] 26+ messages in thread

end of thread, other threads:[~2015-08-13 15:22 UTC | newest]

Thread overview: 26+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-08-11 17:03 [PATCH v2 0/8] change sb_writers to use percpu_rw_semaphore Oleg Nesterov
2015-08-11 17:03 ` [PATCH v2 1/8] introduce __sb_{acquire,release}_write() helpers Oleg Nesterov
2015-08-13  9:45   ` Jan Kara
2015-08-13  9:56     ` Jan Kara
2015-08-13 13:17       ` Oleg Nesterov
2015-08-13 13:32         ` Jan Kara
2015-08-13 13:37           ` Oleg Nesterov
2015-08-11 17:04 ` [PATCH v2 2/8] fix the broken lockdep logic in __sb_start_write() Oleg Nesterov
2015-08-13 10:02   ` Jan Kara
2015-08-13 13:22     ` Oleg Nesterov
2015-08-13 13:29       ` Jan Kara
2015-08-11 17:04 ` [PATCH v2 3/8] document rwsem_release() in sb_wait_write() Oleg Nesterov
2015-08-13 10:22   ` Jan Kara
2015-08-11 17:04 ` [PATCH v2 4/8] percpu-rwsem: introduce percpu_down_read_trylock() Oleg Nesterov
2015-08-11 17:04 ` [PATCH v2 5/8] percpu-rwsem: introduce percpu_rwsem_release() and percpu_rwsem_acquire() Oleg Nesterov
2015-08-11 17:04 ` [PATCH v2 6/8] percpu-rwsem: kill CONFIG_PERCPU_RWSEM Oleg Nesterov
2015-08-11 17:04 ` [PATCH v2 7/8] shift percpu_counter_destroy() into destroy_super_work() Oleg Nesterov
2015-08-13 10:35   ` Jan Kara
2015-08-13 13:36     ` Oleg Nesterov
2015-08-13 14:09       ` Jan Kara
2015-08-13 15:20         ` Oleg Nesterov
2015-08-11 17:04 ` [PATCH v2 8/8] change sb_writers to use percpu_rw_semaphore Oleg Nesterov
2015-08-13 10:48   ` Jan Kara
2015-08-12 13:11 ` [PATCH v2 9/8] don't fool lockdep in freeze_super() and thaw_super() paths Oleg Nesterov
2015-08-13 11:01   ` Jan Kara
2015-08-13 13:58     ` Oleg Nesterov

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).