Re: [PATCH 0/4] change sb_writers to use percpu_rw_semaphore

From: Oleg Nesterov <oleg@redhat.com>
To: Al Viro <viro@zeniv.linux.org.uk>,
	Dave Chinner <david@fromorbit.com>,
	Dave Hansen <dave.hansen@linux.intel.com>,
	Jan Kara <jack@suse.cz>
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>,
	Peter Zijlstra <peterz@infradead.org>,
	linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org
Subject: Re: [PATCH 0/4] change sb_writers to use percpu_rw_semaphore
Date: Fri, 7 Aug 2015 21:55:52 +0200	[thread overview]
Message-ID: <20150807195552.GB28529@redhat.com> (raw)
In-Reply-To: <20150722211513.GA19986@redhat.com>

Jan, Dave, please help.

I'll try to re-check/re-test, but so far I think that this and the
previous series are correct. However 3/4 from the previous series
(attached at the end just in case) uncovers (I think) some problems
in xfs locking.

What should I do now?

On 07/22, Oleg Nesterov wrote:
>
> Testing. Well, so far I only verified that ioctl(FIFREEZE) +
> ioctl(FITHAW) seems to wors "as expected" on my testing machine
> with ext3. So probably this needs more testing.

Finally I got the testing machine and ran xfstests, I did

	dd if=/dev/zero of=TEST.img bs=1MiB count=4000
	dd if=/dev/zero of=SCRATCH.img bs=1MiB count=4000

	losetup --find --show TEST.img
	losetup --find --show SCRATCH.img

	mkfs.xfs -f /dev/loop0
	mkfs.xfs -f /dev/loop1

	mkdir -p TEST SCRATCH

	mount /dev/loop0 TEST
	mount /dev/loop1 SCRATCH

	TEST_DEV=/dev/loop0 TEST_DIR=TEST SCRATCH_DEV=/dev/loop1 SCRATCH_MNT=SCRATCH \
	./check `grep -il freeze tests/*/???`

several times and every time the result looks like below:

	FSTYP         -- xfs (non-debug)
	PLATFORM      -- Linux/x86_64 intel-canoepass-10 4.1.0+
	MKFS_OPTIONS  -- -f -bsize=4096 /dev/loop1
	MOUNT_OPTIONS -- -o context=system_u:object_r:nfs_t:s0 /dev/loop1 SCRATCH

	generic/068 59s ... 61s
	generic/085 23s ... 26s
	generic/280 2s ... 3s
	generic/311 169s ... 167s
	xfs/011 21s ... 20s
	xfs/119 32s ... 21s
	xfs/297 455s ... 301s
	Ran: generic/068 generic/085 generic/280 generic/311 xfs/011 xfs/119 xfs/297
	Passed all 7 tests

But with CONFIG_LOCKDEP=y generic/068 triggers the warning:

	[   66.092831] ======================================================
	[   66.099726] [ INFO: possible circular locking dependency detected ]
	[   66.106719] 4.1.0+ #2 Not tainted
	[   66.110414] -------------------------------------------------------
	[   66.117405] fsstress/2210 is trying to acquire lock:
	[   66.122944]  (&mm->mmap_sem){++++++}, at: [<ffffffff81200562>] might_fault+0x42/0xa0
	[   66.131637]
		       but task is already holding lock:
	[   66.138146]  (&xfs_dir_ilock_class){++++..}, at: [<ffffffffa057da52>] xfs_ilock+0xc2/0x170 [xfs]
	[   66.148022]
		       which lock already depends on the new lock.

	[   66.157141]
		       the existing dependency chain (in reverse order) is:
	[   66.165490]
		       -> #4 (&xfs_dir_ilock_class){++++..}:
	[   66.170974]        [<ffffffff810fc94e>] lock_acquire+0xbe/0x150
	[   66.177596]        [<ffffffff810f4bbc>] down_write_nested+0x3c/0x70
	[   66.184605]        [<ffffffffa057dab6>] xfs_ilock+0x126/0x170 [xfs]
	[   66.191645]        [<ffffffffa057c4da>] xfs_setattr_nonsize+0x3ba/0x5d0 [xfs]
	[   66.199638]        [<ffffffffa057cb5a>] xfs_vn_setattr+0x9a/0xd0 [xfs]
	[   66.206950]        [<ffffffff81279411>] notify_change+0x271/0x3a0
	[   66.213762]        [<ffffffff8125391b>] chown_common.isra.11+0x15b/0x200
	[   66.221251]        [<ffffffff812551a6>] SyS_lchown+0xa6/0xf0
	[   66.227576]        [<ffffffff8182662e>] system_call_fastpath+0x12/0x76
	[   66.234878]
		       -> #3 (sb_internal#2){++++++}:
	[   66.239698]        [<ffffffff810fc94e>] lock_acquire+0xbe/0x150
	[   66.246316]        [<ffffffff81824256>] down_write+0x36/0x70
	[   66.252644]        [<ffffffff81100979>] percpu_down_write+0x39/0x110
	[   66.259760]        [<ffffffff8125901d>] freeze_super+0xbd/0x190
	[   66.266369]        [<ffffffff8126dbc8>] do_vfs_ioctl+0x3f8/0x520
	[   66.273082]        [<ffffffff8126dd71>] SyS_ioctl+0x81/0xa0
	[   66.279311]        [<ffffffff8182662e>] system_call_fastpath+0x12/0x76
	[   66.286610]
		       -> #2 (sb_pagefaults#2){++++++}:
	[   66.291623]        [<ffffffff810fc94e>] lock_acquire+0xbe/0x150
	[   66.298237]        [<ffffffff811007c1>] percpu_down_read+0x51/0xa0
	[   66.305144]        [<ffffffff8125979b>] __sb_start_write+0xdb/0x120
	[   66.312140]        [<ffffffff81295eda>] block_page_mkwrite+0x3a/0xb0
	[   66.319245]        [<ffffffffa057046e>] xfs_filemap_page_mkwrite+0x5e/0xb0 [xfs]
	[   66.327533]        [<ffffffff812009a8>] do_page_mkwrite+0x58/0x100
	[   66.334433]        [<ffffffff81205497>] handle_mm_fault+0x537/0x17c0
	[   66.341533]        [<ffffffff8106a03c>] __do_page_fault+0x19c/0x450
	[   66.348542]        [<ffffffff8106a31f>] do_page_fault+0x2f/0x80
	[   66.355172]        [<ffffffff818286e8>] page_fault+0x28/0x30
	[   66.361493]
		       -> #1 (&(&ip->i_mmaplock)->mr_lock){++++++}:
	[   66.367659]        [<ffffffff810fc94e>] lock_acquire+0xbe/0x150
	[   66.374275]        [<ffffffff810f4aef>] down_read_nested+0x3f/0x60
	[   66.381185]        [<ffffffffa057daf8>] xfs_ilock+0x168/0x170 [xfs]
	[   66.388212]        [<ffffffffa057050c>] xfs_filemap_fault+0x4c/0xb0 [xfs]
	[   66.395817]        [<ffffffff81200a9e>] __do_fault+0x4e/0x100
	[   66.402239]        [<ffffffff81205454>] handle_mm_fault+0x4f4/0x17c0
	[   66.409340]        [<ffffffff8106a03c>] __do_page_fault+0x19c/0x450
	[   66.416344]        [<ffffffff8106a31f>] do_page_fault+0x2f/0x80
	[   66.422959]        [<ffffffff818286e8>] page_fault+0x28/0x30
	[   66.429283]
		       -> #0 (&mm->mmap_sem){++++++}:
	[   66.434093]        [<ffffffff810fb867>] __lock_acquire+0x20d7/0x2100
	[   66.441194]        [<ffffffff810fc94e>] lock_acquire+0xbe/0x150
	[   66.447808]        [<ffffffff8120058f>] might_fault+0x6f/0xa0
	[   66.454235]        [<ffffffff8126e05a>] filldir+0x9a/0x130
	[   66.460373]        [<ffffffffa056daed>] xfs_dir2_sf_getdents.isra.10+0x1bd/0x220 [xfs]
	[   66.469233]        [<ffffffffa056e72e>] xfs_readdir+0x17e/0x1a0 [xfs]
	[   66.476449]        [<ffffffffa057095b>] xfs_file_readdir+0x2b/0x30 [xfs]
	[   66.483954]        [<ffffffff8126de2a>] iterate_dir+0x9a/0x140
	[   66.490473]        [<ffffffff8126e361>] SyS_getdents+0x91/0x120
	[   66.497091]        [<ffffffff8182662e>] system_call_fastpath+0x12/0x76
	[   66.504381]
		       other info that might help us debug this:

	[   66.513316] Chain exists of:
			 &mm->mmap_sem --> sb_internal#2 --> &xfs_dir_ilock_class

	[   66.522619]  Possible unsafe locking scenario:

	[   66.529222]        CPU0                    CPU1
	[   66.534275]        ----                    ----
	[   66.539327]   lock(&xfs_dir_ilock_class);
	[   66.543823]                                lock(sb_internal#2);
	[   66.550465]                                lock(&xfs_dir_ilock_class);
	[   66.557767]   lock(&mm->mmap_sem);
	[   66.561580]
			*** DEADLOCK ***

	[   66.568186] 2 locks held by fsstress/2210:
	[   66.572753]  #0:  (&type->i_mutex_dir_key#5){+.+.+.}, at: [<ffffffff8126ddf1>] iterate_dir+0x61/0x140
	[   66.583103]  #1:  (&xfs_dir_ilock_class){++++..}, at: [<ffffffffa057da52>] xfs_ilock+0xc2/0x170 [xfs]
	[   66.593457]
		       stack backtrace:
	[   66.598321] CPU: 26 PID: 2210 Comm: fsstress Not tainted 4.1.0+ #2
	[   66.605215] Hardware name: Intel Corporation S2600CP/S2600CP, BIOS RMLSDP.86I.R3.27.D685.1305151734 05/15/2013
	[   66.616372]  0000000000000000 0000000048bb9c77 ffff8817fa127ad8 ffffffff8181d9dd
	[   66.624663]  0000000000000000 ffffffff8288f500 ffff8817fa127b28 ffffffff810f7b26
	[   66.632955]  0000000000000002 ffff8817fa127b98 ffff8817fa127b28 ffff881803f06480
	[   66.641249] Call Trace:
	[   66.643983]  [<ffffffff8181d9dd>] dump_stack+0x45/0x57
	[   66.649713]  [<ffffffff810f7b26>] print_circular_bug+0x206/0x280
	[   66.656413]  [<ffffffff810fb867>] __lock_acquire+0x20d7/0x2100
	[   66.662919]  [<ffffffff810fa1b7>] ? __lock_acquire+0xa27/0x2100
	[   66.669523]  [<ffffffff810fc94e>] lock_acquire+0xbe/0x150
	[   66.675543]  [<ffffffff81200562>] ? might_fault+0x42/0xa0
	[   66.681566]  [<ffffffff8120058f>] might_fault+0x6f/0xa0
	[   66.687394]  [<ffffffff81200562>] ? might_fault+0x42/0xa0
	[   66.693418]  [<ffffffff8126e05a>] filldir+0x9a/0x130
	[   66.698968]  [<ffffffffa057db34>] ? xfs_ilock_data_map_shared+0x34/0x40 [xfs]
	[   66.706941]  [<ffffffffa056daed>] xfs_dir2_sf_getdents.isra.10+0x1bd/0x220 [xfs]
	[   66.715203]  [<ffffffffa057da52>] ? xfs_ilock+0xc2/0x170 [xfs]
	[   66.721712]  [<ffffffffa056e72e>] xfs_readdir+0x17e/0x1a0 [xfs]
	[   66.728316]  [<ffffffff81822b6d>] ? mutex_lock_killable_nested+0x3ad/0x480
	[   66.735998]  [<ffffffffa057095b>] xfs_file_readdir+0x2b/0x30 [xfs]
	[   66.742894]  [<ffffffff8126de2a>] iterate_dir+0x9a/0x140
	[   66.748819]  [<ffffffff8127a1ac>] ? __fget_light+0x6c/0xa0
	[   66.754938]  [<ffffffff8126e361>] SyS_getdents+0x91/0x120
	[   66.760958]  [<ffffffff8126dfc0>] ? fillonedir+0xf0/0xf0
	[   66.766884]  [<ffffffff8182662e>] system_call_fastpath+0x12/0x76

The chain reported by lockdep is not exactly the same every time,
but similar.

Once again, I'll recheck. but the patch below still looks correct
to me, and I think that it is actually a fix.

Before this patch freeze_super() calls sync_filesystem() and
s_op->freeze_fs(sb) without ->s_writers locks (as it seen by
lockdep) and this is wrong.

After this patch lockdep knows about ->s_writers locks held and this
is what we want, but apparently this way lockdep can notice the new
dependencies.

Oleg.

------------------------------------------------------------------------------
Subject: [PATCH 3/4] move rwsem_release() from sb_wait_write() to freeze_super()

Move the "fool the lockdep" code from sb_wait_write() into the new
simple helper, sb_lockdep_release(), called once before return from
freeze_super().

This is preparation, but imo this makes sense in any case. This way
we do not hide from lockdep the "write" locks we hold when we call
s_op->freeze_fs(sb).

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
---
 fs/super.c |   16 ++++++++++------
 1 files changed, 10 insertions(+), 6 deletions(-)

diff --git a/fs/super.c b/fs/super.c
index d0fdd49..e7ea9f1 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -1236,16 +1236,10 @@ static void sb_wait_write(struct super_block *sb, int level)
 {
 	s64 writers;
 
-	/*
-	 * We just cycle-through lockdep here so that it does not complain
-	 * about returning with lock to userspace
-	 */
 	rwsem_acquire(&sb->s_writers.lock_map[level-1], 0, 0, _THIS_IP_);
-	rwsem_release(&sb->s_writers.lock_map[level-1], 1, _THIS_IP_);
 
 	do {
 		DEFINE_WAIT(wait);
-
 		/*
 		 * We use a barrier in prepare_to_wait() to separate setting
 		 * of frozen and checking of the counter
@@ -1261,6 +1255,14 @@ static void sb_wait_write(struct super_block *sb, int level)
 	} while (writers);
 }
 
+static void sb_freeze_release(struct super_block *sb)
+{
+	int level;
+	/* Avoid the warning from lockdep_sys_exit() */
+	for (level = 0; level < SB_FREEZE_LEVELS; ++level)
+		rwsem_release(sb->s_writers.lock_map + level, 1, _THIS_IP_);
+}
+
 /**
  * freeze_super - lock the filesystem and force it into a consistent state
  * @sb: the super to lock
@@ -1349,6 +1351,7 @@ int freeze_super(struct super_block *sb)
 			sb->s_writers.frozen = SB_UNFROZEN;
 			smp_wmb();
 			wake_up(&sb->s_writers.wait_unfrozen);
+			sb_freeze_release(sb);
 			deactivate_locked_super(sb);
 			return ret;
 		}
@@ -1358,6 +1361,7 @@ int freeze_super(struct super_block *sb)
 	 * sees write activity when frozen is set to SB_FREEZE_COMPLETE.
 	 */
 	sb->s_writers.frozen = SB_FREEZE_COMPLETE;
+	sb_freeze_release(sb);
 	up_write(&sb->s_umount);
 	return 0;
 }
-- 
1.5.5.1