All of lore.kernel.org
 help / color / mirror / Atom feed
From: Oleg Nesterov <oleg@redhat.com>
To: Dave Chinner <david@fromorbit.com>, Jan Kara <jack@suse.cz>
Cc: linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org
Subject: __sb_start_write() && force_trylock hack
Date: Tue, 18 Aug 2015 16:49:00 +0200	[thread overview]
Message-ID: <20150818144900.GA26478@redhat.com> (raw)

Jan, Dave, perhaps you can take a look...

On 08/14, Oleg Nesterov wrote:
>
> Plus another patch which removes the "trylock"
> hack in __sb_start_write().

I meant the patch we already discussed (attached at the end). And yes,
previously I reported it passed the tests. However, I only ran the same
'grep -il freeze tests/*/???' tests. When I tried to run all tests, I
got the new reports from lockdep.

	[ 2098.200818] [ INFO: possible recursive locking detected ]
	[ 2098.206845] 4.2.0-rc6+ #27 Not tainted
	[ 2098.211026] ---------------------------------------------
	[ 2098.217050] fsstress/50971 is trying to acquire lock:
	[ 2098.222679]  (sb_internal#2){++++++}, at: [<ffffffff81248d32>] __sb_start_write+0x32/0x40
	[ 2098.231886]
		       but task is already holding lock:
	[ 2098.238394]  (sb_internal#2){++++++}, at: [<ffffffff81248d32>] __sb_start_write+0x32/0x40
	[ 2098.247578]
		       other info that might help us debug this:
	[ 2098.254852]  Possible unsafe locking scenario:

	[ 2098.261458]        CPU0
	[ 2098.264185]        ----
	[ 2098.266913]   lock(sb_internal#2);
	[ 2098.270740]   lock(sb_internal#2);
	[ 2098.274566]
			*** DEADLOCK ***

	[ 2098.281171]  May be due to missing lock nesting notation

	[ 2098.288744] 4 locks held by fsstress/50971:
	[ 2098.293408]  #0:  (sb_writers#13){++++++}, at: [<ffffffff81248d32>] __sb_start_write+0x32/0x40
	[ 2098.303085]  #1:  (&type->i_mutex_dir_key#4/1){+.+.+.}, at: [<ffffffff8125685f>] filename_create+0x7f/0x170
	[ 2098.314038]  #2:  (sb_internal#2){++++++}, at: [<ffffffff81248d32>] __sb_start_write+0x32/0x40
	[ 2098.323711]  #3:  (&type->s_umount_key#54){++++++}, at: [<ffffffffa05a638c>] xfs_flush_inodes+0x1c/0x40 [xfs]
	[ 2098.334898]
		       stack backtrace:
	[ 2098.339762] CPU: 3 PID: 50971 Comm: fsstress Not tainted 4.2.0-rc6+ #27
	[ 2098.347143] Hardware name: Intel Corporation S2600CP/S2600CP, BIOS RMLSDP.86I.R3.27.D685.1305151734 05/15/2013
	[ 2098.358303]  0000000000000000 00000000e70ee864 ffff880c05a2b9c8 ffffffff817ee692
	[ 2098.366603]  0000000000000000 ffffffff826f8030 ffff880c05a2bab8 ffffffff810f45be
	[ 2098.374900]  0000000000000000 ffff880c05a2bb20 0000000000000000 0000000000000000
	[ 2098.383197] Call Trace:
	[ 2098.385930]  [<ffffffff817ee692>] dump_stack+0x45/0x57
	[ 2098.391667]  [<ffffffff810f45be>] __lock_acquire+0x1e9e/0x2040
	[ 2098.398177]  [<ffffffff810f310d>] ? __lock_acquire+0x9ed/0x2040
	[ 2098.404787]  [<ffffffff811d4702>] ? pagevec_lookup_entries+0x22/0x30
	[ 2098.411879]  [<ffffffff811d5142>] ? truncate_inode_pages_range+0x2b2/0x7e0
	[ 2098.419551]  [<ffffffff810f542e>] lock_acquire+0xbe/0x150
	[ 2098.425566]  [<ffffffff81248d32>] ? __sb_start_write+0x32/0x40
	[ 2098.432079]  [<ffffffff810ede91>] percpu_down_read+0x51/0xa0
	[ 2098.438396]  [<ffffffff81248d32>] ? __sb_start_write+0x32/0x40
	[ 2098.444905]  [<ffffffff81248d32>] __sb_start_write+0x32/0x40
	[ 2098.451244]  [<ffffffffa05a7f84>] xfs_trans_alloc+0x24/0x40 [xfs]
	[ 2098.458076]  [<ffffffffa059e872>] xfs_inactive_truncate+0x32/0x110 [xfs]
	[ 2098.465580]  [<ffffffffa059f662>] xfs_inactive+0x102/0x120 [xfs]
	[ 2098.472308]  [<ffffffffa05a475b>] xfs_fs_evict_inode+0x7b/0xc0 [xfs]
	[ 2098.479401]  [<ffffffff812642ab>] evict+0xab/0x170
	[ 2098.484748]  [<ffffffff81264568>] iput+0x1a8/0x230
	[ 2098.490100]  [<ffffffff8127701c>] sync_inodes_sb+0x14c/0x1d0
	[ 2098.496439]  [<ffffffffa05a6398>] xfs_flush_inodes+0x28/0x40 [xfs]
	[ 2098.503361]  [<ffffffffa059e213>] xfs_create+0x5c3/0x6d0 [xfs]
	[ 2098.509871]  [<ffffffff8135e6bd>] ? avc_has_perm+0x2d/0x280
	[ 2098.516105]  [<ffffffffa059a33e>] xfs_generic_create+0x1de/0x2b0 [xfs]
	[ 2098.523412]  [<ffffffffa059a444>] xfs_vn_mknod+0x14/0x20 [xfs]
	[ 2098.529923]  [<ffffffff81251db5>] vfs_mknod+0xe5/0x160
	[ 2098.535657]  [<ffffffff81257619>] SyS_mknod+0x1a9/0x200
	[ 2098.541490]  [<ffffffff817f702e>] entry_SYSCALL_64_fastpath+0x12/0x76

	[ 2151.962337] run fstests generic/084 at 2015-08-17 10:39:08

So it seems that xfs still does recursive read_lock().

Then another one after I rebooted and restarted the test:

	[ 1835.388225] ======================================================
	[ 1835.395118] [ INFO: possible circular locking dependency detected ]
	[ 1835.402110] 4.2.0-rc6+ #27 Not tainted
	[ 1835.406288] -------------------------------------------------------
	[ 1835.413278] renameat2/42316 is trying to acquire lock:
	[ 1835.419008]  (sb_internal){++++..}, at: [<ffffffff81248d32>] __sb_start_write+0x32/0x40
	[ 1835.427982]
		       but task is already holding lock:
	[ 1835.434490]  (jbd2_handle){+.+...}, at: [<ffffffff81328ac0>] start_this_handle+0x1a0/0x600
	[ 1835.443753]
		       which lock already depends on the new lock.

	[ 1835.452874]
		       the existing dependency chain (in reverse order) is:
	[ 1835.461223]
		       -> #1 (jbd2_handle){+.+...}:
	[ 1835.465832]        [<ffffffff810f542e>] lock_acquire+0xbe/0x150
	[ 1835.472449]        [<ffffffff81328b1b>] start_this_handle+0x1fb/0x600
	[ 1835.479644]        [<ffffffff813291ce>] jbd2__journal_start+0xae/0x1b0
	[ 1835.486936]        [<ffffffff8130fa35>] __ext4_journal_start_sb+0x75/0x110
	[ 1835.494617]        [<ffffffff812e1c7a>] ext4_evict_inode+0x1aa/0x4c0
	[ 1835.501716]        [<ffffffff812642ab>] evict+0xab/0x170
	[ 1835.507651]        [<ffffffff81264568>] iput+0x1a8/0x230
	[ 1835.513585]        [<ffffffff81256d91>] do_unlinkat+0x1d1/0x2a0
	[ 1835.520200]        [<ffffffff81257896>] SyS_unlink+0x16/0x20
	[ 1835.526520]        [<ffffffff817f702e>] entry_SYSCALL_64_fastpath+0x12/0x76
	[ 1835.534301]
		       -> #0 (sb_internal){++++..}:
	[ 1835.538907]        [<ffffffff810f441f>] __lock_acquire+0x1cff/0x2040
	[ 1835.546004]        [<ffffffff810f542e>] lock_acquire+0xbe/0x150
	[ 1835.552617]        [<ffffffff810ede91>] percpu_down_read+0x51/0xa0
	[ 1835.559521]        [<ffffffff81248d32>] __sb_start_write+0x32/0x40
	[ 1835.566424]        [<ffffffff812e1c03>] ext4_evict_inode+0x133/0x4c0
	[ 1835.573521]        [<ffffffff812642ab>] evict+0xab/0x170
	[ 1835.579454]        [<ffffffff81264568>] iput+0x1a8/0x230
	[ 1835.585386]        [<ffffffff812eb61e>] ext4_rename+0x14e/0x8a0
	[ 1835.591999]        [<ffffffff812ebd8d>] ext4_rename2+0x1d/0x30
	[ 1835.598513]        [<ffffffff812524c2>] vfs_rename+0x552/0x880
	[ 1835.605029]        [<ffffffff81258417>] SyS_renameat2+0x537/0x590
	[ 1835.611836]        [<ffffffff817f702e>] entry_SYSCALL_64_fastpath+0x12/0x76
	[ 1835.619612]
		       other info that might help us debug this:
	[ 1835.628540]  Possible unsafe locking scenario:

	[ 1835.635143]        CPU0                    CPU1
	[ 1835.640193]        ----                    ----
	[ 1835.645245]   lock(jbd2_handle);
	[ 1835.648861]                                lock(sb_internal);
	[ 1835.655287]                                lock(jbd2_handle);
	[ 1835.661714]   lock(sb_internal);
	[ 1835.665330]
			*** DEADLOCK ***

	[ 1835.671932] 4 locks held by renameat2/42316:
	[ 1835.676691]  #0:  (sb_writers#8){++++.+}, at: [<ffffffff81248d32>] __sb_start_write+0x32/0x40
	[ 1835.686263]  #1:  (&type->i_mutex_dir_key#3/1){+.+.+.}, at: [<ffffffff81250e04>] lock_rename+0xe4/0x110
	[ 1835.696815]  #2:  (&type->i_mutex_dir_key#3){+.+.+.}, at: [<ffffffff81252306>] vfs_rename+0x396/0x880
	[ 1835.707165]  #3:  (jbd2_handle){+.+...}, at: [<ffffffff81328ac0>] start_this_handle+0x1a0/0x600
	...
	[ 1836.741417] run fstests generic/079 at 2015-08-17 09:48:19

This is the same thing. This lock inversion is safe but only because we
hold the lock on the higher layer.

Looks like, we can't remove this force_trylock hack. Probably the comment
should be updated...

This is 4.2.0-rc6 + Dave's xfs ILOCK fixes + "[PATCH v3 0/8] change sb_writers
to use percpu_rw_semaphore" I sent to Al.

Oleg.

-------------------------------------------------------------------------------
Subject: [PATCH] kill the "force_trylock" hack in __sb_start_write()

As Dave explains, the comment in __sb_start_write() is not right, xfs
no longer does read_lock() twice at the same level:

	when we duplicate a transaction for a rolling commit, we do it
	before committing the current transaction is committed. I think
	that used to take a second freeze reference, which only existed
	until the first transaction was committed. We do things a bit
	differently now - we hold a state flag on the transaction to
	indicate it needs to release the freeze reference when it is
	freed and we pass it to the new transaction so that the first
	transaction commit doesn't release it.

So we can always use percpu_down_read() if wait == T.

TODO: export percpu_down_read/etc and turn __sb_start_write() and
__sb_end_write() into the trivial inlines in include/linux/fs.h.

Suggested-by: Dave Chinner <david@fromorbit.com>
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
---
 fs/super.c |   30 +++---------------------------
 1 files changed, 3 insertions(+), 27 deletions(-)

diff --git a/fs/super.c b/fs/super.c
index 1170dec..0b06e36 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -1169,36 +1169,12 @@ EXPORT_SYMBOL(__sb_end_write);
  */
 int __sb_start_write(struct super_block *sb, int level, bool wait)
 {
-	bool force_trylock = false;
-	int ret = 1;
-
-#ifdef CONFIG_LOCKDEP
-	/*
-	 * We want lockdep to tell us about possible deadlocks with freezing
-	 * but it's it bit tricky to properly instrument it. Getting a freeze
-	 * protection works as getting a read lock but there are subtle
-	 * problems. XFS for example gets freeze protection on internal level
-	 * twice in some cases, which is OK only because we already hold a
-	 * freeze protection also on higher level. Due to these cases we have
-	 * to use wait == F (trylock mode) which must not fail.
-	 */
 	if (wait) {
-		int i;
-
-		for (i = 0; i < level - 1; i++)
-			if (percpu_rwsem_is_held(sb->s_writers.rw_sem + i)) {
-				force_trylock = true;
-				break;
-			}
-	}
-#endif
-	if (wait && !force_trylock)
 		percpu_down_read(sb->s_writers.rw_sem + level-1);
-	else
-		ret = percpu_down_read_trylock(sb->s_writers.rw_sem + level-1);
+		return 1;
+	}

-	WARN_ON(force_trylock & !ret);
-	return ret;
+	return percpu_down_read_trylock(sb->s_writers.rw_sem + level-1);
 }
 EXPORT_SYMBOL(__sb_start_write);

--
1.5.5.1



             reply	other threads:[~2015-08-18 14:51 UTC|newest]

Thread overview: 7+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2015-08-18 14:49 Oleg Nesterov [this message]
2015-08-18 15:18 ` __sb_start_write() && force_trylock hack Oleg Nesterov
2015-08-19  3:25   ` Dave Chinner
2015-08-19  3:11 ` Dave Chinner
2015-08-19 15:00   ` Oleg Nesterov
2015-08-19 23:26     ` Dave Chinner
2015-08-21 18:30       ` Oleg Nesterov

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20150818144900.GA26478@redhat.com \
    --to=oleg@redhat.com \
    --cc=david@fromorbit.com \
    --cc=jack@suse.cz \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.