All of lore.kernel.org
 help / color / mirror / Atom feed
From: Jan Kara <jack@suse.cz>
To: David Howells <dhowells@redhat.com>
Cc: Jan Kara <jack@suse.cz>, "Darrick J. Wong" <djwong@kernel.org>,
	Christoph Hellwig <hch@lst.de>,
	jlayton@kernel.org, linux-fsdevel@vger.kernel.org
Subject: Re: The new invalidate_lock seems to cause a potential deadlock with fscache
Date: Wed, 22 Sep 2021 13:04:20 +0200	[thread overview]
Message-ID: <20210922110420.GA21576@quack2.suse.cz> (raw)
In-Reply-To: <3439799.1632261329@warthog.procyon.org.uk>

Hi David!

On Tue 21-09-21 22:55:29, David Howells wrote:
> It seems the new mapping invalidate_lock causes a potential deadlock with
> fscache (see attached trace), though the system didn't actually deadlock.
> 
> It's quite possible that it's actually a false positive, since chain #1
> below is holding locks in two different filesystems.
> 
> Note that this was with my fscache-iter-2 branch, but the I/O paths in use are
> mostly upstream and not much affected by that.
> 
> This was found whilst running xfstests over afs with a cache, and I'd reached
> generic/346 when it tripped, so it seems to happen under very specific
> circumstances.  Rerunning generic/346 by itself does reproduce the problem.
> 
> I'm wondering if I'm going to need to offload netfs_rreq_do_write_to_cache(),
> which initiates the write to the cache, to a worker thread.

Indeed, the culprit for lockdep splat seems to be that netfs_readpage()
ends up calling sb_start_write() which ranks above invalidate_lock.  I'd
say that calling vfs_iocb_iter_write() from ->readpage() is definitely
violating "standard" locking wisdom we have around :) After some thought
I'd even say there are some theoretical scenarios where this could deadlock.
Like:

We have filesystems F1 & F2, where F1 is the network fs and F2 is the cache
fs.

Thread 1	Thread 2		Thread 3	Thread 4
		write (F2)
		  sb_start_write() (F2)
		  prepares write
		  copy_from_user()
					freeze_super (F2)
					  - blocks waiting for Thread 2
fault (F1)
  grabs mmap_lock
							munmap()
							  grab mmap_lock exclusively
							    - blocks on Thread 1
  filemap_fault()
    netfs_readpage()
      sb_start_write() (F2)
        - blocks waiting for Thread 3


		Thread 2 continues
		    fault()
		      grabs mmap_lock
			- blocks waiting for Thread 4. RIP.

The core of the problem is that SB_FREEZE_WRITE protection must never be
acquired from under mmap_lock. That's why we have the additional
SB_FREEZE_PAGEFAULT after all.

And I think we could come up also other deadlocks. Simply depending on
write(2) to be safe when already holding mmap_lock and page lock is IMHO
very dangerous and will not fly very well.

								Honza


> WARNING: possible circular locking dependency detected
> 5.15.0-rc1-build2+ #292 Not tainted
> ------------------------------------------------------
> holetest/65517 is trying to acquire lock:
> ffff88810c81d730 (mapping.invalidate_lock#3){.+.+}-{3:3}, at: filemap_fault+0x276/0x7a5
> 
> but task is already holding lock:
> ffff8881595b53e8 (&mm->mmap_lock#2){++++}-{3:3}, at: do_user_addr_fault+0x28d/0x59c
> 
> which lock already depends on the new lock.
> 
> 
> the existing dependency chain (in reverse order) is:
> 
> -> #2 (&mm->mmap_lock#2){++++}-{3:3}:
>        validate_chain+0x3c4/0x4a8
>        __lock_acquire+0x89d/0x949
>        lock_acquire+0x2dc/0x34b
>        __might_fault+0x87/0xb1
>        strncpy_from_user+0x25/0x18c
>        removexattr+0x7c/0xe5
>        __do_sys_fremovexattr+0x73/0x96
>        do_syscall_64+0x67/0x7a
>        entry_SYSCALL_64_after_hwframe+0x44/0xae
> 
> -> #1 (sb_writers#10){.+.+}-{0:0}:
>        validate_chain+0x3c4/0x4a8
>        __lock_acquire+0x89d/0x949
>        lock_acquire+0x2dc/0x34b
>        cachefiles_write+0x2b3/0x4bb
>        netfs_rreq_do_write_to_cache+0x3b5/0x432
>        netfs_readpage+0x2de/0x39d
>        filemap_read_page+0x51/0x94
>        filemap_get_pages+0x26f/0x413
>        filemap_read+0x182/0x427
>        new_sync_read+0xf0/0x161
>        vfs_read+0x118/0x16e
>        ksys_read+0xb8/0x12e
>        do_syscall_64+0x67/0x7a
>        entry_SYSCALL_64_after_hwframe+0x44/0xae
> 
> -> #0 (mapping.invalidate_lock#3){.+.+}-{3:3}:
>        check_noncircular+0xe4/0x129
>        check_prev_add+0x16b/0x3a4
>        validate_chain+0x3c4/0x4a8
>        __lock_acquire+0x89d/0x949
>        lock_acquire+0x2dc/0x34b
>        down_read+0x40/0x4a
>        filemap_fault+0x276/0x7a5
>        __do_fault+0x96/0xbf
>        do_fault+0x262/0x35a
>        __handle_mm_fault+0x171/0x1b5
>        handle_mm_fault+0x12a/0x233
>        do_user_addr_fault+0x3d2/0x59c
>        exc_page_fault+0x85/0xa5
>        asm_exc_page_fault+0x1e/0x30
> 
> other info that might help us debug this:
> 
> Chain exists of:
>   mapping.invalidate_lock#3 --> sb_writers#10 --> &mm->mmap_lock#2
> 
>  Possible unsafe locking scenario:
> 
>        CPU0                    CPU1
>        ----                    ----
>   lock(&mm->mmap_lock#2);
>                                lock(sb_writers#10);
>                                lock(&mm->mmap_lock#2);
>   lock(mapping.invalidate_lock#3);
> 
>  *** DEADLOCK ***
> 
> 1 lock held by holetest/65517:
>  #0: ffff8881595b53e8 (&mm->mmap_lock#2){++++}-{3:3}, at: do_user_addr_fault+0x28d/0x59c
> 
> stack backtrace:
> CPU: 0 PID: 65517 Comm: holetest Not tainted 5.15.0-rc1-build2+ #292
> Hardware name: ASUS All Series/H97-PLUS, BIOS 2306 10/09/2014
> Call Trace:
>  dump_stack_lvl+0x45/0x59
>  check_noncircular+0xe4/0x129
>  ? print_circular_bug+0x207/0x207
>  ? validate_chain+0x461/0x4a8
>  ? add_chain_block+0x88/0xd9
>  ? hlist_add_head_rcu+0x49/0x53
>  check_prev_add+0x16b/0x3a4
>  validate_chain+0x3c4/0x4a8
>  ? check_prev_add+0x3a4/0x3a4
>  ? mark_lock+0xa5/0x1c6
>  __lock_acquire+0x89d/0x949
>  lock_acquire+0x2dc/0x34b
>  ? filemap_fault+0x276/0x7a5
>  ? rcu_read_unlock+0x59/0x59
>  ? add_to_page_cache_lru+0x13c/0x13c
>  ? lock_is_held_type+0x7b/0xd3
>  down_read+0x40/0x4a
>  ? filemap_fault+0x276/0x7a5
>  filemap_fault+0x276/0x7a5
>  ? pagecache_get_page+0x2dd/0x2dd
>  ? __lock_acquire+0x8bc/0x949
>  ? pte_offset_kernel.isra.0+0x6d/0xc3
>  __do_fault+0x96/0xbf
>  ? do_fault+0x124/0x35a
>  do_fault+0x262/0x35a
>  ? handle_pte_fault+0x1c1/0x20d
>  __handle_mm_fault+0x171/0x1b5
>  ? handle_pte_fault+0x20d/0x20d
>  ? __lock_release+0x151/0x254
>  ? mark_held_locks+0x1f/0x78
>  ? rcu_read_unlock+0x3a/0x59
>  handle_mm_fault+0x12a/0x233
>  do_user_addr_fault+0x3d2/0x59c
>  ? pgtable_bad+0x70/0x70
>  ? rcu_read_lock_bh_held+0xab/0xab
>  exc_page_fault+0x85/0xa5
>  ? asm_exc_page_fault+0x8/0x30
>  asm_exc_page_fault+0x1e/0x30
> RIP: 0033:0x40192f
> Code: ff 48 89 c3 48 8b 05 50 28 00 00 48 85 ed 7e 23 31 d2 4b 8d 0c 2f eb 0a 0f 1f 00 48 8b 05 39 28 00 00 48 0f af c2 48 83 c2 01 <48> 89 1c 01 48 39 d5 7f e8 8b 0d f2 27 00 00 31 c0 85 c9 74 0e 8b
> RSP: 002b:00007f9931867eb0 EFLAGS: 00010202
> RAX: 0000000000000000 RBX: 00007f9931868700 RCX: 00007f993206ac00
> RDX: 0000000000000001 RSI: 0000000000000000 RDI: 00007ffc13e06ee0
> RBP: 0000000000000100 R08: 0000000000000000 R09: 00007f9931868700
> R10: 00007f99318689d0 R11: 0000000000000202 R12: 00007ffc13e06ee0
> R13: 0000000000000c00 R14: 00007ffc13e06e00 R15: 00007f993206a000
> 
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

  reply	other threads:[~2021-09-22 11:04 UTC|newest]

Thread overview: 3+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-09-21 21:55 The new invalidate_lock seems to cause a potential deadlock with fscache David Howells
2021-09-22 11:04 ` Jan Kara [this message]
2021-12-07 15:58 ` David Howells

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20210922110420.GA21576@quack2.suse.cz \
    --to=jack@suse.cz \
    --cc=dhowells@redhat.com \
    --cc=djwong@kernel.org \
    --cc=hch@lst.de \
    --cc=jlayton@kernel.org \
    --cc=linux-fsdevel@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.