All of lore.kernel.org
 help / color / mirror / Atom feed
From: Eric Biggers <ebiggers@kernel.org>
To: Sergey Senozhatsky <sergey.senozhatsky.work@gmail.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>,
	Jaegeuk Kim <jaegeuk@kernel.org>,
	"Theodore Y. Ts'o" <tytso@mit.edu>,
	Suleiman Souhlal <suleiman@google.com>,
	linux-fscrypt@vger.kernel.org, stable@vger.kernel.org,
	linux-kernel@vger.kernel.org
Subject: Re: [stable] ext4 fscrypt_get_encryption_info() circular locking dependency
Date: Fri, 11 Dec 2020 10:03:26 -0800	[thread overview]
Message-ID: <X9O0brQ7junfZTfI@sol.localdomain> (raw)
In-Reply-To: <20201211040807.GF1667627@google.com>

On Fri, Dec 11, 2020 at 01:08:07PM +0900, Sergey Senozhatsky wrote:
> On (20/12/10 19:48), Eric Biggers wrote:
> > > 
> > > [  133.454836] Chain exists of:
> > >                  jbd2_handle --> fscrypt_init_mutex --> fs_reclaim
> > > 
> > > [  133.454840]  Possible unsafe locking scenario:
> > > 
> > > [  133.454841]        CPU0                    CPU1
> > > [  133.454843]        ----                    ----
> > > [  133.454844]   lock(fs_reclaim);
> > > [  133.454846]                                lock(fscrypt_init_mutex);
> > > [  133.454848]                                lock(fs_reclaim);
> > > [  133.454850]   lock(jbd2_handle);
> > > [  133.454851] 
> > 
> > This actually got fixed by the patch series
> > https://lkml.kernel.org/linux-fscrypt/20200913083620.170627-1-ebiggers@kernel.org/
> > which went into 5.10.  The more recent patch to remove ext4_dir_open() isn't
> > related.
> > 
> > It's a hard patch series to backport.  Backporting it to 5.4 would be somewhat
> > feasible, while 4.19 would be very difficult as there have been a lot of other
> > fscrypt commits which would heavily conflict with cherry-picks.
> > 
> > How interested are you in having this fixed?  Did you encounter an actual
> > deadlock or just the lockdep report?
> 
> Difficult to say. On one hand 'yes' I see lockups on my devices (4.19
> kernel); I can't tell at the moment what's the root cause. So on the
> other hand 'no' I can't say that it's because of ext4_dir_open().
> 
> What I saw so far involved ext4, kswapd, khugepaged and lots of other things.
> 
> [ 1598.655901] INFO: task khugepaged:66 blocked for more than 122 seconds.
> [ 1598.655914] Call Trace:
> [ 1598.655920]  __schedule+0x506/0x1240
> [ 1598.655924]  ? kvm_zap_rmapp+0x52/0x69
> [ 1598.655927]  schedule+0x3f/0x78
> [ 1598.655929]  __rwsem_down_read_failed_common+0x186/0x201
> [ 1598.655933]  call_rwsem_down_read_failed+0x14/0x30
> [ 1598.655936]  down_read+0x2e/0x45
> [ 1598.655939]  rmap_walk_file+0x73/0x1ce
> [ 1598.655941]  page_referenced+0x10d/0x154
> [ 1598.655948]  shrink_active_list+0x1d4/0x475
> 
> [..]
> 
> [ 1598.655986] INFO: task kswapd0:79 blocked for more than 122 seconds.
> [ 1598.655993] Call Trace:
> [ 1598.655995]  __schedule+0x506/0x1240
> [ 1598.655998]  schedule+0x3f/0x78
> [ 1598.656000]  __rwsem_down_read_failed_common+0x186/0x201
> [ 1598.656003]  call_rwsem_down_read_failed+0x14/0x30
> [ 1598.656006]  down_read+0x2e/0x45
> [ 1598.656008]  rmap_walk_file+0x73/0x1ce
> [ 1598.656010]  page_referenced+0x10d/0x154
> [ 1598.656015]  shrink_active_list+0x1d4/0x475
> 
> [..]
> 
> [ 1598.658233]  __rwsem_down_read_failed_common+0x186/0x201
> [ 1598.658235]  call_rwsem_down_read_failed+0x14/0x30
> [ 1598.658238]  down_read+0x2e/0x45
> [ 1598.658240]  rmap_walk_file+0x73/0x1ce
> [ 1598.658242]  page_referenced+0x10d/0x154
> [ 1598.658247]  shrink_active_list+0x1d4/0x475
> [ 1598.658250]  shrink_node+0x27e/0x661
> [ 1598.658254]  try_to_free_pages+0x425/0x7ec
> [ 1598.658258]  __alloc_pages_nodemask+0x80b/0x1514
> [ 1598.658279]  __do_page_cache_readahead+0xd4/0x1a9
> [ 1598.658282]  filemap_fault+0x346/0x573
> [ 1598.658287]  ext4_filemap_fault+0x31/0x44

Could you provide some more information about what is causing these actual
lockups for you?  Are there more stack traces?

I'd be surprised if it's related to the fscrypt-related lockdep reports you're
getting, as the fscrypt bug seemed unlikely to cause deadlocks in practice, as
most of the time fscrypt_get_encryption_info() does *not* do or wait for a
GFP_KERNEL allocation.  It's only in certain causes (generally, when things are
being initialized as opposed to already running) that it could.

See e.g. how the lockdep reports assume GFP_KERNEL done under
fscrypt_init_mutex, but that only happens the first time an encrypted directory
is accessed after boot.  So that path can't cause a deadlock after that.

This was also a 5 years old bug, so it's unclear why it would suddenly be
causing problems just now...

Maybe something changed that exposed the bug more.  I don't know what it would
be, though.

As I said, the fix would be difficult to backport.  It required a redesign of
how encrypted files get created, as there were lots of different ways in which
fscrypt_get_encryption_info() could be called during a filesystem transaction.
There's a nontrivial risk of regressions by backporting it.  (In fact I already
found and fixed two regressions in it upstream...)

So it would be helpful to know if this is important enough to you that you would
be willing to accept a risk of regressions above that of a typical stable
backport in order to get the re-design that fixes this issue backported.

- Eric

  parent reply	other threads:[~2020-12-11 19:41 UTC|newest]

Thread overview: 6+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-12-11  3:36 [stable] ext4 fscrypt_get_encryption_info() circular locking dependency Sergey Senozhatsky
2020-12-11  3:48 ` Eric Biggers
2020-12-11  4:08   ` Sergey Senozhatsky
2020-12-11  4:34     ` Sergey Senozhatsky
2020-12-11 18:03     ` Eric Biggers [this message]
2021-01-13 11:34       ` Sergey Senozhatsky

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=X9O0brQ7junfZTfI@sol.localdomain \
    --to=ebiggers@kernel.org \
    --cc=gregkh@linuxfoundation.org \
    --cc=jaegeuk@kernel.org \
    --cc=linux-fscrypt@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=sergey.senozhatsky.work@gmail.com \
    --cc=stable@vger.kernel.org \
    --cc=suleiman@google.com \
    --cc=tytso@mit.edu \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.