All of lore.kernel.org
 help / color / mirror / Atom feed
From: Amir Goldstein <amir73il@gmail.com>
To: Jan Kara <jack@suse.cz>
Cc: Trond Myklebust <trondmy@hammerspace.com>,
	"bfields@fieldses.org" <bfields@fieldses.org>,
	"khazhy@google.com" <khazhy@google.com>,
	"chuck.lever@oracle.com" <chuck.lever@oracle.com>,
	"linux-mm@kvack.org" <linux-mm@kvack.org>,
	"linux-nfs@vger.kernel.org" <linux-nfs@vger.kernel.org>,
	"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
	Jeff Layton <jlayton@kernel.org>
Subject: Re: [PATCH RFC] nfsd: avoid recursive locking through fsnotify
Date: Mon, 21 Mar 2022 13:56:47 +0200	[thread overview]
Message-ID: <CAOQ4uxiLXqmAC=769ufLA2dKKfHxm=c_8B0N2y4c-aZ5Qci2hg@mail.gmail.com> (raw)
In-Reply-To: <20220321112310.vpr7oxro2xkz5llh@quack3.lan>

On Mon, Mar 21, 2022 at 1:23 PM Jan Kara <jack@suse.cz> wrote:
>
> On Sat 19-03-22 11:36:13, Amir Goldstein wrote:
> > On Sat, Mar 19, 2022 at 9:02 AM Trond Myklebust <trondmy@hammerspace.com> wrote:
> > >
> > > On Fri, 2022-03-18 at 17:16 -0700, Khazhismel Kumykov wrote:
> > > > fsnotify_add_inode_mark may allocate with GFP_KERNEL, which may
> > > > result
> > > > in recursing back into nfsd, resulting in deadlock. See below stack.
> > > >
> > > > nfsd            D    0 1591536      2 0x80004080
> > > > Call Trace:
> > > >  __schedule+0x497/0x630
> > > >  schedule+0x67/0x90
> > > >  schedule_preempt_disabled+0xe/0x10
> > > >  __mutex_lock+0x347/0x4b0
> > > >  fsnotify_destroy_mark+0x22/0xa0
> > > >  nfsd_file_free+0x79/0xd0 [nfsd]
> > > >  nfsd_file_put_noref+0x7c/0x90 [nfsd]
> > > >  nfsd_file_lru_dispose+0x6d/0xa0 [nfsd]
> > > >  nfsd_file_lru_scan+0x57/0x80 [nfsd]
> > > >  do_shrink_slab+0x1f2/0x330
> > > >  shrink_slab+0x244/0x2f0
> > > >  shrink_node+0xd7/0x490
> > > >  do_try_to_free_pages+0x12f/0x3b0
> > > >  try_to_free_pages+0x43f/0x540
> > > >  __alloc_pages_slowpath+0x6ab/0x11c0
> > > >  __alloc_pages_nodemask+0x274/0x2c0
> > > >  alloc_slab_page+0x32/0x2e0
> > > >  new_slab+0xa6/0x8b0
> > > >  ___slab_alloc+0x34b/0x520
> > > >  kmem_cache_alloc+0x1c4/0x250
> > > >  fsnotify_add_mark_locked+0x18d/0x4c0
> > > >  fsnotify_add_mark+0x48/0x70
> > > >  nfsd_file_acquire+0x570/0x6f0 [nfsd]
> > > >  nfsd_read+0xa7/0x1c0 [nfsd]
> > > >  nfsd3_proc_read+0xc1/0x110 [nfsd]
> > > >  nfsd_dispatch+0xf7/0x240 [nfsd]
> > > >  svc_process_common+0x2f4/0x610 [sunrpc]
> > > >  svc_process+0xf9/0x110 [sunrpc]
> > > >  nfsd+0x10e/0x180 [nfsd]
> > > >  kthread+0x130/0x140
> > > >  ret_from_fork+0x35/0x40
> > > >
> > > > Signed-off-by: Khazhismel Kumykov <khazhy@google.com>
> > > > ---
> > > >  fs/nfsd/filecache.c | 4 ++++
> > > >  1 file changed, 4 insertions(+)
> > > >
> > > > Marking this RFC since I haven't actually had a chance to test this,
> > > > we
> > > > we're seeing this deadlock for some customers.
> > > >
> > > > diff --git a/fs/nfsd/filecache.c b/fs/nfsd/filecache.c
> > > > index fdf89fcf1a0c..a14760f9b486 100644
> > > > --- a/fs/nfsd/filecache.c
> > > > +++ b/fs/nfsd/filecache.c
> > > > @@ -121,6 +121,7 @@ nfsd_file_mark_find_or_create(struct nfsd_file
> > > > *nf)
> > > >         struct fsnotify_mark    *mark;
> > > >         struct nfsd_file_mark   *nfm = NULL, *new;
> > > >         struct inode *inode = nf->nf_inode;
> > > > +       unsigned int pflags;
> > > >
> > > >         do {
> > > >                 mutex_lock(&nfsd_file_fsnotify_group->mark_mutex);
> > > > @@ -149,7 +150,10 @@ nfsd_file_mark_find_or_create(struct nfsd_file
> > > > *nf)
> > > >                 new->nfm_mark.mask = FS_ATTRIB|FS_DELETE_SELF;
> > > >                 refcount_set(&new->nfm_ref, 1);
> > > >
> > > > +               /* fsnotify allocates, avoid recursion back into nfsd
> > > > */
> > > > +               pflags = memalloc_nofs_save();
> > > >                 err = fsnotify_add_inode_mark(&new->nfm_mark, inode,
> > > > 0);
> > > > +               memalloc_nofs_restore(pflags);
> > > >
> > > >                 /*
> > > >                  * If the add was successful, then return the object.
> > >
> > > Isn't that stack trace showing a slab direct reclaim, and not a
> > > filesystem writeback situation?
> > >
> > > Does memalloc_nofs_save()/restore() really fix this problem? It seems
> > > to me that it cannot, particularly since knfsd is not a filesystem, and
> > > so does not ever handle writeback of dirty pages.
> > >
> >
> > Maybe NOFS throttles direct reclaims to the point that the problem is
> > harder to hit?
> >
> > This report came in at good timing for me.
> >
> > It demonstrates an issue I did not predict for "volatile"' fanotify marks [1].
> > As far as I can tell, nfsd filecache is currently the only fsnotify backend that
> > frees fsnotify marks in memory shrinker. "volatile" fanotify marks would also
> > be evictable in that way, so they would expose fanotify to this deadlock.
> >
> > For the short term, maybe nfsd filecache can avoid the problem by checking
> > mutex_is_locked(&nfsd_file_fsnotify_group->mark_mutex) and abort the
> > shrinker. I wonder if there is a place for a helper mutex_is_locked_by_me()?
> >
> > Jan,
> >
> > A relatively simple fix would be to allocate fsnotify_mark_connector in
> > fsnotify_add_mark() and free it, if a connector already exists for the object.
> > I don't think there is a good reason to optimize away this allocation
> > for the case of a non-first group to set a mark on an object?
>
> Indeed, nasty. Volatile marks will add group->mark_mutex into a set of
> locks grabbed during inode slab reclaim. So any allocation under
> group->mark_mutex has to be GFP_NOFS now. This is not just about connector
> allocations but also mark allocations for fanotify. Moving allocations from
> under mark_mutex is also possible solution but passing preallocated memory
> around is kind of ugly as well.

Yes, kind of, here is how it looks:
https://github.com/amir73il/linux/commit/643bb6b9f664f70f68ea0393a06338673c4966b3
https://github.com/amir73il/linux/commit/66f27fc99e46b12f1078e8e2915793040ce50ee7

> So the cleanest solution I currently see is
> to come up with helpers like "fsnotify_lock_group() &
> fsnotify_unlock_group()" which will lock/unlock mark_mutex and also do
> memalloc_nofs_save / restore magic.
>

Sounds good. Won't this cause a regression - more failures to setup new mark
under memory pressure?

Should we maintain a flag in the group FSNOTIFY_GROUP_SHRINKABLE?
and set NOFS state only in that case, so at least we don't cause regression
for existing applications?

Thanks,
Amir.

  reply	other threads:[~2022-03-21 11:57 UTC|newest]

Thread overview: 23+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-03-19  0:16 [PATCH RFC] nfsd: avoid recursive locking through fsnotify Khazhismel Kumykov
2022-03-19  0:36 ` Trond Myklebust
2022-03-19  1:45   ` Khazhy Kumykov
2022-03-19  9:36   ` Amir Goldstein
2022-03-21 11:23     ` Jan Kara
2022-03-21 11:56       ` Amir Goldstein [this message]
2022-03-21 14:51         ` Jan Kara
2022-03-22 22:41           ` Amir Goldstein
2022-03-23 10:41             ` Jan Kara
2022-03-23 11:40               ` Amir Goldstein
2022-03-23 13:48                 ` Jan Kara
2022-03-23 14:00                   ` Amir Goldstein
2022-03-23 14:28                     ` Jan Kara
2022-03-23 15:46                       ` Amir Goldstein
2022-03-23 19:31                         ` Amir Goldstein
2022-03-24 19:17                         ` Amir Goldstein
2022-03-25  9:29                           ` Jan Kara
2022-03-27 18:14                             ` Amir Goldstein
2022-03-21 22:50       ` Trond Myklebust
2022-03-21 23:36         ` Khazhy Kumykov
2022-03-21 23:50           ` Trond Myklebust
2022-03-22 10:37         ` Jan Kara
2022-03-21 17:06     ` Khazhy Kumykov

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='CAOQ4uxiLXqmAC=769ufLA2dKKfHxm=c_8B0N2y4c-aZ5Qci2hg@mail.gmail.com' \
    --to=amir73il@gmail.com \
    --cc=bfields@fieldses.org \
    --cc=chuck.lever@oracle.com \
    --cc=jack@suse.cz \
    --cc=jlayton@kernel.org \
    --cc=khazhy@google.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=linux-nfs@vger.kernel.org \
    --cc=trondmy@hammerspace.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.