Re: [PATCH] fs: Do not check if there is a fsnotify watcher on pseudo inodes

From: Amir Goldstein <amir73il@gmail.com>
To: Jan Kara <jack@suse.cz>
Cc: Mel Gorman <mgorman@techsingularity.net>,
	Alexander Viro <viro@zeniv.linux.org.uk>,
	linux-fsdevel <linux-fsdevel@vger.kernel.org>,
	linux-kernel <linux-kernel@vger.kernel.org>
Subject: Re: [PATCH] fs: Do not check if there is a fsnotify watcher on pseudo inodes
Date: Mon, 15 Jun 2020 14:07:46 +0300	[thread overview]
Message-ID: <CAOQ4uxibBL7W0DR4RhC5ZP7Y68ob5Xj6VncWjepgOj=RB8jrGQ@mail.gmail.com> (raw)
In-Reply-To: <20200615081202.GE9449@quack2.suse.cz>

On Mon, Jun 15, 2020 at 11:12 AM Jan Kara <jack@suse.cz> wrote:
>
> On Fri 12-06-20 23:34:16, Amir Goldstein wrote:
> > > > > So maybe it would be better to list all users of alloc_file_pseudo()
> > > > > and say that they all should be opted out of fsnotify, without mentioning
> > > > > "internal mount"?
> > > > >
> > > >
> > > > The users are DMA buffers, CXL, aio, anon inodes, hugetlbfs, anonymous
> > > > pipes, shmem and sockets although not all of them necessary end up using
> > > > a VFS operation that triggers fsnotify.  Either way, I don't think it
> > > > makes sense (or even possible) to watch any of those with fanotify so
> > > > setting the flag seems reasonable.
> > > >
> > >
> > > I also think this seems reasonable, but the more accurate reason IMO
> > > is found in the comment for d_alloc_pseudo():
> > > "allocate a dentry (for lookup-less filesystems)..."
> > >
> > > > I updated the changelog and maybe this is clearer.
> > >
> > > I still find the use of "internal mount" terminology too vague.
> > > "lookup-less filesystems" would have been more accurate,
> >
> > Only it is not really accurate for shmfs anf hugetlbfs, which are
> > not lookup-less, they just hand out un-lookable inodes.
>
> OK, but I still think we are safe setting FMODE_NONOTIFY in
> alloc_file_pseudo() and that covers all the cases we care about. Or did I
> misunderstand something in the discussion? I can see e.g.
> __shmem_file_setup() uses alloc_file_pseudo() but again that seems to be
> used only for inodes without a path and the comment before d_alloc_pseudo()
> pretty clearly states this should be the case.
>
> So is the dispute here really only about how to call files using
> d_alloc_pseudo()?
>

Yes, semantics, no technical dispute on the patch.

> > > because as you correctly point out, the user API to set a watch
> > > requires that the marked object is looked up in the filesystem.
> > >
> > > There are also some kernel internal users that set watches
> > > like audit and nfsd, but I think they are also only interested in
> > > inodes that have a path at the time that the mark is setup.
> > >
> >
> > FWIW I verified that watches can be set on anonymous pipes
> > via /proc/XX/fd, so if we are going to apply this patch, I think it
> > should be accompanied with a complimentary patch that forbids
> > setting up a mark on these sort of inodes. If someone out there
> > is doing this, at least they would get a loud message that something
> > has changed instead of silently dropping fsnotify events.
> >
> > So now the question is how do we identify/classify "these sort of
> > inodes"? If they are no common well defining characteristics, we
> > may need to blacklist pipes sockets and anon inodes explicitly
> > with S_NONOTIFY.
>
> We already do have FS_DISALLOW_NOTIFY_PERM in file_system_type->fs_flags so
> adding FS_DISALLOW_NOTIFY would be natural if there is a need for this.

Yes, it is possible, but for the specified use case, it is probably easier
to classify by inode type (and maybe IS_ROOT()) than by filesystem type.
Also, in the case of shmem, the same file_system_type is used for user
mountable tmpfs and the kernel internal shm_mnt instance - only the
latter is used for handing out anonymous shmem files.

>
> I don't think using fsnotify on pipe inodes is sane in any way. You'd
> possibly only get the MODIFY or ACCESS events and even those would not be
> quite reliable because with pipes stuff like splicing etc. is much more
> common and that currently completely bypasses fsnotify subsystem. So
> overall I'm fine with completely ignoring fsnotify on such inodes.
>

Agreed for MODIFY ACCESS. Not so sure about other events.
I see that nfsd filecache backend only marks regular files, so that's fine.
I *think* audit only marks directories and exe files, but completely unsure.

Maybe there is no need to optimize out special inodes from all events
and only exclude them from MODIFY/ACCESS, which are the only
events where performance may be a concern?
Or maybe you did not mean to skip events on special inodes in general?

I am not sure how important OPEN events are on special inodes, but
it is scary to stop sending OPEN_PERM events.

Do you agree that we should also actively disallow setting a mark
on special disconnected inodes? instead of silently dropping events
that current kernel does deliver?

We could disallow setting a mark on a disconnected inode
(one that user is trying to configure by using a /proc/$pid/fd/X path).
We can enforce this restriction for all backends in the common helper
fsnotify_add_mark_locked().

Thanks,
Amir.