Re: thoughts about fanotify and HSM

From: Amir Goldstein <amir73il@gmail.com>
To: Jan Kara <jack@suse.cz>
Cc: linux-fsdevel <linux-fsdevel@vger.kernel.org>,
	Dave Chinner <david@fromorbit.com>
Subject: Re: thoughts about fanotify and HSM
Date: Thu, 3 Nov 2022 15:38:35 +0200	[thread overview]
Message-ID: <CAOQ4uxgKwjKbH_qbzpWzzMq8bktn+DeV1OZNtrxa6jaBPSMHsg@mail.gmail.com> (raw)
In-Reply-To: <20221103125748.474y4l3vf2h62mot@quack3>

On Thu, Nov 3, 2022 at 2:57 PM Jan Kara <jack@suse.cz> wrote:
>
> I'm sorry for the really delayed response. We had an internal conference
> and some stuff around that which made me busy.
>

No problem at all.
There is a lot to process in this thread.
My follow up email about avoiding TOCTOU is worse...
I am happy for whatever feedback you can provide me when you have the time.

> On Thu 13-10-22 15:16:25, Amir Goldstein wrote:
> > On Wed, Oct 12, 2022 at 7:28 PM Amir Goldstein <amir73il@gmail.com> wrote:
> > >
> > > On Wed, Oct 12, 2022 at 6:44 PM Jan Kara <jack@suse.cz> wrote:
> > > >
> > > > Hi Amir!
> > > >
> > > > On Fri 07-10-22 16:58:21, Amir Goldstein wrote:
> > > > > > > The other use case of automatic inode marks I was thinking about,
> > > > > > > which are even more relevant for $SUBJECT is this:
> > > > > > > When instantiating a dentry with an inode that has xattr
> > > > > > > "security.fanotify.mask" (a.k.a. persistent inode mark), an inode
> > > > > > > mark could be auto created and attached to a group with a special sb
> > > > > > > mark (we can limit a single special mark per sb).
> > > > > > > This could be implemented similar to get_acl(), where i_fsnotify_mask
> > > > > > > is always initialized with a special value (i.e. FS_UNINITIALIZED)
> > > > > > > which is set to either 0 or non-zero if "security.fanotify.mask" exists.
> > > > > > >
> > > > > > > The details of how such an API would look like are very unclear to me,
> > > > > > > so I will try to focus on the recursive auto inode mark idea.
> > > > > >
> > > > > > Yeah, although initializing fanotify marks based on xattrs does not look
> > > > > > completely crazy I can see a lot of open questions there so I think
> > > > > > automatic inode mark idea has more chances for success at this point :).
> > > > >
> > > > > I realized that there is one sort of "persistent mark" who raises
> > > > > less questions - one that only has an ignore mask.
> > > > >
> > > > > ignore masks can have a "static" namespace that is not bound to any
> > > > > specific group, but rather a set of groups that join this namespace.
> > > > >
> > > > > I played with this idea and wrote some patches:
> > > > > https://github.com/amir73il/linux/commits/fan_xattr_ignore_mask
> > > >
> > > > I have glanced over the patches. In general the idea looks OK to me but I
> > > > have some concerns:
> > > >
> > > > 1) Technically, it may be challenging to call into filesystem xattr
> > > > handling code on first event generated by the inode - that may generate
> > > > some unexpected lock recursion for some filesystems and some events which
> > > > trigger the initialization...
> > >
> > > That may be a correct statement in general, but please note that
> > > - Only permission events trigger auto-init of xattr ignore mask
> > > - Permission events are called from security hooks
> > > - Security hooks may also call getxattr to get the security context
> > >
> > > Perhaps LSMs always initialize the security context in OPEN and
> > > never in ACCESS?
> > >
> > > One of the earlier versions of the patch initialized xattr ignore mask
> > > on LOOKUP permission event, if ANY object was interested in ANY
> > > permission event even if no object was interested in LOOKUP
> > > to mimic the LSM context initialization,
> > > but it was complicated and I wasn't sure if this was necessary.
> > >
> >
> > Also, permission events can sleep by definition
> > so why would getxattr not be safe in the
> > context of permission events handling?
>
> Well, I'm not afraid of sleeping. I was more worried about lock ordering
> issues. But are right that this probably isn't going to be an issue.
>
> > > > 2) What if you set the xattr while the group is already listening to
> > > > events? Currently the change will get ignored, won't it? But I guess this
> > > > could be handled by clearing the "cached" flag when the xattr is set.
> > > >
> > >
> > > I have created an API to update the xattr via
> > >   fanotify_mark(FAN_MARK_XATTR, ...
> > > which updates the cached ignore mask in the connector.
> > >
> > > I see no reason to support "direct" modifications of this xattr.
> > > If such changes are made directly it is fine to ignore them.
> > >
> > > > 3) What if multiple applications want to use the persistent mark
> > > > functionality? I think we need some way to associate a particular
> > > > fanotify group with a particular subset of fanotify xattrs so that
> > > > coexistence of multiple applications is possible...
> > > >
> > >
> > > Yeh, I thought about this as well.
> > > The API in the patches is quite naive because it implements a single
> > > global namespace for xattr ignore masks, but mostly I wanted to
> > > see if I can get the fast path and auto-init implementation done.
> > >
> > > I was generally thinking of ioctl() as the API to join an xattr marks
> > > namespace and negotiate the on-disk format of persistent marks
> > > supported by the application.
> > >
> > > I would not want to allow multiple fanotify xattrs per inode -
> > > that could have the consequence of the inode becoming a junkyard.
> > >
> > > I'd prefer to have a single xattr (say security.fanotify.mark)
> > > and that mark will have
> > > - on-disk format version
> > > - namespace id
> > > - ignore mask
> > > - etc
> > >
> > > If multiple applications want to use persistent marks they need to figure
> > > out how to work together without stepping on each other's toes.
> > > I don't think it is up to fanotify to coordinate that.
>
> I'm not sure if this really scales. Imagine you have your say backup
> application that wants to use persistent marks and then you have your HSM
> application wanting to do the same. Or you have some daemon caching
> preparsed contents of config/ directory and watching whether it needs to
> rescan the dir and rebuild the cache using persistent marks (I'm hearing
> requests like these for persistent marks from desktop people for many
> years). How exactly are these going to coordinate?
>
> I fully understand your concern about the clutter in inode xattrs but if
> we're going to limit the kernel to support only one persistent marks user,
> then IMO we also need to provide a userspace counterpart (in the form of
> some service or library like persistent change notification journal) that
> will handle the coordination. Because otherwise it will become a mess
> rather quickly.
>

The concept of singleton userspace services exists, but getting their
UAPI right is a challenge.
We can draw inspiration from Windows, which is decades a head of
Linux w.r.t persistent fs notifications:
https://learn.microsoft.com/en-us/windows/win32/cfapi/build-a-cloud-file-sync-engine

IIUC, the UAPI allows a single CloudSync engine to register per fs (or
subtree root)
to get the callbacks from a file with the persistent marks (called
reparse points)
in the "Windows.Storage.Provider" namespace.

IOW, you may have OneDrive content provider or GoogleDrive content provider
serving the persistent marks on a specific directory, never both.

But Windows does support different namespaces for reparse points,
so it's an example for both sides of our arguments.

> > > fanotify_mark() can fail with EEXIST when a group that joined namespace A
> > > is trying to update a persistent mark when a persistent mark of namespace B
> > > already exists and probably some FAN_MARK_REPLACE flag could be used
> > > to force overwrite the existing persistent mark.
> >
> > One thing that I feel a bit silly about is something that I only fully
> > noticed after writing this WIP wiki entry:
> > https://github.com/amir73il/fsnotify-utils/wiki/Hierarchical-Storage-Management-API#persistent-inode-marks
> >
> > Persistent marks (in xattr) with ignore mask are useful, but only a
> > little bit more useful than Evictable marks with ignore mask.
> >
> > Persistent marks (in xattr) with a "positive" event mask are the real deal.
> > Because with "positive" persistent marks, we will be able to optimize away
> > srcu_read_lock() and marks iteration for the majority of fsnotify hooks
> > by looking at objects interest masks (including the FS_XATTR_CACHED bit).
> >
> > The good thing about the POC patches [1] is that there is no technical
> > limitation in this code for using persistent marks with positive event
> > masks.  It is a bit more challenging to document the fact that a "normal"
> > fs/mount mark is needed in order to "activate" the persistent marks in
> > the inodes of this fs/mount, but the changes to the code to support that
> > would be minimal.
>
> I agree persistent positive marks are very interesting (in fact I've heard
> requests for functionality like this about 15 years ago :)). But if you'd
> like to use them e.g. for backup or HSM then you need to somehow make sure
> you didn't miss any events before you created the activation mark? That
> looks like a bit unpleasant race with possible data (backup) corruption
> consequences?
>

Yeh, positive persistent marks is quite futuristic at this point.
My POC will stick to mount mark and inode ignore marks, where these
races are rather easy to avoid using FAN_MARK_SYNC (see next email).

Anyway, I have made a decision to aim for an initial HSM UAPI implementation
with evictable ignore marks (without persistent marks) because evictable marks
are functionally sufficient and persistent marks have too many UAPI
open questions.

I've modified the POC code to work to try persistent ignore marks
and fall back to evictable ignore marks, which incur a few extra userspace
callbacks after service restart or after drop caches.

Thanks,
Amir.