linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* fanotify and LSM path hooks
@ 2019-04-14 16:04 Amir Goldstein
  2019-04-14 16:39 ` Al Viro
  2019-04-16 15:45 ` Jan Kara
  0 siblings, 2 replies; 17+ messages in thread
From: Amir Goldstein @ 2019-04-14 16:04 UTC (permalink / raw)
  To: Jan Kara
  Cc: linux-fsdevel, LSM List, Serge E. Hallyn, James Morris, Al Viro,
	Miklos Szeredi, Matthew Bobrowski, Kentaro Takeda, Tetsuo Handa,
	John Johansen

Hi Jan,

I started to look at directory pre-modification "permission" hooks
that were discussed on last year's LSFMM:
https://lwn.net/Articles/755277/

The two existing fanotify_perm() hooks are called
from security_file_permission() and security_file_open()
and depend on build time CONFIG_SECURITY.
If you look at how the fsnotify_perm() hooks are planted inside the
generic security hooks, one might wonder, why are fanotify permission
hooks getting a special treatment and are not registering as LSM hooks?

One benefit from an fanotify LSM, besides more generic code, would be
that fanotify permission hooks could be disabled with boot parameters.

I only bring this up because security hooks seems like the most natural
place to add pre-modify fanotify events for the purpose of maintaining
a filesystem change journal. It would be ugly to spray more fsnotify hooks
inside security hooks instead of registering an fanotify LSM, but maybe
there are downsides of registering fanotify as LSM that I am not aware of?

Another observation relates to the security_path_ hooks.
Let's take rename as an example.
LSM could implement security_path_rename() and/or security_inode_rename()
hooks and rename syscalls will end up calling both hooks.
The security_path_ hooks are more attractive for fanotify, because the path
information could be used to setup pre-modification permission mask on
mount marks and not only on filesystem/inode marks.

One problem with security_path_ hooks is that they require an extra
build time CONFIG_SECURITY_PATH.
Another problem is that they seem to be bypassed by several subsystems.
cachefiles, ecryptfs, overlayfs and nfsd all call the vfs_rename() helper, but
only cachefiles bothers to call the security_path_rename() hook.
This is of course true for all other security_path_ hooks.
I think that is something that requires fixing regardless of the fanotify pre
modification hooks. I wonder if tomoyo and apparmor developers
(LSM that implement security_path_ hooks) are aware of those missing
hooks?

Would love to get feedback about whether or not fanotify LSM sounds
like a good or bad idea and about the security_path_ hooks questions.

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: fanotify and LSM path hooks
  2019-04-14 16:04 fanotify and LSM path hooks Amir Goldstein
@ 2019-04-14 16:39 ` Al Viro
  2019-04-14 18:51   ` Amir Goldstein
  2019-04-16 15:45 ` Jan Kara
  1 sibling, 1 reply; 17+ messages in thread
From: Al Viro @ 2019-04-14 16:39 UTC (permalink / raw)
  To: Amir Goldstein
  Cc: Jan Kara, linux-fsdevel, LSM List, Serge E. Hallyn, James Morris,
	Miklos Szeredi, Matthew Bobrowski, Kentaro Takeda, Tetsuo Handa,
	John Johansen

On Sun, Apr 14, 2019 at 07:04:14PM +0300, Amir Goldstein wrote:

> Another problem is that they seem to be bypassed by several subsystems.
> cachefiles, ecryptfs, overlayfs and nfsd all call the vfs_rename() helper, but
> only cachefiles bothers to call the security_path_rename() hook.
> This is of course true for all other security_path_ hooks.
> I think that is something that requires fixing regardless of the fanotify pre
> modification hooks. I wonder if tomoyo and apparmor developers
> (LSM that implement security_path_ hooks) are aware of those missing
> hooks?

First of all, _what_ path?  You do realize that there is no such thing
as *the* pathname of dentry, right?  The same filesystem may be mounted
in any number of places, some of which might be visible in a given
namespace (including "none of them" - and you are not even guaranteed
that they are visible in any namespace at all).

It's not "bypassed", it's "inapplicable and deeply flawed in general".

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: fanotify and LSM path hooks
  2019-04-14 16:39 ` Al Viro
@ 2019-04-14 18:51   ` Amir Goldstein
  2019-04-14 19:26     ` Al Viro
  0 siblings, 1 reply; 17+ messages in thread
From: Amir Goldstein @ 2019-04-14 18:51 UTC (permalink / raw)
  To: Al Viro
  Cc: Jan Kara, linux-fsdevel, LSM List, Serge E. Hallyn, James Morris,
	Miklos Szeredi, Matthew Bobrowski, Kentaro Takeda, Tetsuo Handa,
	John Johansen

On Sun, Apr 14, 2019 at 7:40 PM Al Viro <viro@zeniv.linux.org.uk> wrote:
>
> On Sun, Apr 14, 2019 at 07:04:14PM +0300, Amir Goldstein wrote:
>
> > Another problem is that they seem to be bypassed by several subsystems.
> > cachefiles, ecryptfs, overlayfs and nfsd all call the vfs_rename() helper, but
> > only cachefiles bothers to call the security_path_rename() hook.
> > This is of course true for all other security_path_ hooks.
> > I think that is something that requires fixing regardless of the fanotify pre
> > modification hooks. I wonder if tomoyo and apparmor developers
> > (LSM that implement security_path_ hooks) are aware of those missing
> > hooks?
>
> First of all, _what_ path?  You do realize that there is no such thing
> as *the* pathname of dentry, right?

Yap.

> The same filesystem may be mounted
> in any number of places, some of which might be visible in a given
> namespace (including "none of them" - and you are not even guaranteed
> that they are visible in any namespace at all).
>
> It's not "bypassed", it's "inapplicable and deeply flawed in general".

I am assuming that you are referring to the security_path_ hooks in
general. I cannot speak on behalf of Tomoyo and Apparmor, but I think
I understand why you view path based security policy as flawed.

From fanotify POV, passing the path that was used to operate on
dentries to fanotify lets the users choose if they want to get events
for all operations on an inode, all operations on a specific filesystem or
all operations where the inode was accessed via a specific mount
(FAN_MARK_MOUNT).

When looking at userspace applications of kernel filesystem change
notifications, like https://facebook.github.io/watchman/
I believe what users really want is a subtree watch.
So for example, you may want to monitor your git workspace(s) for changes
on a large source tree(s) to save time on git index updates.
If you bind mount your git work tree and watch the mount for changes,
chances are nobody is messing with your source files from another
mount and that you do not have hardlinks pointing outside the git
workspace root.

But the truth is I would much rather that users have a way to mark
a subtree root and ask fanotify for events under that subtree.
As a matter of fact, I have some private POC patches that allow users to
setup a mark on a "subtree root" dentry, which really marks the super block
and keep a reference to the dentry. Than every event on that super block
is filtered with is_subdir() against the marked dentry.
I am just not convinced that is the most efficient way to implement a
subtree filter... another though I had was to filter not by subtree, but by
project id and let users worry about populating their subtrees of interest
with appropriate projids.

More comments about ideas that are deeply flawed are welcome.
Those comments would hopefully help me navigate towards the
mildly flawed ideas ;-)

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: fanotify and LSM path hooks
  2019-04-14 18:51   ` Amir Goldstein
@ 2019-04-14 19:26     ` Al Viro
  2019-04-14 20:28       ` Amir Goldstein
  0 siblings, 1 reply; 17+ messages in thread
From: Al Viro @ 2019-04-14 19:26 UTC (permalink / raw)
  To: Amir Goldstein
  Cc: Jan Kara, linux-fsdevel, LSM List, Serge E. Hallyn, James Morris,
	Miklos Szeredi, Matthew Bobrowski, Kentaro Takeda, Tetsuo Handa,
	John Johansen

On Sun, Apr 14, 2019 at 09:51:38PM +0300, Amir Goldstein wrote:

> But the truth is I would much rather that users have a way to mark
> a subtree root and ask fanotify for events under that subtree.
> As a matter of fact, I have some private POC patches that allow users to
> setup a mark on a "subtree root" dentry, which really marks the super block
> and keep a reference to the dentry. Than every event on that super block
> is filtered with is_subdir() against the marked dentry.

And that is_subdir() is protected by what, exactly?  And what happens
if you have many such dentries?

Or, for that matter, what happens if that dentry gets invalidated?

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: fanotify and LSM path hooks
  2019-04-14 19:26     ` Al Viro
@ 2019-04-14 20:28       ` Amir Goldstein
  0 siblings, 0 replies; 17+ messages in thread
From: Amir Goldstein @ 2019-04-14 20:28 UTC (permalink / raw)
  To: Al Viro
  Cc: Jan Kara, linux-fsdevel, LSM List, Serge E. Hallyn, James Morris,
	Miklos Szeredi, Matthew Bobrowski, Kentaro Takeda, Tetsuo Handa,
	John Johansen

On Sun, Apr 14, 2019 at 10:26 PM Al Viro <viro@zeniv.linux.org.uk> wrote:
>
> On Sun, Apr 14, 2019 at 09:51:38PM +0300, Amir Goldstein wrote:
>
> > But the truth is I would much rather that users have a way to mark
> > a subtree root and ask fanotify for events under that subtree.
> > As a matter of fact, I have some private POC patches that allow users to
> > setup a mark on a "subtree root" dentry, which really marks the super block
> > and keep a reference to the dentry. Than every event on that super block
> > is filtered with is_subdir() against the marked dentry.
>
> And that is_subdir() is protected by what, exactly?  And what happens
> if you have many such dentries?
>
> Or, for that matter, what happens if that dentry gets invalidated?

Well, as I said, its just a POC, it only supports filtering by a single dentry
and I didn't think that was the way to go forward. I am looking for the best way
to go forward.

That's why I was looking for an API to "fence" renaming objects in/out of
of a subtree root. It sounds useful to have something like this for
containers and chroots.

Let's look at the options users have today.
Users can use recursive inotify that is racy and pins too many inodes in cache.
Users can use the new fanotify sb mark to get all events on filesystem and
filter them by path is userspace (also racy w.r.t ancestry).

I donno, maybe filtering by projid or another inherited persistent
inode property
is good enough for the existing use cases out there - this seems to be the way
ext4 is going with encrypted subtrees and case insensitive subtrees.

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: fanotify and LSM path hooks
  2019-04-14 16:04 fanotify and LSM path hooks Amir Goldstein
  2019-04-14 16:39 ` Al Viro
@ 2019-04-16 15:45 ` Jan Kara
  2019-04-16 18:24   ` Amir Goldstein
  1 sibling, 1 reply; 17+ messages in thread
From: Jan Kara @ 2019-04-16 15:45 UTC (permalink / raw)
  To: Amir Goldstein
  Cc: Jan Kara, linux-fsdevel, LSM List, Serge E. Hallyn, James Morris,
	Al Viro, Miklos Szeredi, Matthew Bobrowski, Kentaro Takeda,
	Tetsuo Handa, John Johansen

Hi Amir!

On Sun 14-04-19 19:04:14, Amir Goldstein wrote:
> I started to look at directory pre-modification "permission" hooks
> that were discussed on last year's LSFMM:
> https://lwn.net/Articles/755277/
> 
> The two existing fanotify_perm() hooks are called
> from security_file_permission() and security_file_open()
> and depend on build time CONFIG_SECURITY.
> If you look at how the fsnotify_perm() hooks are planted inside the
> generic security hooks, one might wonder, why are fanotify permission
> hooks getting a special treatment and are not registering as LSM hooks?
> 
> One benefit from an fanotify LSM, besides more generic code, would be
> that fanotify permission hooks could be disabled with boot parameters.
> 
> I only bring this up because security hooks seems like the most natural
> place to add pre-modify fanotify events for the purpose of maintaining
> a filesystem change journal. It would be ugly to spray more fsnotify hooks
> inside security hooks instead of registering an fanotify LSM, but maybe
> there are downsides of registering fanotify as LSM that I am not aware of?

I kind of like the idea of generating fanotify permission events from
special LSM hooks.

I'm not so sure about directory pre-modification hooks. Given the amount of
problems we face with applications using fanotify permission events and
deadlocking the system, I'm not very fond of expanding that API... AFAIU
you want to use such hooks for recording (and persisting) that some change
is going to happen and provide crash-consistency guarantees for such
journal?

> Another observation relates to the security_path_ hooks.
> Let's take rename as an example.
> LSM could implement security_path_rename() and/or security_inode_rename()
> hooks and rename syscalls will end up calling both hooks.
> The security_path_ hooks are more attractive for fanotify, because the path
> information could be used to setup pre-modification permission mask on
> mount marks and not only on filesystem/inode marks.
> 
> One problem with security_path_ hooks is that they require an extra
> build time CONFIG_SECURITY_PATH.
> Another problem is that they seem to be bypassed by several subsystems.
> cachefiles, ecryptfs, overlayfs and nfsd all call the vfs_rename() helper, but
> only cachefiles bothers to call the security_path_rename() hook.
> This is of course true for all other security_path_ hooks.
> I think that is something that requires fixing regardless of the fanotify pre
> modification hooks. I wonder if tomoyo and apparmor developers
> (LSM that implement security_path_ hooks) are aware of those missing
> hooks?
> 
> Would love to get feedback about whether or not fanotify LSM sounds
> like a good or bad idea and about the security_path_ hooks questions.

I don't have strong opinion on using security_path_ hooks. I guess if
they're not used everywhere then it's just easier to avoid them.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: fanotify and LSM path hooks
  2019-04-16 15:45 ` Jan Kara
@ 2019-04-16 18:24   ` Amir Goldstein
  2019-04-17 11:30     ` Jan Kara
  0 siblings, 1 reply; 17+ messages in thread
From: Amir Goldstein @ 2019-04-16 18:24 UTC (permalink / raw)
  To: Jan Kara
  Cc: linux-fsdevel, Al Viro, Miklos Szeredi, Matthew Bobrowski,
	LSM List, overlayfs

On Tue, Apr 16, 2019 at 6:45 PM Jan Kara <jack@suse.cz> wrote:
>
> Hi Amir!
>
> On Sun 14-04-19 19:04:14, Amir Goldstein wrote:
> > I started to look at directory pre-modification "permission" hooks
> > that were discussed on last year's LSFMM:
> > https://lwn.net/Articles/755277/
> >
> > The two existing fanotify_perm() hooks are called
> > from security_file_permission() and security_file_open()
> > and depend on build time CONFIG_SECURITY.
> > If you look at how the fsnotify_perm() hooks are planted inside the
> > generic security hooks, one might wonder, why are fanotify permission
> > hooks getting a special treatment and are not registering as LSM hooks?
> >
> > One benefit from an fanotify LSM, besides more generic code, would be
> > that fanotify permission hooks could be disabled with boot parameters.
> >
> > I only bring this up because security hooks seems like the most natural
> > place to add pre-modify fanotify events for the purpose of maintaining
> > a filesystem change journal. It would be ugly to spray more fsnotify hooks
> > inside security hooks instead of registering an fanotify LSM, but maybe
> > there are downsides of registering fanotify as LSM that I am not aware of?
>
> I kind of like the idea of generating fanotify permission events from
> special LSM hooks.
>

Cool, I think that all we really need to do is call security_add_hooks().
[reducing LSM CC list]

> I'm not so sure about directory pre-modification hooks. Given the amount of
> problems we face with applications using fanotify permission events and
> deadlocking the system, I'm not very fond of expanding that API... AFAIU
> you want to use such hooks for recording (and persisting) that some change
> is going to happen and provide crash-consistency guarantees for such
> journal?
>

That's the general idea.
I have two use cases for pre-modification hooks:
1. VFS level snapshots
2. persistent change tracking

TBH, I did not consider implementing any of the above in userspace,
so I do not have a specific interest in extending the fanotify API.
I am actually interested in pre-modify fsnotify hooks (not fanotify),
that a snapshot or change tracking subsystem can register with.
An in-kernel fsnotify event handler can set a flag in current task
struct to circumvent system deadlocks on nested filesystem access.

My current implementation of overlayfs snapshots [1] uses a stackable
filesystem (a.k.a. snapshot fs) as a means for pre-modify hooks.
This approach has some advantages and some disadvantages
compared to fsnotify pre-modify hooks.

With fsnotify pre-modify hooks it would be possible to protect
the underlying filesystem from un-monitored changes even when
filesystem is accessed from another mount point or another namespace.

As a matter of fact, the protection to underlying filesystem changes
needed for overlayfs snapshots is also useful for standard overlayfs -
Modification to underlying overlayfs layers are strongly discouraged,
but there is nothing preventing the user from making such modifications.
If overlayfs were to register for fsnotify pre-modify hooks on underlying
filesystem, it could prevent users from modifying underlying layers.

And not only that - because security_inode_rename() hook is called
with s_vfs_rename_mutex held, it is safe to use d_ancestor() to
prevent renames in and out of overlay layer subtrees.
With that protection in place, it is safe (?) to use is_subdir() in other
hooks to check if an object is under an overlayfs layer subtree,
because the entire ancestry below the layers roots is stable.

Will see if I can sketch a POC.

Thanks,
Amir.

[1] https://github.com/amir73il/overlayfs/wiki/Overlayfs-snapshots

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: fanotify and LSM path hooks
  2019-04-16 18:24   ` Amir Goldstein
@ 2019-04-17 11:30     ` Jan Kara
  2019-04-17 12:14       ` Miklos Szeredi
  2020-06-26 11:06       ` fsnotify pre-modify VFS hooks (Was: fanotify and LSM path hooks) Amir Goldstein
  0 siblings, 2 replies; 17+ messages in thread
From: Jan Kara @ 2019-04-17 11:30 UTC (permalink / raw)
  To: Amir Goldstein
  Cc: Jan Kara, linux-fsdevel, Al Viro, Miklos Szeredi,
	Matthew Bobrowski, LSM List, overlayfs

On Tue 16-04-19 21:24:44, Amir Goldstein wrote:
> > I'm not so sure about directory pre-modification hooks. Given the amount of
> > problems we face with applications using fanotify permission events and
> > deadlocking the system, I'm not very fond of expanding that API... AFAIU
> > you want to use such hooks for recording (and persisting) that some change
> > is going to happen and provide crash-consistency guarantees for such
> > journal?
> >
> 
> That's the general idea.
> I have two use cases for pre-modification hooks:
> 1. VFS level snapshots
> 2. persistent change tracking
> 
> TBH, I did not consider implementing any of the above in userspace,
> so I do not have a specific interest in extending the fanotify API.
> I am actually interested in pre-modify fsnotify hooks (not fanotify),
> that a snapshot or change tracking subsystem can register with.
> An in-kernel fsnotify event handler can set a flag in current task
> struct to circumvent system deadlocks on nested filesystem access.

OK, I'm not opposed to fsnotify pre-modify hooks as such. As long as
handlers stay within the kernel, I'm fine with that. After all this is what
LSMs are already doing. Just exposing this to userspace for arbitration is
what I have a problem with.

> My current implementation of overlayfs snapshots [1] uses a stackable
> filesystem (a.k.a. snapshot fs) as a means for pre-modify hooks.
> This approach has some advantages and some disadvantages
> compared to fsnotify pre-modify hooks.
> 
> With fsnotify pre-modify hooks it would be possible to protect
> the underlying filesystem from un-monitored changes even when
> filesystem is accessed from another mount point or another namespace.
> 
> As a matter of fact, the protection to underlying filesystem changes
> needed for overlayfs snapshots is also useful for standard overlayfs -
> Modification to underlying overlayfs layers are strongly discouraged,
> but there is nothing preventing the user from making such modifications.
> If overlayfs were to register for fsnotify pre-modify hooks on underlying
> filesystem, it could prevent users from modifying underlying layers.
> 
> And not only that - because security_inode_rename() hook is called
> with s_vfs_rename_mutex held, it is safe to use d_ancestor() to
> prevent renames in and out of overlay layer subtrees.
> With that protection in place, it is safe (?) to use is_subdir() in other
> hooks to check if an object is under an overlayfs layer subtree,
> because the entire ancestry below the layers roots is stable.

Uf, OK, should be. But it looks subtle. Not sure what Al will say about it
;).

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: fanotify and LSM path hooks
  2019-04-17 11:30     ` Jan Kara
@ 2019-04-17 12:14       ` Miklos Szeredi
  2019-04-17 14:05         ` Jan Kara
  2020-06-26 11:06       ` fsnotify pre-modify VFS hooks (Was: fanotify and LSM path hooks) Amir Goldstein
  1 sibling, 1 reply; 17+ messages in thread
From: Miklos Szeredi @ 2019-04-17 12:14 UTC (permalink / raw)
  To: Jan Kara
  Cc: Amir Goldstein, linux-fsdevel, Al Viro, Matthew Bobrowski,
	LSM List, overlayfs

On Wed, Apr 17, 2019 at 1:30 PM Jan Kara <jack@suse.cz> wrote:
>
> On Tue 16-04-19 21:24:44, Amir Goldstein wrote:
> > > I'm not so sure about directory pre-modification hooks. Given the amount of
> > > problems we face with applications using fanotify permission events and
> > > deadlocking the system, I'm not very fond of expanding that API... AFAIU
> > > you want to use such hooks for recording (and persisting) that some change
> > > is going to happen and provide crash-consistency guarantees for such
> > > journal?
> > >
> >
> > That's the general idea.
> > I have two use cases for pre-modification hooks:
> > 1. VFS level snapshots
> > 2. persistent change tracking
> >
> > TBH, I did not consider implementing any of the above in userspace,
> > so I do not have a specific interest in extending the fanotify API.
> > I am actually interested in pre-modify fsnotify hooks (not fanotify),
> > that a snapshot or change tracking subsystem can register with.
> > An in-kernel fsnotify event handler can set a flag in current task
> > struct to circumvent system deadlocks on nested filesystem access.
>
> OK, I'm not opposed to fsnotify pre-modify hooks as such. As long as
> handlers stay within the kernel, I'm fine with that. After all this is what
> LSMs are already doing. Just exposing this to userspace for arbitration is
> what I have a problem with.

There's one more usecase that I'd like to explore: providing coherent
view of host filesystem in virtualized environments.  This requires
that guest is synchronously notified when the host filesystem changes.
  I do agree, however, that adding sync hooks to userspace is
problematic.

One idea would be to use shared memory instead of a procedural
notification.  I.e. application (hypervisor) registers a pointer to a
version number that the kernel associates with the given inode.  When
the inode is changed, then the version number is incremented.  The
guest kernel can then look at the version number when verifying cache
validity.   That way perfect coherency is guaranteed between host and
guest filesystems without allowing a broken guest or even a broken
hypervisor to DoS the host.

Thanks,
Miklos

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: fanotify and LSM path hooks
  2019-04-17 12:14       ` Miklos Szeredi
@ 2019-04-17 14:05         ` Jan Kara
  2019-04-17 14:14           ` Miklos Szeredi
  0 siblings, 1 reply; 17+ messages in thread
From: Jan Kara @ 2019-04-17 14:05 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: Jan Kara, Amir Goldstein, linux-fsdevel, Al Viro,
	Matthew Bobrowski, LSM List, overlayfs

On Wed 17-04-19 14:14:58, Miklos Szeredi wrote:
> On Wed, Apr 17, 2019 at 1:30 PM Jan Kara <jack@suse.cz> wrote:
> >
> > On Tue 16-04-19 21:24:44, Amir Goldstein wrote:
> > > > I'm not so sure about directory pre-modification hooks. Given the amount of
> > > > problems we face with applications using fanotify permission events and
> > > > deadlocking the system, I'm not very fond of expanding that API... AFAIU
> > > > you want to use such hooks for recording (and persisting) that some change
> > > > is going to happen and provide crash-consistency guarantees for such
> > > > journal?
> > > >
> > >
> > > That's the general idea.
> > > I have two use cases for pre-modification hooks:
> > > 1. VFS level snapshots
> > > 2. persistent change tracking
> > >
> > > TBH, I did not consider implementing any of the above in userspace,
> > > so I do not have a specific interest in extending the fanotify API.
> > > I am actually interested in pre-modify fsnotify hooks (not fanotify),
> > > that a snapshot or change tracking subsystem can register with.
> > > An in-kernel fsnotify event handler can set a flag in current task
> > > struct to circumvent system deadlocks on nested filesystem access.
> >
> > OK, I'm not opposed to fsnotify pre-modify hooks as such. As long as
> > handlers stay within the kernel, I'm fine with that. After all this is what
> > LSMs are already doing. Just exposing this to userspace for arbitration is
> > what I have a problem with.
> 
> There's one more usecase that I'd like to explore: providing coherent
> view of host filesystem in virtualized environments.  This requires
> that guest is synchronously notified when the host filesystem changes.
>   I do agree, however, that adding sync hooks to userspace is
> problematic.
> 
> One idea would be to use shared memory instead of a procedural
> notification.  I.e. application (hypervisor) registers a pointer to a
> version number that the kernel associates with the given inode.  When
> the inode is changed, then the version number is incremented.  The
> guest kernel can then look at the version number when verifying cache
> validity.   That way perfect coherency is guaranteed between host and
> guest filesystems without allowing a broken guest or even a broken
> hypervisor to DoS the host.

Well, statx() and looking at i_version can do this for you. So I guess
that's too slow for your purposes? Also how many inodes do you want to
monitor like this?

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: fanotify and LSM path hooks
  2019-04-17 14:05         ` Jan Kara
@ 2019-04-17 14:14           ` Miklos Szeredi
  2019-04-18 10:53             ` Jan Kara
  0 siblings, 1 reply; 17+ messages in thread
From: Miklos Szeredi @ 2019-04-17 14:14 UTC (permalink / raw)
  To: Jan Kara
  Cc: Amir Goldstein, linux-fsdevel, Al Viro, Matthew Bobrowski,
	LSM List, overlayfs

On Wed, Apr 17, 2019 at 4:06 PM Jan Kara <jack@suse.cz> wrote:
>
> On Wed 17-04-19 14:14:58, Miklos Szeredi wrote:
> > On Wed, Apr 17, 2019 at 1:30 PM Jan Kara <jack@suse.cz> wrote:
> > >
> > > On Tue 16-04-19 21:24:44, Amir Goldstein wrote:
> > > > > I'm not so sure about directory pre-modification hooks. Given the amount of
> > > > > problems we face with applications using fanotify permission events and
> > > > > deadlocking the system, I'm not very fond of expanding that API... AFAIU
> > > > > you want to use such hooks for recording (and persisting) that some change
> > > > > is going to happen and provide crash-consistency guarantees for such
> > > > > journal?
> > > > >
> > > >
> > > > That's the general idea.
> > > > I have two use cases for pre-modification hooks:
> > > > 1. VFS level snapshots
> > > > 2. persistent change tracking
> > > >
> > > > TBH, I did not consider implementing any of the above in userspace,
> > > > so I do not have a specific interest in extending the fanotify API.
> > > > I am actually interested in pre-modify fsnotify hooks (not fanotify),
> > > > that a snapshot or change tracking subsystem can register with.
> > > > An in-kernel fsnotify event handler can set a flag in current task
> > > > struct to circumvent system deadlocks on nested filesystem access.
> > >
> > > OK, I'm not opposed to fsnotify pre-modify hooks as such. As long as
> > > handlers stay within the kernel, I'm fine with that. After all this is what
> > > LSMs are already doing. Just exposing this to userspace for arbitration is
> > > what I have a problem with.
> >
> > There's one more usecase that I'd like to explore: providing coherent
> > view of host filesystem in virtualized environments.  This requires
> > that guest is synchronously notified when the host filesystem changes.
> >   I do agree, however, that adding sync hooks to userspace is
> > problematic.
> >
> > One idea would be to use shared memory instead of a procedural
> > notification.  I.e. application (hypervisor) registers a pointer to a
> > version number that the kernel associates with the given inode.  When
> > the inode is changed, then the version number is incremented.  The
> > guest kernel can then look at the version number when verifying cache
> > validity.   That way perfect coherency is guaranteed between host and
> > guest filesystems without allowing a broken guest or even a broken
> > hypervisor to DoS the host.
>
> Well, statx() and looking at i_version can do this for you. So I guess
> that's too slow for your purposes?

Okay, missing piece of information: we want to make use of the dcache
and icache in the guest kernel, otherwise lookup/stat will be
painfully slow.  That would preclude doing statx() or anything else
that requires a synchronous round trip to the host for the likely case
of a valid cache.

> Also how many inodes do you want to
> monitor like this?

Everything that's in the guest caches.  Which means: a lot.

Thanks,
Miklos

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: fanotify and LSM path hooks
  2019-04-17 14:14           ` Miklos Szeredi
@ 2019-04-18 10:53             ` Jan Kara
  0 siblings, 0 replies; 17+ messages in thread
From: Jan Kara @ 2019-04-18 10:53 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: Jan Kara, Amir Goldstein, linux-fsdevel, Al Viro,
	Matthew Bobrowski, LSM List, overlayfs

On Wed 17-04-19 16:14:32, Miklos Szeredi wrote:
> On Wed, Apr 17, 2019 at 4:06 PM Jan Kara <jack@suse.cz> wrote:
> >
> > On Wed 17-04-19 14:14:58, Miklos Szeredi wrote:
> > > On Wed, Apr 17, 2019 at 1:30 PM Jan Kara <jack@suse.cz> wrote:
> > > >
> > > > On Tue 16-04-19 21:24:44, Amir Goldstein wrote:
> > > > > > I'm not so sure about directory pre-modification hooks. Given the amount of
> > > > > > problems we face with applications using fanotify permission events and
> > > > > > deadlocking the system, I'm not very fond of expanding that API... AFAIU
> > > > > > you want to use such hooks for recording (and persisting) that some change
> > > > > > is going to happen and provide crash-consistency guarantees for such
> > > > > > journal?
> > > > > >
> > > > >
> > > > > That's the general idea.
> > > > > I have two use cases for pre-modification hooks:
> > > > > 1. VFS level snapshots
> > > > > 2. persistent change tracking
> > > > >
> > > > > TBH, I did not consider implementing any of the above in userspace,
> > > > > so I do not have a specific interest in extending the fanotify API.
> > > > > I am actually interested in pre-modify fsnotify hooks (not fanotify),
> > > > > that a snapshot or change tracking subsystem can register with.
> > > > > An in-kernel fsnotify event handler can set a flag in current task
> > > > > struct to circumvent system deadlocks on nested filesystem access.
> > > >
> > > > OK, I'm not opposed to fsnotify pre-modify hooks as such. As long as
> > > > handlers stay within the kernel, I'm fine with that. After all this is what
> > > > LSMs are already doing. Just exposing this to userspace for arbitration is
> > > > what I have a problem with.
> > >
> > > There's one more usecase that I'd like to explore: providing coherent
> > > view of host filesystem in virtualized environments.  This requires
> > > that guest is synchronously notified when the host filesystem changes.
> > >   I do agree, however, that adding sync hooks to userspace is
> > > problematic.
> > >
> > > One idea would be to use shared memory instead of a procedural
> > > notification.  I.e. application (hypervisor) registers a pointer to a
> > > version number that the kernel associates with the given inode.  When
> > > the inode is changed, then the version number is incremented.  The
> > > guest kernel can then look at the version number when verifying cache
> > > validity.   That way perfect coherency is guaranteed between host and
> > > guest filesystems without allowing a broken guest or even a broken
> > > hypervisor to DoS the host.
> >
> > Well, statx() and looking at i_version can do this for you. So I guess
> > that's too slow for your purposes?
> 
> Okay, missing piece of information: we want to make use of the dcache
> and icache in the guest kernel, otherwise lookup/stat will be
> painfully slow.  That would preclude doing statx() or anything else
> that requires a synchronous round trip to the host for the likely case
> of a valid cache.

Ok, understood.
 
> > Also how many inodes do you want to
> > monitor like this?
> 
> Everything that's in the guest caches.  Which means: a lot.

Yeah, but that would mean also non-trivial amount of memory pinned for this
communication channel... And AFAIU the cost of invalidation going to the
guest isn't so critical as that isn't expected to be that frequent. It is
only the cost of 'is_valid' check that needs to be low to make the caching
in the guest useful. So won't it be better if we just had some kind of ring
buffer for invalidation events going from host to guest, host could just
queue event there (i.e., event processing from the host would have well
bounded time like processing normal fsnotify events has), guest would be
consuming events and updating its validity information. If the buffer
overflows, it means "invalidate all" which is expensive for the guest but
if buffers are reasonably sized, it should not happen frequently...

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: fsnotify pre-modify VFS hooks (Was: fanotify and LSM path hooks)
  2019-04-17 11:30     ` Jan Kara
  2019-04-17 12:14       ` Miklos Szeredi
@ 2020-06-26 11:06       ` Amir Goldstein
  2020-06-30  9:20         ` Jan Kara
  1 sibling, 1 reply; 17+ messages in thread
From: Amir Goldstein @ 2020-06-26 11:06 UTC (permalink / raw)
  To: Jan Kara
  Cc: linux-fsdevel, Al Viro, Miklos Szeredi, Matthew Bobrowski,
	overlayfs, Mel Gorman

[Subject changed and removed LSM list]

On Wed, Apr 17, 2019 at 2:30 PM Jan Kara <jack@suse.cz> wrote:
>
> On Tue 16-04-19 21:24:44, Amir Goldstein wrote:
> > > I'm not so sure about directory pre-modification hooks. Given the amount of
> > > problems we face with applications using fanotify permission events and
> > > deadlocking the system, I'm not very fond of expanding that API... AFAIU
> > > you want to use such hooks for recording (and persisting) that some change
> > > is going to happen and provide crash-consistency guarantees for such
> > > journal?
> > >
> >
> > That's the general idea.
> > I have two use cases for pre-modification hooks:
> > 1. VFS level snapshots
> > 2. persistent change tracking
> >
> > TBH, I did not consider implementing any of the above in userspace,
> > so I do not have a specific interest in extending the fanotify API.
> > I am actually interested in pre-modify fsnotify hooks (not fanotify),
> > that a snapshot or change tracking subsystem can register with.
> > An in-kernel fsnotify event handler can set a flag in current task
> > struct to circumvent system deadlocks on nested filesystem access.
>
> OK, I'm not opposed to fsnotify pre-modify hooks as such. As long as
> handlers stay within the kernel, I'm fine with that. After all this is what
> LSMs are already doing. Just exposing this to userspace for arbitration is
> what I have a problem with.
>

Short update on that.

I decided to ditch the LSM hooks approach because I realized that for
the purpose of persistent change tracking, the pre-modify hooks need
to be called before the caller is taking filesystem locks.

So I added hooks inside mnt_want_write and file_start_write wrappers:
https://github.com/amir73il/linux/commits/fsnotify_pre_modify

The conversion of Overlayfs snapshots to use pre-modify events is
WIP and still has some big open questions.

The purpose of this email is to solicit early feedback on the VFS changes.
If anyone thinks this approach is wrong please shout it out.

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: fsnotify pre-modify VFS hooks (Was: fanotify and LSM path hooks)
  2020-06-26 11:06       ` fsnotify pre-modify VFS hooks (Was: fanotify and LSM path hooks) Amir Goldstein
@ 2020-06-30  9:20         ` Jan Kara
  2020-06-30 14:28           ` Amir Goldstein
  0 siblings, 1 reply; 17+ messages in thread
From: Jan Kara @ 2020-06-30  9:20 UTC (permalink / raw)
  To: Amir Goldstein
  Cc: Jan Kara, linux-fsdevel, Al Viro, Miklos Szeredi,
	Matthew Bobrowski, overlayfs, Mel Gorman

On Fri 26-06-20 14:06:37, Amir Goldstein wrote:
> On Wed, Apr 17, 2019 at 2:30 PM Jan Kara <jack@suse.cz> wrote:
> >
> > On Tue 16-04-19 21:24:44, Amir Goldstein wrote:
> > > > I'm not so sure about directory pre-modification hooks. Given the amount of
> > > > problems we face with applications using fanotify permission events and
> > > > deadlocking the system, I'm not very fond of expanding that API... AFAIU
> > > > you want to use such hooks for recording (and persisting) that some change
> > > > is going to happen and provide crash-consistency guarantees for such
> > > > journal?
> > > >
> > >
> > > That's the general idea.
> > > I have two use cases for pre-modification hooks:
> > > 1. VFS level snapshots
> > > 2. persistent change tracking
> > >
> > > TBH, I did not consider implementing any of the above in userspace,
> > > so I do not have a specific interest in extending the fanotify API.
> > > I am actually interested in pre-modify fsnotify hooks (not fanotify),
> > > that a snapshot or change tracking subsystem can register with.
> > > An in-kernel fsnotify event handler can set a flag in current task
> > > struct to circumvent system deadlocks on nested filesystem access.
> >
> > OK, I'm not opposed to fsnotify pre-modify hooks as such. As long as
> > handlers stay within the kernel, I'm fine with that. After all this is what
> > LSMs are already doing. Just exposing this to userspace for arbitration is
> > what I have a problem with.
> >
> 
> Short update on that.
> 
> I decided to ditch the LSM hooks approach because I realized that for
> the purpose of persistent change tracking, the pre-modify hooks need
> to be called before the caller is taking filesystem locks.
> 
> So I added hooks inside mnt_want_write and file_start_write wrappers:
> https://github.com/amir73il/linux/commits/fsnotify_pre_modify

FWIW I've glanced through the series. I like the choice of mnt_want_write()
and file_start_write() as a place to generate the event. I somewhat dislike
the number of variants you have to introduce and then pass NULL in some
places because you don't have the info available and then it's not
immediately clear what semantics the event consumers can expect... That
would be good to define and then verify in the code.

Also given you have the requirement "no fs locks on event generation", I'm
not sure how reliable this can be. If you don't hold fs locks when
generating event, cannot it happen that actually modified object is
different from the reported one because we raced with some other fs
operations? And can we prove that? So what exactly is the usecase and
guarantees the event needs to provide?

								Honza

-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: fsnotify pre-modify VFS hooks (Was: fanotify and LSM path hooks)
  2020-06-30  9:20         ` Jan Kara
@ 2020-06-30 14:28           ` Amir Goldstein
  2020-07-03 13:38             ` Jan Kara
  0 siblings, 1 reply; 17+ messages in thread
From: Amir Goldstein @ 2020-06-30 14:28 UTC (permalink / raw)
  To: Jan Kara
  Cc: linux-fsdevel, Al Viro, Miklos Szeredi, Matthew Bobrowski,
	overlayfs, Mel Gorman

On Tue, Jun 30, 2020 at 12:20 PM Jan Kara <jack@suse.cz> wrote:
>
> On Fri 26-06-20 14:06:37, Amir Goldstein wrote:
> > On Wed, Apr 17, 2019 at 2:30 PM Jan Kara <jack@suse.cz> wrote:
> > >
> > > On Tue 16-04-19 21:24:44, Amir Goldstein wrote:
> > > > > I'm not so sure about directory pre-modification hooks. Given the amount of
> > > > > problems we face with applications using fanotify permission events and
> > > > > deadlocking the system, I'm not very fond of expanding that API... AFAIU
> > > > > you want to use such hooks for recording (and persisting) that some change
> > > > > is going to happen and provide crash-consistency guarantees for such
> > > > > journal?
> > > > >
> > > >
> > > > That's the general idea.
> > > > I have two use cases for pre-modification hooks:
> > > > 1. VFS level snapshots
> > > > 2. persistent change tracking
> > > >
> > > > TBH, I did not consider implementing any of the above in userspace,
> > > > so I do not have a specific interest in extending the fanotify API.
> > > > I am actually interested in pre-modify fsnotify hooks (not fanotify),
> > > > that a snapshot or change tracking subsystem can register with.
> > > > An in-kernel fsnotify event handler can set a flag in current task
> > > > struct to circumvent system deadlocks on nested filesystem access.
> > >
> > > OK, I'm not opposed to fsnotify pre-modify hooks as such. As long as
> > > handlers stay within the kernel, I'm fine with that. After all this is what
> > > LSMs are already doing. Just exposing this to userspace for arbitration is
> > > what I have a problem with.
> > >
> >
> > Short update on that.
> >
> > I decided to ditch the LSM hooks approach because I realized that for
> > the purpose of persistent change tracking, the pre-modify hooks need
> > to be called before the caller is taking filesystem locks.
> >
> > So I added hooks inside mnt_want_write and file_start_write wrappers:
> > https://github.com/amir73il/linux/commits/fsnotify_pre_modify
>
> FWIW I've glanced through the series. I like the choice of mnt_want_write()
> and file_start_write() as a place to generate the event. I somewhat dislike

Thanks. I was looking for this initial feedback to know if direction in sane.

> the number of variants you have to introduce and then pass NULL in some
> places because you don't have the info available and then it's not
> immediately clear what semantics the event consumers can expect... That
> would be good to define and then verify in the code.
>

I am not sure I understand what you mean.
Did you mean that mnt_want_write_at() mnt_want_write_path() should be
actual functions instead of inline wrappers or something else?

> Also given you have the requirement "no fs locks on event generation", I'm
> not sure how reliable this can be. If you don't hold fs locks when
> generating event, cannot it happen that actually modified object is
> different from the reported one because we raced with some other fs
> operations? And can we prove that? So what exactly is the usecase and
> guarantees the event needs to provide?
>

That's a good question. Answer is not trivial.
The use case is "persistent change tracking snapshot".
"snapshot" because it tracks ALL changes since a point in time -
there is no concept of "consuming" events.
It is important to note that this is complementary to real time fs events.
A user library may combine the two mechanisms to a stream of changes
(either recorded or live), but that is out of scope for this effort.
Also, userspace would likely create periodic snapshots, so that e.g.
current snapshot records changes, while previous snapshot recorded
changes are being scanned.

The concept is to record every dir fid *before* an immediate child or directory
metadata itself may change, so that after a crash, all recorded dir fids
may be scanned to search for possibly missed changes.

The dir fid is stable, so races are not an issue in that respect.
When name is recorded, change tracking never examines the object at that
name, it just records the fact that there has been a change at [dir fid;name].
This is mostly needed to track creates.

Other than that, races should be handled by the backend itself, so proof is
pending the completion of the backend POC, but in hand waving:
- All name changes in the filesystem call the backend before the change
  (because backend marks sb) and backend is responsible for locking
against races
- My current implementation uses overlayfs upper/index as the change
  track storage, which has the benefit that the test "is change recorded"
  is implemented by decode_fh and/or name lookup, so it is already very much
  optimized by inode and dentry cache and shouldn't need any locking for
  most  pre_modify calls
- It is also not a coincidence that overlayfs changes to upper fs do not
  trigger pre_modify hooks because that prevents the feedback loop.
  I wrote in commit message that "is consistent with the fact that overlayfs
  sets the FMODE_NONOTIFY flag on underlying open files" - that is needed
  because the path in underlying files is "fake" (open_with_fake_path()).

If any of this hand waving sounds terribly wrong please let me know.
Otherwise I will report back after POC is complete with a example backend.

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: fsnotify pre-modify VFS hooks (Was: fanotify and LSM path hooks)
  2020-06-30 14:28           ` Amir Goldstein
@ 2020-07-03 13:38             ` Jan Kara
  2020-07-06 10:51               ` Amir Goldstein
  0 siblings, 1 reply; 17+ messages in thread
From: Jan Kara @ 2020-07-03 13:38 UTC (permalink / raw)
  To: Amir Goldstein
  Cc: Jan Kara, linux-fsdevel, Al Viro, Miklos Szeredi,
	Matthew Bobrowski, overlayfs, Mel Gorman

On Tue 30-06-20 17:28:10, Amir Goldstein wrote:
> On Tue, Jun 30, 2020 at 12:20 PM Jan Kara <jack@suse.cz> wrote:
> > the number of variants you have to introduce and then pass NULL in some
> > places because you don't have the info available and then it's not
> > immediately clear what semantics the event consumers can expect... That
> > would be good to define and then verify in the code.
> >
> 
> I am not sure I understand what you mean.
> Did you mean that mnt_want_write_at() mnt_want_write_path() should be
> actual functions instead of inline wrappers or something else?

Now looking at it, I find mnt_want_write_at2() the most confusing one. Also
the distinction between mnt_want_write_at() and mnt_want_write_path() seems
somewhat arbitrary at the first sight (how do you decide where to use
what?) but there I guess I see where you are coming from...

> > Also given you have the requirement "no fs locks on event generation", I'm
> > not sure how reliable this can be. If you don't hold fs locks when
> > generating event, cannot it happen that actually modified object is
> > different from the reported one because we raced with some other fs
> > operations? And can we prove that? So what exactly is the usecase and
> > guarantees the event needs to provide?
> 
> That's a good question. Answer is not trivial.
> The use case is "persistent change tracking snapshot".
> "snapshot" because it tracks ALL changes since a point in time -
> there is no concept of "consuming" events.

So you want to answer question like: "Which paths changed since given point
in time?" 100% reliably (no false negatives) in a powerfail safe manner? So
effectively something like btrfs send-receive?

> It is important to note that this is complementary to real time fs events.
> A user library may combine the two mechanisms to a stream of changes
> (either recorded or live), but that is out of scope for this effort.
> Also, userspace would likely create periodic snapshots, so that e.g.
> current snapshot records changes, while previous snapshot recorded
> changes are being scanned.
> 
> The concept is to record every dir fid *before* an immediate child or directory
> metadata itself may change, so that after a crash, all recorded dir fids
> may be scanned to search for possibly missed changes.
> 
> The dir fid is stable, so races are not an issue in that respect.
> When name is recorded, change tracking never examines the object at that
> name, it just records the fact that there has been a change at [dir fid;name].
> This is mostly needed to track creates.

You're right that by only tracking dir fids where something changed you've
elliminated most of problems since once we lookup a path, the change is
definitely going to happen in the dir we've looked up. If it really happens
or on which inode in the dir is not determined yet but the dir fid is. I'm
not yet 100% sure how you'll link the dir fids to actual paths on query or
how the handling of leaf dir entries is going to work but it seems possible
:)

> Other than that, races should be handled by the backend itself, so proof is
> pending the completion of the backend POC, but in hand waving:
> - All name changes in the filesystem call the backend before the change
>   (because backend marks sb) and backend is responsible for locking
> against races
> - My current implementation uses overlayfs upper/index as the change
>   track storage, which has the benefit that the test "is change recorded"
>   is implemented by decode_fh and/or name lookup, so it is already very much
>   optimized by inode and dentry cache and shouldn't need any locking for
>   most  pre_modify calls
> - It is also not a coincidence that overlayfs changes to upper fs do not
>   trigger pre_modify hooks because that prevents the feedback loop.
>   I wrote in commit message that "is consistent with the fact that overlayfs
>   sets the FMODE_NONOTIFY flag on underlying open files" - that is needed
>   because the path in underlying files is "fake" (open_with_fake_path()).
> 
> If any of this hand waving sounds terribly wrong please let me know.
> Otherwise I will report back after POC is complete with a example backend.

It sounds like it could work :).

								Honza

-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: fsnotify pre-modify VFS hooks (Was: fanotify and LSM path hooks)
  2020-07-03 13:38             ` Jan Kara
@ 2020-07-06 10:51               ` Amir Goldstein
  0 siblings, 0 replies; 17+ messages in thread
From: Amir Goldstein @ 2020-07-06 10:51 UTC (permalink / raw)
  To: Jan Kara
  Cc: linux-fsdevel, Al Viro, Miklos Szeredi, Matthew Bobrowski,
	overlayfs, Mel Gorman

On Fri, Jul 3, 2020 at 4:38 PM Jan Kara <jack@suse.cz> wrote:
>
> On Tue 30-06-20 17:28:10, Amir Goldstein wrote:
> > On Tue, Jun 30, 2020 at 12:20 PM Jan Kara <jack@suse.cz> wrote:
> > > the number of variants you have to introduce and then pass NULL in some
> > > places because you don't have the info available and then it's not
> > > immediately clear what semantics the event consumers can expect... That
> > > would be good to define and then verify in the code.
> > >
> >
> > I am not sure I understand what you mean.
> > Did you mean that mnt_want_write_at() mnt_want_write_path() should be
> > actual functions instead of inline wrappers or something else?
>
> Now looking at it, I find mnt_want_write_at2() the most confusing one. Also
> the distinction between mnt_want_write_at() and mnt_want_write_path() seems
> somewhat arbitrary at the first sight (how do you decide where to use
> what?) but there I guess I see where you are coming from...

OK, I renamed the wrappers and documented them as follows:

 * mnt_want_write_path - get write access to a path's mount
 * mnt_want_write_name - get write access to a path's mount before link/unlink
 * mnt_want_write_rename - get write access to a path's mount before rename

Pushed this to branch fsnotify_pre_modify.

I can see why the use of mnt_want_write_at() and mnt_want_write_path() seems
arbitrary. This is because in VFS code, @path argument can mean the path of the
object or path of the parent, depending on the function.
It is just by examining the code that you can figure that out, but in
practice path
means parent in link/unlink/rename functions.

>
> > > Also given you have the requirement "no fs locks on event generation", I'm
> > > not sure how reliable this can be. If you don't hold fs locks when
> > > generating event, cannot it happen that actually modified object is
> > > different from the reported one because we raced with some other fs
> > > operations? And can we prove that? So what exactly is the usecase and
> > > guarantees the event needs to provide?
> >
> > That's a good question. Answer is not trivial.
> > The use case is "persistent change tracking snapshot".
> > "snapshot" because it tracks ALL changes since a point in time -
> > there is no concept of "consuming" events.
>
> So you want to answer question like: "Which paths changed since given point
> in time?" 100% reliably (no false negatives) in a power fail safe manner? So
> effectively something like btrfs send-receive?
>

It's the same use case of btrfs send-receive (power fail safe
incremental backup),
but semantics are quite different.
btrfs send-receive information contains the actual changes.
Persistent change tracking only contains the information of where changes are
possible. IOW, there is potentially much more comparing to do with the target.
And of course, it is filesystem agnostic.

Thanks,
Amir.

> > It is important to note that this is complementary to real time fs events.
> > A user library may combine the two mechanisms to a stream of changes
> > (either recorded or live), but that is out of scope for this effort.
> > Also, userspace would likely create periodic snapshots, so that e.g.
> > current snapshot records changes, while previous snapshot recorded
> > changes are being scanned.
> >
> > The concept is to record every dir fid *before* an immediate child or directory
> > metadata itself may change, so that after a crash, all recorded dir fids
> > may be scanned to search for possibly missed changes.
> >
> > The dir fid is stable, so races are not an issue in that respect.
> > When name is recorded, change tracking never examines the object at that
> > name, it just records the fact that there has been a change at [dir fid;name].
> > This is mostly needed to track creates.
>
> You're right that by only tracking dir fids where something changed you've
> elliminated most of problems since once we lookup a path, the change is
> definitely going to happen in the dir we've looked up. If it really happens
> or on which inode in the dir is not determined yet but the dir fid is. I'm
> not yet 100% sure how you'll link the dir fids to actual paths on query or
> how the handling of leaf dir entries is going to work but it seems possible
> :)
>
> > Other than that, races should be handled by the backend itself, so proof is
> > pending the completion of the backend POC, but in hand waving:
> > - All name changes in the filesystem call the backend before the change
> >   (because backend marks sb) and backend is responsible for locking
> > against races
> > - My current implementation uses overlayfs upper/index as the change
> >   track storage, which has the benefit that the test "is change recorded"
> >   is implemented by decode_fh and/or name lookup, so it is already very much
> >   optimized by inode and dentry cache and shouldn't need any locking for
> >   most  pre_modify calls
> > - It is also not a coincidence that overlayfs changes to upper fs do not
> >   trigger pre_modify hooks because that prevents the feedback loop.
> >   I wrote in commit message that "is consistent with the fact that overlayfs
> >   sets the FMODE_NONOTIFY flag on underlying open files" - that is needed
> >   because the path in underlying files is "fake" (open_with_fake_path()).
> >
> > If any of this hand waving sounds terribly wrong please let me know.
> > Otherwise I will report back after POC is complete with a example backend.
>
> It sounds like it could work :).
>
>                                                                 Honza
>
> --
> Jan Kara <jack@suse.com>
> SUSE Labs, CR

^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2020-07-06 10:51 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-04-14 16:04 fanotify and LSM path hooks Amir Goldstein
2019-04-14 16:39 ` Al Viro
2019-04-14 18:51   ` Amir Goldstein
2019-04-14 19:26     ` Al Viro
2019-04-14 20:28       ` Amir Goldstein
2019-04-16 15:45 ` Jan Kara
2019-04-16 18:24   ` Amir Goldstein
2019-04-17 11:30     ` Jan Kara
2019-04-17 12:14       ` Miklos Szeredi
2019-04-17 14:05         ` Jan Kara
2019-04-17 14:14           ` Miklos Szeredi
2019-04-18 10:53             ` Jan Kara
2020-06-26 11:06       ` fsnotify pre-modify VFS hooks (Was: fanotify and LSM path hooks) Amir Goldstein
2020-06-30  9:20         ` Jan Kara
2020-06-30 14:28           ` Amir Goldstein
2020-07-03 13:38             ` Jan Kara
2020-07-06 10:51               ` Amir Goldstein

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).