($INBOX_DIR/description missing)
 help / color / Atom feed
* Re: fanotify and LSM path hooks
       [not found] ` <20190416154513.GB13422@quack2.suse.cz>
@ 2019-04-16 18:24   ` Amir Goldstein
  2019-04-17 11:30     ` Jan Kara
  0 siblings, 1 reply; 11+ messages in thread
From: Amir Goldstein @ 2019-04-16 18:24 UTC (permalink / raw)
  To: Jan Kara
  Cc: linux-fsdevel, Al Viro, Miklos Szeredi, Matthew Bobrowski,
	LSM List, overlayfs

On Tue, Apr 16, 2019 at 6:45 PM Jan Kara <jack@suse.cz> wrote:
>
> Hi Amir!
>
> On Sun 14-04-19 19:04:14, Amir Goldstein wrote:
> > I started to look at directory pre-modification "permission" hooks
> > that were discussed on last year's LSFMM:
> > https://lwn.net/Articles/755277/
> >
> > The two existing fanotify_perm() hooks are called
> > from security_file_permission() and security_file_open()
> > and depend on build time CONFIG_SECURITY.
> > If you look at how the fsnotify_perm() hooks are planted inside the
> > generic security hooks, one might wonder, why are fanotify permission
> > hooks getting a special treatment and are not registering as LSM hooks?
> >
> > One benefit from an fanotify LSM, besides more generic code, would be
> > that fanotify permission hooks could be disabled with boot parameters.
> >
> > I only bring this up because security hooks seems like the most natural
> > place to add pre-modify fanotify events for the purpose of maintaining
> > a filesystem change journal. It would be ugly to spray more fsnotify hooks
> > inside security hooks instead of registering an fanotify LSM, but maybe
> > there are downsides of registering fanotify as LSM that I am not aware of?
>
> I kind of like the idea of generating fanotify permission events from
> special LSM hooks.
>

Cool, I think that all we really need to do is call security_add_hooks().
[reducing LSM CC list]

> I'm not so sure about directory pre-modification hooks. Given the amount of
> problems we face with applications using fanotify permission events and
> deadlocking the system, I'm not very fond of expanding that API... AFAIU
> you want to use such hooks for recording (and persisting) that some change
> is going to happen and provide crash-consistency guarantees for such
> journal?
>

That's the general idea.
I have two use cases for pre-modification hooks:
1. VFS level snapshots
2. persistent change tracking

TBH, I did not consider implementing any of the above in userspace,
so I do not have a specific interest in extending the fanotify API.
I am actually interested in pre-modify fsnotify hooks (not fanotify),
that a snapshot or change tracking subsystem can register with.
An in-kernel fsnotify event handler can set a flag in current task
struct to circumvent system deadlocks on nested filesystem access.

My current implementation of overlayfs snapshots [1] uses a stackable
filesystem (a.k.a. snapshot fs) as a means for pre-modify hooks.
This approach has some advantages and some disadvantages
compared to fsnotify pre-modify hooks.

With fsnotify pre-modify hooks it would be possible to protect
the underlying filesystem from un-monitored changes even when
filesystem is accessed from another mount point or another namespace.

As a matter of fact, the protection to underlying filesystem changes
needed for overlayfs snapshots is also useful for standard overlayfs -
Modification to underlying overlayfs layers are strongly discouraged,
but there is nothing preventing the user from making such modifications.
If overlayfs were to register for fsnotify pre-modify hooks on underlying
filesystem, it could prevent users from modifying underlying layers.

And not only that - because security_inode_rename() hook is called
with s_vfs_rename_mutex held, it is safe to use d_ancestor() to
prevent renames in and out of overlay layer subtrees.
With that protection in place, it is safe (?) to use is_subdir() in other
hooks to check if an object is under an overlayfs layer subtree,
because the entire ancestry below the layers roots is stable.

Will see if I can sketch a POC.

Thanks,
Amir.

[1] https://github.com/amir73il/overlayfs/wiki/Overlayfs-snapshots

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: fanotify and LSM path hooks
  2019-04-16 18:24   ` fanotify and LSM path hooks Amir Goldstein
@ 2019-04-17 11:30     ` Jan Kara
  2019-04-17 12:14       ` Miklos Szeredi
  2020-06-26 11:06       ` fsnotify pre-modify VFS hooks (Was: fanotify and LSM path hooks) Amir Goldstein
  0 siblings, 2 replies; 11+ messages in thread
From: Jan Kara @ 2019-04-17 11:30 UTC (permalink / raw)
  To: Amir Goldstein
  Cc: Jan Kara, linux-fsdevel, Al Viro, Miklos Szeredi,
	Matthew Bobrowski, LSM List, overlayfs

On Tue 16-04-19 21:24:44, Amir Goldstein wrote:
> > I'm not so sure about directory pre-modification hooks. Given the amount of
> > problems we face with applications using fanotify permission events and
> > deadlocking the system, I'm not very fond of expanding that API... AFAIU
> > you want to use such hooks for recording (and persisting) that some change
> > is going to happen and provide crash-consistency guarantees for such
> > journal?
> >
> 
> That's the general idea.
> I have two use cases for pre-modification hooks:
> 1. VFS level snapshots
> 2. persistent change tracking
> 
> TBH, I did not consider implementing any of the above in userspace,
> so I do not have a specific interest in extending the fanotify API.
> I am actually interested in pre-modify fsnotify hooks (not fanotify),
> that a snapshot or change tracking subsystem can register with.
> An in-kernel fsnotify event handler can set a flag in current task
> struct to circumvent system deadlocks on nested filesystem access.

OK, I'm not opposed to fsnotify pre-modify hooks as such. As long as
handlers stay within the kernel, I'm fine with that. After all this is what
LSMs are already doing. Just exposing this to userspace for arbitration is
what I have a problem with.

> My current implementation of overlayfs snapshots [1] uses a stackable
> filesystem (a.k.a. snapshot fs) as a means for pre-modify hooks.
> This approach has some advantages and some disadvantages
> compared to fsnotify pre-modify hooks.
> 
> With fsnotify pre-modify hooks it would be possible to protect
> the underlying filesystem from un-monitored changes even when
> filesystem is accessed from another mount point or another namespace.
> 
> As a matter of fact, the protection to underlying filesystem changes
> needed for overlayfs snapshots is also useful for standard overlayfs -
> Modification to underlying overlayfs layers are strongly discouraged,
> but there is nothing preventing the user from making such modifications.
> If overlayfs were to register for fsnotify pre-modify hooks on underlying
> filesystem, it could prevent users from modifying underlying layers.
> 
> And not only that - because security_inode_rename() hook is called
> with s_vfs_rename_mutex held, it is safe to use d_ancestor() to
> prevent renames in and out of overlay layer subtrees.
> With that protection in place, it is safe (?) to use is_subdir() in other
> hooks to check if an object is under an overlayfs layer subtree,
> because the entire ancestry below the layers roots is stable.

Uf, OK, should be. But it looks subtle. Not sure what Al will say about it
;).

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: fanotify and LSM path hooks
  2019-04-17 11:30     ` Jan Kara
@ 2019-04-17 12:14       ` Miklos Szeredi
  2019-04-17 14:05         ` Jan Kara
  2020-06-26 11:06       ` fsnotify pre-modify VFS hooks (Was: fanotify and LSM path hooks) Amir Goldstein
  1 sibling, 1 reply; 11+ messages in thread
From: Miklos Szeredi @ 2019-04-17 12:14 UTC (permalink / raw)
  To: Jan Kara
  Cc: Amir Goldstein, linux-fsdevel, Al Viro, Matthew Bobrowski,
	LSM List, overlayfs

On Wed, Apr 17, 2019 at 1:30 PM Jan Kara <jack@suse.cz> wrote:
>
> On Tue 16-04-19 21:24:44, Amir Goldstein wrote:
> > > I'm not so sure about directory pre-modification hooks. Given the amount of
> > > problems we face with applications using fanotify permission events and
> > > deadlocking the system, I'm not very fond of expanding that API... AFAIU
> > > you want to use such hooks for recording (and persisting) that some change
> > > is going to happen and provide crash-consistency guarantees for such
> > > journal?
> > >
> >
> > That's the general idea.
> > I have two use cases for pre-modification hooks:
> > 1. VFS level snapshots
> > 2. persistent change tracking
> >
> > TBH, I did not consider implementing any of the above in userspace,
> > so I do not have a specific interest in extending the fanotify API.
> > I am actually interested in pre-modify fsnotify hooks (not fanotify),
> > that a snapshot or change tracking subsystem can register with.
> > An in-kernel fsnotify event handler can set a flag in current task
> > struct to circumvent system deadlocks on nested filesystem access.
>
> OK, I'm not opposed to fsnotify pre-modify hooks as such. As long as
> handlers stay within the kernel, I'm fine with that. After all this is what
> LSMs are already doing. Just exposing this to userspace for arbitration is
> what I have a problem with.

There's one more usecase that I'd like to explore: providing coherent
view of host filesystem in virtualized environments.  This requires
that guest is synchronously notified when the host filesystem changes.
  I do agree, however, that adding sync hooks to userspace is
problematic.

One idea would be to use shared memory instead of a procedural
notification.  I.e. application (hypervisor) registers a pointer to a
version number that the kernel associates with the given inode.  When
the inode is changed, then the version number is incremented.  The
guest kernel can then look at the version number when verifying cache
validity.   That way perfect coherency is guaranteed between host and
guest filesystems without allowing a broken guest or even a broken
hypervisor to DoS the host.

Thanks,
Miklos

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: fanotify and LSM path hooks
  2019-04-17 12:14       ` Miklos Szeredi
@ 2019-04-17 14:05         ` Jan Kara
  2019-04-17 14:14           ` Miklos Szeredi
  0 siblings, 1 reply; 11+ messages in thread
From: Jan Kara @ 2019-04-17 14:05 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: Jan Kara, Amir Goldstein, linux-fsdevel, Al Viro,
	Matthew Bobrowski, LSM List, overlayfs

On Wed 17-04-19 14:14:58, Miklos Szeredi wrote:
> On Wed, Apr 17, 2019 at 1:30 PM Jan Kara <jack@suse.cz> wrote:
> >
> > On Tue 16-04-19 21:24:44, Amir Goldstein wrote:
> > > > I'm not so sure about directory pre-modification hooks. Given the amount of
> > > > problems we face with applications using fanotify permission events and
> > > > deadlocking the system, I'm not very fond of expanding that API... AFAIU
> > > > you want to use such hooks for recording (and persisting) that some change
> > > > is going to happen and provide crash-consistency guarantees for such
> > > > journal?
> > > >
> > >
> > > That's the general idea.
> > > I have two use cases for pre-modification hooks:
> > > 1. VFS level snapshots
> > > 2. persistent change tracking
> > >
> > > TBH, I did not consider implementing any of the above in userspace,
> > > so I do not have a specific interest in extending the fanotify API.
> > > I am actually interested in pre-modify fsnotify hooks (not fanotify),
> > > that a snapshot or change tracking subsystem can register with.
> > > An in-kernel fsnotify event handler can set a flag in current task
> > > struct to circumvent system deadlocks on nested filesystem access.
> >
> > OK, I'm not opposed to fsnotify pre-modify hooks as such. As long as
> > handlers stay within the kernel, I'm fine with that. After all this is what
> > LSMs are already doing. Just exposing this to userspace for arbitration is
> > what I have a problem with.
> 
> There's one more usecase that I'd like to explore: providing coherent
> view of host filesystem in virtualized environments.  This requires
> that guest is synchronously notified when the host filesystem changes.
>   I do agree, however, that adding sync hooks to userspace is
> problematic.
> 
> One idea would be to use shared memory instead of a procedural
> notification.  I.e. application (hypervisor) registers a pointer to a
> version number that the kernel associates with the given inode.  When
> the inode is changed, then the version number is incremented.  The
> guest kernel can then look at the version number when verifying cache
> validity.   That way perfect coherency is guaranteed between host and
> guest filesystems without allowing a broken guest or even a broken
> hypervisor to DoS the host.

Well, statx() and looking at i_version can do this for you. So I guess
that's too slow for your purposes? Also how many inodes do you want to
monitor like this?

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: fanotify and LSM path hooks
  2019-04-17 14:05         ` Jan Kara
@ 2019-04-17 14:14           ` Miklos Szeredi
  2019-04-18 10:53             ` Jan Kara
  0 siblings, 1 reply; 11+ messages in thread
From: Miklos Szeredi @ 2019-04-17 14:14 UTC (permalink / raw)
  To: Jan Kara
  Cc: Amir Goldstein, linux-fsdevel, Al Viro, Matthew Bobrowski,
	LSM List, overlayfs

On Wed, Apr 17, 2019 at 4:06 PM Jan Kara <jack@suse.cz> wrote:
>
> On Wed 17-04-19 14:14:58, Miklos Szeredi wrote:
> > On Wed, Apr 17, 2019 at 1:30 PM Jan Kara <jack@suse.cz> wrote:
> > >
> > > On Tue 16-04-19 21:24:44, Amir Goldstein wrote:
> > > > > I'm not so sure about directory pre-modification hooks. Given the amount of
> > > > > problems we face with applications using fanotify permission events and
> > > > > deadlocking the system, I'm not very fond of expanding that API... AFAIU
> > > > > you want to use such hooks for recording (and persisting) that some change
> > > > > is going to happen and provide crash-consistency guarantees for such
> > > > > journal?
> > > > >
> > > >
> > > > That's the general idea.
> > > > I have two use cases for pre-modification hooks:
> > > > 1. VFS level snapshots
> > > > 2. persistent change tracking
> > > >
> > > > TBH, I did not consider implementing any of the above in userspace,
> > > > so I do not have a specific interest in extending the fanotify API.
> > > > I am actually interested in pre-modify fsnotify hooks (not fanotify),
> > > > that a snapshot or change tracking subsystem can register with.
> > > > An in-kernel fsnotify event handler can set a flag in current task
> > > > struct to circumvent system deadlocks on nested filesystem access.
> > >
> > > OK, I'm not opposed to fsnotify pre-modify hooks as such. As long as
> > > handlers stay within the kernel, I'm fine with that. After all this is what
> > > LSMs are already doing. Just exposing this to userspace for arbitration is
> > > what I have a problem with.
> >
> > There's one more usecase that I'd like to explore: providing coherent
> > view of host filesystem in virtualized environments.  This requires
> > that guest is synchronously notified when the host filesystem changes.
> >   I do agree, however, that adding sync hooks to userspace is
> > problematic.
> >
> > One idea would be to use shared memory instead of a procedural
> > notification.  I.e. application (hypervisor) registers a pointer to a
> > version number that the kernel associates with the given inode.  When
> > the inode is changed, then the version number is incremented.  The
> > guest kernel can then look at the version number when verifying cache
> > validity.   That way perfect coherency is guaranteed between host and
> > guest filesystems without allowing a broken guest or even a broken
> > hypervisor to DoS the host.
>
> Well, statx() and looking at i_version can do this for you. So I guess
> that's too slow for your purposes?

Okay, missing piece of information: we want to make use of the dcache
and icache in the guest kernel, otherwise lookup/stat will be
painfully slow.  That would preclude doing statx() or anything else
that requires a synchronous round trip to the host for the likely case
of a valid cache.

> Also how many inodes do you want to
> monitor like this?

Everything that's in the guest caches.  Which means: a lot.

Thanks,
Miklos

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: fanotify and LSM path hooks
  2019-04-17 14:14           ` Miklos Szeredi
@ 2019-04-18 10:53             ` Jan Kara
  0 siblings, 0 replies; 11+ messages in thread
From: Jan Kara @ 2019-04-18 10:53 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: Jan Kara, Amir Goldstein, linux-fsdevel, Al Viro,
	Matthew Bobrowski, LSM List, overlayfs

On Wed 17-04-19 16:14:32, Miklos Szeredi wrote:
> On Wed, Apr 17, 2019 at 4:06 PM Jan Kara <jack@suse.cz> wrote:
> >
> > On Wed 17-04-19 14:14:58, Miklos Szeredi wrote:
> > > On Wed, Apr 17, 2019 at 1:30 PM Jan Kara <jack@suse.cz> wrote:
> > > >
> > > > On Tue 16-04-19 21:24:44, Amir Goldstein wrote:
> > > > > > I'm not so sure about directory pre-modification hooks. Given the amount of
> > > > > > problems we face with applications using fanotify permission events and
> > > > > > deadlocking the system, I'm not very fond of expanding that API... AFAIU
> > > > > > you want to use such hooks for recording (and persisting) that some change
> > > > > > is going to happen and provide crash-consistency guarantees for such
> > > > > > journal?
> > > > > >
> > > > >
> > > > > That's the general idea.
> > > > > I have two use cases for pre-modification hooks:
> > > > > 1. VFS level snapshots
> > > > > 2. persistent change tracking
> > > > >
> > > > > TBH, I did not consider implementing any of the above in userspace,
> > > > > so I do not have a specific interest in extending the fanotify API.
> > > > > I am actually interested in pre-modify fsnotify hooks (not fanotify),
> > > > > that a snapshot or change tracking subsystem can register with.
> > > > > An in-kernel fsnotify event handler can set a flag in current task
> > > > > struct to circumvent system deadlocks on nested filesystem access.
> > > >
> > > > OK, I'm not opposed to fsnotify pre-modify hooks as such. As long as
> > > > handlers stay within the kernel, I'm fine with that. After all this is what
> > > > LSMs are already doing. Just exposing this to userspace for arbitration is
> > > > what I have a problem with.
> > >
> > > There's one more usecase that I'd like to explore: providing coherent
> > > view of host filesystem in virtualized environments.  This requires
> > > that guest is synchronously notified when the host filesystem changes.
> > >   I do agree, however, that adding sync hooks to userspace is
> > > problematic.
> > >
> > > One idea would be to use shared memory instead of a procedural
> > > notification.  I.e. application (hypervisor) registers a pointer to a
> > > version number that the kernel associates with the given inode.  When
> > > the inode is changed, then the version number is incremented.  The
> > > guest kernel can then look at the version number when verifying cache
> > > validity.   That way perfect coherency is guaranteed between host and
> > > guest filesystems without allowing a broken guest or even a broken
> > > hypervisor to DoS the host.
> >
> > Well, statx() and looking at i_version can do this for you. So I guess
> > that's too slow for your purposes?
> 
> Okay, missing piece of information: we want to make use of the dcache
> and icache in the guest kernel, otherwise lookup/stat will be
> painfully slow.  That would preclude doing statx() or anything else
> that requires a synchronous round trip to the host for the likely case
> of a valid cache.

Ok, understood.
 
> > Also how many inodes do you want to
> > monitor like this?
> 
> Everything that's in the guest caches.  Which means: a lot.

Yeah, but that would mean also non-trivial amount of memory pinned for this
communication channel... And AFAIU the cost of invalidation going to the
guest isn't so critical as that isn't expected to be that frequent. It is
only the cost of 'is_valid' check that needs to be low to make the caching
in the guest useful. So won't it be better if we just had some kind of ring
buffer for invalidation events going from host to guest, host could just
queue event there (i.e., event processing from the host would have well
bounded time like processing normal fsnotify events has), guest would be
consuming events and updating its validity information. If the buffer
overflows, it means "invalidate all" which is expensive for the guest but
if buffers are reasonably sized, it should not happen frequently...

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: fsnotify pre-modify VFS hooks (Was: fanotify and LSM path hooks)
  2019-04-17 11:30     ` Jan Kara
  2019-04-17 12:14       ` Miklos Szeredi
@ 2020-06-26 11:06       ` Amir Goldstein
  2020-06-30  9:20         ` Jan Kara
  1 sibling, 1 reply; 11+ messages in thread
From: Amir Goldstein @ 2020-06-26 11:06 UTC (permalink / raw)
  To: Jan Kara
  Cc: linux-fsdevel, Al Viro, Miklos Szeredi, Matthew Bobrowski,
	overlayfs, Mel Gorman

[Subject changed and removed LSM list]

On Wed, Apr 17, 2019 at 2:30 PM Jan Kara <jack@suse.cz> wrote:
>
> On Tue 16-04-19 21:24:44, Amir Goldstein wrote:
> > > I'm not so sure about directory pre-modification hooks. Given the amount of
> > > problems we face with applications using fanotify permission events and
> > > deadlocking the system, I'm not very fond of expanding that API... AFAIU
> > > you want to use such hooks for recording (and persisting) that some change
> > > is going to happen and provide crash-consistency guarantees for such
> > > journal?
> > >
> >
> > That's the general idea.
> > I have two use cases for pre-modification hooks:
> > 1. VFS level snapshots
> > 2. persistent change tracking
> >
> > TBH, I did not consider implementing any of the above in userspace,
> > so I do not have a specific interest in extending the fanotify API.
> > I am actually interested in pre-modify fsnotify hooks (not fanotify),
> > that a snapshot or change tracking subsystem can register with.
> > An in-kernel fsnotify event handler can set a flag in current task
> > struct to circumvent system deadlocks on nested filesystem access.
>
> OK, I'm not opposed to fsnotify pre-modify hooks as such. As long as
> handlers stay within the kernel, I'm fine with that. After all this is what
> LSMs are already doing. Just exposing this to userspace for arbitration is
> what I have a problem with.
>

Short update on that.

I decided to ditch the LSM hooks approach because I realized that for
the purpose of persistent change tracking, the pre-modify hooks need
to be called before the caller is taking filesystem locks.

So I added hooks inside mnt_want_write and file_start_write wrappers:
https://github.com/amir73il/linux/commits/fsnotify_pre_modify

The conversion of Overlayfs snapshots to use pre-modify events is
WIP and still has some big open questions.

The purpose of this email is to solicit early feedback on the VFS changes.
If anyone thinks this approach is wrong please shout it out.

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: fsnotify pre-modify VFS hooks (Was: fanotify and LSM path hooks)
  2020-06-26 11:06       ` fsnotify pre-modify VFS hooks (Was: fanotify and LSM path hooks) Amir Goldstein
@ 2020-06-30  9:20         ` Jan Kara
  2020-06-30 14:28           ` Amir Goldstein
  0 siblings, 1 reply; 11+ messages in thread
From: Jan Kara @ 2020-06-30  9:20 UTC (permalink / raw)
  To: Amir Goldstein
  Cc: Jan Kara, linux-fsdevel, Al Viro, Miklos Szeredi,
	Matthew Bobrowski, overlayfs, Mel Gorman

On Fri 26-06-20 14:06:37, Amir Goldstein wrote:
> On Wed, Apr 17, 2019 at 2:30 PM Jan Kara <jack@suse.cz> wrote:
> >
> > On Tue 16-04-19 21:24:44, Amir Goldstein wrote:
> > > > I'm not so sure about directory pre-modification hooks. Given the amount of
> > > > problems we face with applications using fanotify permission events and
> > > > deadlocking the system, I'm not very fond of expanding that API... AFAIU
> > > > you want to use such hooks for recording (and persisting) that some change
> > > > is going to happen and provide crash-consistency guarantees for such
> > > > journal?
> > > >
> > >
> > > That's the general idea.
> > > I have two use cases for pre-modification hooks:
> > > 1. VFS level snapshots
> > > 2. persistent change tracking
> > >
> > > TBH, I did not consider implementing any of the above in userspace,
> > > so I do not have a specific interest in extending the fanotify API.
> > > I am actually interested in pre-modify fsnotify hooks (not fanotify),
> > > that a snapshot or change tracking subsystem can register with.
> > > An in-kernel fsnotify event handler can set a flag in current task
> > > struct to circumvent system deadlocks on nested filesystem access.
> >
> > OK, I'm not opposed to fsnotify pre-modify hooks as such. As long as
> > handlers stay within the kernel, I'm fine with that. After all this is what
> > LSMs are already doing. Just exposing this to userspace for arbitration is
> > what I have a problem with.
> >
> 
> Short update on that.
> 
> I decided to ditch the LSM hooks approach because I realized that for
> the purpose of persistent change tracking, the pre-modify hooks need
> to be called before the caller is taking filesystem locks.
> 
> So I added hooks inside mnt_want_write and file_start_write wrappers:
> https://github.com/amir73il/linux/commits/fsnotify_pre_modify

FWIW I've glanced through the series. I like the choice of mnt_want_write()
and file_start_write() as a place to generate the event. I somewhat dislike
the number of variants you have to introduce and then pass NULL in some
places because you don't have the info available and then it's not
immediately clear what semantics the event consumers can expect... That
would be good to define and then verify in the code.

Also given you have the requirement "no fs locks on event generation", I'm
not sure how reliable this can be. If you don't hold fs locks when
generating event, cannot it happen that actually modified object is
different from the reported one because we raced with some other fs
operations? And can we prove that? So what exactly is the usecase and
guarantees the event needs to provide?

								Honza

-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: fsnotify pre-modify VFS hooks (Was: fanotify and LSM path hooks)
  2020-06-30  9:20         ` Jan Kara
@ 2020-06-30 14:28           ` Amir Goldstein
  2020-07-03 13:38             ` Jan Kara
  0 siblings, 1 reply; 11+ messages in thread
From: Amir Goldstein @ 2020-06-30 14:28 UTC (permalink / raw)
  To: Jan Kara
  Cc: linux-fsdevel, Al Viro, Miklos Szeredi, Matthew Bobrowski,
	overlayfs, Mel Gorman

On Tue, Jun 30, 2020 at 12:20 PM Jan Kara <jack@suse.cz> wrote:
>
> On Fri 26-06-20 14:06:37, Amir Goldstein wrote:
> > On Wed, Apr 17, 2019 at 2:30 PM Jan Kara <jack@suse.cz> wrote:
> > >
> > > On Tue 16-04-19 21:24:44, Amir Goldstein wrote:
> > > > > I'm not so sure about directory pre-modification hooks. Given the amount of
> > > > > problems we face with applications using fanotify permission events and
> > > > > deadlocking the system, I'm not very fond of expanding that API... AFAIU
> > > > > you want to use such hooks for recording (and persisting) that some change
> > > > > is going to happen and provide crash-consistency guarantees for such
> > > > > journal?
> > > > >
> > > >
> > > > That's the general idea.
> > > > I have two use cases for pre-modification hooks:
> > > > 1. VFS level snapshots
> > > > 2. persistent change tracking
> > > >
> > > > TBH, I did not consider implementing any of the above in userspace,
> > > > so I do not have a specific interest in extending the fanotify API.
> > > > I am actually interested in pre-modify fsnotify hooks (not fanotify),
> > > > that a snapshot or change tracking subsystem can register with.
> > > > An in-kernel fsnotify event handler can set a flag in current task
> > > > struct to circumvent system deadlocks on nested filesystem access.
> > >
> > > OK, I'm not opposed to fsnotify pre-modify hooks as such. As long as
> > > handlers stay within the kernel, I'm fine with that. After all this is what
> > > LSMs are already doing. Just exposing this to userspace for arbitration is
> > > what I have a problem with.
> > >
> >
> > Short update on that.
> >
> > I decided to ditch the LSM hooks approach because I realized that for
> > the purpose of persistent change tracking, the pre-modify hooks need
> > to be called before the caller is taking filesystem locks.
> >
> > So I added hooks inside mnt_want_write and file_start_write wrappers:
> > https://github.com/amir73il/linux/commits/fsnotify_pre_modify
>
> FWIW I've glanced through the series. I like the choice of mnt_want_write()
> and file_start_write() as a place to generate the event. I somewhat dislike

Thanks. I was looking for this initial feedback to know if direction in sane.

> the number of variants you have to introduce and then pass NULL in some
> places because you don't have the info available and then it's not
> immediately clear what semantics the event consumers can expect... That
> would be good to define and then verify in the code.
>

I am not sure I understand what you mean.
Did you mean that mnt_want_write_at() mnt_want_write_path() should be
actual functions instead of inline wrappers or something else?

> Also given you have the requirement "no fs locks on event generation", I'm
> not sure how reliable this can be. If you don't hold fs locks when
> generating event, cannot it happen that actually modified object is
> different from the reported one because we raced with some other fs
> operations? And can we prove that? So what exactly is the usecase and
> guarantees the event needs to provide?
>

That's a good question. Answer is not trivial.
The use case is "persistent change tracking snapshot".
"snapshot" because it tracks ALL changes since a point in time -
there is no concept of "consuming" events.
It is important to note that this is complementary to real time fs events.
A user library may combine the two mechanisms to a stream of changes
(either recorded or live), but that is out of scope for this effort.
Also, userspace would likely create periodic snapshots, so that e.g.
current snapshot records changes, while previous snapshot recorded
changes are being scanned.

The concept is to record every dir fid *before* an immediate child or directory
metadata itself may change, so that after a crash, all recorded dir fids
may be scanned to search for possibly missed changes.

The dir fid is stable, so races are not an issue in that respect.
When name is recorded, change tracking never examines the object at that
name, it just records the fact that there has been a change at [dir fid;name].
This is mostly needed to track creates.

Other than that, races should be handled by the backend itself, so proof is
pending the completion of the backend POC, but in hand waving:
- All name changes in the filesystem call the backend before the change
  (because backend marks sb) and backend is responsible for locking
against races
- My current implementation uses overlayfs upper/index as the change
  track storage, which has the benefit that the test "is change recorded"
  is implemented by decode_fh and/or name lookup, so it is already very much
  optimized by inode and dentry cache and shouldn't need any locking for
  most  pre_modify calls
- It is also not a coincidence that overlayfs changes to upper fs do not
  trigger pre_modify hooks because that prevents the feedback loop.
  I wrote in commit message that "is consistent with the fact that overlayfs
  sets the FMODE_NONOTIFY flag on underlying open files" - that is needed
  because the path in underlying files is "fake" (open_with_fake_path()).

If any of this hand waving sounds terribly wrong please let me know.
Otherwise I will report back after POC is complete with a example backend.

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: fsnotify pre-modify VFS hooks (Was: fanotify and LSM path hooks)
  2020-06-30 14:28           ` Amir Goldstein
@ 2020-07-03 13:38             ` Jan Kara
  2020-07-06 10:51               ` Amir Goldstein
  0 siblings, 1 reply; 11+ messages in thread
From: Jan Kara @ 2020-07-03 13:38 UTC (permalink / raw)
  To: Amir Goldstein
  Cc: Jan Kara, linux-fsdevel, Al Viro, Miklos Szeredi,
	Matthew Bobrowski, overlayfs, Mel Gorman

On Tue 30-06-20 17:28:10, Amir Goldstein wrote:
> On Tue, Jun 30, 2020 at 12:20 PM Jan Kara <jack@suse.cz> wrote:
> > the number of variants you have to introduce and then pass NULL in some
> > places because you don't have the info available and then it's not
> > immediately clear what semantics the event consumers can expect... That
> > would be good to define and then verify in the code.
> >
> 
> I am not sure I understand what you mean.
> Did you mean that mnt_want_write_at() mnt_want_write_path() should be
> actual functions instead of inline wrappers or something else?

Now looking at it, I find mnt_want_write_at2() the most confusing one. Also
the distinction between mnt_want_write_at() and mnt_want_write_path() seems
somewhat arbitrary at the first sight (how do you decide where to use
what?) but there I guess I see where you are coming from...

> > Also given you have the requirement "no fs locks on event generation", I'm
> > not sure how reliable this can be. If you don't hold fs locks when
> > generating event, cannot it happen that actually modified object is
> > different from the reported one because we raced with some other fs
> > operations? And can we prove that? So what exactly is the usecase and
> > guarantees the event needs to provide?
> 
> That's a good question. Answer is not trivial.
> The use case is "persistent change tracking snapshot".
> "snapshot" because it tracks ALL changes since a point in time -
> there is no concept of "consuming" events.

So you want to answer question like: "Which paths changed since given point
in time?" 100% reliably (no false negatives) in a powerfail safe manner? So
effectively something like btrfs send-receive?

> It is important to note that this is complementary to real time fs events.
> A user library may combine the two mechanisms to a stream of changes
> (either recorded or live), but that is out of scope for this effort.
> Also, userspace would likely create periodic snapshots, so that e.g.
> current snapshot records changes, while previous snapshot recorded
> changes are being scanned.
> 
> The concept is to record every dir fid *before* an immediate child or directory
> metadata itself may change, so that after a crash, all recorded dir fids
> may be scanned to search for possibly missed changes.
> 
> The dir fid is stable, so races are not an issue in that respect.
> When name is recorded, change tracking never examines the object at that
> name, it just records the fact that there has been a change at [dir fid;name].
> This is mostly needed to track creates.

You're right that by only tracking dir fids where something changed you've
elliminated most of problems since once we lookup a path, the change is
definitely going to happen in the dir we've looked up. If it really happens
or on which inode in the dir is not determined yet but the dir fid is. I'm
not yet 100% sure how you'll link the dir fids to actual paths on query or
how the handling of leaf dir entries is going to work but it seems possible
:)

> Other than that, races should be handled by the backend itself, so proof is
> pending the completion of the backend POC, but in hand waving:
> - All name changes in the filesystem call the backend before the change
>   (because backend marks sb) and backend is responsible for locking
> against races
> - My current implementation uses overlayfs upper/index as the change
>   track storage, which has the benefit that the test "is change recorded"
>   is implemented by decode_fh and/or name lookup, so it is already very much
>   optimized by inode and dentry cache and shouldn't need any locking for
>   most  pre_modify calls
> - It is also not a coincidence that overlayfs changes to upper fs do not
>   trigger pre_modify hooks because that prevents the feedback loop.
>   I wrote in commit message that "is consistent with the fact that overlayfs
>   sets the FMODE_NONOTIFY flag on underlying open files" - that is needed
>   because the path in underlying files is "fake" (open_with_fake_path()).
> 
> If any of this hand waving sounds terribly wrong please let me know.
> Otherwise I will report back after POC is complete with a example backend.

It sounds like it could work :).

								Honza

-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: fsnotify pre-modify VFS hooks (Was: fanotify and LSM path hooks)
  2020-07-03 13:38             ` Jan Kara
@ 2020-07-06 10:51               ` Amir Goldstein
  0 siblings, 0 replies; 11+ messages in thread
From: Amir Goldstein @ 2020-07-06 10:51 UTC (permalink / raw)
  To: Jan Kara
  Cc: linux-fsdevel, Al Viro, Miklos Szeredi, Matthew Bobrowski,
	overlayfs, Mel Gorman

On Fri, Jul 3, 2020 at 4:38 PM Jan Kara <jack@suse.cz> wrote:
>
> On Tue 30-06-20 17:28:10, Amir Goldstein wrote:
> > On Tue, Jun 30, 2020 at 12:20 PM Jan Kara <jack@suse.cz> wrote:
> > > the number of variants you have to introduce and then pass NULL in some
> > > places because you don't have the info available and then it's not
> > > immediately clear what semantics the event consumers can expect... That
> > > would be good to define and then verify in the code.
> > >
> >
> > I am not sure I understand what you mean.
> > Did you mean that mnt_want_write_at() mnt_want_write_path() should be
> > actual functions instead of inline wrappers or something else?
>
> Now looking at it, I find mnt_want_write_at2() the most confusing one. Also
> the distinction between mnt_want_write_at() and mnt_want_write_path() seems
> somewhat arbitrary at the first sight (how do you decide where to use
> what?) but there I guess I see where you are coming from...

OK, I renamed the wrappers and documented them as follows:

 * mnt_want_write_path - get write access to a path's mount
 * mnt_want_write_name - get write access to a path's mount before link/unlink
 * mnt_want_write_rename - get write access to a path's mount before rename

Pushed this to branch fsnotify_pre_modify.

I can see why the use of mnt_want_write_at() and mnt_want_write_path() seems
arbitrary. This is because in VFS code, @path argument can mean the path of the
object or path of the parent, depending on the function.
It is just by examining the code that you can figure that out, but in
practice path
means parent in link/unlink/rename functions.

>
> > > Also given you have the requirement "no fs locks on event generation", I'm
> > > not sure how reliable this can be. If you don't hold fs locks when
> > > generating event, cannot it happen that actually modified object is
> > > different from the reported one because we raced with some other fs
> > > operations? And can we prove that? So what exactly is the usecase and
> > > guarantees the event needs to provide?
> >
> > That's a good question. Answer is not trivial.
> > The use case is "persistent change tracking snapshot".
> > "snapshot" because it tracks ALL changes since a point in time -
> > there is no concept of "consuming" events.
>
> So you want to answer question like: "Which paths changed since given point
> in time?" 100% reliably (no false negatives) in a power fail safe manner? So
> effectively something like btrfs send-receive?
>

It's the same use case of btrfs send-receive (power fail safe
incremental backup),
but semantics are quite different.
btrfs send-receive information contains the actual changes.
Persistent change tracking only contains the information of where changes are
possible. IOW, there is potentially much more comparing to do with the target.
And of course, it is filesystem agnostic.

Thanks,
Amir.

> > It is important to note that this is complementary to real time fs events.
> > A user library may combine the two mechanisms to a stream of changes
> > (either recorded or live), but that is out of scope for this effort.
> > Also, userspace would likely create periodic snapshots, so that e.g.
> > current snapshot records changes, while previous snapshot recorded
> > changes are being scanned.
> >
> > The concept is to record every dir fid *before* an immediate child or directory
> > metadata itself may change, so that after a crash, all recorded dir fids
> > may be scanned to search for possibly missed changes.
> >
> > The dir fid is stable, so races are not an issue in that respect.
> > When name is recorded, change tracking never examines the object at that
> > name, it just records the fact that there has been a change at [dir fid;name].
> > This is mostly needed to track creates.
>
> You're right that by only tracking dir fids where something changed you've
> elliminated most of problems since once we lookup a path, the change is
> definitely going to happen in the dir we've looked up. If it really happens
> or on which inode in the dir is not determined yet but the dir fid is. I'm
> not yet 100% sure how you'll link the dir fids to actual paths on query or
> how the handling of leaf dir entries is going to work but it seems possible
> :)
>
> > Other than that, races should be handled by the backend itself, so proof is
> > pending the completion of the backend POC, but in hand waving:
> > - All name changes in the filesystem call the backend before the change
> >   (because backend marks sb) and backend is responsible for locking
> > against races
> > - My current implementation uses overlayfs upper/index as the change
> >   track storage, which has the benefit that the test "is change recorded"
> >   is implemented by decode_fh and/or name lookup, so it is already very much
> >   optimized by inode and dentry cache and shouldn't need any locking for
> >   most  pre_modify calls
> > - It is also not a coincidence that overlayfs changes to upper fs do not
> >   trigger pre_modify hooks because that prevents the feedback loop.
> >   I wrote in commit message that "is consistent with the fact that overlayfs
> >   sets the FMODE_NONOTIFY flag on underlying open files" - that is needed
> >   because the path in underlying files is "fake" (open_with_fake_path()).
> >
> > If any of this hand waving sounds terribly wrong please let me know.
> > Otherwise I will report back after POC is complete with a example backend.
>
> It sounds like it could work :).
>
>                                                                 Honza
>
> --
> Jan Kara <jack@suse.com>
> SUSE Labs, CR

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, back to index

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <CAOQ4uxgn=YNj8cJuccx2KqxEVGZy1z3DBVYXrD=Mc7Dc=Je+-w@mail.gmail.com>
     [not found] ` <20190416154513.GB13422@quack2.suse.cz>
2019-04-16 18:24   ` fanotify and LSM path hooks Amir Goldstein
2019-04-17 11:30     ` Jan Kara
2019-04-17 12:14       ` Miklos Szeredi
2019-04-17 14:05         ` Jan Kara
2019-04-17 14:14           ` Miklos Szeredi
2019-04-18 10:53             ` Jan Kara
2020-06-26 11:06       ` fsnotify pre-modify VFS hooks (Was: fanotify and LSM path hooks) Amir Goldstein
2020-06-30  9:20         ` Jan Kara
2020-06-30 14:28           ` Amir Goldstein
2020-07-03 13:38             ` Jan Kara
2020-07-06 10:51               ` Amir Goldstein

($INBOX_DIR/description missing)

Archives are clonable:
	git clone --mirror https://lore.kernel.org/linux-unionfs/0 linux-unionfs/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 linux-unionfs linux-unionfs/ https://lore.kernel.org/linux-unionfs \
		linux-unionfs@vger.kernel.org
	public-inbox-index linux-unionfs

Example config snippet for mirrors

Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.kernel.vger.linux-unionfs


AGPL code for this site: git clone https://public-inbox.org/public-inbox.git