All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC] Volatile fanotify marks
@ 2022-02-23 18:42 Amir Goldstein
  2022-02-28 14:05 ` Jan Kara
  0 siblings, 1 reply; 17+ messages in thread
From: Amir Goldstein @ 2022-02-23 18:42 UTC (permalink / raw)
  To: Jan Kara; +Cc: linux-fsdevel

Hi Jan,

I wanted to get your feedback on an idea I have been playing with.
It started as a poor man's alternative to the old subtree watch problem.
For my employer's use case, we are watching the entire filesystem using
a filesystem mark, but would like to exclude events on a subtree
(i.e. all files underneath .private/).

At the moment, those events are filtered in userspace.
I had considered adding directory marks with an ignored mask on every
event that is received for a directory path under .private/, but that has the
undesired side effect of pinning those directory inodes to cache.

I have this old fsnotify-volatile branch [1] that I am using for an overlayfs
kernel internal fsnotify backend. I wonder what are your thoughts on
exposing this functionally to fanotify UAPI (i.e. FAN_MARK_VOLATILE).

Thanks,
Amir.

[1] https://github.com/amir73il/linux/commits/fsnotify-volatile

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC] Volatile fanotify marks
  2022-02-23 18:42 [RFC] Volatile fanotify marks Amir Goldstein
@ 2022-02-28 14:05 ` Jan Kara
  2022-02-28 17:40   ` Amir Goldstein
  2022-05-13 15:30   ` Matthew Bobrowski
  0 siblings, 2 replies; 17+ messages in thread
From: Jan Kara @ 2022-02-28 14:05 UTC (permalink / raw)
  To: Amir Goldstein; +Cc: Jan Kara, linux-fsdevel

Hi Amir!

On Wed 23-02-22 20:42:37, Amir Goldstein wrote:
> I wanted to get your feedback on an idea I have been playing with.
> It started as a poor man's alternative to the old subtree watch problem.
> For my employer's use case, we are watching the entire filesystem using
> a filesystem mark, but would like to exclude events on a subtree
> (i.e. all files underneath .private/).
> 
> At the moment, those events are filtered in userspace.
> I had considered adding directory marks with an ignored mask on every
> event that is received for a directory path under .private/, but that has the
> undesired side effect of pinning those directory inodes to cache.
> 
> I have this old fsnotify-volatile branch [1] that I am using for an overlayfs
> kernel internal fsnotify backend. I wonder what are your thoughts on
> exposing this functionally to fanotify UAPI (i.e. FAN_MARK_VOLATILE).

Interesting idea. I have some reservations wrt to the implementation (e.g.
fsnotify_add_mark_list() convention of returning EEXIST when it updated
mark's mask, or the fact that inode reclaim should now handle freeing of
mark connector and attached marks - which may get interesting locking wise)
but they are all fixable.

I'm wondering a bit whether this is really useful enough (and consequently
whether we will not get another request to extend fanotify API in some
other way to cater better to some other usecase related to subtree watches
in the near future). I understand ignore marks are mainly a performance
optimization and as such allowing inodes to be reclaimed (which means they
are not used much and hence ignored mark is not very useful anyway) makes
sense. Thinking about this more, I guess it is useful to improve efficiency
when you want to implement any userspace event-filtering scheme.

The only remaining pending question I have is whether we should not go
further and allow event filtering to happen using an eBPF program. That
would be even more efficient (both in terms of memory and CPU). What do you
think?

								Honza

> 
> Thanks,
> Amir.
> 
> [1] https://github.com/amir73il/linux/commits/fsnotify-volatile
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC] Volatile fanotify marks
  2022-02-28 14:05 ` Jan Kara
@ 2022-02-28 17:40   ` Amir Goldstein
  2022-03-01 11:07     ` Jan Kara
  2022-03-01 12:26     ` Tycho Kirchner
  2022-05-13 15:30   ` Matthew Bobrowski
  1 sibling, 2 replies; 17+ messages in thread
From: Amir Goldstein @ 2022-02-28 17:40 UTC (permalink / raw)
  To: Jan Kara; +Cc: linux-fsdevel, Tycho Kirchner

On Mon, Feb 28, 2022 at 4:06 PM Jan Kara <jack@suse.cz> wrote:
>
> Hi Amir!
>
> On Wed 23-02-22 20:42:37, Amir Goldstein wrote:
> > I wanted to get your feedback on an idea I have been playing with.
> > It started as a poor man's alternative to the old subtree watch problem.
> > For my employer's use case, we are watching the entire filesystem using
> > a filesystem mark, but would like to exclude events on a subtree
> > (i.e. all files underneath .private/).
> >
> > At the moment, those events are filtered in userspace.
> > I had considered adding directory marks with an ignored mask on every
> > event that is received for a directory path under .private/, but that has the
> > undesired side effect of pinning those directory inodes to cache.
> >
> > I have this old fsnotify-volatile branch [1] that I am using for an overlayfs
> > kernel internal fsnotify backend. I wonder what are your thoughts on
> > exposing this functionally to fanotify UAPI (i.e. FAN_MARK_VOLATILE).
>
> Interesting idea. I have some reservations wrt to the implementation (e.g.
> fsnotify_add_mark_list() convention of returning EEXIST when it updated
> mark's mask, or the fact that inode reclaim should now handle freeing of
> mark connector and attached marks - which may get interesting locking wise)
> but they are all fixable.

Can you give me a hint as to how to implement the freeing of marks?

>
> I'm wondering a bit whether this is really useful enough (and consequently
> whether we will not get another request to extend fanotify API in some
> other way to cater better to some other usecase related to subtree watches
> in the near future). I understand ignore marks are mainly a performance
> optimization and as such allowing inodes to be reclaimed (which means they
> are not used much and hence ignored mark is not very useful anyway) makes

The problem is that we do not know in advance which of the many dirs in
the subtree are accessed often and which are accessed rarely (and that may
change over time), so volatile ignore marks are a way to set up ignore marks
on the most accessed dirs dynamically.

> sense. Thinking about this more, I guess it is useful to improve efficiency
> when you want to implement any userspace event-filtering scheme.
>
> The only remaining pending question I have is whether we should not go
> further and allow event filtering to happen using an eBPF program. That
> would be even more efficient (both in terms of memory and CPU). What do you
> think?
>

I think that is an unrelated question.

I do agree that we should NOT add "subtree filter" functionality to fanotify
(or any other filter) and that instead, we should add support for attaching an
eBPF program that implements is_subdir().
I found this [1] convection with Tycho where you had suggested this idea.
I wonder if Tycho got to explore this path further?

But I think that it is one thing to recommend users to implement their
filters as
eBPF programs and another thing to stand in the way of users that prefer to
implement userspace event filtering. It could be that the filter
cannot be easily
described by static rules to an eBPF program (e.g. need to query a database).

In my POV, FAN_MARK_VOLATILE does not add any new logic/filtering rule.
It adds resource control by stating that the ignore mark is "best effort".

Does it make sense?

Thanks,
Amir.

[1] https://lore.kernel.org/linux-fsdevel/20200828084603.GA7072@quack2.suse.cz/

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC] Volatile fanotify marks
  2022-02-28 17:40   ` Amir Goldstein
@ 2022-03-01 11:07     ` Jan Kara
  2022-03-01 11:27       ` Amir Goldstein
  2022-03-01 12:26     ` Tycho Kirchner
  1 sibling, 1 reply; 17+ messages in thread
From: Jan Kara @ 2022-03-01 11:07 UTC (permalink / raw)
  To: Amir Goldstein; +Cc: Jan Kara, linux-fsdevel, Tycho Kirchner

On Mon 28-02-22 19:40:07, Amir Goldstein wrote:
> On Mon, Feb 28, 2022 at 4:06 PM Jan Kara <jack@suse.cz> wrote:
> >
> > Hi Amir!
> >
> > On Wed 23-02-22 20:42:37, Amir Goldstein wrote:
> > > I wanted to get your feedback on an idea I have been playing with.
> > > It started as a poor man's alternative to the old subtree watch problem.
> > > For my employer's use case, we are watching the entire filesystem using
> > > a filesystem mark, but would like to exclude events on a subtree
> > > (i.e. all files underneath .private/).
> > >
> > > At the moment, those events are filtered in userspace.
> > > I had considered adding directory marks with an ignored mask on every
> > > event that is received for a directory path under .private/, but that has the
> > > undesired side effect of pinning those directory inodes to cache.
> > >
> > > I have this old fsnotify-volatile branch [1] that I am using for an overlayfs
> > > kernel internal fsnotify backend. I wonder what are your thoughts on
> > > exposing this functionally to fanotify UAPI (i.e. FAN_MARK_VOLATILE).
> >
> > Interesting idea. I have some reservations wrt to the implementation (e.g.
> > fsnotify_add_mark_list() convention of returning EEXIST when it updated
> > mark's mask, or the fact that inode reclaim should now handle freeing of
> > mark connector and attached marks - which may get interesting locking wise)
> > but they are all fixable.
> 
> Can you give me a hint as to how to implement the freeing of marks?

OK, now I can see that fsnotify_inode_delete() gets called from
__destroy_inode() and thus all marks should be freed even for inodes
released by inode reclaim. Good.

> > I'm wondering a bit whether this is really useful enough (and consequently
> > whether we will not get another request to extend fanotify API in some
> > other way to cater better to some other usecase related to subtree watches
> > in the near future). I understand ignore marks are mainly a performance
> > optimization and as such allowing inodes to be reclaimed (which means they
> > are not used much and hence ignored mark is not very useful anyway) makes
> 
> The problem is that we do not know in advance which of the many dirs in
> the subtree are accessed often and which are accessed rarely (and that may
> change over time), so volatile ignore marks are a way to set up ignore marks
> on the most accessed dirs dynamically.

Yes, I understand.

> > sense. Thinking about this more, I guess it is useful to improve efficiency
> > when you want to implement any userspace event-filtering scheme.
> >
> > The only remaining pending question I have is whether we should not go
> > further and allow event filtering to happen using an eBPF program. That
> > would be even more efficient (both in terms of memory and CPU). What do you
> > think?
> >
> 
> I think that is an unrelated question.
> 
> I do agree that we should NOT add "subtree filter" functionality to fanotify
> (or any other filter) and that instead, we should add support for attaching an
> eBPF program that implements is_subdir().
> I found this [1] convection with Tycho where you had suggested this idea.
> I wonder if Tycho got to explore this path further?
> 
> But I think that it is one thing to recommend users to implement their
> filters as
> eBPF programs and another thing to stand in the way of users that prefer to
> implement userspace event filtering. It could be that the filter
> cannot be easily
> described by static rules to an eBPF program (e.g. need to query a database).
> 
> In my POV, FAN_MARK_VOLATILE does not add any new logic/filtering rule.
> It adds resource control by stating that the ignore mark is "best effort".
> 
> Does it make sense?

OK, makes sense. So I agree the functionality is worth it. Will you post
the patches for review of technical details?

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC] Volatile fanotify marks
  2022-03-01 11:07     ` Jan Kara
@ 2022-03-01 11:27       ` Amir Goldstein
  0 siblings, 0 replies; 17+ messages in thread
From: Amir Goldstein @ 2022-03-01 11:27 UTC (permalink / raw)
  To: Jan Kara; +Cc: linux-fsdevel, Tycho Kirchner

On Tue, Mar 1, 2022 at 1:07 PM Jan Kara <jack@suse.cz> wrote:
>
> On Mon 28-02-22 19:40:07, Amir Goldstein wrote:
> > On Mon, Feb 28, 2022 at 4:06 PM Jan Kara <jack@suse.cz> wrote:
> > >
> > > Hi Amir!
> > >
> > > On Wed 23-02-22 20:42:37, Amir Goldstein wrote:
> > > > I wanted to get your feedback on an idea I have been playing with.
> > > > It started as a poor man's alternative to the old subtree watch problem.
> > > > For my employer's use case, we are watching the entire filesystem using
> > > > a filesystem mark, but would like to exclude events on a subtree
> > > > (i.e. all files underneath .private/).
> > > >
> > > > At the moment, those events are filtered in userspace.
> > > > I had considered adding directory marks with an ignored mask on every
> > > > event that is received for a directory path under .private/, but that has the
> > > > undesired side effect of pinning those directory inodes to cache.
> > > >
> > > > I have this old fsnotify-volatile branch [1] that I am using for an overlayfs
> > > > kernel internal fsnotify backend. I wonder what are your thoughts on
> > > > exposing this functionally to fanotify UAPI (i.e. FAN_MARK_VOLATILE).
> > >
> > > Interesting idea. I have some reservations wrt to the implementation (e.g.
> > > fsnotify_add_mark_list() convention of returning EEXIST when it updated
> > > mark's mask, or the fact that inode reclaim should now handle freeing of
> > > mark connector and attached marks - which may get interesting locking wise)
> > > but they are all fixable.
> >
> > Can you give me a hint as to how to implement the freeing of marks?
>
> OK, now I can see that fsnotify_inode_delete() gets called from
> __destroy_inode() and thus all marks should be freed even for inodes
> released by inode reclaim. Good.
>
> > > I'm wondering a bit whether this is really useful enough (and consequently
> > > whether we will not get another request to extend fanotify API in some
> > > other way to cater better to some other usecase related to subtree watches
> > > in the near future). I understand ignore marks are mainly a performance
> > > optimization and as such allowing inodes to be reclaimed (which means they
> > > are not used much and hence ignored mark is not very useful anyway) makes
> >
> > The problem is that we do not know in advance which of the many dirs in
> > the subtree are accessed often and which are accessed rarely (and that may
> > change over time), so volatile ignore marks are a way to set up ignore marks
> > on the most accessed dirs dynamically.
>
> Yes, I understand.
>
> > > sense. Thinking about this more, I guess it is useful to improve efficiency
> > > when you want to implement any userspace event-filtering scheme.
> > >
> > > The only remaining pending question I have is whether we should not go
> > > further and allow event filtering to happen using an eBPF program. That
> > > would be even more efficient (both in terms of memory and CPU). What do you
> > > think?
> > >
> >
> > I think that is an unrelated question.
> >
> > I do agree that we should NOT add "subtree filter" functionality to fanotify
> > (or any other filter) and that instead, we should add support for attaching an
> > eBPF program that implements is_subdir().
> > I found this [1] convection with Tycho where you had suggested this idea.
> > I wonder if Tycho got to explore this path further?
> >
> > But I think that it is one thing to recommend users to implement their
> > filters as
> > eBPF programs and another thing to stand in the way of users that prefer to
> > implement userspace event filtering. It could be that the filter
> > cannot be easily
> > described by static rules to an eBPF program (e.g. need to query a database).
> >
> > In my POV, FAN_MARK_VOLATILE does not add any new logic/filtering rule.
> > It adds resource control by stating that the ignore mark is "best effort".
> >
> > Does it make sense?
>
> OK, makes sense. So I agree the functionality is worth it. Will you post
> the patches for review of technical details?

Of course, I need to add a patch for UAPI change and write a test.
I just wanted to get a tentative ACK before I put in more effort.
I will address your comment about -EEXIST return value.

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC] Volatile fanotify marks
  2022-02-28 17:40   ` Amir Goldstein
  2022-03-01 11:07     ` Jan Kara
@ 2022-03-01 12:26     ` Tycho Kirchner
  2022-03-01 16:58       ` Amir Goldstein
  1 sibling, 1 reply; 17+ messages in thread
From: Tycho Kirchner @ 2022-03-01 12:26 UTC (permalink / raw)
  To: Amir Goldstein, Jan Kara; +Cc: linux-fsdevel



>>> I wanted to get your feedback on an idea I have been playing with.
>>> It started as a poor man's alternative to the old subtree watch problem.


> I do agree that we should NOT add "subtree filter" functionality to fanotify
> (or any other filter) and that instead, we should add support for attaching an
> eBPF program that implements is_subdir().
> I found this [1] convection with Tycho where you had suggested this idea.
> I wonder if Tycho got to explore this path further?
> 
> [1] https://lore.kernel.org/linux-fsdevel/20200828084603.GA7072@quack2.suse.cz/

Hi Amir, Hi Jan,
Thanks for pinging back on me. Indeed I did "explore this path further".
In my project
https://github.com/tycho-kirchner/shournal

the goal is to track read/written files of a process tree and all it's child-processes and connect this data to a given shell-command. In fact after Amir's and mine last correspondence I implemented a kernel module which instruments ftrace and tracepoints to trace fput-events (kernel/event_handler.c:event_handler_fput) of specific tasks, which are then further processed in a dedicated kernel thread. I considered eBPF for this task but found no satisfying approach to have dynamic, different filter-rules (e.g. include-paths) for each process tree of each user.


Regarding improvement of fanotify let's discriminate two cases: system-monitoring and tracing.
Regarding system-monitoring: I'm not sure how exactly FAN_MARK_VOLATILE would work (Amir, could you please elaborate?) but what do you think about the following approach, in order to solve the subtree watch problem:
- Store the include/exlude-paths of interest as *strings* in a hashset.
- on fsevent, lookup the path by calling d_path() only once and cache, whether events for the given path are of interest. This
   can either happen with a reference on the path (clear older paths periodically in a work queue)
   or with a timelimit in which potentially wrong paths are accepted (path pointer freed and address reused).
   The second approach I use myself in kernel/event_consumer_cache.c. See also kpathtree.c for a somewhat efficient
   subpath-lookup.


Regarding tracing I think fanotify would really benefit from a FAN_MARK_PID (with optional follow fork-mode). That way one of the first filter-steps would be whether events for the given task are of interest, so we have no performance problem for all other tasks. The possibility to mark specific processes would also have another substantial benefit: fanotify could be used without root privileges by only allowing the user to mark his/her own processes.
That way existing inotify-users could finally switch to the cleaner/more powerful fanotify.


Thanks and kind regards
Tycho

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC] Volatile fanotify marks
  2022-03-01 12:26     ` Tycho Kirchner
@ 2022-03-01 16:58       ` Amir Goldstein
  2022-03-02 10:04         ` Tycho Kirchner
  0 siblings, 1 reply; 17+ messages in thread
From: Amir Goldstein @ 2022-03-01 16:58 UTC (permalink / raw)
  To: Tycho Kirchner; +Cc: Jan Kara, linux-fsdevel

On Tue, Mar 1, 2022 at 2:26 PM Tycho Kirchner <tychokirchner@mail.de> wrote:
>
>
>
> >>> I wanted to get your feedback on an idea I have been playing with.
> >>> It started as a poor man's alternative to the old subtree watch problem.
>
>
> > I do agree that we should NOT add "subtree filter" functionality to fanotify
> > (or any other filter) and that instead, we should add support for attaching an
> > eBPF program that implements is_subdir().
> > I found this [1] convection with Tycho where you had suggested this idea.
> > I wonder if Tycho got to explore this path further?
> >
> > [1] https://lore.kernel.org/linux-fsdevel/20200828084603.GA7072@quack2.suse.cz/
>
> Hi Amir, Hi Jan,
> Thanks for pinging back on me. Indeed I did "explore this path further".
> In my project
> https://github.com/tycho-kirchner/shournal
>
> the goal is to track read/written files of a process tree and all it's child-processes and connect this data to a given shell-command. In fact after Amir's and mine last correspondence I implemented a kernel module which instruments ftrace and tracepoints to trace fput-events (kernel/event_handler.c:event_handler_fput) of specific tasks, which are then further processed in a dedicated kernel thread. I considered eBPF for this task but found no satisfying approach to have dynamic, different filter-rules (e.g. include-paths) for each process tree of each user.
>
>
> Regarding improvement of fanotify let's discriminate two cases: system-monitoring and tracing.
> Regarding system-monitoring: I'm not sure how exactly FAN_MARK_VOLATILE would work (Amir, could you please elaborate?)

FAN_MARK_VOLATILE is not a solution for "include" filters.
It is a solution for "exclude" filters implemented in userspace.
If monitoring program gets an event and decides that its path should be excluded
it may set a "volatile" exclude mark on that directory that will
suppress further
events from that directory for as long as the directory inode remains
in inode cache.
After directory inode has not been accessed for a while and evicted
from inode cache
the monitoring program can get an event in that directory again and then it can
re-install the volatile ignore mark if it wants to.

> but what do you think about the following approach, in order to solve the subtree watch problem:
> - Store the include/exlude-paths of interest as *strings* in a hashset.
> - on fsevent, lookup the path by calling d_path() only once and cache, whether events for the given path are of interest. This
>    can either happen with a reference on the path (clear older paths periodically in a work queue)
>    or with a timelimit in which potentially wrong paths are accepted (path pointer freed and address reused).
>    The second approach I use myself in kernel/event_consumer_cache.c. See also kpathtree.c for a somewhat efficient
>    subpath-lookup.

I would implement filtering with is_subdir() and not with d_path(),
but there are
advantages to either approach.
In any case, I see there is BPF_FUNC_d_path, so why can't your approach be
implemented using an eBPF program?

>
> Regarding tracing I think fanotify would really benefit from a FAN_MARK_PID (with optional follow fork-mode). That way one of the first filter-steps would be whether events for the given task are of interest, so we have no performance problem for all other tasks. The possibility to mark specific processes would also have another substantial benefit: fanotify could be used without root privileges by only allowing the user to mark his/her own processes.
> That way existing inotify-users could finally switch to the cleaner/more powerful fanotify.

We already have partial support for unprivileged fanotify.
Which features are you missing with unprivileged fanotify?
and why do you think that filtering by process tree will allow those
features to be enabled?
A child process may well have more privileges to read directories than
its parent.

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC] Volatile fanotify marks
  2022-03-01 16:58       ` Amir Goldstein
@ 2022-03-02 10:04         ` Tycho Kirchner
  2022-03-02 18:14           ` Amir Goldstein
  0 siblings, 1 reply; 17+ messages in thread
From: Tycho Kirchner @ 2022-03-02 10:04 UTC (permalink / raw)
  To: Amir Goldstein; +Cc: Jan Kara, linux-fsdevel



Am 01.03.22 um 17:58 schrieb Amir Goldstein:
> On Tue, Mar 1, 2022 at 2:26 PM Tycho Kirchner <tychokirchner@mail.de> wrote:
>>
>>
>>
>>>>> I wanted to get your feedback on an idea I have been playing with.
>>>>> It started as a poor man's alternative to the old subtree watch problem.
>>
>>
>>> I do agree that we should NOT add "subtree filter" functionality to fanotify
>>> (or any other filter) and that instead, we should add support for attaching an
>>> eBPF program that implements is_subdir().
>>> I found this [1] convection with Tycho where you had suggested this idea.
>>> I wonder if Tycho got to explore this path further?
>>>
>>> [1] https://lore.kernel.org/linux-fsdevel/20200828084603.GA7072@quack2.suse.cz/
>>
>> Hi Amir, Hi Jan,
>> Thanks for pinging back on me. Indeed I did "explore this path further".
>> In my project
>> https://github.com/tycho-kirchner/shournal
>>
>> the goal is to track read/written files of a process tree and all it's child-processes and connect this data to a given shell-command. In fact after Amir's and mine last correspondence I implemented a kernel module which instruments ftrace and tracepoints to trace fput-events (kernel/event_handler.c:event_handler_fput) of specific tasks, which are then further processed in a dedicated kernel thread. I considered eBPF for this task but found no satisfying approach to have dynamic, different filter-rules (e.g. include-paths) for each process tree of each user.
>>
>>
>> Regarding improvement of fanotify let's discriminate two cases: system-monitoring and tracing.
>> Regarding system-monitoring: I'm not sure how exactly FAN_MARK_VOLATILE would work (Amir, could you please elaborate?)
> 
> FAN_MARK_VOLATILE is not a solution for "include" filters.
> It is a solution for "exclude" filters implemented in userspace.
> If monitoring program gets an event and decides that its path should be excluded
> it may set a "volatile" exclude mark on that directory that will
> suppress further
> events from that directory for as long as the directory inode remains
> in inode cache.
> After directory inode has not been accessed for a while and evicted
> from inode cache
> the monitoring program can get an event in that directory again and then it can
> re-install the volatile ignore mark if it wants to.
> 
Thanks for this explanation. Regarding few exclude-directories this sounds useful. However, if a whole directory-tree of filesystem events shall be excluded I guess the performance benefit will be rather small. A benchmark may clarify this ( I have some yet unpublished code ready, in case you are interested). If an efficient algorithm can be found I would rather vote for "include" dirs with unlimited depth. Btw. similar to the process-filter approach by unshared mount namespaces about which I wrote in our last correspondence, you may be able to exclude your .private/ directory by bind-mounting over it and otherwise only marking only those mounts of interest instead of the entire filesystem. But yeah, this is kinda messy.

>> but what do you think about the following approach, in order to solve the subtree watch problem:
>> - Store the include/exlude-paths of interest as *strings* in a hashset.
>> - on fsevent, lookup the path by calling d_path() only once and cache, whether events for the given path are of interest. This
>>     can either happen with a reference on the path (clear older paths periodically in a work queue)
>>     or with a timelimit in which potentially wrong paths are accepted (path pointer freed and address reused).
>>     The second approach I use myself in kernel/event_consumer_cache.c. See also kpathtree.c for a somewhat efficient
>>     subpath-lookup.
> 
> I would implement filtering with is_subdir() and not with d_path(),
> but there are
> advantages to either approach.
> In any case, I see there is BPF_FUNC_d_path, so why can't your approach be
> implemented using an eBPF program?
>It seems that bpf_d_path was introduced with v5.10 (6e22ab9da79343532cd3cde39df25e5a5478c692), however, shournal must still run on older kernels (e.g. openSUSE Leap v5.3.18). Further, as far as I remember, at least in Linux 4.19 there was quite some overhead to just install the fd into the eBPF user-space process, but I have to re-check that once that functionality is more widespread.


>>
>> Regarding tracing I think fanotify would really benefit from a FAN_MARK_PID (with optional follow fork-mode). That way one of the first filter-steps would be whether events for the given task are of interest, so we have no performance problem for all other tasks. The possibility to mark specific processes would also have another substantial benefit: fanotify could be used without root privileges by only allowing the user to mark his/her own processes.
>> That way existing inotify-users could finally switch to the cleaner/more powerful fanotify.
> 
> We already have partial support for unprivileged fanotify.
> Which features are you missing with unprivileged fanotify?
> and why do you think that filtering by process tree will allow those
> features to be enabled?


I am missing the ability to filter for (close-)events of large directory trees in a race-free manner, so that no events are lost on newly created dirs. Even without the race, monitoring my home-directory is impossible (without privileges) as I have far more than 8192 directories (393941 as of writing (; ).
Monitoring mounts solves these problems but introduces two others:
First it requires privileges, second a potentially large number of events *not of interest* have to be copied to user-space (except unshared mount namespaces are used). Allowing a user to only monitor his/her own processes would make mark_mount privileges unnecessary (please correct me if I'm wrong). While still events above the directory of interest are reported, at least events from other users are filtered beforehand.

> A child process may well have more privileges to read directories than
> its parent.
> 
Similar to ptrace fanotiy should then not follow suid-programs, so this case should not occur.

After all I totally understand that you do not want to feature-bloat fanotify and maybe my use-case is already too far from the one casual users have. On the other hand, pid- or path-filtering is maybe basic enough and fanotify does offer the ability to filter for paths - it is just quite limited due to the mark-concept. I think it should not be necessary in order to monitor a directory tree, to touch every single directory inside beforehand. Maybe a hybrid-solution fits best here: hard-code pid-filtering as a security feature into fanotify, allow marking of mounts for the user's own processes and allow for eBPF filter rules afterwards.

Thanks and kind regards
Tycho






^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC] Volatile fanotify marks
  2022-03-02 10:04         ` Tycho Kirchner
@ 2022-03-02 18:14           ` Amir Goldstein
  2022-03-03  9:24             ` Jan Kara
  0 siblings, 1 reply; 17+ messages in thread
From: Amir Goldstein @ 2022-03-02 18:14 UTC (permalink / raw)
  To: Tycho Kirchner; +Cc: Jan Kara, linux-fsdevel

On Wed, Mar 2, 2022 at 12:04 PM Tycho Kirchner <tychokirchner@mail.de> wrote:
>
>
>
> Am 01.03.22 um 17:58 schrieb Amir Goldstein:
> > On Tue, Mar 1, 2022 at 2:26 PM Tycho Kirchner <tychokirchner@mail.de> wrote:
> >>
> >>
> >>
> >>>>> I wanted to get your feedback on an idea I have been playing with.
> >>>>> It started as a poor man's alternative to the old subtree watch problem.
> >>
> >>
> >>> I do agree that we should NOT add "subtree filter" functionality to fanotify
> >>> (or any other filter) and that instead, we should add support for attaching an
> >>> eBPF program that implements is_subdir().
> >>> I found this [1] convection with Tycho where you had suggested this idea.
> >>> I wonder if Tycho got to explore this path further?
> >>>
> >>> [1] https://lore.kernel.org/linux-fsdevel/20200828084603.GA7072@quack2.suse.cz/
> >>
> >> Hi Amir, Hi Jan,
> >> Thanks for pinging back on me. Indeed I did "explore this path further".
> >> In my project
> >> https://github.com/tycho-kirchner/shournal
> >>
> >> the goal is to track read/written files of a process tree and all it's child-processes and connect this data to a given shell-command. In fact after Amir's and mine last correspondence I implemented a kernel module which instruments ftrace and tracepoints to trace fput-events (kernel/event_handler.c:event_handler_fput) of specific tasks, which are then further processed in a dedicated kernel thread. I considered eBPF for this task but found no satisfying approach to have dynamic, different filter-rules (e.g. include-paths) for each process tree of each user.
> >>
> >>
> >> Regarding improvement of fanotify let's discriminate two cases: system-monitoring and tracing.
> >> Regarding system-monitoring: I'm not sure how exactly FAN_MARK_VOLATILE would work (Amir, could you please elaborate?)
> >
> > FAN_MARK_VOLATILE is not a solution for "include" filters.
> > It is a solution for "exclude" filters implemented in userspace.
> > If monitoring program gets an event and decides that its path should be excluded
> > it may set a "volatile" exclude mark on that directory that will
> > suppress further
> > events from that directory for as long as the directory inode remains
> > in inode cache.
> > After directory inode has not been accessed for a while and evicted
> > from inode cache
> > the monitoring program can get an event in that directory again and then it can
> > re-install the volatile ignore mark if it wants to.
> >
> Thanks for this explanation. Regarding few exclude-directories this sounds useful.
> However, if a whole directory-tree of filesystem events shall be excluded I guess the
> performance benefit will be rather small. A benchmark may clarify this
> ( I have some yet unpublished code ready, in case you are interested).

Code for what?

> If an efficient algorithm can be found I would rather vote for "include" dirs with unlimited depth.

As I said, this is desirable, difficult and completely orthogonal to
FAN_MARK_VOLATILE
functionality.

> Btw. similar to the process-filter approach by unshared mount namespaces about which
> I wrote in our last correspondence, you may be able to exclude your .private/ directory
> by bind-mounting over it and otherwise only marking only those mounts of interest
> instead of the entire filesystem. But yeah, this is kinda messy.
>

Marking bind mounts could be a good option for some use cases.
But unlike your monitoring app, my app needs to track create/unlink/rename as
well and those events are not currently available for mount marks.
I had several attempts to tackle that but they did not work out yet.
Partly, because marking a bind mount is not as useful as filtering by subtree.

> >> but what do you think about the following approach, in order to solve the subtree watch problem:
> >> - Store the include/exlude-paths of interest as *strings* in a hashset.
> >> - on fsevent, lookup the path by calling d_path() only once and cache, whether events for the given path are of interest. This
> >>     can either happen with a reference on the path (clear older paths periodically in a work queue)
> >>     or with a timelimit in which potentially wrong paths are accepted (path pointer freed and address reused).
> >>     The second approach I use myself in kernel/event_consumer_cache.c. See also kpathtree.c for a somewhat efficient
> >>     subpath-lookup.
> >
> > I would implement filtering with is_subdir() and not with d_path(),
> > but there are
> > advantages to either approach.
> > In any case, I see there is BPF_FUNC_d_path, so why can't your approach be
> > implemented using an eBPF program?
> >It seems that bpf_d_path was introduced with v5.10 (6e22ab9da79343532cd3cde39df25e5a5478c692), however, shournal must still run on older kernels (e.g. openSUSE Leap v5.3.18). Further, as far as I remember, at least in Linux 4.19 there was quite some overhead to just install the fd into the eBPF user-space process, but I have to re-check that once that functionality is more widespread.
>

There is no need to install any fd.
The program should hook a function that has access to a struct path.

>
> >>
> >> Regarding tracing I think fanotify would really benefit from a FAN_MARK_PID (with optional follow fork-mode). That way one of the first filter-steps would be whether events for the given task are of interest, so we have no performance problem for all other tasks. The possibility to mark specific processes would also have another substantial benefit: fanotify could be used without root privileges by only allowing the user to mark his/her own processes.
> >> That way existing inotify-users could finally switch to the cleaner/more powerful fanotify.
> >
> > We already have partial support for unprivileged fanotify.
> > Which features are you missing with unprivileged fanotify?
> > and why do you think that filtering by process tree will allow those
> > features to be enabled?
>
>
> I am missing the ability to filter for (close-)events of large directory trees in a race-free manner, so that no events are lost on newly created dirs. Even without the race, monitoring my home-directory is impossible (without privileges) as I have far more than 8192 directories (393941 as of writing (; ).
> Monitoring mounts solves these problems but introduces two others:
> First it requires privileges, second a potentially large number of events *not of interest* have to be copied to user-space (except unshared mount namespaces are used). Allowing a user to only monitor his/her own processes would make mark_mount privileges unnecessary (please correct me if I'm wrong). While still events above the directory of interest are reported, at least events from other users are filtered beforehand.
>

I don't know. Security model is hard.
What do you mean by "his/her own processes"? processes owned by the same uid?
With simple look it sounds right, but other security policy may be in
play (e.g. sepolicy)
which can grand different processes owned by same user different file access
permissions and not any process may be allowed to ptrace other processes.
userns has more clear semantics, so monitoring all processes/mounts inside
an unprivileged userns may be easier to prove.

> > A child process may well have more privileges to read directories than
> > its parent.
> >
> Similar to ptrace fanotiy should then not follow suid-programs, so this case should not occur.
>
> After all I totally understand that you do not want to feature-bloat fanotify and maybe my use-case is already too far from the one casual users have. On the other hand, pid- or path-filtering is maybe basic enough and fanotify does offer the ability to filter for paths - it is just quite limited due to the mark-concept. I think it should not be necessary in order to monitor a directory tree, to touch every single directory inside beforehand. Maybe a hybrid-solution fits best here: hard-code pid-filtering as a security feature into fanotify, allow marking of mounts for the user's own processes and allow for eBPF filter rules afterwards.
>

This is a bit too hand waving for me to understand.
In the end, it's all in the details.
Need to see a whole design and/or implementation to be able to say
what are its benefits and how doable it is.

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC] Volatile fanotify marks
  2022-03-02 18:14           ` Amir Goldstein
@ 2022-03-03  9:24             ` Jan Kara
  2022-05-02  9:13               ` Tycho Kirchner
  0 siblings, 1 reply; 17+ messages in thread
From: Jan Kara @ 2022-03-03  9:24 UTC (permalink / raw)
  To: Amir Goldstein; +Cc: Tycho Kirchner, Jan Kara, linux-fsdevel

On Wed 02-03-22 20:14:29, Amir Goldstein wrote:
> On Wed, Mar 2, 2022 at 12:04 PM Tycho Kirchner <tychokirchner@mail.de> wrote:
> > >>
> > >> Regarding tracing I think fanotify would really benefit from a FAN_MARK_PID (with optional follow fork-mode). That way one of the first filter-steps would be whether events for the given task are of interest, so we have no performance problem for all other tasks. The possibility to mark specific processes would also have another substantial benefit: fanotify could be used without root privileges by only allowing the user to mark his/her own processes.
> > >> That way existing inotify-users could finally switch to the cleaner/more powerful fanotify.
> > >
> > > We already have partial support for unprivileged fanotify.
> > > Which features are you missing with unprivileged fanotify?
> > > and why do you think that filtering by process tree will allow those
> > > features to be enabled?
> >
> >
> > I am missing the ability to filter for (close-)events of large
> > directory trees in a race-free manner, so that no events are lost on
> > newly created dirs. Even without the race, monitoring my home-directory
> > is impossible (without privileges) as I have far more than 8192
> > directories (393941 as of writing (; ).  Monitoring mounts solves these
> > problems but introduces two others: First it requires privileges,
> > second a potentially large number of events *not of interest* have to
> > be copied to user-space (except unshared mount namespaces are used).
> > Allowing a user to only monitor his/her own processes would make
> > mark_mount privileges unnecessary (please correct me if I'm wrong).
> > While still events above the directory of interest are reported, at
> > least events from other users are filtered beforehand.
> 
> I don't know. Security model is hard.
> What do you mean by "his/her own processes"? processes owned by the same uid?
> With simple look it sounds right, but other security policy may be in
> play (e.g. sepolicy)
> which can grand different processes owned by same user different file access
> permissions and not any process may be allowed to ptrace other processes.
> userns has more clear semantics, so monitoring all processes/mounts inside
> an unprivileged userns may be easier to prove.

I see two problems with limiting events to those generated by a particular
process / user:

1) Fanotify is a filesystem notification system. As such it is primarily
aimed at (more or less efficient) answering of a question - did something
in the filesystem change, was some data from the filesystem used? If you
start to limit visible events to processes / users you are no longer able
to reliably answer this question. As such we would get complaints "but this
is not good enough for our usecase" sooner rather than later. In filesystem
change notification space we have a long history of partial solutions that
then forced us into full redesign and all the pain associated with that.

2) Limiting events to those generated by a particular user may somewhat
reduce the amount of generated events but for a lot of usecases that is not
really significant. So the push for some middle ground between - watching a
file / dir and watching whole fs will still stay. 

Regarding the security model for unpriviledged watches: IMO a sensible
security model for fs notification could be like: "If you can read the
file, you should be able to watch for changes. If you own the file, you
should be able to watch for accesses." But the trouble is that for
filesystem wide marks, it is not easy to verify these conditions.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC] Volatile fanotify marks
  2022-03-03  9:24             ` Jan Kara
@ 2022-05-02  9:13               ` Tycho Kirchner
  2022-05-04  6:13                 ` Amir Goldstein
  0 siblings, 1 reply; 17+ messages in thread
From: Tycho Kirchner @ 2022-05-02  9:13 UTC (permalink / raw)
  To: Jan Kara, Amir Goldstein; +Cc: linux-fsdevel

All right, I thought a bit more about that and returned to your
original BPF idea you mentioned on 2020-08-28:

> I was thinking that we could add a BPF hook to fanotify_handle_event()
> (similar to what's happening in packet filtering code) and you could attach
> BPF programs to this hook to do filtering of events. That way we don't have
> to introduce new group flags for various filtering options. The question is
> whether eBPF is strong enough so that filters useful for fanotify users
> could be implemented with it but this particular check seems implementable.
>
> 								Honza

Instead of changing fanotify's filesystem notification functionality,
I suggest to rather **add a tracing mode (fantrace)**.

The synchronous handling of syscalls via ptrace is of course required
for debugging purposes, however that introduces a major slowdown (even
with seccomp-bpf filters). There are a number of cases, including
[1-3], where async processing of file events of specific tasks would be
fine but is not readily available in Linux. Fanotify already ships
important infrastructure in this regard: it provides very fast
event-buffering and, by using file descriptors instead of resolved
paths, a clean and race-free API to process the events later. However,
as already stated, fanotify does not provide a clean way, to monitor
only a subset of tasks. Therefore please consider the following
proposed architecture of fantrace:

Each taks gets its own struct fsnotify_group. Within
fsnotify.c:fsnotify() it is checked if the given task has a
fsnotify_group attached where events of interest are buffered as usual.
Note that this is an additional hook - sysadmins being subscribed to
filesystem events rather than task-filesystem-events are notified as
usual - in that case two hooks possibly run. The fsnotify_group is
extended by a field optionally pointing to a BPF program which allows
for custom filters to be run.

Some implementation details:
- To let the tracee return quickly, run BPF filter program within tracer
   context during read(fan_fd) but before events are copied to userspace
- only one fantracer per task, which overrides existing ones if any
- task->fsnotify_group refcount increment on fork, decrement on exit (run
   after exit_files(tsk) to not miss final close events). When last task
   exited, send EOF to listener.
- on exec of seuid-programs the fsnotify_group is cleared (like in ptrace)
- lazy check when event occurs, if listener is still alive (refcount > 1)
- for the beginning, to keep things simple and to "solve" the cleanup of
   filesystem marks, I suggest to disable i_fsnotify_marks for fantrace
   (only allow FAN_MARK_FILESYSTEM), as that functionality can be
   implemented within the user-provided BPF-program.

A working implementation of this concept, which effectively does the
same using hardcoded filter rules can be found in my kernel module
shournalk [4]. For instance In kernel/event_handler.c:event_handler_fput()
it is checked, if the task is observed using a hashtable, and if so,
the event is stored to a buffer corresponding to that process tree.

Thanks
Tycho


[1] Chirigati F, Rampin R, Shasha D, Freire J. (2016). ReproZip:
Computational Reproducibility with Ease. Paper presented at the
Proceedings of the 2016 International Conference on Management of Data.
San Francisco, CA: New York: Association for Computing Technology.
https://github.com/VIDA-NYU/reprozip
[2] Guo, P. (2012). CDE: A Tool For Creating Portable Experimental
Software Packages. Computing in Science & Engineering 14, 332–35
[3] Tycho Kirchner, Konstantin Riege, Steve Hoffmann (2020). Bashing
irreproducibility with shournal bioRxiv 2020.08.03.232843; doi:
https://doi.org/10.1101/2020.08.03.232843
[4] https://github.com/tycho-kirchner/shournal





^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC] Volatile fanotify marks
  2022-05-02  9:13               ` Tycho Kirchner
@ 2022-05-04  6:13                 ` Amir Goldstein
  2022-05-04 10:01                   ` Tycho Kirchner
  0 siblings, 1 reply; 17+ messages in thread
From: Amir Goldstein @ 2022-05-04  6:13 UTC (permalink / raw)
  To: Tycho Kirchner; +Cc: Jan Kara, linux-fsdevel

On Mon, May 2, 2022 at 12:13 PM Tycho Kirchner <tychokirchner@mail.de> wrote:
>
> All right, I thought a bit more about that and returned to your
> original BPF idea you mentioned on 2020-08-28:
>
> > I was thinking that we could add a BPF hook to fanotify_handle_event()
> > (similar to what's happening in packet filtering code) and you could attach
> > BPF programs to this hook to do filtering of events. That way we don't have
> > to introduce new group flags for various filtering options. The question is
> > whether eBPF is strong enough so that filters useful for fanotify users
> > could be implemented with it but this particular check seems implementable.
> >
> >                                                               Honza
>
> Instead of changing fanotify's filesystem notification functionality,
> I suggest to rather **add a tracing mode (fantrace)**.
>
> The synchronous handling of syscalls via ptrace is of course required
> for debugging purposes, however that introduces a major slowdown (even
> with seccomp-bpf filters). There are a number of cases, including
> [1-3], where async processing of file events of specific tasks would be
> fine but is not readily available in Linux. Fanotify already ships
> important infrastructure in this regard: it provides very fast
> event-buffering and, by using file descriptors instead of resolved
> paths, a clean and race-free API to process the events later. However,
> as already stated, fanotify does not provide a clean way, to monitor
> only a subset of tasks. Therefore please consider the following
> proposed architecture of fantrace:
>
> Each taks gets its own struct fsnotify_group. Within
> fsnotify.c:fsnotify() it is checked if the given task has a
> fsnotify_group attached where events of interest are buffered as usual.
> Note that this is an additional hook - sysadmins being subscribed to
> filesystem events rather than task-filesystem-events are notified as
> usual - in that case two hooks possibly run. The fsnotify_group is
> extended by a field optionally pointing to a BPF program which allows
> for custom filters to be run.
>
> Some implementation details:
> - To let the tracee return quickly, run BPF filter program within tracer
>    context during read(fan_fd) but before events are copied to userspace
> - only one fantracer per task, which overrides existing ones if any
> - task->fsnotify_group refcount increment on fork, decrement on exit (run
>    after exit_files(tsk) to not miss final close events). When last task
>    exited, send EOF to listener.
> - on exec of seuid-programs the fsnotify_group is cleared (like in ptrace)
> - lazy check when event occurs, if listener is still alive (refcount > 1)
> - for the beginning, to keep things simple and to "solve" the cleanup of
>    filesystem marks, I suggest to disable i_fsnotify_marks for fantrace
>    (only allow FAN_MARK_FILESYSTEM), as that functionality can be
>    implemented within the user-provided BPF-program.
>

Maybe I am slow, but I did not understand the need for this task fsnotify_group.

What's wrong with Jan's suggestion? (add a BPF hook to fanotify_handle_event())
that hook is supposed to filter by pid so why all this extra complexity?

We may consider the option to have another BFP hook when reading
events if there is
good justification, but subtree filters will have to be in handle_event().

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC] Volatile fanotify marks
  2022-05-04  6:13                 ` Amir Goldstein
@ 2022-05-04 10:01                   ` Tycho Kirchner
  2022-05-04 14:37                     ` Amir Goldstein
  2022-05-05 10:42                     ` Jan Kara
  0 siblings, 2 replies; 17+ messages in thread
From: Tycho Kirchner @ 2022-05-04 10:01 UTC (permalink / raw)
  To: Amir Goldstein; +Cc: Jan Kara, linux-fsdevel



Am 04.05.22 um 08:13 schrieb Amir Goldstein:
> On Mon, May 2, 2022 at 12:13 PM Tycho Kirchner <tychokirchner@mail.de> wrote:
>>
>> All right, I thought a bit more about that and returned to your
>> original BPF idea you mentioned on 2020-08-28:
>>
>>> I was thinking that we could add a BPF hook to fanotify_handle_event()
>>> (similar to what's happening in packet filtering code) and you could attach
>>> BPF programs to this hook to do filtering of events. That way we don't have
>>> to introduce new group flags for various filtering options. The question is
>>> whether eBPF is strong enough so that filters useful for fanotify users
>>> could be implemented with it but this particular check seems implementable.
>>>
>>>                                                                Honza
>>
>> Instead of changing fanotify's filesystem notification functionality,
>> I suggest to rather **add a tracing mode (fantrace)**.
>>
>> The synchronous handling of syscalls via ptrace is of course required
>> for debugging purposes, however that introduces a major slowdown (even
>> with seccomp-bpf filters). There are a number of cases, including
>> [1-3], where async processing of file events of specific tasks would be
>> fine but is not readily available in Linux. Fanotify already ships
>> important infrastructure in this regard: it provides very fast
>> event-buffering and, by using file descriptors instead of resolved
>> paths, a clean and race-free API to process the events later. However,
>> as already stated, fanotify does not provide a clean way, to monitor
>> only a subset of tasks. Therefore please consider the following
>> proposed architecture of fantrace:
>>
>> Each taks gets its own struct fsnotify_group. Within
>> fsnotify.c:fsnotify() it is checked if the given task has a
>> fsnotify_group attached where events of interest are buffered as usual.
>> Note that this is an additional hook - sysadmins being subscribed to
>> filesystem events rather than task-filesystem-events are notified as
>> usual - in that case two hooks possibly run. The fsnotify_group is
>> extended by a field optionally pointing to a BPF program which allows
>> for custom filters to be run.
>>
>> Some implementation details:
>> - To let the tracee return quickly, run BPF filter program within tracer
>>     context during read(fan_fd) but before events are copied to userspace
>> - only one fantracer per task, which overrides existing ones if any
>> - task->fsnotify_group refcount increment on fork, decrement on exit (run
>>     after exit_files(tsk) to not miss final close events). When last task
>>     exited, send EOF to listener.
>> - on exec of seuid-programs the fsnotify_group is cleared (like in ptrace)
>> - lazy check when event occurs, if listener is still alive (refcount > 1)
>> - for the beginning, to keep things simple and to "solve" the cleanup of
>>     filesystem marks, I suggest to disable i_fsnotify_marks for fantrace
>>     (only allow FAN_MARK_FILESYSTEM), as that functionality can be
>>     implemented within the user-provided BPF-program.
>>
> 
> Maybe I am slow, but I did not understand the need for this task fsnotify_group.
> 
> What's wrong with Jan's suggestion? (add a BPF hook to fanotify_handle_event())
> that hook is supposed to filter by pid so why all this extra complexity?
> 
> We may consider the option to have another BFP hook when reading
> events if there is
> good justification, but subtree filters will have to be in handle_event().
> 
> Thanks,
> Amir.

To be a reasonable async replacement for ptrace (see e.g. mentioned reprozip)
file-events from all paths have to be reported, which is difficult
using i_fsnotify_marks, because
- marking whole mountpoints requires privileges
- marking the whole filesystem using directory marks is unfeasible

However, we need a quick way to find out, if a file event is of
interest (find its beloning fsnotify_group). For the purpose of tracing
it appears reasonable to consider all file events of a traced task as
"interesting" in the first place. So, in this way, we allow a user to
trace file events of his own tasks without slowing down other,
non-traced tasks.

After all, it's all about the order of running filters - first inode,
then pid or reverse. With my proposed architecture for the purpose of
tracing I would hand the inode-filter to the user in form of an
optional BPF hook. Performance-wise that's also the "fair" solution.
Let's assume we allow marking the whole filesystem (via mountpoints).
Now, the BPF-pid-filter code has to run for every single file event (of
all users!), if multiple users trace the filesystem even multiple hooks
have to run, slowing down the whole system.


Thanks,
Tycho

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC] Volatile fanotify marks
  2022-05-04 10:01                   ` Tycho Kirchner
@ 2022-05-04 14:37                     ` Amir Goldstein
  2022-05-06  9:59                       ` Tycho Kirchner
  2022-05-05 10:42                     ` Jan Kara
  1 sibling, 1 reply; 17+ messages in thread
From: Amir Goldstein @ 2022-05-04 14:37 UTC (permalink / raw)
  To: Tycho Kirchner; +Cc: Jan Kara, linux-fsdevel

On Wed, May 4, 2022 at 1:01 PM Tycho Kirchner <tychokirchner@mail.de> wrote:
>
>
>
> Am 04.05.22 um 08:13 schrieb Amir Goldstein:
> > On Mon, May 2, 2022 at 12:13 PM Tycho Kirchner <tychokirchner@mail.de> wrote:
> >>
> >> All right, I thought a bit more about that and returned to your
> >> original BPF idea you mentioned on 2020-08-28:
> >>
> >>> I was thinking that we could add a BPF hook to fanotify_handle_event()
> >>> (similar to what's happening in packet filtering code) and you could attach
> >>> BPF programs to this hook to do filtering of events. That way we don't have
> >>> to introduce new group flags for various filtering options. The question is
> >>> whether eBPF is strong enough so that filters useful for fanotify users
> >>> could be implemented with it but this particular check seems implementable.
> >>>
> >>>                                                                Honza
> >>
> >> Instead of changing fanotify's filesystem notification functionality,
> >> I suggest to rather **add a tracing mode (fantrace)**.
> >>
> >> The synchronous handling of syscalls via ptrace is of course required
> >> for debugging purposes, however that introduces a major slowdown (even
> >> with seccomp-bpf filters). There are a number of cases, including
> >> [1-3], where async processing of file events of specific tasks would be
> >> fine but is not readily available in Linux. Fanotify already ships
> >> important infrastructure in this regard: it provides very fast
> >> event-buffering and, by using file descriptors instead of resolved
> >> paths, a clean and race-free API to process the events later. However,
> >> as already stated, fanotify does not provide a clean way, to monitor
> >> only a subset of tasks. Therefore please consider the following
> >> proposed architecture of fantrace:
> >>
> >> Each taks gets its own struct fsnotify_group. Within
> >> fsnotify.c:fsnotify() it is checked if the given task has a
> >> fsnotify_group attached where events of interest are buffered as usual.
> >> Note that this is an additional hook - sysadmins being subscribed to
> >> filesystem events rather than task-filesystem-events are notified as
> >> usual - in that case two hooks possibly run. The fsnotify_group is
> >> extended by a field optionally pointing to a BPF program which allows
> >> for custom filters to be run.
> >>
> >> Some implementation details:
> >> - To let the tracee return quickly, run BPF filter program within tracer
> >>     context during read(fan_fd) but before events are copied to userspace
> >> - only one fantracer per task, which overrides existing ones if any
> >> - task->fsnotify_group refcount increment on fork, decrement on exit (run
> >>     after exit_files(tsk) to not miss final close events). When last task
> >>     exited, send EOF to listener.
> >> - on exec of seuid-programs the fsnotify_group is cleared (like in ptrace)
> >> - lazy check when event occurs, if listener is still alive (refcount > 1)
> >> - for the beginning, to keep things simple and to "solve" the cleanup of
> >>     filesystem marks, I suggest to disable i_fsnotify_marks for fantrace
> >>     (only allow FAN_MARK_FILESYSTEM), as that functionality can be
> >>     implemented within the user-provided BPF-program.
> >>
> >
> > Maybe I am slow, but I did not understand the need for this task fsnotify_group.
> >
> > What's wrong with Jan's suggestion? (add a BPF hook to fanotify_handle_event())
> > that hook is supposed to filter by pid so why all this extra complexity?
> >
> > We may consider the option to have another BFP hook when reading
> > events if there is
> > good justification, but subtree filters will have to be in handle_event().
> >
> > Thanks,
> > Amir.
>
> To be a reasonable async replacement for ptrace (see e.g. mentioned reprozip)
> file-events from all paths have to be reported, which is difficult
> using i_fsnotify_marks, because
> - marking whole mountpoints requires privileges
> - marking the whole filesystem using directory marks is unfeasible
>
> However, we need a quick way to find out, if a file event is of
> interest (find its beloning fsnotify_group). For the purpose of tracing
> it appears reasonable to consider all file events of a traced task as
> "interesting" in the first place. So, in this way, we allow a user to
> trace file events of his own tasks without slowing down other,
> non-traced tasks.
>
> After all, it's all about the order of running filters - first inode,
> then pid or reverse. With my proposed architecture for the purpose of
> tracing I would hand the inode-filter to the user in form of an
> optional BPF hook. Performance-wise that's also the "fair" solution.
> Let's assume we allow marking the whole filesystem (via mountpoints).
> Now, the BPF-pid-filter code has to run for every single file event (of
> all users!), if multiple users trace the filesystem even multiple hooks
> have to run, slowing down the whole system.
>

I understand now what you were trying to do, but still not convinced
that the added complexity to the kernel is worth it, because you may be
able to achieve the same with userspace LD_PRELOAD hooking.

BTW, the fanotify_userns patch set [1] was an attempt to allow unprivileged
user watch over subtree with low performance impact on the rest of the
file system.
This is not exactly what you need, but maybe it could be made
into what you need.

Thanks,
Amir.

[1] https://github.com/amir73il/linux/commits/fanotify_userns

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC] Volatile fanotify marks
  2022-05-04 10:01                   ` Tycho Kirchner
  2022-05-04 14:37                     ` Amir Goldstein
@ 2022-05-05 10:42                     ` Jan Kara
  1 sibling, 0 replies; 17+ messages in thread
From: Jan Kara @ 2022-05-05 10:42 UTC (permalink / raw)
  To: Tycho Kirchner; +Cc: Amir Goldstein, Jan Kara, linux-fsdevel

On Wed 04-05-22 12:01:49, Tycho Kirchner wrote:
> Am 04.05.22 um 08:13 schrieb Amir Goldstein:
> > On Mon, May 2, 2022 at 12:13 PM Tycho Kirchner <tychokirchner@mail.de> wrote:
> > > 
> > > All right, I thought a bit more about that and returned to your
> > > original BPF idea you mentioned on 2020-08-28:
> > > 
> > > > I was thinking that we could add a BPF hook to fanotify_handle_event()
> > > > (similar to what's happening in packet filtering code) and you could attach
> > > > BPF programs to this hook to do filtering of events. That way we don't have
> > > > to introduce new group flags for various filtering options. The question is
> > > > whether eBPF is strong enough so that filters useful for fanotify users
> > > > could be implemented with it but this particular check seems implementable.
> > > > 
> > > >                                                                Honza
> > > 
> > > Instead of changing fanotify's filesystem notification functionality,
> > > I suggest to rather **add a tracing mode (fantrace)**.
> > > 
> > > The synchronous handling of syscalls via ptrace is of course required
> > > for debugging purposes, however that introduces a major slowdown (even
> > > with seccomp-bpf filters). There are a number of cases, including
> > > [1-3], where async processing of file events of specific tasks would be
> > > fine but is not readily available in Linux. Fanotify already ships
> > > important infrastructure in this regard: it provides very fast
> > > event-buffering and, by using file descriptors instead of resolved
> > > paths, a clean and race-free API to process the events later. However,
> > > as already stated, fanotify does not provide a clean way, to monitor
> > > only a subset of tasks. Therefore please consider the following
> > > proposed architecture of fantrace:
> > > 
> > > Each taks gets its own struct fsnotify_group. Within
> > > fsnotify.c:fsnotify() it is checked if the given task has a
> > > fsnotify_group attached where events of interest are buffered as usual.
> > > Note that this is an additional hook - sysadmins being subscribed to
> > > filesystem events rather than task-filesystem-events are notified as
> > > usual - in that case two hooks possibly run. The fsnotify_group is
> > > extended by a field optionally pointing to a BPF program which allows
> > > for custom filters to be run.
> > > 
> > > Some implementation details:
> > > - To let the tracee return quickly, run BPF filter program within tracer
> > >     context during read(fan_fd) but before events are copied to userspace
> > > - only one fantracer per task, which overrides existing ones if any
> > > - task->fsnotify_group refcount increment on fork, decrement on exit (run
> > >     after exit_files(tsk) to not miss final close events). When last task
> > >     exited, send EOF to listener.
> > > - on exec of seuid-programs the fsnotify_group is cleared (like in ptrace)
> > > - lazy check when event occurs, if listener is still alive (refcount > 1)
> > > - for the beginning, to keep things simple and to "solve" the cleanup of
> > >     filesystem marks, I suggest to disable i_fsnotify_marks for fantrace
> > >     (only allow FAN_MARK_FILESYSTEM), as that functionality can be
> > >     implemented within the user-provided BPF-program.
> > > 
> > 
> > Maybe I am slow, but I did not understand the need for this task fsnotify_group.
> > 
> > What's wrong with Jan's suggestion? (add a BPF hook to fanotify_handle_event())
> > that hook is supposed to filter by pid so why all this extra complexity?
> > 
> > We may consider the option to have another BFP hook when reading
> > events if there is
> > good justification, but subtree filters will have to be in handle_event().
> > 
> > Thanks,
> > Amir.
> 
> To be a reasonable async replacement for ptrace (see e.g. mentioned
> reprozip) file-events from all paths have to be reported, which is
> difficult using i_fsnotify_marks, because
> - marking whole mountpoints requires privileges
> - marking the whole filesystem using directory marks is unfeasible

I agree with both points here but I guess the point is that fanotify simply
is not meant to be "async replacement for ptrace". It is a filesystem
change monitoring API. Things like Linux tracing (e.g. perf-trace(1)) could
be considered as such, but they require priviledges as well.

> However, we need a quick way to find out, if a file event is of
> interest (find its beloning fsnotify_group). For the purpose of tracing
> it appears reasonable to consider all file events of a traced task as
> "interesting" in the first place. So, in this way, we allow a user to
> trace file events of his own tasks without slowing down other,
> non-traced tasks.

So I agree that the idea of storing some "owner" information (be it user or
particular pid) in fsnotify_group and notify group about events only from
"owner" has some appeal because it nicely solves the question "is user
priviledged enough to see this event?" But it does not quite fit in the
philosophy of a filesystem monitor because fanotify filesystem changes from
other "owners" will not generate notification although you can see them
when looking at the filesystem. I have not completely made up my mind how
much that matters ;)

> After all, it's all about the order of running filters - first inode,
> then pid or reverse. With my proposed architecture for the purpose of
> tracing I would hand the inode-filter to the user in form of an
> optional BPF hook. Performance-wise that's also the "fair" solution.
> Let's assume we allow marking the whole filesystem (via mountpoints).
> Now, the BPF-pid-filter code has to run for every single file event (of
> all users!), if multiple users trace the filesystem even multiple hooks
> have to run, slowing down the whole system.

I understand what you'd like to achieve but things don't work as you
imagine it. Fsnotify has hooks in paths modifying filesystem. When the hook
gets called with the modified object (file, dir, ..) we need to quickly
tell whether the event may be of interest or not - for that we consult the
inode, parent directory, mountpoint, superblock. At this point we don't
know we should limit anything to a particular task because we do not know
yet which notification groups are interested in the event. Only if we find
someone is interested, we walk lists of notification groups, identify which
notification group is exactly interested and at that point we could know
this is your special group limited to a particular task and filter by task.

What you are suggesting could be also implemented by adding another object
type to watch - a task. But honestly it does not seem like a good idea to
expand fanotify in such direction because fanotify is all about filesystem
objects so adding so vastly different objects such as originating task does
not fit well with the API.

								Honza

-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC] Volatile fanotify marks
  2022-05-04 14:37                     ` Amir Goldstein
@ 2022-05-06  9:59                       ` Tycho Kirchner
  0 siblings, 0 replies; 17+ messages in thread
From: Tycho Kirchner @ 2022-05-06  9:59 UTC (permalink / raw)
  To: Amir Goldstein; +Cc: Jan Kara, linux-fsdevel



>>> On Mon, May 2, 2022 at 12:13 PM Tycho Kirchner <tychokirchner@mail.de> wrote:
>>>>
>>>> All right, I thought a bit more about that and returned to your
>>>> original BPF idea you mentioned on 2020-08-28:
>>>>
>>>>> I was thinking that we could add a BPF hook to fanotify_handle_event()
>>>>> (similar to what's happening in packet filtering code) and you could attach
>>>>> BPF programs to this hook to do filtering of events. That way we don't have
>>>>> to introduce new group flags for various filtering options. The question is
>>>>> whether eBPF is strong enough so that filters useful for fanotify users
>>>>> could be implemented with it but this particular check seems implementable.
>>>>>
>>>>>                                                                 Honza
>>>>
>>>> Instead of changing fanotify's filesystem notification functionality,
>>>> I suggest to rather **add a tracing mode (fantrace)**.
>>>>
>>>> The synchronous handling of syscalls via ptrace is of course required
>>>> for debugging purposes, however that introduces a major slowdown (even
>>>> with seccomp-bpf filters). There are a number of cases, including
>>>> [1-3], where async processing of file events of specific tasks would be
>>>> fine but is not readily available in Linux. Fanotify already ships
>>>> important infrastructure in this regard: it provides very fast
>>>> event-buffering and, by using file descriptors instead of resolved
>>>> paths, a clean and race-free API to process the events later. However,
>>>> as already stated, fanotify does not provide a clean way, to monitor
>>>> only a subset of tasks. Therefore please consider the following
>>>> proposed architecture of fantrace:
>>>>
>>>> Each taks gets its own struct fsnotify_group. Within
>>>> fsnotify.c:fsnotify() it is checked if the given task has a
>>>> fsnotify_group attached where events of interest are buffered as usual.
>>>> Note that this is an additional hook - sysadmins being subscribed to
>>>> filesystem events rather than task-filesystem-events are notified as
>>>> usual - in that case two hooks possibly run. The fsnotify_group is
>>>> extended by a field optionally pointing to a BPF program which allows
>>>> for custom filters to be run.
>>>>
>>>> Some implementation details:
>>>> - To let the tracee return quickly, run BPF filter program within tracer
>>>>      context during read(fan_fd) but before events are copied to userspace
>>>> - only one fantracer per task, which overrides existing ones if any
>>>> - task->fsnotify_group refcount increment on fork, decrement on exit (run
>>>>      after exit_files(tsk) to not miss final close events). When last task
>>>>      exited, send EOF to listener.
>>>> - on exec of seuid-programs the fsnotify_group is cleared (like in ptrace)
>>>> - lazy check when event occurs, if listener is still alive (refcount > 1)
>>>> - for the beginning, to keep things simple and to "solve" the cleanup of
>>>>      filesystem marks, I suggest to disable i_fsnotify_marks for fantrace
>>>>      (only allow FAN_MARK_FILESYSTEM), as that functionality can be
>>>>      implemented within the user-provided BPF-program.
>>>>
>>>
>>> Maybe I am slow, but I did not understand the need for this task fsnotify_group.
>>>
>>> What's wrong with Jan's suggestion? (add a BPF hook to fanotify_handle_event())
>>> that hook is supposed to filter by pid so why all this extra complexity?
>>>
>>> We may consider the option to have another BFP hook when reading
>>> events if there is
>>> good justification, but subtree filters will have to be in handle_event().
>>>
>>> Thanks,
>>> Amir.
>>
>> To be a reasonable async replacement for ptrace (see e.g. mentioned reprozip)
>> file-events from all paths have to be reported, which is difficult
>> using i_fsnotify_marks, because
>> - marking whole mountpoints requires privileges
>> - marking the whole filesystem using directory marks is unfeasible
>>
>> However, we need a quick way to find out, if a file event is of
>> interest (find its beloning fsnotify_group). For the purpose of tracing
>> it appears reasonable to consider all file events of a traced task as
>> "interesting" in the first place. So, in this way, we allow a user to
>> trace file events of his own tasks without slowing down other,
>> non-traced tasks.
>>
>> After all, it's all about the order of running filters - first inode,
>> then pid or reverse. With my proposed architecture for the purpose of
>> tracing I would hand the inode-filter to the user in form of an
>> optional BPF hook. Performance-wise that's also the "fair" solution.
>> Let's assume we allow marking the whole filesystem (via mountpoints).
>> Now, the BPF-pid-filter code has to run for every single file event (of
>> all users!), if multiple users trace the filesystem even multiple hooks
>> have to run, slowing down the whole system.
>>
>
> I understand now what you were trying to do, but still not convinced
> that the added complexity to the kernel is worth it, because you may be
> able to achieve the same with userspace LD_PRELOAD hooking.

In fact, I use that approach in shournal's fanotify backend, as the
shell process cannot join the mount namespace of the currently executing
command (without re-exec). See here, for example:
https://github.com/tycho-kirchner/shournal/blob/master/src/shell-integration-fanotify/libshournal-shellwatch.cpp

However, while this works for the supported shells bash and zsh,
the LD_PRELOAD approach in general is problematic:

- Static compiled executables cannot be traced (rare, ptrace with
   seccomp-bpf filters might be ok for those)
- killed tasks may still have fd's open, so signals also have to be hooked.
   SIGKILL or crashing tasks (e.g. segfault) cannot be handled at all (though
   this would be ok)
- close- or other library-calls are potentially inlined. Some programs may
   even use syscalls directly (probably a rare issue, but I'm not sure)
- quite some hooks required: close, fclose (glibc does not use the close wrapper),
   fcloseall, dup2, dup3, exit, _exit and possibly others

Actually, this sounds more of a complexity. I guess I could make a
proof-of-principle implementation of my proposed fantrace in a few days,
but of course this makes only sense if there is a slight chance this
could be accepted.

>
> BTW, the fanotify_userns patch set [1] was an attempt to allow unprivileged
> user watch over subtree with low performance impact on the rest of the
> file system.
> This is not exactly what you need, but maybe it could be made
> into what you need.
> [1] https://github.com/amir73il/linux/commits/fanotify_userns

I'll take at look, but while I'm fine in compiling a custom kernel
users of my tool are not. Further, using user namespaces in the
outermost shell layer (where my tool runs) is problematic for
administrators or users of setuid binaries.

Thanks,
Tycho


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC] Volatile fanotify marks
  2022-02-28 14:05 ` Jan Kara
  2022-02-28 17:40   ` Amir Goldstein
@ 2022-05-13 15:30   ` Matthew Bobrowski
  1 sibling, 0 replies; 17+ messages in thread
From: Matthew Bobrowski @ 2022-05-13 15:30 UTC (permalink / raw)
  To: Jan Kara; +Cc: Amir Goldstein, linux-fsdevel

On Mon, Feb 28, 2022 at 03:05:56PM +0100, Jan Kara wrote:
> Hi Amir!
> 
> On Wed 23-02-22 20:42:37, Amir Goldstein wrote:
> > I wanted to get your feedback on an idea I have been playing with.
> > It started as a poor man's alternative to the old subtree watch problem.
> > For my employer's use case, we are watching the entire filesystem using
> > a filesystem mark, but would like to exclude events on a subtree
> > (i.e. all files underneath .private/).
> > 
> > At the moment, those events are filtered in userspace.
> > I had considered adding directory marks with an ignored mask on every
> > event that is received for a directory path under .private/, but that has the
> > undesired side effect of pinning those directory inodes to cache.
> > 
> > I have this old fsnotify-volatile branch [1] that I am using for an overlayfs
> > kernel internal fsnotify backend. I wonder what are your thoughts on
> > exposing this functionally to fanotify UAPI (i.e. FAN_MARK_VOLATILE).
> 
> Interesting idea. I have some reservations wrt to the implementation (e.g.
> fsnotify_add_mark_list() convention of returning EEXIST when it updated
> mark's mask, or the fact that inode reclaim should now handle freeing of
> mark connector and attached marks - which may get interesting locking wise)
> but they are all fixable.
> 
> I'm wondering a bit whether this is really useful enough (and consequently
> whether we will not get another request to extend fanotify API in some
> other way to cater better to some other usecase related to subtree watches
> in the near future). I understand ignore marks are mainly a performance
> optimization and as such allowing inodes to be reclaimed (which means they
> are not used much and hence ignored mark is not very useful anyway) makes
> sense. Thinking about this more, I guess it is useful to improve efficiency
> when you want to implement any userspace event-filtering scheme.
> 
> The only remaining pending question I have is whether we should not go
> further and allow event filtering to happen using an eBPF program. That
> would be even more efficient (both in terms of memory and CPU). What do you
> think?

Wait. Did I just read that Jan is open to implementing in kernel event
filtering through eBPF? This feature is something that I'd definitely
be interested in and perhaps also open to running with doing the
design/implementation. One of the really obvious filtering semantics
that immediately comes to mind for me would be to filter on expected
processes/binary images touching an expected set of files. Currently,
this level of filtering has traditionally been done in userspace, but
if we could offload that it'd be nice...

/M

^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2022-05-13 15:30 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-02-23 18:42 [RFC] Volatile fanotify marks Amir Goldstein
2022-02-28 14:05 ` Jan Kara
2022-02-28 17:40   ` Amir Goldstein
2022-03-01 11:07     ` Jan Kara
2022-03-01 11:27       ` Amir Goldstein
2022-03-01 12:26     ` Tycho Kirchner
2022-03-01 16:58       ` Amir Goldstein
2022-03-02 10:04         ` Tycho Kirchner
2022-03-02 18:14           ` Amir Goldstein
2022-03-03  9:24             ` Jan Kara
2022-05-02  9:13               ` Tycho Kirchner
2022-05-04  6:13                 ` Amir Goldstein
2022-05-04 10:01                   ` Tycho Kirchner
2022-05-04 14:37                     ` Amir Goldstein
2022-05-06  9:59                       ` Tycho Kirchner
2022-05-05 10:42                     ` Jan Kara
2022-05-13 15:30   ` Matthew Bobrowski

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.