* [RFC] Volatile fanotify marks @ 2022-02-23 18:42 Amir Goldstein 2022-02-28 14:05 ` Jan Kara 0 siblings, 1 reply; 17+ messages in thread From: Amir Goldstein @ 2022-02-23 18:42 UTC (permalink / raw) To: Jan Kara; +Cc: linux-fsdevel Hi Jan, I wanted to get your feedback on an idea I have been playing with. It started as a poor man's alternative to the old subtree watch problem. For my employer's use case, we are watching the entire filesystem using a filesystem mark, but would like to exclude events on a subtree (i.e. all files underneath .private/). At the moment, those events are filtered in userspace. I had considered adding directory marks with an ignored mask on every event that is received for a directory path under .private/, but that has the undesired side effect of pinning those directory inodes to cache. I have this old fsnotify-volatile branch [1] that I am using for an overlayfs kernel internal fsnotify backend. I wonder what are your thoughts on exposing this functionally to fanotify UAPI (i.e. FAN_MARK_VOLATILE). Thanks, Amir. [1] https://github.com/amir73il/linux/commits/fsnotify-volatile ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [RFC] Volatile fanotify marks 2022-02-23 18:42 [RFC] Volatile fanotify marks Amir Goldstein @ 2022-02-28 14:05 ` Jan Kara 2022-02-28 17:40 ` Amir Goldstein 2022-05-13 15:30 ` Matthew Bobrowski 0 siblings, 2 replies; 17+ messages in thread From: Jan Kara @ 2022-02-28 14:05 UTC (permalink / raw) To: Amir Goldstein; +Cc: Jan Kara, linux-fsdevel Hi Amir! On Wed 23-02-22 20:42:37, Amir Goldstein wrote: > I wanted to get your feedback on an idea I have been playing with. > It started as a poor man's alternative to the old subtree watch problem. > For my employer's use case, we are watching the entire filesystem using > a filesystem mark, but would like to exclude events on a subtree > (i.e. all files underneath .private/). > > At the moment, those events are filtered in userspace. > I had considered adding directory marks with an ignored mask on every > event that is received for a directory path under .private/, but that has the > undesired side effect of pinning those directory inodes to cache. > > I have this old fsnotify-volatile branch [1] that I am using for an overlayfs > kernel internal fsnotify backend. I wonder what are your thoughts on > exposing this functionally to fanotify UAPI (i.e. FAN_MARK_VOLATILE). Interesting idea. I have some reservations wrt to the implementation (e.g. fsnotify_add_mark_list() convention of returning EEXIST when it updated mark's mask, or the fact that inode reclaim should now handle freeing of mark connector and attached marks - which may get interesting locking wise) but they are all fixable. I'm wondering a bit whether this is really useful enough (and consequently whether we will not get another request to extend fanotify API in some other way to cater better to some other usecase related to subtree watches in the near future). I understand ignore marks are mainly a performance optimization and as such allowing inodes to be reclaimed (which means they are not used much and hence ignored mark is not very useful anyway) makes sense. Thinking about this more, I guess it is useful to improve efficiency when you want to implement any userspace event-filtering scheme. The only remaining pending question I have is whether we should not go further and allow event filtering to happen using an eBPF program. That would be even more efficient (both in terms of memory and CPU). What do you think? Honza > > Thanks, > Amir. > > [1] https://github.com/amir73il/linux/commits/fsnotify-volatile -- Jan Kara <jack@suse.com> SUSE Labs, CR ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [RFC] Volatile fanotify marks 2022-02-28 14:05 ` Jan Kara @ 2022-02-28 17:40 ` Amir Goldstein 2022-03-01 11:07 ` Jan Kara 2022-03-01 12:26 ` Tycho Kirchner 2022-05-13 15:30 ` Matthew Bobrowski 1 sibling, 2 replies; 17+ messages in thread From: Amir Goldstein @ 2022-02-28 17:40 UTC (permalink / raw) To: Jan Kara; +Cc: linux-fsdevel, Tycho Kirchner On Mon, Feb 28, 2022 at 4:06 PM Jan Kara <jack@suse.cz> wrote: > > Hi Amir! > > On Wed 23-02-22 20:42:37, Amir Goldstein wrote: > > I wanted to get your feedback on an idea I have been playing with. > > It started as a poor man's alternative to the old subtree watch problem. > > For my employer's use case, we are watching the entire filesystem using > > a filesystem mark, but would like to exclude events on a subtree > > (i.e. all files underneath .private/). > > > > At the moment, those events are filtered in userspace. > > I had considered adding directory marks with an ignored mask on every > > event that is received for a directory path under .private/, but that has the > > undesired side effect of pinning those directory inodes to cache. > > > > I have this old fsnotify-volatile branch [1] that I am using for an overlayfs > > kernel internal fsnotify backend. I wonder what are your thoughts on > > exposing this functionally to fanotify UAPI (i.e. FAN_MARK_VOLATILE). > > Interesting idea. I have some reservations wrt to the implementation (e.g. > fsnotify_add_mark_list() convention of returning EEXIST when it updated > mark's mask, or the fact that inode reclaim should now handle freeing of > mark connector and attached marks - which may get interesting locking wise) > but they are all fixable. Can you give me a hint as to how to implement the freeing of marks? > > I'm wondering a bit whether this is really useful enough (and consequently > whether we will not get another request to extend fanotify API in some > other way to cater better to some other usecase related to subtree watches > in the near future). I understand ignore marks are mainly a performance > optimization and as such allowing inodes to be reclaimed (which means they > are not used much and hence ignored mark is not very useful anyway) makes The problem is that we do not know in advance which of the many dirs in the subtree are accessed often and which are accessed rarely (and that may change over time), so volatile ignore marks are a way to set up ignore marks on the most accessed dirs dynamically. > sense. Thinking about this more, I guess it is useful to improve efficiency > when you want to implement any userspace event-filtering scheme. > > The only remaining pending question I have is whether we should not go > further and allow event filtering to happen using an eBPF program. That > would be even more efficient (both in terms of memory and CPU). What do you > think? > I think that is an unrelated question. I do agree that we should NOT add "subtree filter" functionality to fanotify (or any other filter) and that instead, we should add support for attaching an eBPF program that implements is_subdir(). I found this [1] convection with Tycho where you had suggested this idea. I wonder if Tycho got to explore this path further? But I think that it is one thing to recommend users to implement their filters as eBPF programs and another thing to stand in the way of users that prefer to implement userspace event filtering. It could be that the filter cannot be easily described by static rules to an eBPF program (e.g. need to query a database). In my POV, FAN_MARK_VOLATILE does not add any new logic/filtering rule. It adds resource control by stating that the ignore mark is "best effort". Does it make sense? Thanks, Amir. [1] https://lore.kernel.org/linux-fsdevel/20200828084603.GA7072@quack2.suse.cz/ ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [RFC] Volatile fanotify marks 2022-02-28 17:40 ` Amir Goldstein @ 2022-03-01 11:07 ` Jan Kara 2022-03-01 11:27 ` Amir Goldstein 2022-03-01 12:26 ` Tycho Kirchner 1 sibling, 1 reply; 17+ messages in thread From: Jan Kara @ 2022-03-01 11:07 UTC (permalink / raw) To: Amir Goldstein; +Cc: Jan Kara, linux-fsdevel, Tycho Kirchner On Mon 28-02-22 19:40:07, Amir Goldstein wrote: > On Mon, Feb 28, 2022 at 4:06 PM Jan Kara <jack@suse.cz> wrote: > > > > Hi Amir! > > > > On Wed 23-02-22 20:42:37, Amir Goldstein wrote: > > > I wanted to get your feedback on an idea I have been playing with. > > > It started as a poor man's alternative to the old subtree watch problem. > > > For my employer's use case, we are watching the entire filesystem using > > > a filesystem mark, but would like to exclude events on a subtree > > > (i.e. all files underneath .private/). > > > > > > At the moment, those events are filtered in userspace. > > > I had considered adding directory marks with an ignored mask on every > > > event that is received for a directory path under .private/, but that has the > > > undesired side effect of pinning those directory inodes to cache. > > > > > > I have this old fsnotify-volatile branch [1] that I am using for an overlayfs > > > kernel internal fsnotify backend. I wonder what are your thoughts on > > > exposing this functionally to fanotify UAPI (i.e. FAN_MARK_VOLATILE). > > > > Interesting idea. I have some reservations wrt to the implementation (e.g. > > fsnotify_add_mark_list() convention of returning EEXIST when it updated > > mark's mask, or the fact that inode reclaim should now handle freeing of > > mark connector and attached marks - which may get interesting locking wise) > > but they are all fixable. > > Can you give me a hint as to how to implement the freeing of marks? OK, now I can see that fsnotify_inode_delete() gets called from __destroy_inode() and thus all marks should be freed even for inodes released by inode reclaim. Good. > > I'm wondering a bit whether this is really useful enough (and consequently > > whether we will not get another request to extend fanotify API in some > > other way to cater better to some other usecase related to subtree watches > > in the near future). I understand ignore marks are mainly a performance > > optimization and as such allowing inodes to be reclaimed (which means they > > are not used much and hence ignored mark is not very useful anyway) makes > > The problem is that we do not know in advance which of the many dirs in > the subtree are accessed often and which are accessed rarely (and that may > change over time), so volatile ignore marks are a way to set up ignore marks > on the most accessed dirs dynamically. Yes, I understand. > > sense. Thinking about this more, I guess it is useful to improve efficiency > > when you want to implement any userspace event-filtering scheme. > > > > The only remaining pending question I have is whether we should not go > > further and allow event filtering to happen using an eBPF program. That > > would be even more efficient (both in terms of memory and CPU). What do you > > think? > > > > I think that is an unrelated question. > > I do agree that we should NOT add "subtree filter" functionality to fanotify > (or any other filter) and that instead, we should add support for attaching an > eBPF program that implements is_subdir(). > I found this [1] convection with Tycho where you had suggested this idea. > I wonder if Tycho got to explore this path further? > > But I think that it is one thing to recommend users to implement their > filters as > eBPF programs and another thing to stand in the way of users that prefer to > implement userspace event filtering. It could be that the filter > cannot be easily > described by static rules to an eBPF program (e.g. need to query a database). > > In my POV, FAN_MARK_VOLATILE does not add any new logic/filtering rule. > It adds resource control by stating that the ignore mark is "best effort". > > Does it make sense? OK, makes sense. So I agree the functionality is worth it. Will you post the patches for review of technical details? Honza -- Jan Kara <jack@suse.com> SUSE Labs, CR ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [RFC] Volatile fanotify marks 2022-03-01 11:07 ` Jan Kara @ 2022-03-01 11:27 ` Amir Goldstein 0 siblings, 0 replies; 17+ messages in thread From: Amir Goldstein @ 2022-03-01 11:27 UTC (permalink / raw) To: Jan Kara; +Cc: linux-fsdevel, Tycho Kirchner On Tue, Mar 1, 2022 at 1:07 PM Jan Kara <jack@suse.cz> wrote: > > On Mon 28-02-22 19:40:07, Amir Goldstein wrote: > > On Mon, Feb 28, 2022 at 4:06 PM Jan Kara <jack@suse.cz> wrote: > > > > > > Hi Amir! > > > > > > On Wed 23-02-22 20:42:37, Amir Goldstein wrote: > > > > I wanted to get your feedback on an idea I have been playing with. > > > > It started as a poor man's alternative to the old subtree watch problem. > > > > For my employer's use case, we are watching the entire filesystem using > > > > a filesystem mark, but would like to exclude events on a subtree > > > > (i.e. all files underneath .private/). > > > > > > > > At the moment, those events are filtered in userspace. > > > > I had considered adding directory marks with an ignored mask on every > > > > event that is received for a directory path under .private/, but that has the > > > > undesired side effect of pinning those directory inodes to cache. > > > > > > > > I have this old fsnotify-volatile branch [1] that I am using for an overlayfs > > > > kernel internal fsnotify backend. I wonder what are your thoughts on > > > > exposing this functionally to fanotify UAPI (i.e. FAN_MARK_VOLATILE). > > > > > > Interesting idea. I have some reservations wrt to the implementation (e.g. > > > fsnotify_add_mark_list() convention of returning EEXIST when it updated > > > mark's mask, or the fact that inode reclaim should now handle freeing of > > > mark connector and attached marks - which may get interesting locking wise) > > > but they are all fixable. > > > > Can you give me a hint as to how to implement the freeing of marks? > > OK, now I can see that fsnotify_inode_delete() gets called from > __destroy_inode() and thus all marks should be freed even for inodes > released by inode reclaim. Good. > > > > I'm wondering a bit whether this is really useful enough (and consequently > > > whether we will not get another request to extend fanotify API in some > > > other way to cater better to some other usecase related to subtree watches > > > in the near future). I understand ignore marks are mainly a performance > > > optimization and as such allowing inodes to be reclaimed (which means they > > > are not used much and hence ignored mark is not very useful anyway) makes > > > > The problem is that we do not know in advance which of the many dirs in > > the subtree are accessed often and which are accessed rarely (and that may > > change over time), so volatile ignore marks are a way to set up ignore marks > > on the most accessed dirs dynamically. > > Yes, I understand. > > > > sense. Thinking about this more, I guess it is useful to improve efficiency > > > when you want to implement any userspace event-filtering scheme. > > > > > > The only remaining pending question I have is whether we should not go > > > further and allow event filtering to happen using an eBPF program. That > > > would be even more efficient (both in terms of memory and CPU). What do you > > > think? > > > > > > > I think that is an unrelated question. > > > > I do agree that we should NOT add "subtree filter" functionality to fanotify > > (or any other filter) and that instead, we should add support for attaching an > > eBPF program that implements is_subdir(). > > I found this [1] convection with Tycho where you had suggested this idea. > > I wonder if Tycho got to explore this path further? > > > > But I think that it is one thing to recommend users to implement their > > filters as > > eBPF programs and another thing to stand in the way of users that prefer to > > implement userspace event filtering. It could be that the filter > > cannot be easily > > described by static rules to an eBPF program (e.g. need to query a database). > > > > In my POV, FAN_MARK_VOLATILE does not add any new logic/filtering rule. > > It adds resource control by stating that the ignore mark is "best effort". > > > > Does it make sense? > > OK, makes sense. So I agree the functionality is worth it. Will you post > the patches for review of technical details? Of course, I need to add a patch for UAPI change and write a test. I just wanted to get a tentative ACK before I put in more effort. I will address your comment about -EEXIST return value. Thanks, Amir. ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [RFC] Volatile fanotify marks 2022-02-28 17:40 ` Amir Goldstein 2022-03-01 11:07 ` Jan Kara @ 2022-03-01 12:26 ` Tycho Kirchner 2022-03-01 16:58 ` Amir Goldstein 1 sibling, 1 reply; 17+ messages in thread From: Tycho Kirchner @ 2022-03-01 12:26 UTC (permalink / raw) To: Amir Goldstein, Jan Kara; +Cc: linux-fsdevel >>> I wanted to get your feedback on an idea I have been playing with. >>> It started as a poor man's alternative to the old subtree watch problem. > I do agree that we should NOT add "subtree filter" functionality to fanotify > (or any other filter) and that instead, we should add support for attaching an > eBPF program that implements is_subdir(). > I found this [1] convection with Tycho where you had suggested this idea. > I wonder if Tycho got to explore this path further? > > [1] https://lore.kernel.org/linux-fsdevel/20200828084603.GA7072@quack2.suse.cz/ Hi Amir, Hi Jan, Thanks for pinging back on me. Indeed I did "explore this path further". In my project https://github.com/tycho-kirchner/shournal the goal is to track read/written files of a process tree and all it's child-processes and connect this data to a given shell-command. In fact after Amir's and mine last correspondence I implemented a kernel module which instruments ftrace and tracepoints to trace fput-events (kernel/event_handler.c:event_handler_fput) of specific tasks, which are then further processed in a dedicated kernel thread. I considered eBPF for this task but found no satisfying approach to have dynamic, different filter-rules (e.g. include-paths) for each process tree of each user. Regarding improvement of fanotify let's discriminate two cases: system-monitoring and tracing. Regarding system-monitoring: I'm not sure how exactly FAN_MARK_VOLATILE would work (Amir, could you please elaborate?) but what do you think about the following approach, in order to solve the subtree watch problem: - Store the include/exlude-paths of interest as *strings* in a hashset. - on fsevent, lookup the path by calling d_path() only once and cache, whether events for the given path are of interest. This can either happen with a reference on the path (clear older paths periodically in a work queue) or with a timelimit in which potentially wrong paths are accepted (path pointer freed and address reused). The second approach I use myself in kernel/event_consumer_cache.c. See also kpathtree.c for a somewhat efficient subpath-lookup. Regarding tracing I think fanotify would really benefit from a FAN_MARK_PID (with optional follow fork-mode). That way one of the first filter-steps would be whether events for the given task are of interest, so we have no performance problem for all other tasks. The possibility to mark specific processes would also have another substantial benefit: fanotify could be used without root privileges by only allowing the user to mark his/her own processes. That way existing inotify-users could finally switch to the cleaner/more powerful fanotify. Thanks and kind regards Tycho ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [RFC] Volatile fanotify marks 2022-03-01 12:26 ` Tycho Kirchner @ 2022-03-01 16:58 ` Amir Goldstein 2022-03-02 10:04 ` Tycho Kirchner 0 siblings, 1 reply; 17+ messages in thread From: Amir Goldstein @ 2022-03-01 16:58 UTC (permalink / raw) To: Tycho Kirchner; +Cc: Jan Kara, linux-fsdevel On Tue, Mar 1, 2022 at 2:26 PM Tycho Kirchner <tychokirchner@mail.de> wrote: > > > > >>> I wanted to get your feedback on an idea I have been playing with. > >>> It started as a poor man's alternative to the old subtree watch problem. > > > > I do agree that we should NOT add "subtree filter" functionality to fanotify > > (or any other filter) and that instead, we should add support for attaching an > > eBPF program that implements is_subdir(). > > I found this [1] convection with Tycho where you had suggested this idea. > > I wonder if Tycho got to explore this path further? > > > > [1] https://lore.kernel.org/linux-fsdevel/20200828084603.GA7072@quack2.suse.cz/ > > Hi Amir, Hi Jan, > Thanks for pinging back on me. Indeed I did "explore this path further". > In my project > https://github.com/tycho-kirchner/shournal > > the goal is to track read/written files of a process tree and all it's child-processes and connect this data to a given shell-command. In fact after Amir's and mine last correspondence I implemented a kernel module which instruments ftrace and tracepoints to trace fput-events (kernel/event_handler.c:event_handler_fput) of specific tasks, which are then further processed in a dedicated kernel thread. I considered eBPF for this task but found no satisfying approach to have dynamic, different filter-rules (e.g. include-paths) for each process tree of each user. > > > Regarding improvement of fanotify let's discriminate two cases: system-monitoring and tracing. > Regarding system-monitoring: I'm not sure how exactly FAN_MARK_VOLATILE would work (Amir, could you please elaborate?) FAN_MARK_VOLATILE is not a solution for "include" filters. It is a solution for "exclude" filters implemented in userspace. If monitoring program gets an event and decides that its path should be excluded it may set a "volatile" exclude mark on that directory that will suppress further events from that directory for as long as the directory inode remains in inode cache. After directory inode has not been accessed for a while and evicted from inode cache the monitoring program can get an event in that directory again and then it can re-install the volatile ignore mark if it wants to. > but what do you think about the following approach, in order to solve the subtree watch problem: > - Store the include/exlude-paths of interest as *strings* in a hashset. > - on fsevent, lookup the path by calling d_path() only once and cache, whether events for the given path are of interest. This > can either happen with a reference on the path (clear older paths periodically in a work queue) > or with a timelimit in which potentially wrong paths are accepted (path pointer freed and address reused). > The second approach I use myself in kernel/event_consumer_cache.c. See also kpathtree.c for a somewhat efficient > subpath-lookup. I would implement filtering with is_subdir() and not with d_path(), but there are advantages to either approach. In any case, I see there is BPF_FUNC_d_path, so why can't your approach be implemented using an eBPF program? > > Regarding tracing I think fanotify would really benefit from a FAN_MARK_PID (with optional follow fork-mode). That way one of the first filter-steps would be whether events for the given task are of interest, so we have no performance problem for all other tasks. The possibility to mark specific processes would also have another substantial benefit: fanotify could be used without root privileges by only allowing the user to mark his/her own processes. > That way existing inotify-users could finally switch to the cleaner/more powerful fanotify. We already have partial support for unprivileged fanotify. Which features are you missing with unprivileged fanotify? and why do you think that filtering by process tree will allow those features to be enabled? A child process may well have more privileges to read directories than its parent. Thanks, Amir. ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [RFC] Volatile fanotify marks 2022-03-01 16:58 ` Amir Goldstein @ 2022-03-02 10:04 ` Tycho Kirchner 2022-03-02 18:14 ` Amir Goldstein 0 siblings, 1 reply; 17+ messages in thread From: Tycho Kirchner @ 2022-03-02 10:04 UTC (permalink / raw) To: Amir Goldstein; +Cc: Jan Kara, linux-fsdevel Am 01.03.22 um 17:58 schrieb Amir Goldstein: > On Tue, Mar 1, 2022 at 2:26 PM Tycho Kirchner <tychokirchner@mail.de> wrote: >> >> >> >>>>> I wanted to get your feedback on an idea I have been playing with. >>>>> It started as a poor man's alternative to the old subtree watch problem. >> >> >>> I do agree that we should NOT add "subtree filter" functionality to fanotify >>> (or any other filter) and that instead, we should add support for attaching an >>> eBPF program that implements is_subdir(). >>> I found this [1] convection with Tycho where you had suggested this idea. >>> I wonder if Tycho got to explore this path further? >>> >>> [1] https://lore.kernel.org/linux-fsdevel/20200828084603.GA7072@quack2.suse.cz/ >> >> Hi Amir, Hi Jan, >> Thanks for pinging back on me. Indeed I did "explore this path further". >> In my project >> https://github.com/tycho-kirchner/shournal >> >> the goal is to track read/written files of a process tree and all it's child-processes and connect this data to a given shell-command. In fact after Amir's and mine last correspondence I implemented a kernel module which instruments ftrace and tracepoints to trace fput-events (kernel/event_handler.c:event_handler_fput) of specific tasks, which are then further processed in a dedicated kernel thread. I considered eBPF for this task but found no satisfying approach to have dynamic, different filter-rules (e.g. include-paths) for each process tree of each user. >> >> >> Regarding improvement of fanotify let's discriminate two cases: system-monitoring and tracing. >> Regarding system-monitoring: I'm not sure how exactly FAN_MARK_VOLATILE would work (Amir, could you please elaborate?) > > FAN_MARK_VOLATILE is not a solution for "include" filters. > It is a solution for "exclude" filters implemented in userspace. > If monitoring program gets an event and decides that its path should be excluded > it may set a "volatile" exclude mark on that directory that will > suppress further > events from that directory for as long as the directory inode remains > in inode cache. > After directory inode has not been accessed for a while and evicted > from inode cache > the monitoring program can get an event in that directory again and then it can > re-install the volatile ignore mark if it wants to. > Thanks for this explanation. Regarding few exclude-directories this sounds useful. However, if a whole directory-tree of filesystem events shall be excluded I guess the performance benefit will be rather small. A benchmark may clarify this ( I have some yet unpublished code ready, in case you are interested). If an efficient algorithm can be found I would rather vote for "include" dirs with unlimited depth. Btw. similar to the process-filter approach by unshared mount namespaces about which I wrote in our last correspondence, you may be able to exclude your .private/ directory by bind-mounting over it and otherwise only marking only those mounts of interest instead of the entire filesystem. But yeah, this is kinda messy. >> but what do you think about the following approach, in order to solve the subtree watch problem: >> - Store the include/exlude-paths of interest as *strings* in a hashset. >> - on fsevent, lookup the path by calling d_path() only once and cache, whether events for the given path are of interest. This >> can either happen with a reference on the path (clear older paths periodically in a work queue) >> or with a timelimit in which potentially wrong paths are accepted (path pointer freed and address reused). >> The second approach I use myself in kernel/event_consumer_cache.c. See also kpathtree.c for a somewhat efficient >> subpath-lookup. > > I would implement filtering with is_subdir() and not with d_path(), > but there are > advantages to either approach. > In any case, I see there is BPF_FUNC_d_path, so why can't your approach be > implemented using an eBPF program? >It seems that bpf_d_path was introduced with v5.10 (6e22ab9da79343532cd3cde39df25e5a5478c692), however, shournal must still run on older kernels (e.g. openSUSE Leap v5.3.18). Further, as far as I remember, at least in Linux 4.19 there was quite some overhead to just install the fd into the eBPF user-space process, but I have to re-check that once that functionality is more widespread. >> >> Regarding tracing I think fanotify would really benefit from a FAN_MARK_PID (with optional follow fork-mode). That way one of the first filter-steps would be whether events for the given task are of interest, so we have no performance problem for all other tasks. The possibility to mark specific processes would also have another substantial benefit: fanotify could be used without root privileges by only allowing the user to mark his/her own processes. >> That way existing inotify-users could finally switch to the cleaner/more powerful fanotify. > > We already have partial support for unprivileged fanotify. > Which features are you missing with unprivileged fanotify? > and why do you think that filtering by process tree will allow those > features to be enabled? I am missing the ability to filter for (close-)events of large directory trees in a race-free manner, so that no events are lost on newly created dirs. Even without the race, monitoring my home-directory is impossible (without privileges) as I have far more than 8192 directories (393941 as of writing (; ). Monitoring mounts solves these problems but introduces two others: First it requires privileges, second a potentially large number of events *not of interest* have to be copied to user-space (except unshared mount namespaces are used). Allowing a user to only monitor his/her own processes would make mark_mount privileges unnecessary (please correct me if I'm wrong). While still events above the directory of interest are reported, at least events from other users are filtered beforehand. > A child process may well have more privileges to read directories than > its parent. > Similar to ptrace fanotiy should then not follow suid-programs, so this case should not occur. After all I totally understand that you do not want to feature-bloat fanotify and maybe my use-case is already too far from the one casual users have. On the other hand, pid- or path-filtering is maybe basic enough and fanotify does offer the ability to filter for paths - it is just quite limited due to the mark-concept. I think it should not be necessary in order to monitor a directory tree, to touch every single directory inside beforehand. Maybe a hybrid-solution fits best here: hard-code pid-filtering as a security feature into fanotify, allow marking of mounts for the user's own processes and allow for eBPF filter rules afterwards. Thanks and kind regards Tycho ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [RFC] Volatile fanotify marks 2022-03-02 10:04 ` Tycho Kirchner @ 2022-03-02 18:14 ` Amir Goldstein 2022-03-03 9:24 ` Jan Kara 0 siblings, 1 reply; 17+ messages in thread From: Amir Goldstein @ 2022-03-02 18:14 UTC (permalink / raw) To: Tycho Kirchner; +Cc: Jan Kara, linux-fsdevel On Wed, Mar 2, 2022 at 12:04 PM Tycho Kirchner <tychokirchner@mail.de> wrote: > > > > Am 01.03.22 um 17:58 schrieb Amir Goldstein: > > On Tue, Mar 1, 2022 at 2:26 PM Tycho Kirchner <tychokirchner@mail.de> wrote: > >> > >> > >> > >>>>> I wanted to get your feedback on an idea I have been playing with. > >>>>> It started as a poor man's alternative to the old subtree watch problem. > >> > >> > >>> I do agree that we should NOT add "subtree filter" functionality to fanotify > >>> (or any other filter) and that instead, we should add support for attaching an > >>> eBPF program that implements is_subdir(). > >>> I found this [1] convection with Tycho where you had suggested this idea. > >>> I wonder if Tycho got to explore this path further? > >>> > >>> [1] https://lore.kernel.org/linux-fsdevel/20200828084603.GA7072@quack2.suse.cz/ > >> > >> Hi Amir, Hi Jan, > >> Thanks for pinging back on me. Indeed I did "explore this path further". > >> In my project > >> https://github.com/tycho-kirchner/shournal > >> > >> the goal is to track read/written files of a process tree and all it's child-processes and connect this data to a given shell-command. In fact after Amir's and mine last correspondence I implemented a kernel module which instruments ftrace and tracepoints to trace fput-events (kernel/event_handler.c:event_handler_fput) of specific tasks, which are then further processed in a dedicated kernel thread. I considered eBPF for this task but found no satisfying approach to have dynamic, different filter-rules (e.g. include-paths) for each process tree of each user. > >> > >> > >> Regarding improvement of fanotify let's discriminate two cases: system-monitoring and tracing. > >> Regarding system-monitoring: I'm not sure how exactly FAN_MARK_VOLATILE would work (Amir, could you please elaborate?) > > > > FAN_MARK_VOLATILE is not a solution for "include" filters. > > It is a solution for "exclude" filters implemented in userspace. > > If monitoring program gets an event and decides that its path should be excluded > > it may set a "volatile" exclude mark on that directory that will > > suppress further > > events from that directory for as long as the directory inode remains > > in inode cache. > > After directory inode has not been accessed for a while and evicted > > from inode cache > > the monitoring program can get an event in that directory again and then it can > > re-install the volatile ignore mark if it wants to. > > > Thanks for this explanation. Regarding few exclude-directories this sounds useful. > However, if a whole directory-tree of filesystem events shall be excluded I guess the > performance benefit will be rather small. A benchmark may clarify this > ( I have some yet unpublished code ready, in case you are interested). Code for what? > If an efficient algorithm can be found I would rather vote for "include" dirs with unlimited depth. As I said, this is desirable, difficult and completely orthogonal to FAN_MARK_VOLATILE functionality. > Btw. similar to the process-filter approach by unshared mount namespaces about which > I wrote in our last correspondence, you may be able to exclude your .private/ directory > by bind-mounting over it and otherwise only marking only those mounts of interest > instead of the entire filesystem. But yeah, this is kinda messy. > Marking bind mounts could be a good option for some use cases. But unlike your monitoring app, my app needs to track create/unlink/rename as well and those events are not currently available for mount marks. I had several attempts to tackle that but they did not work out yet. Partly, because marking a bind mount is not as useful as filtering by subtree. > >> but what do you think about the following approach, in order to solve the subtree watch problem: > >> - Store the include/exlude-paths of interest as *strings* in a hashset. > >> - on fsevent, lookup the path by calling d_path() only once and cache, whether events for the given path are of interest. This > >> can either happen with a reference on the path (clear older paths periodically in a work queue) > >> or with a timelimit in which potentially wrong paths are accepted (path pointer freed and address reused). > >> The second approach I use myself in kernel/event_consumer_cache.c. See also kpathtree.c for a somewhat efficient > >> subpath-lookup. > > > > I would implement filtering with is_subdir() and not with d_path(), > > but there are > > advantages to either approach. > > In any case, I see there is BPF_FUNC_d_path, so why can't your approach be > > implemented using an eBPF program? > >It seems that bpf_d_path was introduced with v5.10 (6e22ab9da79343532cd3cde39df25e5a5478c692), however, shournal must still run on older kernels (e.g. openSUSE Leap v5.3.18). Further, as far as I remember, at least in Linux 4.19 there was quite some overhead to just install the fd into the eBPF user-space process, but I have to re-check that once that functionality is more widespread. > There is no need to install any fd. The program should hook a function that has access to a struct path. > > >> > >> Regarding tracing I think fanotify would really benefit from a FAN_MARK_PID (with optional follow fork-mode). That way one of the first filter-steps would be whether events for the given task are of interest, so we have no performance problem for all other tasks. The possibility to mark specific processes would also have another substantial benefit: fanotify could be used without root privileges by only allowing the user to mark his/her own processes. > >> That way existing inotify-users could finally switch to the cleaner/more powerful fanotify. > > > > We already have partial support for unprivileged fanotify. > > Which features are you missing with unprivileged fanotify? > > and why do you think that filtering by process tree will allow those > > features to be enabled? > > > I am missing the ability to filter for (close-)events of large directory trees in a race-free manner, so that no events are lost on newly created dirs. Even without the race, monitoring my home-directory is impossible (without privileges) as I have far more than 8192 directories (393941 as of writing (; ). > Monitoring mounts solves these problems but introduces two others: > First it requires privileges, second a potentially large number of events *not of interest* have to be copied to user-space (except unshared mount namespaces are used). Allowing a user to only monitor his/her own processes would make mark_mount privileges unnecessary (please correct me if I'm wrong). While still events above the directory of interest are reported, at least events from other users are filtered beforehand. > I don't know. Security model is hard. What do you mean by "his/her own processes"? processes owned by the same uid? With simple look it sounds right, but other security policy may be in play (e.g. sepolicy) which can grand different processes owned by same user different file access permissions and not any process may be allowed to ptrace other processes. userns has more clear semantics, so monitoring all processes/mounts inside an unprivileged userns may be easier to prove. > > A child process may well have more privileges to read directories than > > its parent. > > > Similar to ptrace fanotiy should then not follow suid-programs, so this case should not occur. > > After all I totally understand that you do not want to feature-bloat fanotify and maybe my use-case is already too far from the one casual users have. On the other hand, pid- or path-filtering is maybe basic enough and fanotify does offer the ability to filter for paths - it is just quite limited due to the mark-concept. I think it should not be necessary in order to monitor a directory tree, to touch every single directory inside beforehand. Maybe a hybrid-solution fits best here: hard-code pid-filtering as a security feature into fanotify, allow marking of mounts for the user's own processes and allow for eBPF filter rules afterwards. > This is a bit too hand waving for me to understand. In the end, it's all in the details. Need to see a whole design and/or implementation to be able to say what are its benefits and how doable it is. Thanks, Amir. ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [RFC] Volatile fanotify marks 2022-03-02 18:14 ` Amir Goldstein @ 2022-03-03 9:24 ` Jan Kara 2022-05-02 9:13 ` Tycho Kirchner 0 siblings, 1 reply; 17+ messages in thread From: Jan Kara @ 2022-03-03 9:24 UTC (permalink / raw) To: Amir Goldstein; +Cc: Tycho Kirchner, Jan Kara, linux-fsdevel On Wed 02-03-22 20:14:29, Amir Goldstein wrote: > On Wed, Mar 2, 2022 at 12:04 PM Tycho Kirchner <tychokirchner@mail.de> wrote: > > >> > > >> Regarding tracing I think fanotify would really benefit from a FAN_MARK_PID (with optional follow fork-mode). That way one of the first filter-steps would be whether events for the given task are of interest, so we have no performance problem for all other tasks. The possibility to mark specific processes would also have another substantial benefit: fanotify could be used without root privileges by only allowing the user to mark his/her own processes. > > >> That way existing inotify-users could finally switch to the cleaner/more powerful fanotify. > > > > > > We already have partial support for unprivileged fanotify. > > > Which features are you missing with unprivileged fanotify? > > > and why do you think that filtering by process tree will allow those > > > features to be enabled? > > > > > > I am missing the ability to filter for (close-)events of large > > directory trees in a race-free manner, so that no events are lost on > > newly created dirs. Even without the race, monitoring my home-directory > > is impossible (without privileges) as I have far more than 8192 > > directories (393941 as of writing (; ). Monitoring mounts solves these > > problems but introduces two others: First it requires privileges, > > second a potentially large number of events *not of interest* have to > > be copied to user-space (except unshared mount namespaces are used). > > Allowing a user to only monitor his/her own processes would make > > mark_mount privileges unnecessary (please correct me if I'm wrong). > > While still events above the directory of interest are reported, at > > least events from other users are filtered beforehand. > > I don't know. Security model is hard. > What do you mean by "his/her own processes"? processes owned by the same uid? > With simple look it sounds right, but other security policy may be in > play (e.g. sepolicy) > which can grand different processes owned by same user different file access > permissions and not any process may be allowed to ptrace other processes. > userns has more clear semantics, so monitoring all processes/mounts inside > an unprivileged userns may be easier to prove. I see two problems with limiting events to those generated by a particular process / user: 1) Fanotify is a filesystem notification system. As such it is primarily aimed at (more or less efficient) answering of a question - did something in the filesystem change, was some data from the filesystem used? If you start to limit visible events to processes / users you are no longer able to reliably answer this question. As such we would get complaints "but this is not good enough for our usecase" sooner rather than later. In filesystem change notification space we have a long history of partial solutions that then forced us into full redesign and all the pain associated with that. 2) Limiting events to those generated by a particular user may somewhat reduce the amount of generated events but for a lot of usecases that is not really significant. So the push for some middle ground between - watching a file / dir and watching whole fs will still stay. Regarding the security model for unpriviledged watches: IMO a sensible security model for fs notification could be like: "If you can read the file, you should be able to watch for changes. If you own the file, you should be able to watch for accesses." But the trouble is that for filesystem wide marks, it is not easy to verify these conditions. Honza -- Jan Kara <jack@suse.com> SUSE Labs, CR ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [RFC] Volatile fanotify marks 2022-03-03 9:24 ` Jan Kara @ 2022-05-02 9:13 ` Tycho Kirchner 2022-05-04 6:13 ` Amir Goldstein 0 siblings, 1 reply; 17+ messages in thread From: Tycho Kirchner @ 2022-05-02 9:13 UTC (permalink / raw) To: Jan Kara, Amir Goldstein; +Cc: linux-fsdevel All right, I thought a bit more about that and returned to your original BPF idea you mentioned on 2020-08-28: > I was thinking that we could add a BPF hook to fanotify_handle_event() > (similar to what's happening in packet filtering code) and you could attach > BPF programs to this hook to do filtering of events. That way we don't have > to introduce new group flags for various filtering options. The question is > whether eBPF is strong enough so that filters useful for fanotify users > could be implemented with it but this particular check seems implementable. > > Honza Instead of changing fanotify's filesystem notification functionality, I suggest to rather **add a tracing mode (fantrace)**. The synchronous handling of syscalls via ptrace is of course required for debugging purposes, however that introduces a major slowdown (even with seccomp-bpf filters). There are a number of cases, including [1-3], where async processing of file events of specific tasks would be fine but is not readily available in Linux. Fanotify already ships important infrastructure in this regard: it provides very fast event-buffering and, by using file descriptors instead of resolved paths, a clean and race-free API to process the events later. However, as already stated, fanotify does not provide a clean way, to monitor only a subset of tasks. Therefore please consider the following proposed architecture of fantrace: Each taks gets its own struct fsnotify_group. Within fsnotify.c:fsnotify() it is checked if the given task has a fsnotify_group attached where events of interest are buffered as usual. Note that this is an additional hook - sysadmins being subscribed to filesystem events rather than task-filesystem-events are notified as usual - in that case two hooks possibly run. The fsnotify_group is extended by a field optionally pointing to a BPF program which allows for custom filters to be run. Some implementation details: - To let the tracee return quickly, run BPF filter program within tracer context during read(fan_fd) but before events are copied to userspace - only one fantracer per task, which overrides existing ones if any - task->fsnotify_group refcount increment on fork, decrement on exit (run after exit_files(tsk) to not miss final close events). When last task exited, send EOF to listener. - on exec of seuid-programs the fsnotify_group is cleared (like in ptrace) - lazy check when event occurs, if listener is still alive (refcount > 1) - for the beginning, to keep things simple and to "solve" the cleanup of filesystem marks, I suggest to disable i_fsnotify_marks for fantrace (only allow FAN_MARK_FILESYSTEM), as that functionality can be implemented within the user-provided BPF-program. A working implementation of this concept, which effectively does the same using hardcoded filter rules can be found in my kernel module shournalk [4]. For instance In kernel/event_handler.c:event_handler_fput() it is checked, if the task is observed using a hashtable, and if so, the event is stored to a buffer corresponding to that process tree. Thanks Tycho [1] Chirigati F, Rampin R, Shasha D, Freire J. (2016). ReproZip: Computational Reproducibility with Ease. Paper presented at the Proceedings of the 2016 International Conference on Management of Data. San Francisco, CA: New York: Association for Computing Technology. https://github.com/VIDA-NYU/reprozip [2] Guo, P. (2012). CDE: A Tool For Creating Portable Experimental Software Packages. Computing in Science & Engineering 14, 332–35 [3] Tycho Kirchner, Konstantin Riege, Steve Hoffmann (2020). Bashing irreproducibility with shournal bioRxiv 2020.08.03.232843; doi: https://doi.org/10.1101/2020.08.03.232843 [4] https://github.com/tycho-kirchner/shournal ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [RFC] Volatile fanotify marks 2022-05-02 9:13 ` Tycho Kirchner @ 2022-05-04 6:13 ` Amir Goldstein 2022-05-04 10:01 ` Tycho Kirchner 0 siblings, 1 reply; 17+ messages in thread From: Amir Goldstein @ 2022-05-04 6:13 UTC (permalink / raw) To: Tycho Kirchner; +Cc: Jan Kara, linux-fsdevel On Mon, May 2, 2022 at 12:13 PM Tycho Kirchner <tychokirchner@mail.de> wrote: > > All right, I thought a bit more about that and returned to your > original BPF idea you mentioned on 2020-08-28: > > > I was thinking that we could add a BPF hook to fanotify_handle_event() > > (similar to what's happening in packet filtering code) and you could attach > > BPF programs to this hook to do filtering of events. That way we don't have > > to introduce new group flags for various filtering options. The question is > > whether eBPF is strong enough so that filters useful for fanotify users > > could be implemented with it but this particular check seems implementable. > > > > Honza > > Instead of changing fanotify's filesystem notification functionality, > I suggest to rather **add a tracing mode (fantrace)**. > > The synchronous handling of syscalls via ptrace is of course required > for debugging purposes, however that introduces a major slowdown (even > with seccomp-bpf filters). There are a number of cases, including > [1-3], where async processing of file events of specific tasks would be > fine but is not readily available in Linux. Fanotify already ships > important infrastructure in this regard: it provides very fast > event-buffering and, by using file descriptors instead of resolved > paths, a clean and race-free API to process the events later. However, > as already stated, fanotify does not provide a clean way, to monitor > only a subset of tasks. Therefore please consider the following > proposed architecture of fantrace: > > Each taks gets its own struct fsnotify_group. Within > fsnotify.c:fsnotify() it is checked if the given task has a > fsnotify_group attached where events of interest are buffered as usual. > Note that this is an additional hook - sysadmins being subscribed to > filesystem events rather than task-filesystem-events are notified as > usual - in that case two hooks possibly run. The fsnotify_group is > extended by a field optionally pointing to a BPF program which allows > for custom filters to be run. > > Some implementation details: > - To let the tracee return quickly, run BPF filter program within tracer > context during read(fan_fd) but before events are copied to userspace > - only one fantracer per task, which overrides existing ones if any > - task->fsnotify_group refcount increment on fork, decrement on exit (run > after exit_files(tsk) to not miss final close events). When last task > exited, send EOF to listener. > - on exec of seuid-programs the fsnotify_group is cleared (like in ptrace) > - lazy check when event occurs, if listener is still alive (refcount > 1) > - for the beginning, to keep things simple and to "solve" the cleanup of > filesystem marks, I suggest to disable i_fsnotify_marks for fantrace > (only allow FAN_MARK_FILESYSTEM), as that functionality can be > implemented within the user-provided BPF-program. > Maybe I am slow, but I did not understand the need for this task fsnotify_group. What's wrong with Jan's suggestion? (add a BPF hook to fanotify_handle_event()) that hook is supposed to filter by pid so why all this extra complexity? We may consider the option to have another BFP hook when reading events if there is good justification, but subtree filters will have to be in handle_event(). Thanks, Amir. ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [RFC] Volatile fanotify marks 2022-05-04 6:13 ` Amir Goldstein @ 2022-05-04 10:01 ` Tycho Kirchner 2022-05-04 14:37 ` Amir Goldstein 2022-05-05 10:42 ` Jan Kara 0 siblings, 2 replies; 17+ messages in thread From: Tycho Kirchner @ 2022-05-04 10:01 UTC (permalink / raw) To: Amir Goldstein; +Cc: Jan Kara, linux-fsdevel Am 04.05.22 um 08:13 schrieb Amir Goldstein: > On Mon, May 2, 2022 at 12:13 PM Tycho Kirchner <tychokirchner@mail.de> wrote: >> >> All right, I thought a bit more about that and returned to your >> original BPF idea you mentioned on 2020-08-28: >> >>> I was thinking that we could add a BPF hook to fanotify_handle_event() >>> (similar to what's happening in packet filtering code) and you could attach >>> BPF programs to this hook to do filtering of events. That way we don't have >>> to introduce new group flags for various filtering options. The question is >>> whether eBPF is strong enough so that filters useful for fanotify users >>> could be implemented with it but this particular check seems implementable. >>> >>> Honza >> >> Instead of changing fanotify's filesystem notification functionality, >> I suggest to rather **add a tracing mode (fantrace)**. >> >> The synchronous handling of syscalls via ptrace is of course required >> for debugging purposes, however that introduces a major slowdown (even >> with seccomp-bpf filters). There are a number of cases, including >> [1-3], where async processing of file events of specific tasks would be >> fine but is not readily available in Linux. Fanotify already ships >> important infrastructure in this regard: it provides very fast >> event-buffering and, by using file descriptors instead of resolved >> paths, a clean and race-free API to process the events later. However, >> as already stated, fanotify does not provide a clean way, to monitor >> only a subset of tasks. Therefore please consider the following >> proposed architecture of fantrace: >> >> Each taks gets its own struct fsnotify_group. Within >> fsnotify.c:fsnotify() it is checked if the given task has a >> fsnotify_group attached where events of interest are buffered as usual. >> Note that this is an additional hook - sysadmins being subscribed to >> filesystem events rather than task-filesystem-events are notified as >> usual - in that case two hooks possibly run. The fsnotify_group is >> extended by a field optionally pointing to a BPF program which allows >> for custom filters to be run. >> >> Some implementation details: >> - To let the tracee return quickly, run BPF filter program within tracer >> context during read(fan_fd) but before events are copied to userspace >> - only one fantracer per task, which overrides existing ones if any >> - task->fsnotify_group refcount increment on fork, decrement on exit (run >> after exit_files(tsk) to not miss final close events). When last task >> exited, send EOF to listener. >> - on exec of seuid-programs the fsnotify_group is cleared (like in ptrace) >> - lazy check when event occurs, if listener is still alive (refcount > 1) >> - for the beginning, to keep things simple and to "solve" the cleanup of >> filesystem marks, I suggest to disable i_fsnotify_marks for fantrace >> (only allow FAN_MARK_FILESYSTEM), as that functionality can be >> implemented within the user-provided BPF-program. >> > > Maybe I am slow, but I did not understand the need for this task fsnotify_group. > > What's wrong with Jan's suggestion? (add a BPF hook to fanotify_handle_event()) > that hook is supposed to filter by pid so why all this extra complexity? > > We may consider the option to have another BFP hook when reading > events if there is > good justification, but subtree filters will have to be in handle_event(). > > Thanks, > Amir. To be a reasonable async replacement for ptrace (see e.g. mentioned reprozip) file-events from all paths have to be reported, which is difficult using i_fsnotify_marks, because - marking whole mountpoints requires privileges - marking the whole filesystem using directory marks is unfeasible However, we need a quick way to find out, if a file event is of interest (find its beloning fsnotify_group). For the purpose of tracing it appears reasonable to consider all file events of a traced task as "interesting" in the first place. So, in this way, we allow a user to trace file events of his own tasks without slowing down other, non-traced tasks. After all, it's all about the order of running filters - first inode, then pid or reverse. With my proposed architecture for the purpose of tracing I would hand the inode-filter to the user in form of an optional BPF hook. Performance-wise that's also the "fair" solution. Let's assume we allow marking the whole filesystem (via mountpoints). Now, the BPF-pid-filter code has to run for every single file event (of all users!), if multiple users trace the filesystem even multiple hooks have to run, slowing down the whole system. Thanks, Tycho ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [RFC] Volatile fanotify marks 2022-05-04 10:01 ` Tycho Kirchner @ 2022-05-04 14:37 ` Amir Goldstein 2022-05-06 9:59 ` Tycho Kirchner 2022-05-05 10:42 ` Jan Kara 1 sibling, 1 reply; 17+ messages in thread From: Amir Goldstein @ 2022-05-04 14:37 UTC (permalink / raw) To: Tycho Kirchner; +Cc: Jan Kara, linux-fsdevel On Wed, May 4, 2022 at 1:01 PM Tycho Kirchner <tychokirchner@mail.de> wrote: > > > > Am 04.05.22 um 08:13 schrieb Amir Goldstein: > > On Mon, May 2, 2022 at 12:13 PM Tycho Kirchner <tychokirchner@mail.de> wrote: > >> > >> All right, I thought a bit more about that and returned to your > >> original BPF idea you mentioned on 2020-08-28: > >> > >>> I was thinking that we could add a BPF hook to fanotify_handle_event() > >>> (similar to what's happening in packet filtering code) and you could attach > >>> BPF programs to this hook to do filtering of events. That way we don't have > >>> to introduce new group flags for various filtering options. The question is > >>> whether eBPF is strong enough so that filters useful for fanotify users > >>> could be implemented with it but this particular check seems implementable. > >>> > >>> Honza > >> > >> Instead of changing fanotify's filesystem notification functionality, > >> I suggest to rather **add a tracing mode (fantrace)**. > >> > >> The synchronous handling of syscalls via ptrace is of course required > >> for debugging purposes, however that introduces a major slowdown (even > >> with seccomp-bpf filters). There are a number of cases, including > >> [1-3], where async processing of file events of specific tasks would be > >> fine but is not readily available in Linux. Fanotify already ships > >> important infrastructure in this regard: it provides very fast > >> event-buffering and, by using file descriptors instead of resolved > >> paths, a clean and race-free API to process the events later. However, > >> as already stated, fanotify does not provide a clean way, to monitor > >> only a subset of tasks. Therefore please consider the following > >> proposed architecture of fantrace: > >> > >> Each taks gets its own struct fsnotify_group. Within > >> fsnotify.c:fsnotify() it is checked if the given task has a > >> fsnotify_group attached where events of interest are buffered as usual. > >> Note that this is an additional hook - sysadmins being subscribed to > >> filesystem events rather than task-filesystem-events are notified as > >> usual - in that case two hooks possibly run. The fsnotify_group is > >> extended by a field optionally pointing to a BPF program which allows > >> for custom filters to be run. > >> > >> Some implementation details: > >> - To let the tracee return quickly, run BPF filter program within tracer > >> context during read(fan_fd) but before events are copied to userspace > >> - only one fantracer per task, which overrides existing ones if any > >> - task->fsnotify_group refcount increment on fork, decrement on exit (run > >> after exit_files(tsk) to not miss final close events). When last task > >> exited, send EOF to listener. > >> - on exec of seuid-programs the fsnotify_group is cleared (like in ptrace) > >> - lazy check when event occurs, if listener is still alive (refcount > 1) > >> - for the beginning, to keep things simple and to "solve" the cleanup of > >> filesystem marks, I suggest to disable i_fsnotify_marks for fantrace > >> (only allow FAN_MARK_FILESYSTEM), as that functionality can be > >> implemented within the user-provided BPF-program. > >> > > > > Maybe I am slow, but I did not understand the need for this task fsnotify_group. > > > > What's wrong with Jan's suggestion? (add a BPF hook to fanotify_handle_event()) > > that hook is supposed to filter by pid so why all this extra complexity? > > > > We may consider the option to have another BFP hook when reading > > events if there is > > good justification, but subtree filters will have to be in handle_event(). > > > > Thanks, > > Amir. > > To be a reasonable async replacement for ptrace (see e.g. mentioned reprozip) > file-events from all paths have to be reported, which is difficult > using i_fsnotify_marks, because > - marking whole mountpoints requires privileges > - marking the whole filesystem using directory marks is unfeasible > > However, we need a quick way to find out, if a file event is of > interest (find its beloning fsnotify_group). For the purpose of tracing > it appears reasonable to consider all file events of a traced task as > "interesting" in the first place. So, in this way, we allow a user to > trace file events of his own tasks without slowing down other, > non-traced tasks. > > After all, it's all about the order of running filters - first inode, > then pid or reverse. With my proposed architecture for the purpose of > tracing I would hand the inode-filter to the user in form of an > optional BPF hook. Performance-wise that's also the "fair" solution. > Let's assume we allow marking the whole filesystem (via mountpoints). > Now, the BPF-pid-filter code has to run for every single file event (of > all users!), if multiple users trace the filesystem even multiple hooks > have to run, slowing down the whole system. > I understand now what you were trying to do, but still not convinced that the added complexity to the kernel is worth it, because you may be able to achieve the same with userspace LD_PRELOAD hooking. BTW, the fanotify_userns patch set [1] was an attempt to allow unprivileged user watch over subtree with low performance impact on the rest of the file system. This is not exactly what you need, but maybe it could be made into what you need. Thanks, Amir. [1] https://github.com/amir73il/linux/commits/fanotify_userns ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [RFC] Volatile fanotify marks 2022-05-04 14:37 ` Amir Goldstein @ 2022-05-06 9:59 ` Tycho Kirchner 0 siblings, 0 replies; 17+ messages in thread From: Tycho Kirchner @ 2022-05-06 9:59 UTC (permalink / raw) To: Amir Goldstein; +Cc: Jan Kara, linux-fsdevel >>> On Mon, May 2, 2022 at 12:13 PM Tycho Kirchner <tychokirchner@mail.de> wrote: >>>> >>>> All right, I thought a bit more about that and returned to your >>>> original BPF idea you mentioned on 2020-08-28: >>>> >>>>> I was thinking that we could add a BPF hook to fanotify_handle_event() >>>>> (similar to what's happening in packet filtering code) and you could attach >>>>> BPF programs to this hook to do filtering of events. That way we don't have >>>>> to introduce new group flags for various filtering options. The question is >>>>> whether eBPF is strong enough so that filters useful for fanotify users >>>>> could be implemented with it but this particular check seems implementable. >>>>> >>>>> Honza >>>> >>>> Instead of changing fanotify's filesystem notification functionality, >>>> I suggest to rather **add a tracing mode (fantrace)**. >>>> >>>> The synchronous handling of syscalls via ptrace is of course required >>>> for debugging purposes, however that introduces a major slowdown (even >>>> with seccomp-bpf filters). There are a number of cases, including >>>> [1-3], where async processing of file events of specific tasks would be >>>> fine but is not readily available in Linux. Fanotify already ships >>>> important infrastructure in this regard: it provides very fast >>>> event-buffering and, by using file descriptors instead of resolved >>>> paths, a clean and race-free API to process the events later. However, >>>> as already stated, fanotify does not provide a clean way, to monitor >>>> only a subset of tasks. Therefore please consider the following >>>> proposed architecture of fantrace: >>>> >>>> Each taks gets its own struct fsnotify_group. Within >>>> fsnotify.c:fsnotify() it is checked if the given task has a >>>> fsnotify_group attached where events of interest are buffered as usual. >>>> Note that this is an additional hook - sysadmins being subscribed to >>>> filesystem events rather than task-filesystem-events are notified as >>>> usual - in that case two hooks possibly run. The fsnotify_group is >>>> extended by a field optionally pointing to a BPF program which allows >>>> for custom filters to be run. >>>> >>>> Some implementation details: >>>> - To let the tracee return quickly, run BPF filter program within tracer >>>> context during read(fan_fd) but before events are copied to userspace >>>> - only one fantracer per task, which overrides existing ones if any >>>> - task->fsnotify_group refcount increment on fork, decrement on exit (run >>>> after exit_files(tsk) to not miss final close events). When last task >>>> exited, send EOF to listener. >>>> - on exec of seuid-programs the fsnotify_group is cleared (like in ptrace) >>>> - lazy check when event occurs, if listener is still alive (refcount > 1) >>>> - for the beginning, to keep things simple and to "solve" the cleanup of >>>> filesystem marks, I suggest to disable i_fsnotify_marks for fantrace >>>> (only allow FAN_MARK_FILESYSTEM), as that functionality can be >>>> implemented within the user-provided BPF-program. >>>> >>> >>> Maybe I am slow, but I did not understand the need for this task fsnotify_group. >>> >>> What's wrong with Jan's suggestion? (add a BPF hook to fanotify_handle_event()) >>> that hook is supposed to filter by pid so why all this extra complexity? >>> >>> We may consider the option to have another BFP hook when reading >>> events if there is >>> good justification, but subtree filters will have to be in handle_event(). >>> >>> Thanks, >>> Amir. >> >> To be a reasonable async replacement for ptrace (see e.g. mentioned reprozip) >> file-events from all paths have to be reported, which is difficult >> using i_fsnotify_marks, because >> - marking whole mountpoints requires privileges >> - marking the whole filesystem using directory marks is unfeasible >> >> However, we need a quick way to find out, if a file event is of >> interest (find its beloning fsnotify_group). For the purpose of tracing >> it appears reasonable to consider all file events of a traced task as >> "interesting" in the first place. So, in this way, we allow a user to >> trace file events of his own tasks without slowing down other, >> non-traced tasks. >> >> After all, it's all about the order of running filters - first inode, >> then pid or reverse. With my proposed architecture for the purpose of >> tracing I would hand the inode-filter to the user in form of an >> optional BPF hook. Performance-wise that's also the "fair" solution. >> Let's assume we allow marking the whole filesystem (via mountpoints). >> Now, the BPF-pid-filter code has to run for every single file event (of >> all users!), if multiple users trace the filesystem even multiple hooks >> have to run, slowing down the whole system. >> > > I understand now what you were trying to do, but still not convinced > that the added complexity to the kernel is worth it, because you may be > able to achieve the same with userspace LD_PRELOAD hooking. In fact, I use that approach in shournal's fanotify backend, as the shell process cannot join the mount namespace of the currently executing command (without re-exec). See here, for example: https://github.com/tycho-kirchner/shournal/blob/master/src/shell-integration-fanotify/libshournal-shellwatch.cpp However, while this works for the supported shells bash and zsh, the LD_PRELOAD approach in general is problematic: - Static compiled executables cannot be traced (rare, ptrace with seccomp-bpf filters might be ok for those) - killed tasks may still have fd's open, so signals also have to be hooked. SIGKILL or crashing tasks (e.g. segfault) cannot be handled at all (though this would be ok) - close- or other library-calls are potentially inlined. Some programs may even use syscalls directly (probably a rare issue, but I'm not sure) - quite some hooks required: close, fclose (glibc does not use the close wrapper), fcloseall, dup2, dup3, exit, _exit and possibly others Actually, this sounds more of a complexity. I guess I could make a proof-of-principle implementation of my proposed fantrace in a few days, but of course this makes only sense if there is a slight chance this could be accepted. > > BTW, the fanotify_userns patch set [1] was an attempt to allow unprivileged > user watch over subtree with low performance impact on the rest of the > file system. > This is not exactly what you need, but maybe it could be made > into what you need. > [1] https://github.com/amir73il/linux/commits/fanotify_userns I'll take at look, but while I'm fine in compiling a custom kernel users of my tool are not. Further, using user namespaces in the outermost shell layer (where my tool runs) is problematic for administrators or users of setuid binaries. Thanks, Tycho ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [RFC] Volatile fanotify marks 2022-05-04 10:01 ` Tycho Kirchner 2022-05-04 14:37 ` Amir Goldstein @ 2022-05-05 10:42 ` Jan Kara 1 sibling, 0 replies; 17+ messages in thread From: Jan Kara @ 2022-05-05 10:42 UTC (permalink / raw) To: Tycho Kirchner; +Cc: Amir Goldstein, Jan Kara, linux-fsdevel On Wed 04-05-22 12:01:49, Tycho Kirchner wrote: > Am 04.05.22 um 08:13 schrieb Amir Goldstein: > > On Mon, May 2, 2022 at 12:13 PM Tycho Kirchner <tychokirchner@mail.de> wrote: > > > > > > All right, I thought a bit more about that and returned to your > > > original BPF idea you mentioned on 2020-08-28: > > > > > > > I was thinking that we could add a BPF hook to fanotify_handle_event() > > > > (similar to what's happening in packet filtering code) and you could attach > > > > BPF programs to this hook to do filtering of events. That way we don't have > > > > to introduce new group flags for various filtering options. The question is > > > > whether eBPF is strong enough so that filters useful for fanotify users > > > > could be implemented with it but this particular check seems implementable. > > > > > > > > Honza > > > > > > Instead of changing fanotify's filesystem notification functionality, > > > I suggest to rather **add a tracing mode (fantrace)**. > > > > > > The synchronous handling of syscalls via ptrace is of course required > > > for debugging purposes, however that introduces a major slowdown (even > > > with seccomp-bpf filters). There are a number of cases, including > > > [1-3], where async processing of file events of specific tasks would be > > > fine but is not readily available in Linux. Fanotify already ships > > > important infrastructure in this regard: it provides very fast > > > event-buffering and, by using file descriptors instead of resolved > > > paths, a clean and race-free API to process the events later. However, > > > as already stated, fanotify does not provide a clean way, to monitor > > > only a subset of tasks. Therefore please consider the following > > > proposed architecture of fantrace: > > > > > > Each taks gets its own struct fsnotify_group. Within > > > fsnotify.c:fsnotify() it is checked if the given task has a > > > fsnotify_group attached where events of interest are buffered as usual. > > > Note that this is an additional hook - sysadmins being subscribed to > > > filesystem events rather than task-filesystem-events are notified as > > > usual - in that case two hooks possibly run. The fsnotify_group is > > > extended by a field optionally pointing to a BPF program which allows > > > for custom filters to be run. > > > > > > Some implementation details: > > > - To let the tracee return quickly, run BPF filter program within tracer > > > context during read(fan_fd) but before events are copied to userspace > > > - only one fantracer per task, which overrides existing ones if any > > > - task->fsnotify_group refcount increment on fork, decrement on exit (run > > > after exit_files(tsk) to not miss final close events). When last task > > > exited, send EOF to listener. > > > - on exec of seuid-programs the fsnotify_group is cleared (like in ptrace) > > > - lazy check when event occurs, if listener is still alive (refcount > 1) > > > - for the beginning, to keep things simple and to "solve" the cleanup of > > > filesystem marks, I suggest to disable i_fsnotify_marks for fantrace > > > (only allow FAN_MARK_FILESYSTEM), as that functionality can be > > > implemented within the user-provided BPF-program. > > > > > > > Maybe I am slow, but I did not understand the need for this task fsnotify_group. > > > > What's wrong with Jan's suggestion? (add a BPF hook to fanotify_handle_event()) > > that hook is supposed to filter by pid so why all this extra complexity? > > > > We may consider the option to have another BFP hook when reading > > events if there is > > good justification, but subtree filters will have to be in handle_event(). > > > > Thanks, > > Amir. > > To be a reasonable async replacement for ptrace (see e.g. mentioned > reprozip) file-events from all paths have to be reported, which is > difficult using i_fsnotify_marks, because > - marking whole mountpoints requires privileges > - marking the whole filesystem using directory marks is unfeasible I agree with both points here but I guess the point is that fanotify simply is not meant to be "async replacement for ptrace". It is a filesystem change monitoring API. Things like Linux tracing (e.g. perf-trace(1)) could be considered as such, but they require priviledges as well. > However, we need a quick way to find out, if a file event is of > interest (find its beloning fsnotify_group). For the purpose of tracing > it appears reasonable to consider all file events of a traced task as > "interesting" in the first place. So, in this way, we allow a user to > trace file events of his own tasks without slowing down other, > non-traced tasks. So I agree that the idea of storing some "owner" information (be it user or particular pid) in fsnotify_group and notify group about events only from "owner" has some appeal because it nicely solves the question "is user priviledged enough to see this event?" But it does not quite fit in the philosophy of a filesystem monitor because fanotify filesystem changes from other "owners" will not generate notification although you can see them when looking at the filesystem. I have not completely made up my mind how much that matters ;) > After all, it's all about the order of running filters - first inode, > then pid or reverse. With my proposed architecture for the purpose of > tracing I would hand the inode-filter to the user in form of an > optional BPF hook. Performance-wise that's also the "fair" solution. > Let's assume we allow marking the whole filesystem (via mountpoints). > Now, the BPF-pid-filter code has to run for every single file event (of > all users!), if multiple users trace the filesystem even multiple hooks > have to run, slowing down the whole system. I understand what you'd like to achieve but things don't work as you imagine it. Fsnotify has hooks in paths modifying filesystem. When the hook gets called with the modified object (file, dir, ..) we need to quickly tell whether the event may be of interest or not - for that we consult the inode, parent directory, mountpoint, superblock. At this point we don't know we should limit anything to a particular task because we do not know yet which notification groups are interested in the event. Only if we find someone is interested, we walk lists of notification groups, identify which notification group is exactly interested and at that point we could know this is your special group limited to a particular task and filter by task. What you are suggesting could be also implemented by adding another object type to watch - a task. But honestly it does not seem like a good idea to expand fanotify in such direction because fanotify is all about filesystem objects so adding so vastly different objects such as originating task does not fit well with the API. Honza -- Jan Kara <jack@suse.com> SUSE Labs, CR ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [RFC] Volatile fanotify marks 2022-02-28 14:05 ` Jan Kara 2022-02-28 17:40 ` Amir Goldstein @ 2022-05-13 15:30 ` Matthew Bobrowski 1 sibling, 0 replies; 17+ messages in thread From: Matthew Bobrowski @ 2022-05-13 15:30 UTC (permalink / raw) To: Jan Kara; +Cc: Amir Goldstein, linux-fsdevel On Mon, Feb 28, 2022 at 03:05:56PM +0100, Jan Kara wrote: > Hi Amir! > > On Wed 23-02-22 20:42:37, Amir Goldstein wrote: > > I wanted to get your feedback on an idea I have been playing with. > > It started as a poor man's alternative to the old subtree watch problem. > > For my employer's use case, we are watching the entire filesystem using > > a filesystem mark, but would like to exclude events on a subtree > > (i.e. all files underneath .private/). > > > > At the moment, those events are filtered in userspace. > > I had considered adding directory marks with an ignored mask on every > > event that is received for a directory path under .private/, but that has the > > undesired side effect of pinning those directory inodes to cache. > > > > I have this old fsnotify-volatile branch [1] that I am using for an overlayfs > > kernel internal fsnotify backend. I wonder what are your thoughts on > > exposing this functionally to fanotify UAPI (i.e. FAN_MARK_VOLATILE). > > Interesting idea. I have some reservations wrt to the implementation (e.g. > fsnotify_add_mark_list() convention of returning EEXIST when it updated > mark's mask, or the fact that inode reclaim should now handle freeing of > mark connector and attached marks - which may get interesting locking wise) > but they are all fixable. > > I'm wondering a bit whether this is really useful enough (and consequently > whether we will not get another request to extend fanotify API in some > other way to cater better to some other usecase related to subtree watches > in the near future). I understand ignore marks are mainly a performance > optimization and as such allowing inodes to be reclaimed (which means they > are not used much and hence ignored mark is not very useful anyway) makes > sense. Thinking about this more, I guess it is useful to improve efficiency > when you want to implement any userspace event-filtering scheme. > > The only remaining pending question I have is whether we should not go > further and allow event filtering to happen using an eBPF program. That > would be even more efficient (both in terms of memory and CPU). What do you > think? Wait. Did I just read that Jan is open to implementing in kernel event filtering through eBPF? This feature is something that I'd definitely be interested in and perhaps also open to running with doing the design/implementation. One of the really obvious filtering semantics that immediately comes to mind for me would be to filter on expected processes/binary images touching an expected set of files. Currently, this level of filtering has traditionally been done in userspace, but if we could offload that it'd be nice... /M ^ permalink raw reply [flat|nested] 17+ messages in thread
end of thread, other threads:[~2022-05-13 15:30 UTC | newest] Thread overview: 17+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2022-02-23 18:42 [RFC] Volatile fanotify marks Amir Goldstein 2022-02-28 14:05 ` Jan Kara 2022-02-28 17:40 ` Amir Goldstein 2022-03-01 11:07 ` Jan Kara 2022-03-01 11:27 ` Amir Goldstein 2022-03-01 12:26 ` Tycho Kirchner 2022-03-01 16:58 ` Amir Goldstein 2022-03-02 10:04 ` Tycho Kirchner 2022-03-02 18:14 ` Amir Goldstein 2022-03-03 9:24 ` Jan Kara 2022-05-02 9:13 ` Tycho Kirchner 2022-05-04 6:13 ` Amir Goldstein 2022-05-04 10:01 ` Tycho Kirchner 2022-05-04 14:37 ` Amir Goldstein 2022-05-06 9:59 ` Tycho Kirchner 2022-05-05 10:42 ` Jan Kara 2022-05-13 15:30 ` Matthew Bobrowski
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.