* fanotify feature request FAN_MARK_PID @ 2020-08-17 16:08 Tycho Kirchner 2020-08-17 17:02 ` Amir Goldstein 0 siblings, 1 reply; 5+ messages in thread From: Tycho Kirchner @ 2020-08-17 16:08 UTC (permalink / raw) To: amir73il, mbobrowski, linux-fsdevel Dear Amir Goldstein, Dear Matthew Bobrowski, Dear developers of the kernel filesystem, First of all, thanks for your effort in improving Linux, especially your work regarding fanotify, which I heavily use in one of my projects: https://github.com/tycho-kirchner/shournal For a more scientfic introduction please take a look at Bashing irreproducibility with shournal https://doi.org/10.1101/2020.08.03.232843 I wanted to kindly ask you, whether it is possible for you to add another feature to fanotify, that is reporting only events of a PID or any of its children. This would be very useful, because especially in the world of bioinformatics there is a huge need to automatically and efficiently track file events on the shell, that is, you enter a command on the shell (bash) and then track, which file events were modified by the shell or any of its child-processes. Right now this is realized in shournal by joining a mount namespace which is unique for each entered command and listening to file events of these mountpoints using fanotify. This works great so far in most cases, but joining another mount namespace is actually something I would like to avoid, because i. Some applications (gdb and possibly others) do not play well in controlling applications across mount namespaces (see also https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=940563) ii. Joining the mount-namespace has performance-implications, because a setuid-binary, which joins the mount-namespace, must be called beforehand. Further, care must be taken to preserve the environment (env). iii. setuid-binaries always impose a security-risk. I imagine e.g. the following syscalls: 1. Use fanotify_mark to restrict the fanotify notification group to a specific PID, optionally marking forked children as well. fanotify_mark(fan_fd, FAN_MARK_ADD | FAN_MARK_PID, FAN_EVENT_ON_CHILD, pid, NULL); // FAN_EVENT_ON_CHILD -> additional meaning: also forked child processes. 2. Use fanotify_mark to remove a PID from the notification group. fanotify_mark(fan_fd, FAN_MARK_REMOVE | FAN_MARK_PID, 0, pid, NULL); 3. When reading from a fan_fd, which is marked for PID's which have all ended or were removed, return e.g. ENOENT. Independent of that it would be also useful, to be able to track applications, which unshare their mount namespace as well (e.g. flatpak). So in case a process, whose mount points are observed, unshares, the new mount id's should also be added to the same fanotify notification group. To preserve backwards compatibility I suggest introducing a new flag FAN_MARK_MOUNT_REC: fanotify_mark(fan_fd, FAN_MARK_ADD | FAN_MARK_MOUNT | FAN_MARK_MOUNT_REC, mask, AT_FDCWD, path); Thanks in Advance Kind Regards Tycho Kirchner ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: fanotify feature request FAN_MARK_PID 2020-08-17 16:08 fanotify feature request FAN_MARK_PID Tycho Kirchner @ 2020-08-17 17:02 ` Amir Goldstein 2020-08-22 22:47 ` Tycho Kirchner 0 siblings, 1 reply; 5+ messages in thread From: Amir Goldstein @ 2020-08-17 17:02 UTC (permalink / raw) To: Tycho Kirchner; +Cc: Matthew Bobrowski, linux-fsdevel On Mon, Aug 17, 2020 at 7:08 PM Tycho Kirchner <tychokirchner@mail.de> wrote: > > Dear Amir Goldstein, > Hi Tycho, > Dear Matthew Bobrowski, > > Dear developers of the kernel filesystem, > > First of all, thanks for your effort in improving Linux, especially your > work regarding fanotify, which I heavily use in one of my projects: > > https://github.com/tycho-kirchner/shournal > Nice project! > For a more scientfic introduction please take a look at > Bashing irreproducibility with shournal > https://doi.org/10.1101/2020.08.03.232843 > > I wanted to kindly ask you, whether it is possible for you to add > another feature to fanotify, that is reporting only events of a PID or > any of its children. > This would be very useful, because especially in the world of > bioinformatics there is a huge need to automatically and efficiently > track file events on the shell, that is, you enter a command on the > shell (bash) and then track, which file events were modified by the > shell or any of its child-processes. I am not sure if fanotify is the right tool for the job. fanotify is a *system* monitoring tool and its functionality is very limited. If you want to watch what file operations a process and its children are doing, you can use more powerful tracing tools like strace, seccomp, and eBPF. For starters, did you look at bcc tools, for example: https://github.com/iovisor/bcc/blob/master/tools/opensnoop.py [...] > I imagine e.g. the following syscalls: > > 1. > Use fanotify_mark to restrict the fanotify notification group to a > specific PID, optionally marking forked children as well. > fanotify_mark(fan_fd, FAN_MARK_ADD | FAN_MARK_PID, FAN_EVENT_ON_CHILD, > pid, NULL); > // FAN_EVENT_ON_CHILD -> additional meaning: also forked child processes. > Technically, it is quite easy to filter out events generated by processes outside pid namespace (which would report pid 0), but I doubt if the use case you presented justifies that. Maybe there are other use cases... > 2. > Use fanotify_mark to remove a PID from the notification group. > fanotify_mark(fan_fd, FAN_MARK_REMOVE | FAN_MARK_PID, 0, pid, NULL); > > 3. > When reading from a fan_fd, which is marked for PID's which have all > ended or were removed, return e.g. ENOENT. > > > Independent of that it would be also useful, to be able to track > applications, which unshare their mount namespace as well (e.g. > flatpak). So in case a process, whose mount points are observed, > unshares, the new mount id's should also be added to the same fanotify > notification group. To preserve backwards compatibility I suggest > introducing a new flag FAN_MARK_MOUNT_REC: > fanotify_mark(fan_fd, FAN_MARK_ADD | FAN_MARK_MOUNT | > FAN_MARK_MOUNT_REC, mask, AT_FDCWD, path); > The inherited mark concept sounds useful. I also thought of a likewise flag for directories. The question is if and how you clean all the inherited marks when program removes the original mark. It's an API question. Not a trivial one IMO. The thing is, with FAN_MARK_FILESYSTEM (v5.1), you can sort of implement what you want in userspace with the opposite approach: 1. Watch events on filesystem regardless of which mount 2. When getting an event with an open fd, resolve the mount 3. If you are NOT interested in that mount add a FAN_MARK_IGNORED mask on that mount 4. Soon, you will be left with only the events you care about 5. When mount is unshared, you will get the events generated on that mount But that will only work if the unshared mount is visible in the mount namespace of the listener, so it is not a complete solution, but maybe it works for some of your use cases. Thanks, Amir. ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: fanotify feature request FAN_MARK_PID 2020-08-17 17:02 ` Amir Goldstein @ 2020-08-22 22:47 ` Tycho Kirchner 2020-08-23 13:04 ` Amir Goldstein 0 siblings, 1 reply; 5+ messages in thread From: Tycho Kirchner @ 2020-08-22 22:47 UTC (permalink / raw) To: Amir Goldstein; +Cc: Matthew Bobrowski, linux-fsdevel Hi Amir, and Thanks for the quick response! > strace, seccomp, and eBPF Also thanks for these tips. However: strace Is a performance killer. As shournal tracks everyday work on the shell and also runs e.g. during expensive genomic analysis, ptrace-based approaches are not acceptable here. seccomp and eBPF Thanks - I took a deeper look at BPF and the PID-filtering is very nice, following child-processes/forks looks also doable; maybe it's better to implement it with cgroups(?).. However, using the current fanotify-approach, to recognize files later, consuming the FAN_CLOSE_WRITE-event they are xxHashed (partially, based on size) using the passed file-descriptor, which is very nice, because renaming/etc does no harm (resolving a path and opening the fd later introduces a race condition). Thus, with BPF, one might try to trace fs/file.c:__close_fd. Calculating the hash in kernel-mode would be ideal but reading bytes from files is not allowed in BPF-programs. As far as I can tell, BPF also does not support sending the fd to the user-space-process (like fanotify does). The last acceptable resort would have been to resolve the path (within BPF) using the fdtable from files_struct *files, but this is not allowed within a BPF-program, because it might produce a page fault (see [1] - kernel-patch with bpf_fd2path is available, but not in mainline). Resoling the path in userspace with the known pid and fd-number using /proc/$pid/fd/$fdnum is possible but the process might be gone already. Any further help is appreciated. Thanks, Tycho [1]: https://github.com/iovisor/bcc/issues/2538#issuecomment-541393483 Am 17.08.20 um 19:02 schrieb Amir Goldstein: > On Mon, Aug 17, 2020 at 7:08 PM Tycho Kirchner <tychokirchner@mail.de> wrote: >> >> Dear Amir Goldstein, >> > > Hi Tycho, > > >> Dear Matthew Bobrowski, >> >> Dear developers of the kernel filesystem, >> >> First of all, thanks for your effort in improving Linux, especially your >> work regarding fanotify, which I heavily use in one of my projects: >> >> https://github.com/tycho-kirchner/shournal >> > > Nice project! > >> For a more scientfic introduction please take a look at >> Bashing irreproducibility with shournal >> https://doi.org/10.1101/2020.08.03.232843 >> >> I wanted to kindly ask you, whether it is possible for you to add >> another feature to fanotify, that is reporting only events of a PID or >> any of its children. >> This would be very useful, because especially in the world of >> bioinformatics there is a huge need to automatically and efficiently >> track file events on the shell, that is, you enter a command on the >> shell (bash) and then track, which file events were modified by the >> shell or any of its child-processes. > > I am not sure if fanotify is the right tool for the job. > fanotify is a *system* monitoring tool and its functionality is very limited. > If you want to watch what file operations a process and its children are doing, > you can use more powerful tracing tools like strace, seccomp, and eBPF. > For starters, did you look at bcc tools, for example: > https://github.com/iovisor/bcc/blob/master/tools/opensnoop.py > > [...] > >> I imagine e.g. the following syscalls: >> >> 1. >> Use fanotify_mark to restrict the fanotify notification group to a >> specific PID, optionally marking forked children as well. >> fanotify_mark(fan_fd, FAN_MARK_ADD | FAN_MARK_PID, FAN_EVENT_ON_CHILD, >> pid, NULL); >> // FAN_EVENT_ON_CHILD -> additional meaning: also forked child processes. >> > > Technically, it is quite easy to filter out events generated by > processes outside > pid namespace (which would report pid 0), but I doubt if the use case you > presented justifies that. Maybe there are other use cases... > >> 2. >> Use fanotify_mark to remove a PID from the notification group. >> fanotify_mark(fan_fd, FAN_MARK_REMOVE | FAN_MARK_PID, 0, pid, NULL); >> >> 3. >> When reading from a fan_fd, which is marked for PID's which have all >> ended or were removed, return e.g. ENOENT. >> >> >> Independent of that it would be also useful, to be able to track >> applications, which unshare their mount namespace as well (e.g. >> flatpak). So in case a process, whose mount points are observed, >> unshares, the new mount id's should also be added to the same fanotify >> notification group. To preserve backwards compatibility I suggest >> introducing a new flag FAN_MARK_MOUNT_REC: >> fanotify_mark(fan_fd, FAN_MARK_ADD | FAN_MARK_MOUNT | >> FAN_MARK_MOUNT_REC, mask, AT_FDCWD, path); >> > > The inherited mark concept sounds useful. > I also thought of a likewise flag for directories. > The question is if and how you clean all the inherited marks when program > removes the original mark. It's an API question. Not a trivial one IMO. > > The thing is, with FAN_MARK_FILESYSTEM (v5.1), you can sort of implement > what you want in userspace with the opposite approach: > 1. Watch events on filesystem regardless of which mount > 2. When getting an event with an open fd, resolve the mount > 3. If you are NOT interested in that mount add a FAN_MARK_IGNORED > mask on that mount > 4. Soon, you will be left with only the events you care about > 5. When mount is unshared, you will get the events generated on that mount > > But that will only work if the unshared mount is visible in the mount namespace > of the listener, so it is not a complete solution, but maybe it works for some > of your use cases. > > Thanks, > Amir. > ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: fanotify feature request FAN_MARK_PID 2020-08-22 22:47 ` Tycho Kirchner @ 2020-08-23 13:04 ` Amir Goldstein 2020-08-28 8:46 ` Jan Kara 0 siblings, 1 reply; 5+ messages in thread From: Amir Goldstein @ 2020-08-23 13:04 UTC (permalink / raw) To: Tycho Kirchner; +Cc: Matthew Bobrowski, linux-fsdevel > Any further help is appreciated. > A patch along those line (fill in the missing pieces) looks useful to me. It could serve a use case where applications are using fanotify filesystem mark, but developer would like to limit those application's scope inside "system containers". Perhaps an even more useful API would be FAN_FILTER_MOUNT_NS. FAN_FILTER_PID_NS effectively means that kernel will drop events that are expected to report pid 0. FAN_FILTER_MOUNT_NS would mean that kernel will drop events that are expected to report an fd, whose /proc/<pid>/fd/<fd> symlink cannot be resolved (it shows "/") because the file's mount is outside the scope of the listener's mount namespace. The burden of proof that this will be useful is still on you ;-) Thanks, Amir. --- a/fs/notify/fanotify/fanotify.c +++ b/fs/notify/fanotify/fanotify.c @@ -685,6 +685,11 @@ static int fanotify_handle_event(struct fsnotify_group *group, u32 mask, pr_debug("%s: group=%p mask=%x\n", __func__, group, mask); + /* Interested only in events from group's pid ns */ + if (FAN_GROUP_FLAG(group, FAN_FILTER_PID_NS) && + !pid_nr_ns(task_pid(current), group->fanotify_data.pid_ns)) + return 0; + if (fanotify_is_perm_event(mask)) { /* * fsnotify_prepare_user_wait() fails if we race with mark ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: fanotify feature request FAN_MARK_PID 2020-08-23 13:04 ` Amir Goldstein @ 2020-08-28 8:46 ` Jan Kara 0 siblings, 0 replies; 5+ messages in thread From: Jan Kara @ 2020-08-28 8:46 UTC (permalink / raw) To: Amir Goldstein; +Cc: Tycho Kirchner, Matthew Bobrowski, linux-fsdevel On Sun 23-08-20 16:04:39, Amir Goldstein wrote: > > Any further help is appreciated. > > > > A patch along those line (fill in the missing pieces) looks useful to me. > It could serve a use case where applications are using fanotify filesystem > mark, but developer would like to limit those application's scope inside > "system containers". > > Perhaps an even more useful API would be FAN_FILTER_MOUNT_NS. > FAN_FILTER_PID_NS effectively means that kernel will drop events > that are expected to report pid 0. > FAN_FILTER_MOUNT_NS would mean that kernel will drop events that > are expected to report an fd, whose /proc/<pid>/fd/<fd> symlink cannot > be resolved (it shows "/") because the file's mount is outside the scope > of the listener's mount namespace. > > The burden of proof that this will be useful is still on you ;-) I was thinking that we could add a BPF hook to fanotify_handle_event() (similar to what's happening in packet filtering code) and you could attach BPF programs to this hook to do filtering of events. That way we don't have to introduce new group flags for various filtering options. The question is whether eBPF is strong enough so that filters useful for fanotify users could be implemented with it but this particular check seems implementable. Honza > --- a/fs/notify/fanotify/fanotify.c > +++ b/fs/notify/fanotify/fanotify.c > @@ -685,6 +685,11 @@ static int fanotify_handle_event(struct > fsnotify_group *group, u32 mask, > > pr_debug("%s: group=%p mask=%x\n", __func__, group, mask); > > + /* Interested only in events from group's pid ns */ > + if (FAN_GROUP_FLAG(group, FAN_FILTER_PID_NS) && > + !pid_nr_ns(task_pid(current), group->fanotify_data.pid_ns)) > + return 0; > + > if (fanotify_is_perm_event(mask)) { > /* > * fsnotify_prepare_user_wait() fails if we race with mark -- Jan Kara <jack@suse.com> SUSE Labs, CR ^ permalink raw reply [flat|nested] 5+ messages in thread
end of thread, other threads:[~2020-08-28 8:46 UTC | newest] Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2020-08-17 16:08 fanotify feature request FAN_MARK_PID Tycho Kirchner 2020-08-17 17:02 ` Amir Goldstein 2020-08-22 22:47 ` Tycho Kirchner 2020-08-23 13:04 ` Amir Goldstein 2020-08-28 8:46 ` Jan Kara
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).