* fanotify feature request FAN_MARK_PID
@ 2020-08-17 16:08 Tycho Kirchner
2020-08-17 17:02 ` Amir Goldstein
0 siblings, 1 reply; 5+ messages in thread
From: Tycho Kirchner @ 2020-08-17 16:08 UTC (permalink / raw)
To: amir73il, mbobrowski, linux-fsdevel
Dear Amir Goldstein,
Dear Matthew Bobrowski,
Dear developers of the kernel filesystem,
First of all, thanks for your effort in improving Linux, especially your
work regarding fanotify, which I heavily use in one of my projects:
https://github.com/tycho-kirchner/shournal
For a more scientfic introduction please take a look at
Bashing irreproducibility with shournal
https://doi.org/10.1101/2020.08.03.232843
I wanted to kindly ask you, whether it is possible for you to add
another feature to fanotify, that is reporting only events of a PID or
any of its children.
This would be very useful, because especially in the world of
bioinformatics there is a huge need to automatically and efficiently
track file events on the shell, that is, you enter a command on the
shell (bash) and then track, which file events were modified by the
shell or any of its child-processes.
Right now this is realized in shournal by joining a mount namespace
which is unique for each entered command and listening to file events of
these mountpoints using fanotify.
This works great so far in most cases, but joining another mount
namespace is actually something I would like to avoid, because
i.
Some applications (gdb and possibly others) do not play well in
controlling applications across mount namespaces (see also
https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=940563)
ii.
Joining the mount-namespace has performance-implications, because a
setuid-binary, which joins the mount-namespace, must be called
beforehand. Further, care must be taken to preserve the environment (env).
iii.
setuid-binaries always impose a security-risk.
I imagine e.g. the following syscalls:
1.
Use fanotify_mark to restrict the fanotify notification group to a
specific PID, optionally marking forked children as well.
fanotify_mark(fan_fd, FAN_MARK_ADD | FAN_MARK_PID, FAN_EVENT_ON_CHILD,
pid, NULL);
// FAN_EVENT_ON_CHILD -> additional meaning: also forked child processes.
2.
Use fanotify_mark to remove a PID from the notification group.
fanotify_mark(fan_fd, FAN_MARK_REMOVE | FAN_MARK_PID, 0, pid, NULL);
3.
When reading from a fan_fd, which is marked for PID's which have all
ended or were removed, return e.g. ENOENT.
Independent of that it would be also useful, to be able to track
applications, which unshare their mount namespace as well (e.g.
flatpak). So in case a process, whose mount points are observed,
unshares, the new mount id's should also be added to the same fanotify
notification group. To preserve backwards compatibility I suggest
introducing a new flag FAN_MARK_MOUNT_REC:
fanotify_mark(fan_fd, FAN_MARK_ADD | FAN_MARK_MOUNT |
FAN_MARK_MOUNT_REC, mask, AT_FDCWD, path);
Thanks in Advance
Kind Regards
Tycho Kirchner
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: fanotify feature request FAN_MARK_PID
2020-08-17 16:08 fanotify feature request FAN_MARK_PID Tycho Kirchner
@ 2020-08-17 17:02 ` Amir Goldstein
2020-08-22 22:47 ` Tycho Kirchner
0 siblings, 1 reply; 5+ messages in thread
From: Amir Goldstein @ 2020-08-17 17:02 UTC (permalink / raw)
To: Tycho Kirchner; +Cc: Matthew Bobrowski, linux-fsdevel
On Mon, Aug 17, 2020 at 7:08 PM Tycho Kirchner <tychokirchner@mail.de> wrote:
>
> Dear Amir Goldstein,
>
Hi Tycho,
> Dear Matthew Bobrowski,
>
> Dear developers of the kernel filesystem,
>
> First of all, thanks for your effort in improving Linux, especially your
> work regarding fanotify, which I heavily use in one of my projects:
>
> https://github.com/tycho-kirchner/shournal
>
Nice project!
> For a more scientfic introduction please take a look at
> Bashing irreproducibility with shournal
> https://doi.org/10.1101/2020.08.03.232843
>
> I wanted to kindly ask you, whether it is possible for you to add
> another feature to fanotify, that is reporting only events of a PID or
> any of its children.
> This would be very useful, because especially in the world of
> bioinformatics there is a huge need to automatically and efficiently
> track file events on the shell, that is, you enter a command on the
> shell (bash) and then track, which file events were modified by the
> shell or any of its child-processes.
I am not sure if fanotify is the right tool for the job.
fanotify is a *system* monitoring tool and its functionality is very limited.
If you want to watch what file operations a process and its children are doing,
you can use more powerful tracing tools like strace, seccomp, and eBPF.
For starters, did you look at bcc tools, for example:
https://github.com/iovisor/bcc/blob/master/tools/opensnoop.py
[...]
> I imagine e.g. the following syscalls:
>
> 1.
> Use fanotify_mark to restrict the fanotify notification group to a
> specific PID, optionally marking forked children as well.
> fanotify_mark(fan_fd, FAN_MARK_ADD | FAN_MARK_PID, FAN_EVENT_ON_CHILD,
> pid, NULL);
> // FAN_EVENT_ON_CHILD -> additional meaning: also forked child processes.
>
Technically, it is quite easy to filter out events generated by
processes outside
pid namespace (which would report pid 0), but I doubt if the use case you
presented justifies that. Maybe there are other use cases...
> 2.
> Use fanotify_mark to remove a PID from the notification group.
> fanotify_mark(fan_fd, FAN_MARK_REMOVE | FAN_MARK_PID, 0, pid, NULL);
>
> 3.
> When reading from a fan_fd, which is marked for PID's which have all
> ended or were removed, return e.g. ENOENT.
>
>
> Independent of that it would be also useful, to be able to track
> applications, which unshare their mount namespace as well (e.g.
> flatpak). So in case a process, whose mount points are observed,
> unshares, the new mount id's should also be added to the same fanotify
> notification group. To preserve backwards compatibility I suggest
> introducing a new flag FAN_MARK_MOUNT_REC:
> fanotify_mark(fan_fd, FAN_MARK_ADD | FAN_MARK_MOUNT |
> FAN_MARK_MOUNT_REC, mask, AT_FDCWD, path);
>
The inherited mark concept sounds useful.
I also thought of a likewise flag for directories.
The question is if and how you clean all the inherited marks when program
removes the original mark. It's an API question. Not a trivial one IMO.
The thing is, with FAN_MARK_FILESYSTEM (v5.1), you can sort of implement
what you want in userspace with the opposite approach:
1. Watch events on filesystem regardless of which mount
2. When getting an event with an open fd, resolve the mount
3. If you are NOT interested in that mount add a FAN_MARK_IGNORED
mask on that mount
4. Soon, you will be left with only the events you care about
5. When mount is unshared, you will get the events generated on that mount
But that will only work if the unshared mount is visible in the mount namespace
of the listener, so it is not a complete solution, but maybe it works for some
of your use cases.
Thanks,
Amir.
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: fanotify feature request FAN_MARK_PID
2020-08-17 17:02 ` Amir Goldstein
@ 2020-08-22 22:47 ` Tycho Kirchner
2020-08-23 13:04 ` Amir Goldstein
0 siblings, 1 reply; 5+ messages in thread
From: Tycho Kirchner @ 2020-08-22 22:47 UTC (permalink / raw)
To: Amir Goldstein; +Cc: Matthew Bobrowski, linux-fsdevel
Hi Amir, and Thanks for the quick response!
> strace, seccomp, and eBPF
Also thanks for these tips. However:
strace
Is a performance killer. As shournal tracks everyday work on the shell
and also runs e.g. during expensive genomic analysis, ptrace-based
approaches are not acceptable here.
seccomp and eBPF
Thanks - I took a deeper look at BPF and the PID-filtering is very nice,
following child-processes/forks looks also doable; maybe it's better to
implement it with cgroups(?)..
However, using the current fanotify-approach, to recognize files later,
consuming the FAN_CLOSE_WRITE-event they are xxHashed (partially, based
on size) using the passed file-descriptor, which is very nice, because
renaming/etc does no harm (resolving a path and opening the fd later
introduces a race condition).
Thus, with BPF, one might try to trace fs/file.c:__close_fd.
Calculating the hash in kernel-mode would be ideal but reading bytes
from files is not allowed in BPF-programs.
As far as I can tell, BPF also does not support sending the fd to the
user-space-process (like fanotify does).
The last acceptable resort would have been to resolve the path (within
BPF) using the fdtable from files_struct *files, but this is not allowed
within a BPF-program, because it might produce a page fault (see [1] -
kernel-patch with bpf_fd2path is available, but not in mainline).
Resoling the path in userspace with the known pid and fd-number using
/proc/$pid/fd/$fdnum is possible but the process might be gone already.
Any further help is appreciated.
Thanks,
Tycho
[1]: https://github.com/iovisor/bcc/issues/2538#issuecomment-541393483
Am 17.08.20 um 19:02 schrieb Amir Goldstein:
> On Mon, Aug 17, 2020 at 7:08 PM Tycho Kirchner <tychokirchner@mail.de> wrote:
>>
>> Dear Amir Goldstein,
>>
>
> Hi Tycho,
>
>
>> Dear Matthew Bobrowski,
>>
>> Dear developers of the kernel filesystem,
>>
>> First of all, thanks for your effort in improving Linux, especially your
>> work regarding fanotify, which I heavily use in one of my projects:
>>
>> https://github.com/tycho-kirchner/shournal
>>
>
> Nice project!
>
>> For a more scientfic introduction please take a look at
>> Bashing irreproducibility with shournal
>> https://doi.org/10.1101/2020.08.03.232843
>>
>> I wanted to kindly ask you, whether it is possible for you to add
>> another feature to fanotify, that is reporting only events of a PID or
>> any of its children.
>> This would be very useful, because especially in the world of
>> bioinformatics there is a huge need to automatically and efficiently
>> track file events on the shell, that is, you enter a command on the
>> shell (bash) and then track, which file events were modified by the
>> shell or any of its child-processes.
>
> I am not sure if fanotify is the right tool for the job.
> fanotify is a *system* monitoring tool and its functionality is very limited.
> If you want to watch what file operations a process and its children are doing,
> you can use more powerful tracing tools like strace, seccomp, and eBPF.
> For starters, did you look at bcc tools, for example:
> https://github.com/iovisor/bcc/blob/master/tools/opensnoop.py
>
> [...]
>
>> I imagine e.g. the following syscalls:
>>
>> 1.
>> Use fanotify_mark to restrict the fanotify notification group to a
>> specific PID, optionally marking forked children as well.
>> fanotify_mark(fan_fd, FAN_MARK_ADD | FAN_MARK_PID, FAN_EVENT_ON_CHILD,
>> pid, NULL);
>> // FAN_EVENT_ON_CHILD -> additional meaning: also forked child processes.
>>
>
> Technically, it is quite easy to filter out events generated by
> processes outside
> pid namespace (which would report pid 0), but I doubt if the use case you
> presented justifies that. Maybe there are other use cases...
>
>> 2.
>> Use fanotify_mark to remove a PID from the notification group.
>> fanotify_mark(fan_fd, FAN_MARK_REMOVE | FAN_MARK_PID, 0, pid, NULL);
>>
>> 3.
>> When reading from a fan_fd, which is marked for PID's which have all
>> ended or were removed, return e.g. ENOENT.
>>
>>
>> Independent of that it would be also useful, to be able to track
>> applications, which unshare their mount namespace as well (e.g.
>> flatpak). So in case a process, whose mount points are observed,
>> unshares, the new mount id's should also be added to the same fanotify
>> notification group. To preserve backwards compatibility I suggest
>> introducing a new flag FAN_MARK_MOUNT_REC:
>> fanotify_mark(fan_fd, FAN_MARK_ADD | FAN_MARK_MOUNT |
>> FAN_MARK_MOUNT_REC, mask, AT_FDCWD, path);
>>
>
> The inherited mark concept sounds useful.
> I also thought of a likewise flag for directories.
> The question is if and how you clean all the inherited marks when program
> removes the original mark. It's an API question. Not a trivial one IMO.
>
> The thing is, with FAN_MARK_FILESYSTEM (v5.1), you can sort of implement
> what you want in userspace with the opposite approach:
> 1. Watch events on filesystem regardless of which mount
> 2. When getting an event with an open fd, resolve the mount
> 3. If you are NOT interested in that mount add a FAN_MARK_IGNORED
> mask on that mount
> 4. Soon, you will be left with only the events you care about
> 5. When mount is unshared, you will get the events generated on that mount
>
> But that will only work if the unshared mount is visible in the mount namespace
> of the listener, so it is not a complete solution, but maybe it works for some
> of your use cases.
>
> Thanks,
> Amir.
>
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: fanotify feature request FAN_MARK_PID
2020-08-22 22:47 ` Tycho Kirchner
@ 2020-08-23 13:04 ` Amir Goldstein
2020-08-28 8:46 ` Jan Kara
0 siblings, 1 reply; 5+ messages in thread
From: Amir Goldstein @ 2020-08-23 13:04 UTC (permalink / raw)
To: Tycho Kirchner; +Cc: Matthew Bobrowski, linux-fsdevel
> Any further help is appreciated.
>
A patch along those line (fill in the missing pieces) looks useful to me.
It could serve a use case where applications are using fanotify filesystem
mark, but developer would like to limit those application's scope inside
"system containers".
Perhaps an even more useful API would be FAN_FILTER_MOUNT_NS.
FAN_FILTER_PID_NS effectively means that kernel will drop events
that are expected to report pid 0.
FAN_FILTER_MOUNT_NS would mean that kernel will drop events that
are expected to report an fd, whose /proc/<pid>/fd/<fd> symlink cannot
be resolved (it shows "/") because the file's mount is outside the scope
of the listener's mount namespace.
The burden of proof that this will be useful is still on you ;-)
Thanks,
Amir.
--- a/fs/notify/fanotify/fanotify.c
+++ b/fs/notify/fanotify/fanotify.c
@@ -685,6 +685,11 @@ static int fanotify_handle_event(struct
fsnotify_group *group, u32 mask,
pr_debug("%s: group=%p mask=%x\n", __func__, group, mask);
+ /* Interested only in events from group's pid ns */
+ if (FAN_GROUP_FLAG(group, FAN_FILTER_PID_NS) &&
+ !pid_nr_ns(task_pid(current), group->fanotify_data.pid_ns))
+ return 0;
+
if (fanotify_is_perm_event(mask)) {
/*
* fsnotify_prepare_user_wait() fails if we race with mark
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: fanotify feature request FAN_MARK_PID
2020-08-23 13:04 ` Amir Goldstein
@ 2020-08-28 8:46 ` Jan Kara
0 siblings, 0 replies; 5+ messages in thread
From: Jan Kara @ 2020-08-28 8:46 UTC (permalink / raw)
To: Amir Goldstein; +Cc: Tycho Kirchner, Matthew Bobrowski, linux-fsdevel
On Sun 23-08-20 16:04:39, Amir Goldstein wrote:
> > Any further help is appreciated.
> >
>
> A patch along those line (fill in the missing pieces) looks useful to me.
> It could serve a use case where applications are using fanotify filesystem
> mark, but developer would like to limit those application's scope inside
> "system containers".
>
> Perhaps an even more useful API would be FAN_FILTER_MOUNT_NS.
> FAN_FILTER_PID_NS effectively means that kernel will drop events
> that are expected to report pid 0.
> FAN_FILTER_MOUNT_NS would mean that kernel will drop events that
> are expected to report an fd, whose /proc/<pid>/fd/<fd> symlink cannot
> be resolved (it shows "/") because the file's mount is outside the scope
> of the listener's mount namespace.
>
> The burden of proof that this will be useful is still on you ;-)
I was thinking that we could add a BPF hook to fanotify_handle_event()
(similar to what's happening in packet filtering code) and you could attach
BPF programs to this hook to do filtering of events. That way we don't have
to introduce new group flags for various filtering options. The question is
whether eBPF is strong enough so that filters useful for fanotify users
could be implemented with it but this particular check seems implementable.
Honza
> --- a/fs/notify/fanotify/fanotify.c
> +++ b/fs/notify/fanotify/fanotify.c
> @@ -685,6 +685,11 @@ static int fanotify_handle_event(struct
> fsnotify_group *group, u32 mask,
>
> pr_debug("%s: group=%p mask=%x\n", __func__, group, mask);
>
> + /* Interested only in events from group's pid ns */
> + if (FAN_GROUP_FLAG(group, FAN_FILTER_PID_NS) &&
> + !pid_nr_ns(task_pid(current), group->fanotify_data.pid_ns))
> + return 0;
> +
> if (fanotify_is_perm_event(mask)) {
> /*
> * fsnotify_prepare_user_wait() fails if we race with mark
--
Jan Kara <jack@suse.com>
SUSE Labs, CR
^ permalink raw reply [flat|nested] 5+ messages in thread
end of thread, other threads:[~2020-08-28 8:46 UTC | newest]
Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-08-17 16:08 fanotify feature request FAN_MARK_PID Tycho Kirchner
2020-08-17 17:02 ` Amir Goldstein
2020-08-22 22:47 ` Tycho Kirchner
2020-08-23 13:04 ` Amir Goldstein
2020-08-28 8:46 ` Jan Kara
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).