linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* fanotify feature request FAN_MARK_PID
@ 2020-08-17 16:08 Tycho Kirchner
  2020-08-17 17:02 ` Amir Goldstein
  0 siblings, 1 reply; 5+ messages in thread
From: Tycho Kirchner @ 2020-08-17 16:08 UTC (permalink / raw)
  To: amir73il, mbobrowski, linux-fsdevel

Dear Amir Goldstein,

Dear Matthew Bobrowski,

Dear developers of the kernel filesystem,

First of all, thanks for your effort in improving Linux, especially your 
work regarding fanotify, which I heavily use in one of my projects:

https://github.com/tycho-kirchner/shournal

For a more scientfic introduction please take a look at
Bashing irreproducibility with shournal
https://doi.org/10.1101/2020.08.03.232843

I wanted to kindly ask you, whether it is possible for you to add 
another feature to fanotify, that is reporting only events of a PID or 
any of its children.
This would be very useful, because especially in the world of 
bioinformatics there is a huge need to automatically and efficiently 
track file events on the shell, that is, you enter a command on the 
shell (bash) and then track, which file events were modified by the 
shell or any of its child-processes.
Right now this is realized in shournal by joining a mount namespace 
which is unique for each entered command and listening to file events of 
these mountpoints using fanotify.
This works great so far in most cases, but joining another mount 
namespace is actually something I would like to avoid, because

i.
Some applications (gdb and possibly others) do not play well in 
controlling applications across mount namespaces (see also 
https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=940563)

ii.
Joining the mount-namespace has performance-implications, because a 
setuid-binary, which joins the mount-namespace, must be called 
beforehand. Further, care must be taken to preserve the environment (env).

iii.
setuid-binaries always impose a security-risk.


I imagine e.g. the following syscalls:

1.
Use fanotify_mark to restrict the fanotify notification group to a 
specific PID, optionally marking forked children as well.
fanotify_mark(fan_fd, FAN_MARK_ADD | FAN_MARK_PID, FAN_EVENT_ON_CHILD, 
pid, NULL);
// FAN_EVENT_ON_CHILD -> additional meaning: also forked child processes.

2.
Use fanotify_mark to remove a PID from the notification group.
fanotify_mark(fan_fd, FAN_MARK_REMOVE | FAN_MARK_PID, 0, pid, NULL);

3.
When reading from a fan_fd, which is marked for PID's which have all 
ended or were removed, return e.g. ENOENT.


Independent of that it would be also useful, to be able to track 
applications, which unshare their mount namespace as well (e.g. 
flatpak). So in case a process, whose mount points are observed, 
unshares, the new mount id's should also be added to the same fanotify 
notification group. To preserve backwards compatibility I suggest 
introducing a new flag FAN_MARK_MOUNT_REC:
fanotify_mark(fan_fd, FAN_MARK_ADD | FAN_MARK_MOUNT | 
FAN_MARK_MOUNT_REC, mask, AT_FDCWD, path);


Thanks in Advance
Kind Regards
Tycho Kirchner




^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: fanotify feature request FAN_MARK_PID
  2020-08-17 16:08 fanotify feature request FAN_MARK_PID Tycho Kirchner
@ 2020-08-17 17:02 ` Amir Goldstein
  2020-08-22 22:47   ` Tycho Kirchner
  0 siblings, 1 reply; 5+ messages in thread
From: Amir Goldstein @ 2020-08-17 17:02 UTC (permalink / raw)
  To: Tycho Kirchner; +Cc: Matthew Bobrowski, linux-fsdevel

On Mon, Aug 17, 2020 at 7:08 PM Tycho Kirchner <tychokirchner@mail.de> wrote:
>
> Dear Amir Goldstein,
>

Hi Tycho,


> Dear Matthew Bobrowski,
>
> Dear developers of the kernel filesystem,
>
> First of all, thanks for your effort in improving Linux, especially your
> work regarding fanotify, which I heavily use in one of my projects:
>
> https://github.com/tycho-kirchner/shournal
>

Nice project!

> For a more scientfic introduction please take a look at
> Bashing irreproducibility with shournal
> https://doi.org/10.1101/2020.08.03.232843
>
> I wanted to kindly ask you, whether it is possible for you to add
> another feature to fanotify, that is reporting only events of a PID or
> any of its children.
> This would be very useful, because especially in the world of
> bioinformatics there is a huge need to automatically and efficiently
> track file events on the shell, that is, you enter a command on the
> shell (bash) and then track, which file events were modified by the
> shell or any of its child-processes.

I am not sure if fanotify is the right tool for the job.
fanotify is a *system* monitoring tool and its functionality is very limited.
If you want to watch what file operations a process and its children are doing,
you can use more powerful tracing tools like strace, seccomp, and eBPF.
For starters, did you look at bcc tools, for example:
https://github.com/iovisor/bcc/blob/master/tools/opensnoop.py

[...]

> I imagine e.g. the following syscalls:
>
> 1.
> Use fanotify_mark to restrict the fanotify notification group to a
> specific PID, optionally marking forked children as well.
> fanotify_mark(fan_fd, FAN_MARK_ADD | FAN_MARK_PID, FAN_EVENT_ON_CHILD,
> pid, NULL);
> // FAN_EVENT_ON_CHILD -> additional meaning: also forked child processes.
>

Technically, it is quite easy to filter out events generated by
processes outside
pid namespace (which would report pid 0), but I doubt if the use case you
presented justifies that. Maybe there are other use cases...

> 2.
> Use fanotify_mark to remove a PID from the notification group.
> fanotify_mark(fan_fd, FAN_MARK_REMOVE | FAN_MARK_PID, 0, pid, NULL);
>
> 3.
> When reading from a fan_fd, which is marked for PID's which have all
> ended or were removed, return e.g. ENOENT.
>
>
> Independent of that it would be also useful, to be able to track
> applications, which unshare their mount namespace as well (e.g.
> flatpak). So in case a process, whose mount points are observed,
> unshares, the new mount id's should also be added to the same fanotify
> notification group. To preserve backwards compatibility I suggest
> introducing a new flag FAN_MARK_MOUNT_REC:
> fanotify_mark(fan_fd, FAN_MARK_ADD | FAN_MARK_MOUNT |
> FAN_MARK_MOUNT_REC, mask, AT_FDCWD, path);
>

The inherited mark concept sounds useful.
I also thought of a likewise flag for directories.
The question is if and how you clean all the inherited marks when program
removes the original mark. It's an API question. Not a trivial one IMO.

The thing is, with FAN_MARK_FILESYSTEM (v5.1), you can sort of implement
what you want in userspace with the opposite approach:
1. Watch events on filesystem regardless of which mount
2. When getting an event with an open fd, resolve the mount
3. If you are NOT interested in that mount add a FAN_MARK_IGNORED
    mask on that mount
4. Soon, you will be left with only the events you care about
5. When mount is unshared, you will get the events generated on that mount

But that will only work if the unshared mount is visible in the mount namespace
of the listener, so it is not a complete solution, but maybe it works for some
of your use cases.

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: fanotify feature request FAN_MARK_PID
  2020-08-17 17:02 ` Amir Goldstein
@ 2020-08-22 22:47   ` Tycho Kirchner
  2020-08-23 13:04     ` Amir Goldstein
  0 siblings, 1 reply; 5+ messages in thread
From: Tycho Kirchner @ 2020-08-22 22:47 UTC (permalink / raw)
  To: Amir Goldstein; +Cc: Matthew Bobrowski, linux-fsdevel

Hi Amir, and Thanks for the quick response!

> strace, seccomp, and eBPF


Also thanks for these tips. However:

strace
Is a performance killer. As shournal tracks everyday work on the shell 
and also runs e.g. during expensive genomic analysis, ptrace-based 
approaches are not acceptable here.

seccomp and eBPF
Thanks - I took a deeper look at BPF and the PID-filtering is very nice, 
following child-processes/forks looks also doable; maybe it's better to 
implement it with cgroups(?)..

However, using the current fanotify-approach, to recognize files later, 
consuming the FAN_CLOSE_WRITE-event they are xxHashed (partially, based 
on size) using the passed file-descriptor, which is very nice, because 
renaming/etc does no harm (resolving a path and opening the fd later 
introduces a race condition).
Thus, with BPF, one might try to trace fs/file.c:__close_fd.
Calculating the hash in kernel-mode would be ideal but reading bytes 
from files is not allowed in BPF-programs.
As far as I can tell, BPF also does not support sending the fd to the 
user-space-process (like fanotify does).
The last acceptable resort would have been to resolve the path (within 
BPF) using the fdtable from files_struct *files, but this is not allowed 
within a BPF-program, because it might produce a page fault (see [1] - 
kernel-patch with bpf_fd2path is available, but not in mainline).
Resoling the path in userspace with the known pid and fd-number using 
/proc/$pid/fd/$fdnum is possible but the process might be gone already.

Any further help is appreciated.

Thanks,
Tycho


[1]: https://github.com/iovisor/bcc/issues/2538#issuecomment-541393483



Am 17.08.20 um 19:02 schrieb Amir Goldstein:
> On Mon, Aug 17, 2020 at 7:08 PM Tycho Kirchner <tychokirchner@mail.de> wrote:
>>
>> Dear Amir Goldstein,
>>
> 
> Hi Tycho,
> 
> 
>> Dear Matthew Bobrowski,
>>
>> Dear developers of the kernel filesystem,
>>
>> First of all, thanks for your effort in improving Linux, especially your
>> work regarding fanotify, which I heavily use in one of my projects:
>>
>> https://github.com/tycho-kirchner/shournal
>>
> 
> Nice project!
> 
>> For a more scientfic introduction please take a look at
>> Bashing irreproducibility with shournal
>> https://doi.org/10.1101/2020.08.03.232843
>>
>> I wanted to kindly ask you, whether it is possible for you to add
>> another feature to fanotify, that is reporting only events of a PID or
>> any of its children.
>> This would be very useful, because especially in the world of
>> bioinformatics there is a huge need to automatically and efficiently
>> track file events on the shell, that is, you enter a command on the
>> shell (bash) and then track, which file events were modified by the
>> shell or any of its child-processes.
> 
> I am not sure if fanotify is the right tool for the job.
> fanotify is a *system* monitoring tool and its functionality is very limited.
> If you want to watch what file operations a process and its children are doing,
> you can use more powerful tracing tools like strace, seccomp, and eBPF.
> For starters, did you look at bcc tools, for example:
> https://github.com/iovisor/bcc/blob/master/tools/opensnoop.py
> 
> [...]
> 
>> I imagine e.g. the following syscalls:
>>
>> 1.
>> Use fanotify_mark to restrict the fanotify notification group to a
>> specific PID, optionally marking forked children as well.
>> fanotify_mark(fan_fd, FAN_MARK_ADD | FAN_MARK_PID, FAN_EVENT_ON_CHILD,
>> pid, NULL);
>> // FAN_EVENT_ON_CHILD -> additional meaning: also forked child processes.
>>
> 
> Technically, it is quite easy to filter out events generated by
> processes outside
> pid namespace (which would report pid 0), but I doubt if the use case you
> presented justifies that. Maybe there are other use cases...
> 
>> 2.
>> Use fanotify_mark to remove a PID from the notification group.
>> fanotify_mark(fan_fd, FAN_MARK_REMOVE | FAN_MARK_PID, 0, pid, NULL);
>>
>> 3.
>> When reading from a fan_fd, which is marked for PID's which have all
>> ended or were removed, return e.g. ENOENT.
>>
>>
>> Independent of that it would be also useful, to be able to track
>> applications, which unshare their mount namespace as well (e.g.
>> flatpak). So in case a process, whose mount points are observed,
>> unshares, the new mount id's should also be added to the same fanotify
>> notification group. To preserve backwards compatibility I suggest
>> introducing a new flag FAN_MARK_MOUNT_REC:
>> fanotify_mark(fan_fd, FAN_MARK_ADD | FAN_MARK_MOUNT |
>> FAN_MARK_MOUNT_REC, mask, AT_FDCWD, path);
>>
> 
> The inherited mark concept sounds useful.
> I also thought of a likewise flag for directories.
> The question is if and how you clean all the inherited marks when program
> removes the original mark. It's an API question. Not a trivial one IMO.
> 
> The thing is, with FAN_MARK_FILESYSTEM (v5.1), you can sort of implement
> what you want in userspace with the opposite approach:
> 1. Watch events on filesystem regardless of which mount
> 2. When getting an event with an open fd, resolve the mount
> 3. If you are NOT interested in that mount add a FAN_MARK_IGNORED
>      mask on that mount
> 4. Soon, you will be left with only the events you care about
> 5. When mount is unshared, you will get the events generated on that mount
> 
> But that will only work if the unshared mount is visible in the mount namespace
> of the listener, so it is not a complete solution, but maybe it works for some
> of your use cases.
> 
> Thanks,
> Amir.
> 

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: fanotify feature request FAN_MARK_PID
  2020-08-22 22:47   ` Tycho Kirchner
@ 2020-08-23 13:04     ` Amir Goldstein
  2020-08-28  8:46       ` Jan Kara
  0 siblings, 1 reply; 5+ messages in thread
From: Amir Goldstein @ 2020-08-23 13:04 UTC (permalink / raw)
  To: Tycho Kirchner; +Cc: Matthew Bobrowski, linux-fsdevel

> Any further help is appreciated.
>

A patch along those line (fill in the missing pieces) looks useful to me.
It could serve a use case where applications are using fanotify filesystem
mark, but developer would like to limit those application's scope inside
"system containers".

Perhaps an even more useful API would be FAN_FILTER_MOUNT_NS.
FAN_FILTER_PID_NS effectively means that kernel will drop events
that are expected to report pid 0.
FAN_FILTER_MOUNT_NS would mean that kernel will drop events that
are expected to report an fd, whose /proc/<pid>/fd/<fd> symlink cannot
be resolved (it shows "/") because the file's mount is outside the scope
of the listener's mount namespace.

The burden of proof that this will be useful is still on you ;-)

Thanks,
Amir.

--- a/fs/notify/fanotify/fanotify.c
+++ b/fs/notify/fanotify/fanotify.c
@@ -685,6 +685,11 @@ static int fanotify_handle_event(struct
fsnotify_group *group, u32 mask,

        pr_debug("%s: group=%p mask=%x\n", __func__, group, mask);

+       /* Interested only in events from group's pid ns */
+       if (FAN_GROUP_FLAG(group, FAN_FILTER_PID_NS) &&
+           !pid_nr_ns(task_pid(current), group->fanotify_data.pid_ns))
+               return 0;
+
        if (fanotify_is_perm_event(mask)) {
                /*
                 * fsnotify_prepare_user_wait() fails if we race with mark

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: fanotify feature request FAN_MARK_PID
  2020-08-23 13:04     ` Amir Goldstein
@ 2020-08-28  8:46       ` Jan Kara
  0 siblings, 0 replies; 5+ messages in thread
From: Jan Kara @ 2020-08-28  8:46 UTC (permalink / raw)
  To: Amir Goldstein; +Cc: Tycho Kirchner, Matthew Bobrowski, linux-fsdevel

On Sun 23-08-20 16:04:39, Amir Goldstein wrote:
> > Any further help is appreciated.
> >
> 
> A patch along those line (fill in the missing pieces) looks useful to me.
> It could serve a use case where applications are using fanotify filesystem
> mark, but developer would like to limit those application's scope inside
> "system containers".
> 
> Perhaps an even more useful API would be FAN_FILTER_MOUNT_NS.
> FAN_FILTER_PID_NS effectively means that kernel will drop events
> that are expected to report pid 0.
> FAN_FILTER_MOUNT_NS would mean that kernel will drop events that
> are expected to report an fd, whose /proc/<pid>/fd/<fd> symlink cannot
> be resolved (it shows "/") because the file's mount is outside the scope
> of the listener's mount namespace.
> 
> The burden of proof that this will be useful is still on you ;-)

I was thinking that we could add a BPF hook to fanotify_handle_event()
(similar to what's happening in packet filtering code) and you could attach
BPF programs to this hook to do filtering of events. That way we don't have
to introduce new group flags for various filtering options. The question is
whether eBPF is strong enough so that filters useful for fanotify users
could be implemented with it but this particular check seems implementable.

								Honza

> --- a/fs/notify/fanotify/fanotify.c
> +++ b/fs/notify/fanotify/fanotify.c
> @@ -685,6 +685,11 @@ static int fanotify_handle_event(struct
> fsnotify_group *group, u32 mask,
> 
>         pr_debug("%s: group=%p mask=%x\n", __func__, group, mask);
> 
> +       /* Interested only in events from group's pid ns */
> +       if (FAN_GROUP_FLAG(group, FAN_FILTER_PID_NS) &&
> +           !pid_nr_ns(task_pid(current), group->fanotify_data.pid_ns))
> +               return 0;
> +
>         if (fanotify_is_perm_event(mask)) {
>                 /*
>                  * fsnotify_prepare_user_wait() fails if we race with mark
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2020-08-28  8:46 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-08-17 16:08 fanotify feature request FAN_MARK_PID Tycho Kirchner
2020-08-17 17:02 ` Amir Goldstein
2020-08-22 22:47   ` Tycho Kirchner
2020-08-23 13:04     ` Amir Goldstein
2020-08-28  8:46       ` Jan Kara

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).