linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* fsnotify_mark_srcu wtf?
@ 2016-11-02 22:09 Miklos Szeredi
  2016-11-03 10:25 ` Amir Goldstein
  2016-11-05 21:34 ` Jan Kara
  0 siblings, 2 replies; 19+ messages in thread
From: Miklos Szeredi @ 2016-11-02 22:09 UTC (permalink / raw)
  To: Jan Kara, Eric Paris; +Cc: linux-fsdevel, linux-kernel

We've got a report where a fanotify daemon that implements permission checks
screws up and doesn't send a reply.  This then causes widespread hangs due to
fsnotify_mark_srcu read side lock being held and thus causing synchronize_srcu()
called from e.g. inotify_release()-> fsnotify_destroy_group()->
fsnotify_mark_destroy_list() to block.

Below program demonstrates the issue.  It should output a single line:

close(inotify_fd): success

Instead it outputs nothing, which means that close(inotify_fd) got blocked by
the waiting permission event.

Wouldn't making the srcu per-group fix this?  Would that be too expensive?

Thanks,
Miklos
---

#include <err.h>
#include <fcntl.h>
#include <errno.h>
#include <stdio.h>
#include <stdlib.h>
#include <signal.h>
#include <unistd.h>
#include <sys/stat.h>
#include <sys/types.h>
#include <sys/fanotify.h>
#include <sys/inotify.h>

int main(void)
{
	int fanotify_fd, inotify_fd;
	int ret, pid1, pid2;
	const char *testfile = "t";

	ret = unlink(testfile);
	if (ret == -1 && errno != ENOENT)
		err(1, "unlink()");

	ret = mknod(testfile, S_IFREG | 0666, 0);
	if (ret == -1)
		err(1, "mknod()");

	fanotify_fd = fanotify_init(FAN_CLASS_PRE_CONTENT, O_RDONLY);
	if (fanotify_fd == -1)
		err(1, "fanotify_init()");

	ret = fanotify_mark(fanotify_fd, FAN_MARK_ADD, FAN_OPEN_PERM, AT_FDCWD, testfile);
	if (ret == -1)
		err(1, "fanotify_mark()");

	pid1 = fork();
	if (pid1 == -1)
		err(1, "fork()");

	if (pid1 == 0) {
		close(fanotify_fd);
		ret = open(testfile, O_RDONLY);
		if (ret == -1)
			err(1, "open()");
		fprintf(stderr, "something went wrong: open succeeded\n");
		exit(1);
	}
	sleep(1);

	pid2 = fork();
	if (pid2 == -1)
		err(1, "fork()");

	if (pid2 == 0) {
		close(fanotify_fd);
		inotify_fd = inotify_init();
		if (inotify_fd == -1)
			err(1, "inotify_init()");
		close(inotify_fd);
		fprintf(stderr, "close(inotify_fd): success\n");
		exit(0);
	}

	sleep(1);
	kill(pid1, SIGKILL);
	kill(pid2, SIGKILL);

	return 0;
}

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: fsnotify_mark_srcu wtf?
  2016-11-02 22:09 fsnotify_mark_srcu wtf? Miklos Szeredi
@ 2016-11-03 10:25 ` Amir Goldstein
  2016-11-05 21:43   ` Jan Kara
  2016-11-05 21:34 ` Jan Kara
  1 sibling, 1 reply; 19+ messages in thread
From: Amir Goldstein @ 2016-11-03 10:25 UTC (permalink / raw)
  To: Miklos Szeredi; +Cc: Jan Kara, Eric Paris, linux-fsdevel, linux-kernel

On Thu, Nov 3, 2016 at 12:09 AM, Miklos Szeredi <miklos@szeredi.hu> wrote:
> We've got a report where a fanotify daemon that implements permission checks
> screws up and doesn't send a reply.  This then causes widespread hangs due to
> fsnotify_mark_srcu read side lock being held and thus causing synchronize_srcu()
> called from e.g. inotify_release()-> fsnotify_destroy_group()->
> fsnotify_mark_destroy_list() to block.
>
> Below program demonstrates the issue.  It should output a single line:
>
> close(inotify_fd): success
>
> Instead it outputs nothing, which means that close(inotify_fd) got blocked by
> the waiting permission event.
>
> Wouldn't making the srcu per-group fix this?  Would that be too expensive?
>

IIUC, the purpose of fsnotify_mark_srcu is to protect the inode and vfsmount
mark lists, which are used to iterate the groups to send events to.
So you cannot make it per group as far as I can tell.

OTOH, specifically, for fanotify, once you can passed
fanotify_should_send_event(),
there is no need to keep a reference to the mark, so it may be safely destroyed,
you only need a reference to the group.

I tested this patch on top of my fanotify super block watch development branch
and it seems to fix the problem you reported, but I am not savvy enough with
srcu to say that this is correct.
Jan, what do you think?

>From 28da34cdf9bf71fe9bbac13ded11a19da3b7a48e Mon Sep 17 00:00:00 2001
From: Amir Goldstein <amir73il@gmail.com>
Date: Thu, 3 Nov 2016 11:55:46 +0200
Subject: [PATCH] fanotify: handle permission events without holding
 fsnotify_mark_srcu

Handling fanotify events does not entail dereferencing fsnotify_mask
beyond the point of fanotify_should_send_event().

For the case of permission events, which may block indefinetely,
return -EAGAIN and then fsnotify() will call handle_event() again
without a reference to the mark.

Without a reference to the mark, there is no need to call
handle_event() under fsnotify_mark_srcu read side lock
which is protecting the inode/vfsmount mark lists.
We just need to hold a reference to the group

Signed-off-by: Amir Goldstein <amir73il@gmail.com>
---
 fs/notify/fanotify/fanotify.c | 14 +++++++++++---
 fs/notify/fsnotify.c          | 26 ++++++++++++++++++++++----
 2 files changed, 33 insertions(+), 7 deletions(-)

diff --git a/fs/notify/fanotify/fanotify.c b/fs/notify/fanotify/fanotify.c
index 604e307..0ffa9ed 100644
--- a/fs/notify/fanotify/fanotify.c
+++ b/fs/notify/fanotify/fanotify.c
@@ -298,9 +298,17 @@ static int fanotify_handle_event(struct
fsnotify_group *group,
        BUILD_BUG_ON(FAN_ACCESS_PERM != FS_ACCESS_PERM);
        BUILD_BUG_ON(FAN_ONDIR != FS_ISDIR);

-       if (!fanotify_should_send_event(inode_mark, vfsmnt_mark, mask, data
-                                       data_type))
-               return 0;
+       if (inode_mark || vfsmnt_mark) {
+               if (!fanotify_should_send_event(inode_mark, vfsmnt_mark, mask,
+                                               data, data_type))
+                       return 0;
+
+#ifdef CONFIG_FANOTIFY_ACCESS_PERMISSIONS
+               /* Ask to be called again without a reference to mark */
+               if (mask & FAN_ALL_PERM_EVENTS)
+                       return -EAGAIN;
+#endif
+       }

        pr_debug("%s: group=%p inode=%p mask=%x\n", __func__, group, inode,
                 mask);
diff --git a/fs/notify/fsnotify.c b/fs/notify/fsnotify.c
index 34bccbe..dc0c86e 100644
--- a/fs/notify/fsnotify.c
+++ b/fs/notify/fsnotify.c
@@ -199,7 +199,7 @@ int fsnotify(struct inode *to_tell, __u32 mask,
void *data, int data_is,
 {
        struct hlist_node *inode_node = NULL, *vfsmount_node = NULL;
        struct fsnotify_mark *inode_mark = NULL, *vfsmount_mark = NULL;
-       struct fsnotify_group *inode_group, *vfsmount_group;
+       struct fsnotify_group *inode_group, *vfsmount_group, *group;
        struct mount *mnt;
        int idx, ret = 0;
        /* global tests shouldn't care about events on child only the
specific event */
@@ -282,15 +282,33 @@ int fsnotify(struct inode *to_tell, __u32 mask,
void *data, int data_is,
                ret = send_to_group(to_tell, inode_mark, vfsmount_mark, mask,
                                    data, data_is, cookie, file_name);

-               if (ret && (mask & ALL_FSNOTIFY_PERM_EVENTS))
-                       goto out;
-
                if (inode_group)
                        inode_node = srcu_dereference(inode_node->next,
                                                      &fsnotify_mark_srcu);
                if (vfsmount_group)
                        vfsmount_node = srcu_dereference(vfsmount_node->next,
                                                         &fsnotify_mark_srcu);
+
+               if (ret != -EAGAIN)
+                       continue;
+
+               group = inode_group ? : vfsmount_group;
+               fsnotify_get_group(group);
+               srcu_read_unlock(&fsnotify_mark_srcu, idx);
+               /*
+                * Calling handle_event() again without reference to mark,
+                * so we do not need to call it under fsnotify_mark_srcu
+                * which is protecting the inode/vfsmount mark lists.
+                * We just need to hold a reference to the group
+                */
+               return group->ops->handle_event(group, to_tell, NULL, NULL,
+                                               mask, data, data_is,
+                                               file_name, cookie);
+               idx = srcu_read_lock(&fsnotify_mark_srcu);
+               fsnotify_put_group(group);
+
+               if (ret && (mask & ALL_FSNOTIFY_PERM_EVENTS))
+                       goto out;
        }
        ret = 0;
 out:
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: fsnotify_mark_srcu wtf?
  2016-11-02 22:09 fsnotify_mark_srcu wtf? Miklos Szeredi
  2016-11-03 10:25 ` Amir Goldstein
@ 2016-11-05 21:34 ` Jan Kara
  2016-11-06  6:45   ` Amir Goldstein
  1 sibling, 1 reply; 19+ messages in thread
From: Jan Kara @ 2016-11-05 21:34 UTC (permalink / raw)
  To: Miklos Szeredi; +Cc: Jan Kara, Eric Paris, linux-fsdevel, linux-kernel

On Wed 02-11-16 23:09:26, Miklos Szeredi wrote:
> We've got a report where a fanotify daemon that implements permission checks
> screws up and doesn't send a reply.  This then causes widespread hangs due to
> fsnotify_mark_srcu read side lock being held and thus causing synchronize_srcu()
> called from e.g. inotify_release()-> fsnotify_destroy_group()->
> fsnotify_mark_destroy_list() to block.

Yes. But if a program implementing permission checks does not reply, your
system is likely hosed anyway. We can only try to somewhat limit the
damage...

> Below program demonstrates the issue.  It should output a single line:
> 
> close(inotify_fd): success
> 
> Instead it outputs nothing, which means that close(inotify_fd) got blocked by
> the waiting permission event.
> 
> Wouldn't making the srcu per-group fix this?  Would that be too expensive?

Per-group would be IMHO too expensive. You can have lots of groups and I'm
not sure srcu would scale to that. Furthermore the SRCU protects the list
of groups that need to get notification so it would not even be easily
possible. Also Amir's solution is buggy - I'll comment on that as a reply
to his patch. I'll try to find something to improve the situation but so
far I have no good idea...

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: fsnotify_mark_srcu wtf?
  2016-11-03 10:25 ` Amir Goldstein
@ 2016-11-05 21:43   ` Jan Kara
  0 siblings, 0 replies; 19+ messages in thread
From: Jan Kara @ 2016-11-05 21:43 UTC (permalink / raw)
  To: Amir Goldstein
  Cc: Miklos Szeredi, Jan Kara, Eric Paris, linux-fsdevel, linux-kernel

On Thu 03-11-16 12:25:13, Amir Goldstein wrote:
> On Thu, Nov 3, 2016 at 12:09 AM, Miklos Szeredi <miklos@szeredi.hu> wrote:
> > We've got a report where a fanotify daemon that implements permission checks
> > screws up and doesn't send a reply.  This then causes widespread hangs due to
> > fsnotify_mark_srcu read side lock being held and thus causing synchronize_srcu()
> > called from e.g. inotify_release()-> fsnotify_destroy_group()->
> > fsnotify_mark_destroy_list() to block.
> >
> > Below program demonstrates the issue.  It should output a single line:
> >
> > close(inotify_fd): success
> >
> > Instead it outputs nothing, which means that close(inotify_fd) got blocked by
> > the waiting permission event.
> >
> > Wouldn't making the srcu per-group fix this?  Would that be too expensive?
> >
> 
> IIUC, the purpose of fsnotify_mark_srcu is to protect the inode and vfsmount
> mark lists, which are used to iterate the groups to send events to.
> So you cannot make it per group as far as I can tell.
> 
> OTOH, specifically, for fanotify, once you can passed
> fanotify_should_send_event(),
> there is no need to keep a reference to the mark, so it may be safely destroyed,
> you only need a reference to the group.
> 
> I tested this patch on top of my fanotify super block watch development branch
> and it seems to fix the problem you reported, but I am not savvy enough with
> srcu to say that this is correct.
> Jan, what do you think?

So the way you've written it is definitely buggy. The inode_node and
vfsmount_node pointers may become invalid once you drop SRCU lock and so
you cannot dereference them even after regaining that lock. Also frankly
your solution looks a bit ugly. I'll think whether we can somehow fix the
problem...

								Honza

> 
> From 28da34cdf9bf71fe9bbac13ded11a19da3b7a48e Mon Sep 17 00:00:00 2001
> From: Amir Goldstein <amir73il@gmail.com>
> Date: Thu, 3 Nov 2016 11:55:46 +0200
> Subject: [PATCH] fanotify: handle permission events without holding
>  fsnotify_mark_srcu
> 
> Handling fanotify events does not entail dereferencing fsnotify_mask
> beyond the point of fanotify_should_send_event().
> 
> For the case of permission events, which may block indefinetely,
> return -EAGAIN and then fsnotify() will call handle_event() again
> without a reference to the mark.
> 
> Without a reference to the mark, there is no need to call
> handle_event() under fsnotify_mark_srcu read side lock
> which is protecting the inode/vfsmount mark lists.
> We just need to hold a reference to the group
> 
> Signed-off-by: Amir Goldstein <amir73il@gmail.com>
> ---
>  fs/notify/fanotify/fanotify.c | 14 +++++++++++---
>  fs/notify/fsnotify.c          | 26 ++++++++++++++++++++++----
>  2 files changed, 33 insertions(+), 7 deletions(-)
> 
> diff --git a/fs/notify/fanotify/fanotify.c b/fs/notify/fanotify/fanotify.c
> index 604e307..0ffa9ed 100644
> --- a/fs/notify/fanotify/fanotify.c
> +++ b/fs/notify/fanotify/fanotify.c
> @@ -298,9 +298,17 @@ static int fanotify_handle_event(struct
> fsnotify_group *group,
>         BUILD_BUG_ON(FAN_ACCESS_PERM != FS_ACCESS_PERM);
>         BUILD_BUG_ON(FAN_ONDIR != FS_ISDIR);
> 
> -       if (!fanotify_should_send_event(inode_mark, vfsmnt_mark, mask, data
> -                                       data_type))
> -               return 0;
> +       if (inode_mark || vfsmnt_mark) {
> +               if (!fanotify_should_send_event(inode_mark, vfsmnt_mark, mask,
> +                                               data, data_type))
> +                       return 0;
> +
> +#ifdef CONFIG_FANOTIFY_ACCESS_PERMISSIONS
> +               /* Ask to be called again without a reference to mark */
> +               if (mask & FAN_ALL_PERM_EVENTS)
> +                       return -EAGAIN;
> +#endif
> +       }
> 
>         pr_debug("%s: group=%p inode=%p mask=%x\n", __func__, group, inode,
>                  mask);
> diff --git a/fs/notify/fsnotify.c b/fs/notify/fsnotify.c
> index 34bccbe..dc0c86e 100644
> --- a/fs/notify/fsnotify.c
> +++ b/fs/notify/fsnotify.c
> @@ -199,7 +199,7 @@ int fsnotify(struct inode *to_tell, __u32 mask,
> void *data, int data_is,
>  {
>         struct hlist_node *inode_node = NULL, *vfsmount_node = NULL;
>         struct fsnotify_mark *inode_mark = NULL, *vfsmount_mark = NULL;
> -       struct fsnotify_group *inode_group, *vfsmount_group;
> +       struct fsnotify_group *inode_group, *vfsmount_group, *group;
>         struct mount *mnt;
>         int idx, ret = 0;
>         /* global tests shouldn't care about events on child only the
> specific event */
> @@ -282,15 +282,33 @@ int fsnotify(struct inode *to_tell, __u32 mask,
> void *data, int data_is,
>                 ret = send_to_group(to_tell, inode_mark, vfsmount_mark, mask,
>                                     data, data_is, cookie, file_name);
> 
> -               if (ret && (mask & ALL_FSNOTIFY_PERM_EVENTS))
> -                       goto out;
> -
>                 if (inode_group)
>                         inode_node = srcu_dereference(inode_node->next,
>                                                       &fsnotify_mark_srcu);
>                 if (vfsmount_group)
>                         vfsmount_node = srcu_dereference(vfsmount_node->next,
>                                                          &fsnotify_mark_srcu);
> +
> +               if (ret != -EAGAIN)
> +                       continue;
> +
> +               group = inode_group ? : vfsmount_group;
> +               fsnotify_get_group(group);
> +               srcu_read_unlock(&fsnotify_mark_srcu, idx);
> +               /*
> +                * Calling handle_event() again without reference to mark,
> +                * so we do not need to call it under fsnotify_mark_srcu
> +                * which is protecting the inode/vfsmount mark lists.
> +                * We just need to hold a reference to the group
> +                */
> +               return group->ops->handle_event(group, to_tell, NULL, NULL,
> +                                               mask, data, data_is,
> +                                               file_name, cookie);
> +               idx = srcu_read_lock(&fsnotify_mark_srcu);
> +               fsnotify_put_group(group);
> +
> +               if (ret && (mask & ALL_FSNOTIFY_PERM_EVENTS))
> +                       goto out;
>         }
>         ret = 0;
>  out:
> -- 
> 2.7.4
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: fsnotify_mark_srcu wtf?
  2016-11-05 21:34 ` Jan Kara
@ 2016-11-06  6:45   ` Amir Goldstein
  2016-11-09 11:10     ` Jan Kara
  0 siblings, 1 reply; 19+ messages in thread
From: Amir Goldstein @ 2016-11-06  6:45 UTC (permalink / raw)
  To: Jan Kara; +Cc: Miklos Szeredi, Eric Paris, linux-fsdevel, linux-kernel

On Sat, Nov 5, 2016 at 11:34 PM, Jan Kara <jack@suse.cz> wrote:
> On Wed 02-11-16 23:09:26, Miklos Szeredi wrote:
>> We've got a report where a fanotify daemon that implements permission checks
>> screws up and doesn't send a reply.  This then causes widespread hangs due to
>> fsnotify_mark_srcu read side lock being held and thus causing synchronize_srcu()
>> called from e.g. inotify_release()-> fsnotify_destroy_group()->
>> fsnotify_mark_destroy_list() to block.
>
> Yes. But if a program implementing permission checks does not reply, your
> system is likely hosed anyway. We can only try to somewhat limit the
> damage...
>

That was my initial thought as well, but at least with the sample code
Miklos sent
the only thing that gets hosed is the one process watching that one file.
You could think of a use case of fanotify being used to watch over files
in a specific user directory, where the damage on the entire system
should/could be limited. No?

>> Below program demonstrates the issue.  It should output a single line:
>>
>> close(inotify_fd): success
>>
>> Instead it outputs nothing, which means that close(inotify_fd) got blocked by
>> the waiting permission event.
>>
>> Wouldn't making the srcu per-group fix this?  Would that be too expensive?
>
> Per-group would be IMHO too expensive. You can have lots of groups and I'm
> not sure srcu would scale to that. Furthermore the SRCU protects the list
> of groups that need to get notification so it would not even be easily
> possible. Also Amir's solution is buggy - I'll comment on that as a reply
> to his patch. I'll try to find something to improve the situation but so
> far I have no good idea...
>

Yes, very much buggy indeed :/
Anyway, the reason I drafted it quickly was to highlight the fact that the
marks only need to live to the point of decision whether or not the event
should be sent to the group and afterwards, its sufficient to grab the
group reference, without having impact on the entire system.

Yet another possible ugly (but less buggy) solution would be
to iterate all marks under SRCU read protection.
If any group is about to block (either by suggested return value
EAGAIN or another
by using a new op should_handle_event_deferred), defer event handling to post
marks iteration, by keeping a few group references on stack.

But hopefully, you'll find some less ugly solution.

Amir.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: fsnotify_mark_srcu wtf?
  2016-11-06  6:45   ` Amir Goldstein
@ 2016-11-09 11:10     ` Jan Kara
  2016-11-09 18:26       ` Amir Goldstein
  0 siblings, 1 reply; 19+ messages in thread
From: Jan Kara @ 2016-11-09 11:10 UTC (permalink / raw)
  To: Amir Goldstein
  Cc: Jan Kara, Miklos Szeredi, Eric Paris, linux-fsdevel, linux-kernel

On Sun 06-11-16 08:45:54, Amir Goldstein wrote:
> On Sat, Nov 5, 2016 at 11:34 PM, Jan Kara <jack@suse.cz> wrote:
> > On Wed 02-11-16 23:09:26, Miklos Szeredi wrote:
> >> We've got a report where a fanotify daemon that implements permission checks
> >> screws up and doesn't send a reply.  This then causes widespread hangs due to
> >> fsnotify_mark_srcu read side lock being held and thus causing synchronize_srcu()
> >> called from e.g. inotify_release()-> fsnotify_destroy_group()->
> >> fsnotify_mark_destroy_list() to block.
> >
> > Yes. But if a program implementing permission checks does not reply, your
> > system is likely hosed anyway. We can only try to somewhat limit the
> > damage...
> >
> 
> That was my initial thought as well, but at least with the sample code
> Miklos sent
> the only thing that gets hosed is the one process watching that one file.
> You could think of a use case of fanotify being used to watch over files
> in a specific user directory, where the damage on the entire system
> should/could be limited. No?

Yes, the damage could at least theoretically be limited only to those files
/ dirs watched by the buggy process.

> >> Below program demonstrates the issue.  It should output a single line:
> >>
> >> close(inotify_fd): success
> >>
> >> Instead it outputs nothing, which means that close(inotify_fd) got blocked by
> >> the waiting permission event.
> >>
> >> Wouldn't making the srcu per-group fix this?  Would that be too expensive?
> >
> > Per-group would be IMHO too expensive. You can have lots of groups and I'm
> > not sure srcu would scale to that. Furthermore the SRCU protects the list
> > of groups that need to get notification so it would not even be easily
> > possible. Also Amir's solution is buggy - I'll comment on that as a reply
> > to his patch. I'll try to find something to improve the situation but so
> > far I have no good idea...
> >
> 
> Yes, very much buggy indeed :/
> Anyway, the reason I drafted it quickly was to highlight the fact that the
> marks only need to live to the point of decision whether or not the event
> should be sent to the group and afterwards, its sufficient to grab the
> group reference, without having impact on the entire system.

Yes, fanotify code as such does not need the marks anymore. But the core
fsnotify code does...

> Yet another possible ugly (but less buggy) solution would be
> to iterate all marks under SRCU read protection.
> If any group is about to block (either by suggested return value
> EAGAIN or another
> by using a new op should_handle_event_deferred), defer event handling to post
> marks iteration, by keeping a few group references on stack.

And this does not work as well... Fanotify must notify groups by their
priority so you cannot arbitrarily reorder ordering in which groups get
notified. I'm currently pondering on using mark refcount to pin it when
processing permission event but there are still some details to check.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: fsnotify_mark_srcu wtf?
  2016-11-09 11:10     ` Jan Kara
@ 2016-11-09 18:26       ` Amir Goldstein
  2016-11-10 19:46         ` Jan Kara
  0 siblings, 1 reply; 19+ messages in thread
From: Amir Goldstein @ 2016-11-09 18:26 UTC (permalink / raw)
  To: Jan Kara; +Cc: Miklos Szeredi, Eric Paris, linux-fsdevel, linux-kernel

On Wed, Nov 9, 2016 at 1:10 PM, Jan Kara <jack@suse.cz> wrote:
> On Sun 06-11-16 08:45:54, Amir Goldstein wrote:
>> On Sat, Nov 5, 2016 at 11:34 PM, Jan Kara <jack@suse.cz> wrote:
>> > On Wed 02-11-16 23:09:26, Miklos Szeredi wrote:
>> >> We've got a report where a fanotify daemon that implements permission checks
>> >> screws up and doesn't send a reply.  This then causes widespread hangs due to
>> >> fsnotify_mark_srcu read side lock being held and thus causing synchronize_srcu()
>> >> called from e.g. inotify_release()-> fsnotify_destroy_group()->
>> >> fsnotify_mark_destroy_list() to block.
>> >
>> > Yes. But if a program implementing permission checks does not reply, your
>> > system is likely hosed anyway. We can only try to somewhat limit the
>> > damage...
>> >
>>
>> That was my initial thought as well, but at least with the sample code
>> Miklos sent
>> the only thing that gets hosed is the one process watching that one file.
>> You could think of a use case of fanotify being used to watch over files
>> in a specific user directory, where the damage on the entire system
>> should/could be limited. No?
>
> Yes, the damage could at least theoretically be limited only to those files
> / dirs watched by the buggy process.
>
>> >> Below program demonstrates the issue.  It should output a single line:
>> >>
>> >> close(inotify_fd): success
>> >>
>> >> Instead it outputs nothing, which means that close(inotify_fd) got blocked by
>> >> the waiting permission event.
>> >>
>> >> Wouldn't making the srcu per-group fix this?  Would that be too expensive?
>> >
>> > Per-group would be IMHO too expensive. You can have lots of groups and I'm
>> > not sure srcu would scale to that. Furthermore the SRCU protects the list
>> > of groups that need to get notification so it would not even be easily
>> > possible. Also Amir's solution is buggy - I'll comment on that as a reply
>> > to his patch. I'll try to find something to improve the situation but so
>> > far I have no good idea...
>> >
>>
>> Yes, very much buggy indeed :/
>> Anyway, the reason I drafted it quickly was to highlight the fact that the
>> marks only need to live to the point of decision whether or not the event
>> should be sent to the group and afterwards, its sufficient to grab the
>> group reference, without having impact on the entire system.
>
> Yes, fanotify code as such does not need the marks anymore. But the core
> fsnotify code does...
>
>> Yet another possible ugly (but less buggy) solution would be
>> to iterate all marks under SRCU read protection.
>> If any group is about to block (either by suggested return value
>> EAGAIN or another
>> by using a new op should_handle_event_deferred), defer event handling to post
>> marks iteration, by keeping a few group references on stack.
>
> And this does not work as well... Fanotify must notify groups by their
> priority so you cannot arbitrarily reorder ordering in which groups get
> notified. I'm currently pondering on using mark refcount to pin it when
> processing permission event but there are still some details to check.
>

All right, mark refcount sound like the proper solution.

In case this approach doesn't work out for some reason, you may want
to consider fsnotify_mark_srcu (and destroy_list) per priority.
Or just 2 srcu, one for for priority 0 and one for other.
Because priority > 0 may block and priority 0 may not block.

The priority set on an inode/vfsmount can be easily encoded into the
i_fsnotify_mask/mnt_fsnotify_mask, e.g.:
#define FS_GROUP_PRIO_1          0x00040000      /* fanotify content
based access control */
#define FS_GROUP_PRIO_2          0x00080000      /* fanotify
pre-content access */

in fsnotify(), only need to take the read side srcu of relevant priority groups,
but need to take extra care to set the priority bit in the inode/mnt
mask *before*
adding the mark to the list, with a relevant memory barrier before checking
the priority bits in fsnotify().

Amir.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: fsnotify_mark_srcu wtf?
  2016-11-09 18:26       ` Amir Goldstein
@ 2016-11-10 19:46         ` Jan Kara
  2016-11-10 20:02           ` Amir Goldstein
                             ` (3 more replies)
  0 siblings, 4 replies; 19+ messages in thread
From: Jan Kara @ 2016-11-10 19:46 UTC (permalink / raw)
  To: Amir Goldstein
  Cc: Jan Kara, Miklos Szeredi, Eric Paris, linux-fsdevel, linux-kernel

On Wed 09-11-16 20:26:16, Amir Goldstein wrote:
> On Wed, Nov 9, 2016 at 1:10 PM, Jan Kara <jack@suse.cz> wrote:
> > On Sun 06-11-16 08:45:54, Amir Goldstein wrote:
> >> On Sat, Nov 5, 2016 at 11:34 PM, Jan Kara <jack@suse.cz> wrote:
> >> > On Wed 02-11-16 23:09:26, Miklos Szeredi wrote:
> >> >> We've got a report where a fanotify daemon that implements permission checks
> >> >> screws up and doesn't send a reply.  This then causes widespread hangs due to
> >> >> fsnotify_mark_srcu read side lock being held and thus causing synchronize_srcu()
> >> >> called from e.g. inotify_release()-> fsnotify_destroy_group()->
> >> >> fsnotify_mark_destroy_list() to block.
> >> >
> >> > Yes. But if a program implementing permission checks does not reply, your
> >> > system is likely hosed anyway. We can only try to somewhat limit the
> >> > damage...
> >> >
> >>
> >> That was my initial thought as well, but at least with the sample code
> >> Miklos sent
> >> the only thing that gets hosed is the one process watching that one file.
> >> You could think of a use case of fanotify being used to watch over files
> >> in a specific user directory, where the damage on the entire system
> >> should/could be limited. No?
> >
> > Yes, the damage could at least theoretically be limited only to those files
> > / dirs watched by the buggy process.
> >
> >> >> Below program demonstrates the issue.  It should output a single line:
> >> >>
> >> >> close(inotify_fd): success
> >> >>
> >> >> Instead it outputs nothing, which means that close(inotify_fd) got blocked by
> >> >> the waiting permission event.
> >> >>
> >> >> Wouldn't making the srcu per-group fix this?  Would that be too expensive?
> >> >
> >> > Per-group would be IMHO too expensive. You can have lots of groups and I'm
> >> > not sure srcu would scale to that. Furthermore the SRCU protects the list
> >> > of groups that need to get notification so it would not even be easily
> >> > possible. Also Amir's solution is buggy - I'll comment on that as a reply
> >> > to his patch. I'll try to find something to improve the situation but so
> >> > far I have no good idea...
> >> >
> >>
> >> Yes, very much buggy indeed :/
> >> Anyway, the reason I drafted it quickly was to highlight the fact that the
> >> marks only need to live to the point of decision whether or not the event
> >> should be sent to the group and afterwards, its sufficient to grab the
> >> group reference, without having impact on the entire system.
> >
> > Yes, fanotify code as such does not need the marks anymore. But the core
> > fsnotify code does...
> >
> >> Yet another possible ugly (but less buggy) solution would be
> >> to iterate all marks under SRCU read protection.
> >> If any group is about to block (either by suggested return value
> >> EAGAIN or another
> >> by using a new op should_handle_event_deferred), defer event handling to post
> >> marks iteration, by keeping a few group references on stack.
> >
> > And this does not work as well... Fanotify must notify groups by their
> > priority so you cannot arbitrarily reorder ordering in which groups get
> > notified. I'm currently pondering on using mark refcount to pin it when
> > processing permission event but there are still some details to check.
> >
> 
> All right, mark refcount sound like the proper solution.

Except it doesn't quite work. We can pin the current marks by a refcount
but they can still be removed from the list so after we regain srcu lock,
we are not sure their ->next pointers still point to still allocated marks
:-| Sadly I realized this only after implementing all this.

> In case this approach doesn't work out for some reason, you may want
> to consider fsnotify_mark_srcu (and destroy_list) per priority.
> Or just 2 srcu, one for for priority 0 and one for other.
> Because priority > 0 may block and priority 0 may not block.
> 
> The priority set on an inode/vfsmount can be easily encoded into the
> i_fsnotify_mask/mnt_fsnotify_mask, e.g.:
> #define FS_GROUP_PRIO_1          0x00040000      /* fanotify content
> based access control */
> #define FS_GROUP_PRIO_2          0x00080000      /* fanotify
> pre-content access */
> 
> in fsnotify(), only need to take the read side srcu of relevant priority groups,
> but need to take extra care to set the priority bit in the inode/mnt
> mask *before*
> adding the mark to the list, with a relevant memory barrier before checking
> the priority bits in fsnotify().

Well but how would you like to protect the mark list hanging off the inode
/ mountpoint with two SRCUs? You'd need two lists hanging off the inode &
mountpoint (for different priorities) and that's too big cost to pay to
accomodate broken userspace...

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: fsnotify_mark_srcu wtf?
  2016-11-10 19:46         ` Jan Kara
@ 2016-11-10 20:02           ` Amir Goldstein
  2016-11-10 20:44           ` Miklos Szeredi
                             ` (2 subsequent siblings)
  3 siblings, 0 replies; 19+ messages in thread
From: Amir Goldstein @ 2016-11-10 20:02 UTC (permalink / raw)
  To: Jan Kara; +Cc: Miklos Szeredi, Eric Paris, linux-fsdevel, linux-kernel

On Thu, Nov 10, 2016 at 9:46 PM, Jan Kara <jack@suse.cz> wrote:
> On Wed 09-11-16 20:26:16, Amir Goldstein wrote:
>> On Wed, Nov 9, 2016 at 1:10 PM, Jan Kara <jack@suse.cz> wrote:
>> > On Sun 06-11-16 08:45:54, Amir Goldstein wrote:
>> >> On Sat, Nov 5, 2016 at 11:34 PM, Jan Kara <jack@suse.cz> wrote:
>> >> > On Wed 02-11-16 23:09:26, Miklos Szeredi wrote:
>> >> >> We've got a report where a fanotify daemon that implements permission checks
>> >> >> screws up and doesn't send a reply.  This then causes widespread hangs due to
>> >> >> fsnotify_mark_srcu read side lock being held and thus causing synchronize_srcu()
>> >> >> called from e.g. inotify_release()-> fsnotify_destroy_group()->
>> >> >> fsnotify_mark_destroy_list() to block.
>> >> >
>> >> > Yes. But if a program implementing permission checks does not reply, your
>> >> > system is likely hosed anyway. We can only try to somewhat limit the
>> >> > damage...
>> >> >
>> >>
>> >> That was my initial thought as well, but at least with the sample code
>> >> Miklos sent
>> >> the only thing that gets hosed is the one process watching that one file.
>> >> You could think of a use case of fanotify being used to watch over files
>> >> in a specific user directory, where the damage on the entire system
>> >> should/could be limited. No?
>> >
>> > Yes, the damage could at least theoretically be limited only to those files
>> > / dirs watched by the buggy process.
>> >
>> >> >> Below program demonstrates the issue.  It should output a single line:
>> >> >>
>> >> >> close(inotify_fd): success
>> >> >>
>> >> >> Instead it outputs nothing, which means that close(inotify_fd) got blocked by
>> >> >> the waiting permission event.
>> >> >>
>> >> >> Wouldn't making the srcu per-group fix this?  Would that be too expensive?
>> >> >
>> >> > Per-group would be IMHO too expensive. You can have lots of groups and I'm
>> >> > not sure srcu would scale to that. Furthermore the SRCU protects the list
>> >> > of groups that need to get notification so it would not even be easily
>> >> > possible. Also Amir's solution is buggy - I'll comment on that as a reply
>> >> > to his patch. I'll try to find something to improve the situation but so
>> >> > far I have no good idea...
>> >> >
>> >>
>> >> Yes, very much buggy indeed :/
>> >> Anyway, the reason I drafted it quickly was to highlight the fact that the
>> >> marks only need to live to the point of decision whether or not the event
>> >> should be sent to the group and afterwards, its sufficient to grab the
>> >> group reference, without having impact on the entire system.
>> >
>> > Yes, fanotify code as such does not need the marks anymore. But the core
>> > fsnotify code does...
>> >
>> >> Yet another possible ugly (but less buggy) solution would be
>> >> to iterate all marks under SRCU read protection.
>> >> If any group is about to block (either by suggested return value
>> >> EAGAIN or another
>> >> by using a new op should_handle_event_deferred), defer event handling to post
>> >> marks iteration, by keeping a few group references on stack.
>> >
>> > And this does not work as well... Fanotify must notify groups by their
>> > priority so you cannot arbitrarily reorder ordering in which groups get
>> > notified. I'm currently pondering on using mark refcount to pin it when
>> > processing permission event but there are still some details to check.
>> >
>>
>> All right, mark refcount sound like the proper solution.
>
> Except it doesn't quite work. We can pin the current marks by a refcount
> but they can still be removed from the list so after we regain srcu lock,
> we are not sure their ->next pointers still point to still allocated marks
> :-| Sadly I realized this only after implementing all this.
>
>> In case this approach doesn't work out for some reason, you may want
>> to consider fsnotify_mark_srcu (and destroy_list) per priority.
>> Or just 2 srcu, one for for priority 0 and one for other.
>> Because priority > 0 may block and priority 0 may not block.
>>
>> The priority set on an inode/vfsmount can be easily encoded into the
>> i_fsnotify_mask/mnt_fsnotify_mask, e.g.:
>> #define FS_GROUP_PRIO_1          0x00040000      /* fanotify content
>> based access control */
>> #define FS_GROUP_PRIO_2          0x00080000      /* fanotify
>> pre-content access */
>>
>> in fsnotify(), only need to take the read side srcu of relevant priority groups,
>> but need to take extra care to set the priority bit in the inode/mnt
>> mask *before*
>> adding the mark to the list, with a relevant memory barrier before checking
>> the priority bits in fsnotify().
>
> Well but how would you like to protect the mark list hanging off the inode
> / mountpoint with two SRCUs? You'd need two lists hanging off the inode &
> mountpoint (for different priorities) and that's too big cost to pay to
> accomodate broken userspace...
>

My idea was to avoid another list on the inode and use the fact that the inode
mark mask can indicate which is the largest priority on the list.
Theoretically speaking, and I know this is playing close to fire,
if you manage find a way to start traversing the first element of the
list with the
correct set of srcu held, say (0 and 1), then you should not be bothered with
reaching higher priority marks (say 2) anymore, because those would be added to
the head of the list and never after lower priority marks.

Also, with the approach of SRCU per priority, one misbehaved fanotify permission
checker will block all other permission checkers, so it's not the best outcome,
but at least it would save all the inotify and fanotify passive observers.

Amir.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: fsnotify_mark_srcu wtf?
  2016-11-10 19:46         ` Jan Kara
  2016-11-10 20:02           ` Amir Goldstein
@ 2016-11-10 20:44           ` Miklos Szeredi
  2016-11-10 22:41             ` Amir Goldstein
  2016-11-13 18:43           ` Amir Goldstein
  2016-12-02  8:26           ` Miklos Szeredi
  3 siblings, 1 reply; 19+ messages in thread
From: Miklos Szeredi @ 2016-11-10 20:44 UTC (permalink / raw)
  To: Jan Kara; +Cc: Amir Goldstein, Eric Paris, linux-fsdevel, linux-kernel

On Thu, Nov 10, 2016 at 8:46 PM, Jan Kara <jack@suse.cz> wrote:
> Except it doesn't quite work. We can pin the current marks by a refcount
> but they can still be removed from the list so after we regain srcu lock,
> we are not sure their ->next pointers still point to still allocated marks
> :-| Sadly I realized this only after implementing all this.

I think the real problem is the synchronous nature of the notification
interface, that really doesn't fit the permission events model very
well.

If it were to be changed around to an async interface then all the
marks could be iterated, the permission events queued and then the
srcu lock can be released for good.

Did I miss something?

Thanks,
Miklos

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: fsnotify_mark_srcu wtf?
  2016-11-10 20:44           ` Miklos Szeredi
@ 2016-11-10 22:41             ` Amir Goldstein
  0 siblings, 0 replies; 19+ messages in thread
From: Amir Goldstein @ 2016-11-10 22:41 UTC (permalink / raw)
  To: Miklos Szeredi; +Cc: Jan Kara, Eric Paris, linux-fsdevel, linux-kernel

On Thu, Nov 10, 2016 at 10:44 PM, Miklos Szeredi <miklos@szeredi.hu> wrote:
> On Thu, Nov 10, 2016 at 8:46 PM, Jan Kara <jack@suse.cz> wrote:
>> Except it doesn't quite work. We can pin the current marks by a refcount
>> but they can still be removed from the list so after we regain srcu lock,
>> we are not sure their ->next pointers still point to still allocated marks
>> :-| Sadly I realized this only after implementing all this.
>
> I think the real problem is the synchronous nature of the notification
> interface, that really doesn't fit the permission events model very
> well.
>
> If it were to be changed around to an async interface then all the
> marks could be iterated, the permission events queued and then the
> srcu lock can be released for good.
>
> Did I miss something?
>

Just one annoying fact - the the spec says that permission events must
be delivered
before the lower priority class events (i.e. the passive listeners).
But unlike fanotify, that doesn't need the marks after they have been
iterated, inotify
does need the marks further along and dnotify goes further by taking a
lock, which
is on the mark itself, throughout the emittion of the event.

This is the reason I suggested to walk the mark list while taking
different SRCUs,
but as I said, that's easier said then done, so I am not sure its a
feasible idea.

Amir.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: fsnotify_mark_srcu wtf?
  2016-11-10 19:46         ` Jan Kara
  2016-11-10 20:02           ` Amir Goldstein
  2016-11-10 20:44           ` Miklos Szeredi
@ 2016-11-13 18:43           ` Amir Goldstein
  2016-11-14 11:59             ` Amir Goldstein
  2016-12-02  8:26           ` Miklos Szeredi
  3 siblings, 1 reply; 19+ messages in thread
From: Amir Goldstein @ 2016-11-13 18:43 UTC (permalink / raw)
  To: Jan Kara; +Cc: Miklos Szeredi, Eric Paris, linux-fsdevel, linux-kernel

On Thu, Nov 10, 2016 at 9:46 PM, Jan Kara <jack@suse.cz> wrote:

...

>
> Well but how would you like to protect the mark list hanging off the inode
> / mountpoint with two SRCUs? You'd need two lists hanging off the inode &
> mountpoint (for different priorities) and that's too big cost to pay to
> accomodate broken userspace...
>

This is the plan.
I have a rough implementation, which still requires some testing:

/*
 * fsnotify_perm_mark[1] protects reads of the first half of inode/vfsmount
 * mark lists, which consist of the high priority (>0) marks, whose masks
 * may contain permission events.
 *
 * fsnotify_perm_mark[0] protects reads of the second half of inode/vfsmount
 * mark lists, which consist of the low priority (0) marks, whose masks
 * do not contain permission events.
 *
 * High priority marks and low priority marks are added to different
 * destroy lists and freed by different reapers, who synchronize only
 * the relevant srcu.
 *
 * Before traversing to first mark on the list we must hold both srcu.
 * While traversing first half, we may temporary drop fsnotify_perm_mark[0]
 * to handle a blocking permission event.
 * When we reach the first low level mark, we may drop fsnotify_perm_mark[1]
 * for good.
 */

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: fsnotify_mark_srcu wtf?
  2016-11-13 18:43           ` Amir Goldstein
@ 2016-11-14 11:59             ` Amir Goldstein
  0 siblings, 0 replies; 19+ messages in thread
From: Amir Goldstein @ 2016-11-14 11:59 UTC (permalink / raw)
  To: Jan Kara; +Cc: Miklos Szeredi, Eric Paris, linux-fsdevel, linux-kernel

On Sun, Nov 13, 2016 at 8:43 PM, Amir Goldstein <amir73il@gmail.com> wrote:
> On Thu, Nov 10, 2016 at 9:46 PM, Jan Kara <jack@suse.cz> wrote:
>
> ...
>
>>
>> Well but how would you like to protect the mark list hanging off the inode
>> / mountpoint with two SRCUs? You'd need two lists hanging off the inode &
>> mountpoint (for different priorities) and that's too big cost to pay to
>> accomodate broken userspace...
>>
>
> This is the plan.
> I have a rough implementation, which still requires some testing:
>

2 patches posted:
 fsnotify: separate fsnotify_mark_srcu for groups with permission events
 fsnotify: handle permission events without holding fsnotify_mark_srcu[0]

Miklos,

Your test program actually passes after the first patch (split to 2 SRCU)
but only because it does not add any inotify mark, so I modified your test
a bit and made it available on my github:
https://github.com/amir73il/fsnotify-utils/blob/master/fanotify_bug.c

Jan,

Please let me know what you think of this version.

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: fsnotify_mark_srcu wtf?
  2016-11-10 19:46         ` Jan Kara
                             ` (2 preceding siblings ...)
  2016-11-13 18:43           ` Amir Goldstein
@ 2016-12-02  8:26           ` Miklos Szeredi
  2016-12-02 10:48             ` Jan Kara
  3 siblings, 1 reply; 19+ messages in thread
From: Miklos Szeredi @ 2016-12-02  8:26 UTC (permalink / raw)
  To: Jan Kara; +Cc: Amir Goldstein, Eric Paris, linux-fsdevel, linux-kernel

On Thu, Nov 10, 2016 at 8:46 PM, Jan Kara <jack@suse.cz> wrote:
> On Wed 09-11-16 20:26:16, Amir Goldstein wrote:
>> On Wed, Nov 9, 2016 at 1:10 PM, Jan Kara <jack@suse.cz> wrote:

>> > And this does not work as well... Fanotify must notify groups by their
>> > priority so you cannot arbitrarily reorder ordering in which groups get
>> > notified. I'm currently pondering on using mark refcount to pin it when
>> > processing permission event but there are still some details to check.
>> >
>>
>> All right, mark refcount sound like the proper solution.
>
> Except it doesn't quite work. We can pin the current marks by a refcount
> but they can still be removed from the list so after we regain srcu lock,
> we are not sure their ->next pointers still point to still allocated marks
> :-| Sadly I realized this only after implementing all this.

Hmm, how about this: when removing mark from inode, drop refcount.  If
refcount is zero can remove from list.  Otherwise mark the mark "dead"
and leave it on the list.

And fsnotify can just skip dead marks.

Thanks,
Miklos

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: fsnotify_mark_srcu wtf?
  2016-12-02  8:26           ` Miklos Szeredi
@ 2016-12-02 10:48             ` Jan Kara
  2016-12-02 11:02               ` Miklos Szeredi
  2016-12-02 11:41               ` Amir Goldstein
  0 siblings, 2 replies; 19+ messages in thread
From: Jan Kara @ 2016-12-02 10:48 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: Jan Kara, Amir Goldstein, Eric Paris, linux-fsdevel, linux-kernel

On Fri 02-12-16 09:26:51, Miklos Szeredi wrote:
> On Thu, Nov 10, 2016 at 8:46 PM, Jan Kara <jack@suse.cz> wrote:
> > On Wed 09-11-16 20:26:16, Amir Goldstein wrote:
> >> On Wed, Nov 9, 2016 at 1:10 PM, Jan Kara <jack@suse.cz> wrote:
> 
> >> > And this does not work as well... Fanotify must notify groups by their
> >> > priority so you cannot arbitrarily reorder ordering in which groups get
> >> > notified. I'm currently pondering on using mark refcount to pin it when
> >> > processing permission event but there are still some details to check.
> >> >
> >>
> >> All right, mark refcount sound like the proper solution.
> >
> > Except it doesn't quite work. We can pin the current marks by a refcount
> > but they can still be removed from the list so after we regain srcu lock,
> > we are not sure their ->next pointers still point to still allocated marks
> > :-| Sadly I realized this only after implementing all this.
> 
> Hmm, how about this: when removing mark from inode, drop refcount.  If
> refcount is zero can remove from list.  Otherwise mark the mark "dead"
> and leave it on the list.
> 
> And fsnotify can just skip dead marks.

I had this idea as well and when trying to implement this, I've stumbled
over some problems. I think the biggest problem was that destruction of a
notification mark is relatively complex operation (doing iput() for
example) and quite a few places dropping mark references are in a context
where this can cause problems. Also I don't want to defer iput() to a
workqueue as that will have unexpected consequences such as unlinked
watched inode lingering in the system (possibly colliding with umount
etc.).

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: fsnotify_mark_srcu wtf?
  2016-12-02 10:48             ` Jan Kara
@ 2016-12-02 11:02               ` Miklos Szeredi
  2016-12-06 17:07                 ` Jan Kara
  2016-12-02 11:41               ` Amir Goldstein
  1 sibling, 1 reply; 19+ messages in thread
From: Miklos Szeredi @ 2016-12-02 11:02 UTC (permalink / raw)
  To: Jan Kara; +Cc: Amir Goldstein, Eric Paris, linux-fsdevel, linux-kernel

On Fri, Dec 2, 2016 at 11:48 AM, Jan Kara <jack@suse.cz> wrote:
> On Fri 02-12-16 09:26:51, Miklos Szeredi wrote:
>> On Thu, Nov 10, 2016 at 8:46 PM, Jan Kara <jack@suse.cz> wrote:
>> > On Wed 09-11-16 20:26:16, Amir Goldstein wrote:
>> >> On Wed, Nov 9, 2016 at 1:10 PM, Jan Kara <jack@suse.cz> wrote:
>>
>> >> > And this does not work as well... Fanotify must notify groups by their
>> >> > priority so you cannot arbitrarily reorder ordering in which groups get
>> >> > notified. I'm currently pondering on using mark refcount to pin it when
>> >> > processing permission event but there are still some details to check.
>> >> >
>> >>
>> >> All right, mark refcount sound like the proper solution.
>> >
>> > Except it doesn't quite work. We can pin the current marks by a refcount
>> > but they can still be removed from the list so after we regain srcu lock,
>> > we are not sure their ->next pointers still point to still allocated marks
>> > :-| Sadly I realized this only after implementing all this.
>>
>> Hmm, how about this: when removing mark from inode, drop refcount.  If
>> refcount is zero can remove from list.  Otherwise mark the mark "dead"
>> and leave it on the list.
>>
>> And fsnotify can just skip dead marks.
>
> I had this idea as well and when trying to implement this, I've stumbled
> over some problems. I think the biggest problem was that destruction of a
> notification mark is relatively complex operation (doing iput() for
> example) and quite a few places dropping mark references are in a context
> where this can cause problems. Also I don't want to defer iput() to a
> workqueue as that will have unexpected consequences such as unlinked
> watched inode lingering in the system (possibly colliding with umount
> etc.).

Okay, but all we need from the deleted mark is the ->next pointer, no?
 So everything else related to destruction can be done.

Thanks,
Miklos

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: fsnotify_mark_srcu wtf?
  2016-12-02 10:48             ` Jan Kara
  2016-12-02 11:02               ` Miklos Szeredi
@ 2016-12-02 11:41               ` Amir Goldstein
  2016-12-02 11:57                 ` Amir Goldstein
  1 sibling, 1 reply; 19+ messages in thread
From: Amir Goldstein @ 2016-12-02 11:41 UTC (permalink / raw)
  To: Jan Kara; +Cc: Miklos Szeredi, Eric Paris, linux-fsdevel, linux-kernel

On Fri, Dec 2, 2016 at 12:48 PM, Jan Kara <jack@suse.cz> wrote:
> On Fri 02-12-16 09:26:51, Miklos Szeredi wrote:
...
>>
>> Hmm, how about this: when removing mark from inode, drop refcount.  If
>> refcount is zero can remove from list.  Otherwise mark the mark "dead"
>> and leave it on the list.
>>
>> And fsnotify can just skip dead marks.
>
> I had this idea as well and when trying to implement this, I've stumbled
> over some problems. I think the biggest problem was that destruction of a
> notification mark is relatively complex operation (doing iput() for
> example) and quite a few places dropping mark references are in a context
> where this can cause problems. Also I don't want to defer iput() to a
> workqueue as that will have unexpected consequences such as unlinked
> watched inode lingering in the system (possibly colliding with umount
> etc.).
>

I am wondering out loud if we are trying to solve a real problem or a made
up test case. I wonder if Miklos' test program truly represents the original
bug report. I am asking because fanotify permission events are usually
associated with system security software and it usually makes sense on
a vfsmount_mark and not an inode_mark.

Maybe the break even solution is not to split destroy lists per group priority,
but to split destroy lists by inode marks and vfsmount marks
and also keep 2 separate lists per group.

I am only asking this because you mentioned iput as a thorn in the solution.
Since vfsmount mark does not pin the mount, nor hold an elevated reference,
perhaps dealing with simpler destruction of vfsmount marks can solve the
problem for "rogue fanotify permission mount watch" and maybe that is
enough for all practical matters?

Amir.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: fsnotify_mark_srcu wtf?
  2016-12-02 11:41               ` Amir Goldstein
@ 2016-12-02 11:57                 ` Amir Goldstein
  0 siblings, 0 replies; 19+ messages in thread
From: Amir Goldstein @ 2016-12-02 11:57 UTC (permalink / raw)
  To: Jan Kara; +Cc: Miklos Szeredi, Eric Paris, linux-fsdevel, linux-kernel

On Fri, Dec 2, 2016 at 1:41 PM, Amir Goldstein <amir73il@gmail.com> wrote:
> On Fri, Dec 2, 2016 at 12:48 PM, Jan Kara <jack@suse.cz> wrote:
>> On Fri 02-12-16 09:26:51, Miklos Szeredi wrote:
> ...
>>>
>>> Hmm, how about this: when removing mark from inode, drop refcount.  If
>>> refcount is zero can remove from list.  Otherwise mark the mark "dead"
>>> and leave it on the list.
>>>
>>> And fsnotify can just skip dead marks.
>>
>> I had this idea as well and when trying to implement this, I've stumbled
>> over some problems. I think the biggest problem was that destruction of a
>> notification mark is relatively complex operation (doing iput() for
>> example) and quite a few places dropping mark references are in a context
>> where this can cause problems. Also I don't want to defer iput() to a
>> workqueue as that will have unexpected consequences such as unlinked
>> watched inode lingering in the system (possibly colliding with umount
>> etc.).
>>
>
> I am wondering out loud if we are trying to solve a real problem or a made
> up test case. I wonder if Miklos' test program truly represents the original
> bug report. I am asking because fanotify permission events are usually
> associated with system security software and it usually makes sense on
> a vfsmount_mark and not an inode_mark.
>
> Maybe the break even solution is not to split destroy lists per group priority,
> but to split destroy lists by inode marks and vfsmount marks
> and also keep 2 separate lists per group.
>
> I am only asking this because you mentioned iput as a thorn in the solution.
> Since vfsmount mark does not pin the mount, nor hold an elevated reference,
> perhaps dealing with simpler destruction of vfsmount marks can solve the
> problem for "rogue fanotify permission mount watch" and maybe that is
> enough for all practical matters?
>

And before you comment about the need to merge the inode and vfsmount
lists by priority I'll suggest:

- Check if head of inode mark list is priority 0
  If it is, there is no need to merge the lists:
-- first iterate vfsmount list with vfsmount mark srcu
-- then iterate inode list with inode mark srcu
--- if high priority inode mark is found on the list we can either skip it or
     process it out of priority order, because it was just added, so we could
     have missed it anyway

If inode list head is high priority then resort to old problem, as I said,
this is supposed to be a break even solution for practical use cases.

Amir.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: fsnotify_mark_srcu wtf?
  2016-12-02 11:02               ` Miklos Szeredi
@ 2016-12-06 17:07                 ` Jan Kara
  0 siblings, 0 replies; 19+ messages in thread
From: Jan Kara @ 2016-12-06 17:07 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: Jan Kara, Amir Goldstein, Eric Paris, linux-fsdevel, linux-kernel

On Fri 02-12-16 12:02:33, Miklos Szeredi wrote:
> On Fri, Dec 2, 2016 at 11:48 AM, Jan Kara <jack@suse.cz> wrote:
> > On Fri 02-12-16 09:26:51, Miklos Szeredi wrote:
> >> On Thu, Nov 10, 2016 at 8:46 PM, Jan Kara <jack@suse.cz> wrote:
> >> > On Wed 09-11-16 20:26:16, Amir Goldstein wrote:
> >> >> On Wed, Nov 9, 2016 at 1:10 PM, Jan Kara <jack@suse.cz> wrote:
> >>
> >> >> > And this does not work as well... Fanotify must notify groups by their
> >> >> > priority so you cannot arbitrarily reorder ordering in which groups get
> >> >> > notified. I'm currently pondering on using mark refcount to pin it when
> >> >> > processing permission event but there are still some details to check.
> >> >> >
> >> >>
> >> >> All right, mark refcount sound like the proper solution.
> >> >
> >> > Except it doesn't quite work. We can pin the current marks by a refcount
> >> > but they can still be removed from the list so after we regain srcu lock,
> >> > we are not sure their ->next pointers still point to still allocated marks
> >> > :-| Sadly I realized this only after implementing all this.
> >>
> >> Hmm, how about this: when removing mark from inode, drop refcount.  If
> >> refcount is zero can remove from list.  Otherwise mark the mark "dead"
> >> and leave it on the list.
> >>
> >> And fsnotify can just skip dead marks.
> >
> > I had this idea as well and when trying to implement this, I've stumbled
> > over some problems. I think the biggest problem was that destruction of a
> > notification mark is relatively complex operation (doing iput() for
> > example) and quite a few places dropping mark references are in a context
> > where this can cause problems. Also I don't want to defer iput() to a
> > workqueue as that will have unexpected consequences such as unlinked
> > watched inode lingering in the system (possibly colliding with umount
> > etc.).
> 
> Okay, but all we need from the deleted mark is the ->next pointer, no?
>  So everything else related to destruction can be done.

Yes, all we need for continuing the iteration itself is a ->next pointer and
possibly a ->group pointer. However the list itself is starting at the
inode->i_fsnotify_marks and the removal of the mark from the list needs to
be protected by inode->i_lock so we do need inode reference while the mark
is still attached to the inode list. So more plausible looks to wait while
detaching mark from the object for all responses for that mark.

Another possibility would be to create separate object for the head of the
list of fsnotify marks (containing also the lock for the list). That would
allow us to detach list of marks from the object and thus we should not
have issues with conflicting lifetime needs... And the additional memory
overhead (about 3 longs) is not that big compared to the size of fsnotify
mark (currently 9 longs). We'll see.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2016-12-07  8:56 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-11-02 22:09 fsnotify_mark_srcu wtf? Miklos Szeredi
2016-11-03 10:25 ` Amir Goldstein
2016-11-05 21:43   ` Jan Kara
2016-11-05 21:34 ` Jan Kara
2016-11-06  6:45   ` Amir Goldstein
2016-11-09 11:10     ` Jan Kara
2016-11-09 18:26       ` Amir Goldstein
2016-11-10 19:46         ` Jan Kara
2016-11-10 20:02           ` Amir Goldstein
2016-11-10 20:44           ` Miklos Szeredi
2016-11-10 22:41             ` Amir Goldstein
2016-11-13 18:43           ` Amir Goldstein
2016-11-14 11:59             ` Amir Goldstein
2016-12-02  8:26           ` Miklos Szeredi
2016-12-02 10:48             ` Jan Kara
2016-12-02 11:02               ` Miklos Szeredi
2016-12-06 17:07                 ` Jan Kara
2016-12-02 11:41               ` Amir Goldstein
2016-12-02 11:57                 ` Amir Goldstein

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).