All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC][PATCH 0/2] fsnotify: reduce coupling of permission and non permission events
@ 2016-11-14 11:48 Amir Goldstein
  2016-11-14 11:48 ` [RFC][PATCH 1/2] fsnotify: separate fsnotify_mark_srcu for groups with " Amir Goldstein
  2016-11-14 11:48 ` [RFC][PATCH 2/2] fsnotify: handle permission events without holding fsnotify_mark_srcu[0] Amir Goldstein
  0 siblings, 2 replies; 6+ messages in thread
From: Amir Goldstein @ 2016-11-14 11:48 UTC (permalink / raw)
  To: Jan Kara
  Cc: Jeff Layton, Miklos Szeredi, Eric Paris, Eryu Guan, linux-kernel,
	linux-fsdevel

The issue reported by Miklos Szeredi was processes blocking when trying
to close an inotify file descriptor, while another fanotify listener is
failing to handle permission events.

Miklos included a test program with the report.
His test program with some modifications is available on my github:
https://github.com/amir73il/fsnotify-utils/blob/master/fanotify_bug.c

Ideally, we would want that destruction of a group would only block
if an event handled by this group is in progress, but this is not easy
to achieve. Instead, we make sure that destrurction of a group would
only block if an event handled by a group of similar class (priority)
is in progress.

Amir Goldstein (2):
  fsnotify: separate fsnotify_mark_srcu for groups with permission
    events
  fsnotify: handle permission events without holding
    fsnotify_mark_srcu[0]

 fs/notify/fanotify/fanotify.c | 15 ++++++---
 fs/notify/fsnotify.c          | 57 ++++++++++++++++++++++++++++----
 fs/notify/fsnotify.h          | 17 ++++++++--
 fs/notify/group.c             |  2 +-
 fs/notify/mark.c              | 77 ++++++++++++++++++++++++++++++++-----------
 5 files changed, 134 insertions(+), 34 deletions(-)

-- 
2.7.4

^ permalink raw reply	[flat|nested] 6+ messages in thread

* [RFC][PATCH 1/2] fsnotify: separate fsnotify_mark_srcu for groups with permission events
  2016-11-14 11:48 [RFC][PATCH 0/2] fsnotify: reduce coupling of permission and non permission events Amir Goldstein
@ 2016-11-14 11:48 ` Amir Goldstein
  2016-11-14 11:48 ` [RFC][PATCH 2/2] fsnotify: handle permission events without holding fsnotify_mark_srcu[0] Amir Goldstein
  1 sibling, 0 replies; 6+ messages in thread
From: Amir Goldstein @ 2016-11-14 11:48 UTC (permalink / raw)
  To: Jan Kara
  Cc: Jeff Layton, Miklos Szeredi, Eric Paris, Eryu Guan, linux-kernel,
	linux-fsdevel

fsnotify_mark_srcu[1] protects reads of the first half of inode/vfsmount
mark lists, which consist of the high priority (>0) marks, whose masks
may contain permission events.

fsnotify_mark_srcu[0] protects reads of the second half of inode/vfsmount
mark lists, which consist of the low priority (0) marks, whose masks
do not contain permission events.

High priority marks and low priority marks are added to different
destroy lists and freed by different reapers, who synchronize only
the relevant srcu.

A follow up change will use this separation to guaranty that
destroying low priority groups will not block on handling of
permission events.

Signed-off-by: Amir Goldstein <amir73il@gmail.com>
---
 fs/notify/fsnotify.c | 34 ++++++++++++++++++-----
 fs/notify/fsnotify.h | 17 ++++++++++--
 fs/notify/group.c    |  2 +-
 fs/notify/mark.c     | 77 +++++++++++++++++++++++++++++++++++++++-------------
 4 files changed, 100 insertions(+), 30 deletions(-)

diff --git a/fs/notify/fsnotify.c b/fs/notify/fsnotify.c
index db39de2..af5c523a 100644
--- a/fs/notify/fsnotify.c
+++ b/fs/notify/fsnotify.c
@@ -192,9 +192,10 @@ int fsnotify(struct inode *to_tell, __u32 mask, void *data, int data_is,
 {
 	struct hlist_node *inode_node = NULL, *vfsmount_node = NULL;
 	struct fsnotify_mark *inode_mark = NULL, *vfsmount_mark = NULL;
-	struct fsnotify_group *inode_group, *vfsmount_group;
+	struct fsnotify_group *inode_group, *vfsmount_group, *group;
 	struct mount *mnt;
-	int idx, ret = 0;
+	int perm_idx, idx;
+	int ret = 0;
 	/* global tests shouldn't care about events on child only the specific event */
 	__u32 test_mask = (mask & ~FS_EVENT_ON_CHILD);
 
@@ -223,7 +224,14 @@ int fsnotify(struct inode *to_tell, __u32 mask, void *data, int data_is,
 	    !(mnt && test_mask & mnt->mnt_fsnotify_mask))
 		return 0;
 
-	idx = srcu_read_lock(&fsnotify_mark_srcu);
+	/*
+	 * First mark on the list may be either low or high priority, so
+	 * when traversing to first mark we must hold read side of both srcu.
+	 * When we reach the first low priority mark, we may drop the
+	 * read side of fsnotify_mark_srcu[1]
+	 */
+	idx = srcu_read_lock(&fsnotify_mark_srcu[0]);
+	perm_idx = srcu_read_lock(&fsnotify_mark_srcu[1]);
 
 	if ((mask & FS_MODIFY) ||
 	    (test_mask & to_tell->i_fsnotify_mask))
@@ -252,13 +260,13 @@ int fsnotify(struct inode *to_tell, __u32 mask, void *data, int data_is,
 		if (inode_node) {
 			inode_mark = hlist_entry(srcu_dereference(inode_node, &fsnotify_mark_srcu),
 						 struct fsnotify_mark, obj_list);
-			inode_group = inode_mark->group;
+			group = inode_group = inode_mark->group;
 		}
 
 		if (vfsmount_node) {
 			vfsmount_mark = hlist_entry(srcu_dereference(vfsmount_node, &fsnotify_mark_srcu),
 						    struct fsnotify_mark, obj_list);
-			vfsmount_group = vfsmount_mark->group;
+			group = vfsmount_group = vfsmount_mark->group;
 		}
 
 		if (inode_group && vfsmount_group) {
@@ -270,8 +278,16 @@ int fsnotify(struct inode *to_tell, __u32 mask, void *data, int data_is,
 			} else if (cmp < 0) {
 				vfsmount_group = NULL;
 				vfsmount_mark = NULL;
+				group = inode_group;
 			}
 		}
+
+		if (group->priority == 0 && perm_idx >= 0) {
+			/* We are done iterating the high priority marks */
+			srcu_read_unlock(&fsnotify_mark_srcu[1], perm_idx);
+			perm_idx = -1;
+		}
+
 		ret = send_to_group(to_tell, inode_mark, vfsmount_mark, mask,
 				    data, data_is, cookie, file_name);
 
@@ -287,7 +303,9 @@ int fsnotify(struct inode *to_tell, __u32 mask, void *data, int data_is,
 	}
 	ret = 0;
 out:
-	srcu_read_unlock(&fsnotify_mark_srcu, idx);
+	srcu_read_unlock(&fsnotify_mark_srcu[0], idx);
+	if (perm_idx >= 0)
+		srcu_read_unlock(&fsnotify_mark_srcu[1], perm_idx);
 
 	return ret;
 }
@@ -299,7 +317,9 @@ static __init int fsnotify_init(void)
 
 	BUG_ON(hweight32(ALL_FSNOTIFY_EVENTS) != 23);
 
-	ret = init_srcu_struct(&fsnotify_mark_srcu);
+	ret = init_srcu_struct(&fsnotify_mark_srcu[0]);
+	if (!ret)
+		ret = init_srcu_struct(&fsnotify_mark_srcu[1]);
 	if (ret)
 		panic("initializing fsnotify_mark_srcu");
 
diff --git a/fs/notify/fsnotify.h b/fs/notify/fsnotify.h
index 0a3bc2c..c1d6ae6 100644
--- a/fs/notify/fsnotify.h
+++ b/fs/notify/fsnotify.h
@@ -11,8 +11,19 @@
 /* destroy all events sitting in this groups notification queue */
 extern void fsnotify_flush_notify(struct fsnotify_group *group);
 
-/* protects reads of inode and vfsmount marks list */
-extern struct srcu_struct fsnotify_mark_srcu;
+/*
+ * fsnotify_mark_srcu[1] protects reads of the first half of inode/vfsmount
+ * mark lists, which consist of the high priority (>0) marks, whose masks
+ * may contain permission events.
+ *
+ * fsnotify_mark_srcu[0] protects reads of the second half of inode/vfsmount
+ * mark lists, which consist of the low priority (0) marks, whose masks
+ * do not contain permission events.
+ */
+#define FS_PRIO_SRCU(prio)	((int)(prio > FS_PRIO_0))
+#define FS_PRIO_SRCU_NUM	2
+
+extern struct srcu_struct fsnotify_mark_srcu[FS_PRIO_SRCU_NUM];
 
 /* Calculate mask of events for a list of marks */
 extern u32 fsnotify_recalc_mask(struct hlist_head *head);
@@ -61,7 +72,7 @@ extern void fsnotify_detach_group_marks(struct fsnotify_group *group);
 /*
  * wait for fsnotify_mark_srcu period to end and free all marks in destroy_list
  */
-extern void fsnotify_mark_destroy_list(void);
+extern void fsnotify_mark_destroy_list(int priority);
 
 /*
  * update the dentry->d_flags of all of inode's children to indicate if inode cares
diff --git a/fs/notify/group.c b/fs/notify/group.c
index fbe3cbe..ef946c0 100644
--- a/fs/notify/group.c
+++ b/fs/notify/group.c
@@ -73,7 +73,7 @@ void fsnotify_destroy_group(struct fsnotify_group *group)
 	 * Wait for fsnotify_mark_srcu period to end and free all marks in
 	 * destroy_list
 	 */
-	fsnotify_mark_destroy_list();
+	fsnotify_mark_destroy_list(group->priority);
 
 	/*
 	 * Since we have waited for fsnotify_mark_srcu in
diff --git a/fs/notify/mark.c b/fs/notify/mark.c
index d3fea0b..44e39a5 100644
--- a/fs/notify/mark.c
+++ b/fs/notify/mark.c
@@ -93,12 +93,50 @@
 
 #define FSNOTIFY_REAPER_DELAY	(1)	/* 1 jiffy */
 
-struct srcu_struct fsnotify_mark_srcu;
+/*
+ * fsnotify_mark_srcu[1] protects reads of the first half of inode/vfsmount
+ * mark lists, which consist of the high priority (>0) marks, whose masks
+ * may contain permission events.
+ *
+ * fsnotify_mark_srcu[0] protects reads of the second half of inode/vfsmount
+ * mark lists, which consist of the low priority (0) marks, whose masks
+ * do not contain permission events.
+ *
+ * High priority marks and low priority marks are added to different
+ * destroy lists and freed by different reapers, who synchronize only
+ * the relevant srcu.
+ */
+struct srcu_struct fsnotify_mark_srcu[FS_PRIO_SRCU_NUM];
 static DEFINE_SPINLOCK(destroy_lock);
-static LIST_HEAD(destroy_list);
+static LIST_HEAD(destroy_list_0);
+static LIST_HEAD(destroy_list_1);
+static struct list_head *destroy_lists[FS_PRIO_SRCU_NUM] = {
+	&destroy_list_0,
+	&destroy_list_1
+};
 
 static void fsnotify_mark_destroy_workfn(struct work_struct *work);
-static DECLARE_DELAYED_WORK(reaper_work, fsnotify_mark_destroy_workfn);
+static DECLARE_DELAYED_WORK(reaper_work_0, fsnotify_mark_destroy_workfn);
+static DECLARE_DELAYED_WORK(reaper_work_1, fsnotify_mark_destroy_workfn);
+static struct delayed_work *reapers[FS_PRIO_SRCU_NUM] = {
+	&reaper_work_0,
+	&reaper_work_1
+};
+
+static inline void fsnotify_destroy_list_add(struct fsnotify_mark *mark,
+					     int priority)
+{
+	spin_lock(&destroy_lock);
+	list_add(&mark->g_list, destroy_lists[FS_PRIO_SRCU(priority)]);
+	spin_unlock(&destroy_lock);
+}
+
+static inline void fsnotify_destroy_list_reap(int priority)
+{
+	queue_delayed_work(system_unbound_wq,
+			   reapers[FS_PRIO_SRCU(priority)],
+			   FSNOTIFY_REAPER_DELAY);
+}
 
 void fsnotify_get_mark(struct fsnotify_mark *mark)
 {
@@ -202,9 +240,7 @@ static bool __fsnotify_free_mark(struct fsnotify_mark *mark)
 	if (group->ops->freeing_mark)
 		group->ops->freeing_mark(mark, group);
 
-	spin_lock(&destroy_lock);
-	list_add(&mark->g_list, &destroy_list);
-	spin_unlock(&destroy_lock);
+	fsnotify_destroy_list_add(mark, group->priority);
 
 	return true;
 }
@@ -216,10 +252,8 @@ static bool __fsnotify_free_mark(struct fsnotify_mark *mark)
  */
 void fsnotify_free_mark(struct fsnotify_mark *mark)
 {
-	if (__fsnotify_free_mark(mark)) {
-		queue_delayed_work(system_unbound_wq, &reaper_work,
-				   FSNOTIFY_REAPER_DELAY);
-	}
+	if (__fsnotify_free_mark(mark))
+		fsnotify_destroy_list_reap(mark->group->priority);
 }
 
 void fsnotify_destroy_mark(struct fsnotify_mark *mark,
@@ -407,11 +441,8 @@ int fsnotify_add_mark_locked(struct fsnotify_mark *mark,
 
 	spin_unlock(&mark->lock);
 
-	spin_lock(&destroy_lock);
-	list_add(&mark->g_list, &destroy_list);
-	spin_unlock(&destroy_lock);
-	queue_delayed_work(system_unbound_wq, &reaper_work,
-				FSNOTIFY_REAPER_DELAY);
+	fsnotify_destroy_list_add(mark, group->priority);
+	fsnotify_destroy_list_reap(group->priority);
 
 	return ret;
 }
@@ -537,18 +568,26 @@ void fsnotify_init_mark(struct fsnotify_mark *mark,
 /*
  * Destroy all marks in destroy_list, waits for SRCU period to finish before
  * actually freeing marks.
+ * High priority (>0) marks are on destroy_lists[1] and protected by
+ * fsnotify_mark_srcu[1].
+ * Low priority (0) marks are on destroy_lists[0] and protected by
+ * fsnotify_mark_srcu[0].
  */
-void fsnotify_mark_destroy_list(void)
+void fsnotify_mark_destroy_list(int priority)
 {
 	struct fsnotify_mark *mark, *next;
 	struct list_head private_destroy_list;
+	int id = FS_PRIO_SRCU(priority);
 
 	spin_lock(&destroy_lock);
 	/* exchange the list head */
-	list_replace_init(&destroy_list, &private_destroy_list);
+	list_replace_init(destroy_lists[id], &private_destroy_list);
 	spin_unlock(&destroy_lock);
 
-	synchronize_srcu(&fsnotify_mark_srcu);
+	if (list_empty(&private_destroy_list))
+		return;
+
+	synchronize_srcu(&fsnotify_mark_srcu[id]);
 
 	list_for_each_entry_safe(mark, next, &private_destroy_list, g_list) {
 		list_del_init(&mark->g_list);
@@ -558,5 +597,5 @@ void fsnotify_mark_destroy_list(void)
 
 static void fsnotify_mark_destroy_workfn(struct work_struct *work)
 {
-	fsnotify_mark_destroy_list();
+	fsnotify_mark_destroy_list((void *)work != reapers[0]);
 }
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 6+ messages in thread

* [RFC][PATCH 2/2] fsnotify: handle permission events without holding fsnotify_mark_srcu[0]
  2016-11-14 11:48 [RFC][PATCH 0/2] fsnotify: reduce coupling of permission and non permission events Amir Goldstein
  2016-11-14 11:48 ` [RFC][PATCH 1/2] fsnotify: separate fsnotify_mark_srcu for groups with " Amir Goldstein
@ 2016-11-14 11:48 ` Amir Goldstein
  2016-11-14 13:20   ` Jan Kara
  1 sibling, 1 reply; 6+ messages in thread
From: Amir Goldstein @ 2016-11-14 11:48 UTC (permalink / raw)
  To: Jan Kara
  Cc: Jeff Layton, Miklos Szeredi, Eric Paris, Eryu Guan, linux-kernel,
	linux-fsdevel

Handling fanotify events does not entail dereferencing fsnotify_mark
beyond the point of fanotify_should_send_event().

For the case of permission events, which may block indefinitely,
return -EAGAIN and then fsnotify() will call handle_event() again
without a reference to the mark.

Without a reference to the mark, there is no need to call
handle_event() under fsnotify_mark_srcu[0] read side lock,
so we drop fsnotify_mark_srcu[0] while handling the event
and grab it back before continuing to the next mark.

After this change, a blocking permission event will no longer
block closing of any file descriptors of 0 priority groups,
i.e: inotify and fanotify groups of class FAN_CLASS_NOTIF.

Reported-by: Miklos Szeredi <miklos@szeredi.hu>
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
---
 fs/notify/fanotify/fanotify.c | 15 +++++++++++----
 fs/notify/fsnotify.c          | 23 +++++++++++++++++++++++
 2 files changed, 34 insertions(+), 4 deletions(-)

diff --git a/fs/notify/fanotify/fanotify.c b/fs/notify/fanotify/fanotify.c
index e0e5f7c..c7689ad 100644
--- a/fs/notify/fanotify/fanotify.c
+++ b/fs/notify/fanotify/fanotify.c
@@ -176,7 +176,7 @@ init: __maybe_unused
 static int fanotify_handle_event(struct fsnotify_group *group,
 				 struct inode *inode,
 				 struct fsnotify_mark *inode_mark,
-				 struct fsnotify_mark *fanotify_mark,
+				 struct fsnotify_mark *vfsmnt_mark,
 				 u32 mask, void *data, int data_type,
 				 const unsigned char *file_name, u32 cookie)
 {
@@ -195,9 +195,16 @@ static int fanotify_handle_event(struct fsnotify_group *group,
 	BUILD_BUG_ON(FAN_ACCESS_PERM != FS_ACCESS_PERM);
 	BUILD_BUG_ON(FAN_ONDIR != FS_ISDIR);
 
-	if (!fanotify_should_send_event(inode_mark, fanotify_mark, mask, data,
-					data_type))
-		return 0;
+	if (inode_mark || vfsmnt_mark) {
+		if (!fanotify_should_send_event(inode_mark, vfsmnt_mark, mask,
+						data, data_type))
+			return 0;
+#ifdef CONFIG_FANOTIFY_ACCESS_PERMISSIONS
+		/* Ask to be called again without a reference to mark */
+		if (mask & FAN_ALL_PERM_EVENTS)
+			return -EAGAIN;
+#endif
+	}
 
 	pr_debug("%s: group=%p inode=%p mask=%x\n", __func__, group, inode,
 		 mask);
diff --git a/fs/notify/fsnotify.c b/fs/notify/fsnotify.c
index af5c523a..5b9a248 100644
--- a/fs/notify/fsnotify.c
+++ b/fs/notify/fsnotify.c
@@ -291,6 +291,29 @@ int fsnotify(struct inode *to_tell, __u32 mask, void *data, int data_is,
 		ret = send_to_group(to_tell, inode_mark, vfsmount_mark, mask,
 				    data, data_is, cookie, file_name);
 
+		/*
+		 * If handle_event() is going to block, we call it again
+		 * witout holding fsnotify_mark_srcu[0], which is protecting
+		 * the low priority mark lists.
+		 * We are still holding fsnotify_mark_srcu[1], which
+		 * is protecting the high priority marks in the first half
+		 * of the mark list, which is where we are at.
+		 */
+		if (group->priority > 0 && ret == -EAGAIN) {
+			srcu_read_unlock(&fsnotify_mark_srcu[0], idx);
+
+			ret = group->ops->handle_event(group, to_tell,
+						       NULL, NULL,
+						       mask, data, data_is,
+						       file_name, cookie);
+
+			/*
+			 * We need to hold fsnotify_mark_srcu[0], because
+			 * next mark may be low priority.
+			 */
+			idx = srcu_read_lock(&fsnotify_mark_srcu[0]);
+		}
+
 		if (ret && (mask & ALL_FSNOTIFY_PERM_EVENTS))
 			goto out;
 
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 6+ messages in thread

* Re: [RFC][PATCH 2/2] fsnotify: handle permission events without holding fsnotify_mark_srcu[0]
  2016-11-14 11:48 ` [RFC][PATCH 2/2] fsnotify: handle permission events without holding fsnotify_mark_srcu[0] Amir Goldstein
@ 2016-11-14 13:20   ` Jan Kara
  2016-11-14 15:09     ` Amir Goldstein
  0 siblings, 1 reply; 6+ messages in thread
From: Jan Kara @ 2016-11-14 13:20 UTC (permalink / raw)
  To: Amir Goldstein
  Cc: Jan Kara, Jeff Layton, Miklos Szeredi, Eric Paris, Eryu Guan,
	linux-kernel, linux-fsdevel

On Mon 14-11-16 13:48:27, Amir Goldstein wrote:
> Handling fanotify events does not entail dereferencing fsnotify_mark
> beyond the point of fanotify_should_send_event().
> 
> For the case of permission events, which may block indefinitely,
> return -EAGAIN and then fsnotify() will call handle_event() again
> without a reference to the mark.
> 
> Without a reference to the mark, there is no need to call
> handle_event() under fsnotify_mark_srcu[0] read side lock,
> so we drop fsnotify_mark_srcu[0] while handling the event
> and grab it back before continuing to the next mark.
> 
> After this change, a blocking permission event will no longer
> block closing of any file descriptors of 0 priority groups,
> i.e: inotify and fanotify groups of class FAN_CLASS_NOTIF.
> 
> Reported-by: Miklos Szeredi <miklos@szeredi.hu>
> Signed-off-by: Amir Goldstein <amir73il@gmail.com>

Well, this has a similar problem as my attempt to fix the issue. The
current mark can get removed from the mark list while waiting for userspace
response. ->next pointer is still valid at that moment but ->next->pprev
already points to mark preceding us (that's how rcu lists work). When
->next mark then gets removed from the list and destroyed (it may be
protected by the second srcu), our ->next pointer points to freed memory.

Furthermore I don't like the scheme of ->handle_event returning -EAGAIN and
then dropping the srcu lock - I'd rather have some helpers provided by the
generic fsnotify code to drop srcu lock. That needs some propagation of
information inside the ->handle_event and then the helper but that's IMO
cleaner. Anyway, that is just a technical detail.

I have some ideas how to fix up issues with my refcounting approach -
refcount would pin marks not only in memory but also in lists but I have
yet to see whether that works out sensibly (it would mean that dropping
mark reference would then need to take group->mark_mutex and that may cause
lock ordering issues).

								Honza


> diff --git a/fs/notify/fanotify/fanotify.c b/fs/notify/fanotify/fanotify.c
> index e0e5f7c..c7689ad 100644
> --- a/fs/notify/fanotify/fanotify.c
> +++ b/fs/notify/fanotify/fanotify.c
> @@ -176,7 +176,7 @@ init: __maybe_unused
>  static int fanotify_handle_event(struct fsnotify_group *group,
>  				 struct inode *inode,
>  				 struct fsnotify_mark *inode_mark,
> -				 struct fsnotify_mark *fanotify_mark,
> +				 struct fsnotify_mark *vfsmnt_mark,
>  				 u32 mask, void *data, int data_type,
>  				 const unsigned char *file_name, u32 cookie)
>  {
> @@ -195,9 +195,16 @@ static int fanotify_handle_event(struct fsnotify_group *group,
>  	BUILD_BUG_ON(FAN_ACCESS_PERM != FS_ACCESS_PERM);
>  	BUILD_BUG_ON(FAN_ONDIR != FS_ISDIR);
>  
> -	if (!fanotify_should_send_event(inode_mark, fanotify_mark, mask, data,
> -					data_type))
> -		return 0;
> +	if (inode_mark || vfsmnt_mark) {
> +		if (!fanotify_should_send_event(inode_mark, vfsmnt_mark, mask,
> +						data, data_type))
> +			return 0;
> +#ifdef CONFIG_FANOTIFY_ACCESS_PERMISSIONS
> +		/* Ask to be called again without a reference to mark */
> +		if (mask & FAN_ALL_PERM_EVENTS)
> +			return -EAGAIN;
> +#endif
> +	}
>  
>  	pr_debug("%s: group=%p inode=%p mask=%x\n", __func__, group, inode,
>  		 mask);
> diff --git a/fs/notify/fsnotify.c b/fs/notify/fsnotify.c
> index af5c523a..5b9a248 100644
> --- a/fs/notify/fsnotify.c
> +++ b/fs/notify/fsnotify.c
> @@ -291,6 +291,29 @@ int fsnotify(struct inode *to_tell, __u32 mask, void *data, int data_is,
>  		ret = send_to_group(to_tell, inode_mark, vfsmount_mark, mask,
>  				    data, data_is, cookie, file_name);
>  
> +		/*
> +		 * If handle_event() is going to block, we call it again
> +		 * witout holding fsnotify_mark_srcu[0], which is protecting
> +		 * the low priority mark lists.
> +		 * We are still holding fsnotify_mark_srcu[1], which
> +		 * is protecting the high priority marks in the first half
> +		 * of the mark list, which is where we are at.
> +		 */
> +		if (group->priority > 0 && ret == -EAGAIN) {
> +			srcu_read_unlock(&fsnotify_mark_srcu[0], idx);
> +
> +			ret = group->ops->handle_event(group, to_tell,
> +						       NULL, NULL,
> +						       mask, data, data_is,
> +						       file_name, cookie);
> +
> +			/*
> +			 * We need to hold fsnotify_mark_srcu[0], because
> +			 * next mark may be low priority.
> +			 */
> +			idx = srcu_read_lock(&fsnotify_mark_srcu[0]);
> +		}
> +
>  		if (ret && (mask & ALL_FSNOTIFY_PERM_EVENTS))
>  			goto out;
>  
> -- 
> 2.7.4
> 
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [RFC][PATCH 2/2] fsnotify: handle permission events without holding fsnotify_mark_srcu[0]
  2016-11-14 13:20   ` Jan Kara
@ 2016-11-14 15:09     ` Amir Goldstein
  2016-11-16  9:35       ` Jan Kara
  0 siblings, 1 reply; 6+ messages in thread
From: Amir Goldstein @ 2016-11-14 15:09 UTC (permalink / raw)
  To: Jan Kara
  Cc: Jeff Layton, Miklos Szeredi, Eric Paris, Eryu Guan, linux-kernel,
	linux-fsdevel

On Mon, Nov 14, 2016 at 3:20 PM, Jan Kara <jack@suse.cz> wrote:
> On Mon 14-11-16 13:48:27, Amir Goldstein wrote:
>> Handling fanotify events does not entail dereferencing fsnotify_mark
>> beyond the point of fanotify_should_send_event().
>>
>> For the case of permission events, which may block indefinitely,
>> return -EAGAIN and then fsnotify() will call handle_event() again
>> without a reference to the mark.
>>
>> Without a reference to the mark, there is no need to call
>> handle_event() under fsnotify_mark_srcu[0] read side lock,
>> so we drop fsnotify_mark_srcu[0] while handling the event
>> and grab it back before continuing to the next mark.
>>
>> After this change, a blocking permission event will no longer
>> block closing of any file descriptors of 0 priority groups,
>> i.e: inotify and fanotify groups of class FAN_CLASS_NOTIF.
>>
>> Reported-by: Miklos Szeredi <miklos@szeredi.hu>
>> Signed-off-by: Amir Goldstein <amir73il@gmail.com>
>
> Well, this has a similar problem as my attempt to fix the issue. The
> current mark can get removed from the mark list while waiting for userspace
> response. ->next pointer is still valid at that moment but ->next->pprev
> already points to mark preceding us (that's how rcu lists work). When
> ->next mark then gets removed from the list and destroyed (it may be
> protected by the second srcu), our ->next pointer points to freed memory.

Oh! I missed the fact that the SRCU does not protect mark->obj_list.
Can resurrecting mark->free_list buy us anything?
Perhaps keep the marks on obj_list without FLAG_ATTACHED
and then remove marks from both free_list and obj_list post srcu_synchronize()?
I think that removing mark->obj_list was optimization of good faith
and not because it really hurts the system's memory usage that much?

>
> Furthermore I don't like the scheme of ->handle_event returning -EAGAIN and
> then dropping the srcu lock - I'd rather have some helpers provided by the
> generic fsnotify code to drop srcu lock. That needs some propagation of
> information inside the ->handle_event and then the helper but that's IMO
> cleaner. Anyway, that is just a technical detail.
>

Yeh, that's just the minimal LOC implementation. I figured the implementation
may be rejected for more profound flaws. Although personally,
I do like the so called O_NONBLOCKING semantics here.
I debated with myself if I should use EAGAIN or EWOULDBLOCK
for sake of readability.

> I have some ideas how to fix up issues with my refcounting approach -
> refcount would pin marks not only in memory but also in lists but I have
> yet to see whether that works out sensibly (it would mean that dropping
> mark reference would then need to take group->mark_mutex and that may cause
> lock ordering issues).
>

It sounds like a chicken and egg problem, but I suppose you don't mean
the actual mark refcount, but a secondary "pinned refcount"?

Anyway, if you have something ready, my test setup is already in place..

Cheers,
Amir.

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [RFC][PATCH 2/2] fsnotify: handle permission events without holding fsnotify_mark_srcu[0]
  2016-11-14 15:09     ` Amir Goldstein
@ 2016-11-16  9:35       ` Jan Kara
  0 siblings, 0 replies; 6+ messages in thread
From: Jan Kara @ 2016-11-16  9:35 UTC (permalink / raw)
  To: Amir Goldstein
  Cc: Jan Kara, Jeff Layton, Miklos Szeredi, Eric Paris, Eryu Guan,
	linux-kernel, linux-fsdevel

On Mon 14-11-16 17:09:47, Amir Goldstein wrote:
> On Mon, Nov 14, 2016 at 3:20 PM, Jan Kara <jack@suse.cz> wrote:
> > On Mon 14-11-16 13:48:27, Amir Goldstein wrote:
> >> Handling fanotify events does not entail dereferencing fsnotify_mark
> >> beyond the point of fanotify_should_send_event().
> >>
> >> For the case of permission events, which may block indefinitely,
> >> return -EAGAIN and then fsnotify() will call handle_event() again
> >> without a reference to the mark.
> >>
> >> Without a reference to the mark, there is no need to call
> >> handle_event() under fsnotify_mark_srcu[0] read side lock,
> >> so we drop fsnotify_mark_srcu[0] while handling the event
> >> and grab it back before continuing to the next mark.
> >>
> >> After this change, a blocking permission event will no longer
> >> block closing of any file descriptors of 0 priority groups,
> >> i.e: inotify and fanotify groups of class FAN_CLASS_NOTIF.
> >>
> >> Reported-by: Miklos Szeredi <miklos@szeredi.hu>
> >> Signed-off-by: Amir Goldstein <amir73il@gmail.com>
> >
> > Well, this has a similar problem as my attempt to fix the issue. The
> > current mark can get removed from the mark list while waiting for userspace
> > response. ->next pointer is still valid at that moment but ->next->pprev
> > already points to mark preceding us (that's how rcu lists work). When
> > ->next mark then gets removed from the list and destroyed (it may be
> > protected by the second srcu), our ->next pointer points to freed memory.
> 
> Oh! I missed the fact that the SRCU does not protect mark->obj_list.
> Can resurrecting mark->free_list buy us anything?
> Perhaps keep the marks on obj_list without FLAG_ATTACHED
> and then remove marks from both free_list and obj_list post srcu_synchronize()?
> I think that removing mark->obj_list was optimization of good faith
> and not because it really hurts the system's memory usage that much?

You have to be really careful here. Minimally you'd need then another
srcu_synchronize() call after removing mark from the list and before
freeing the mark so that you are sure no process iterating the list can
have a reference to a mark being freed but even then it needs careful
checking whether it would work. The joy of lockless algorithms...

> > I have some ideas how to fix up issues with my refcounting approach -
> > refcount would pin marks not only in memory but also in lists but I have
> > yet to see whether that works out sensibly (it would mean that dropping
> > mark reference would then need to take group->mark_mutex and that may cause
> > lock ordering issues).
> 
> It sounds like a chicken and egg problem, but I suppose you don't mean
> the actual mark refcount, but a secondary "pinned refcount"?

No, I mean the original refcount. I have patches that already postpone part
of the mark cleanup to happen only once the refcount drops to 0.

> Anyway, if you have something ready, my test setup is already in place..

Thanks for an offer. I don't have anything yet, hopefully later today or
tomorrow.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2016-11-16  9:35 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-11-14 11:48 [RFC][PATCH 0/2] fsnotify: reduce coupling of permission and non permission events Amir Goldstein
2016-11-14 11:48 ` [RFC][PATCH 1/2] fsnotify: separate fsnotify_mark_srcu for groups with " Amir Goldstein
2016-11-14 11:48 ` [RFC][PATCH 2/2] fsnotify: handle permission events without holding fsnotify_mark_srcu[0] Amir Goldstein
2016-11-14 13:20   ` Jan Kara
2016-11-14 15:09     ` Amir Goldstein
2016-11-16  9:35       ` Jan Kara

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.