linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/7] Performance improvement for fanotify merge
@ 2021-02-02 16:20 Amir Goldstein
  2021-02-02 16:20 ` [PATCH 1/7] fsnotify: allow fsnotify_{peek,remove}_first_event with empty queue Amir Goldstein
                   ` (7 more replies)
  0 siblings, 8 replies; 36+ messages in thread
From: Amir Goldstein @ 2021-02-02 16:20 UTC (permalink / raw)
  To: Jan Kara; +Cc: linux-fsdevel

Jan,

fanotify_merge() has been observed [1] to consume a lot of CPU.
This is not surprising considering that it is implemented as a linear
search on a practically unbounded size list.

The following series improves the linear search for an event to merge
in three different ways:
1. Hash events into as much as to 128 lists
2. Limit linear search to 128 last list elements
3. Use a better key - instead of victim inode ptr, use a hash of all
   the compared fields

The end result can be observed in the test run times below.
The test is an extension of your queue overflow LTP test [2].
The timing results use are from the 2nd run of -i 2, where files
are already existing in the test fs.

With an unlimited queue, queueing of 16385 events on unique objects
is ~3 times faster than before the change.

In fact, the run time of queueing 16385 events (~600ms) is almost the
same as the run time of rejecting 16385 events (~550ms) due to full
queue, which suggest a very low overhead for merging events.

The test runs two passes to test event merge, the "create" pass and
the "open" pass.

Before the change (v5.11-rc2) 100% of the events of the "open" pass
are merged (16385 files and 16385 events).

After the change, only %50 of the events of the "open" pass are
merged (16385 files and 25462 events).

This is because 16384 is the maximum number of events that we can
merge when hash table is fully balanced.
When reducing the number of unique objects to 8192, all events
on the "open" pass are merged.

Thanks,
Amir.

v5.11-rc2, run #2 of ./fanotify05 -i 2:

fanotify05.c:109: TINFO: Test #0: Limited queue
fanotify05.c:98: TINFO: Created 16385 files in 1653ms
fanotify05.c:98: TINFO: Opened 16385 files in 543ms
fanotify05.c:77: TINFO: Got event #0 filename=fname_0
fanotify05.c:176: TPASS: Got an overflow event: pid=0 fd=-1
fanotify05.c:182: TINFO: Got 16385 events

fanotify05.c:109: TINFO: Test #1: Unlimited queue
fanotify05.c:98: TINFO: Created 16385 files in 1683ms
fanotify05.c:98: TINFO: Opened 16385 files in 1647ms
fanotify05.c:77: TINFO: Got event #0 filename=fname_0
fanotify05.c:138: TPASS: Overflow event not generated!
fanotify05.c:182: TINFO: Got 16385 events

fanotify_merge branch, run #2 of ./fanotify05 -i 2:

fanotify05.c:109: TINFO: Test #0: Limited queue
fanotify05.c:98: TINFO: Created 16385 files in 616ms
fanotify05.c:98: TINFO: Opened 16385 files in 549ms
fanotify05.c:77: TINFO: Got event #0 filename=fname_0
fanotify05.c:176: TPASS: Got an overflow event: pid=0 fd=-1
fanotify05.c:182: TINFO: Got 16385 events

fanotify05.c:109: TINFO: Test #1: Unlimited queue
fanotify05.c:98: TINFO: Created 16385 files in 614ms
fanotify05.c:98: TINFO: Opened 16385 files in 599ms
fanotify05.c:77: TINFO: Got event #0 filename=fname_0
fanotify05.c:138: TPASS: Overflow event not generated!
fanotify05.c:182: TINFO: Got 25462 events

[1] https://lore.kernel.org/linux-fsdevel/20200714025417.A25EB95C0339@us180.sjc.aristanetworks.com/
[2] https://github.com/amir73il/ltp/commits/fanotify_merge

Amir Goldstein (7):
  fsnotify: allow fsnotify_{peek,remove}_first_event with empty queue
  fsnotify: support hashed notification queue
  fsnotify: read events from hashed notification queue by order of
    insertion
  fanotify: enable hashed notification queue for FAN_CLASS_NOTIF groups
  fanotify: limit number of event merge attempts
  fanotify: mix event info into merge key hash
  fsnotify: print some debug stats on hashed queue overflow

 fs/notify/fanotify/fanotify.c      |  40 ++++++-
 fs/notify/fanotify/fanotify.h      |  24 +++-
 fs/notify/fanotify/fanotify_user.c |  55 ++++++---
 fs/notify/group.c                  |  37 ++++--
 fs/notify/inotify/inotify_user.c   |  22 ++--
 fs/notify/notification.c           | 175 +++++++++++++++++++++++++----
 include/linux/fsnotify_backend.h   | 105 +++++++++++++++--
 7 files changed, 383 insertions(+), 75 deletions(-)

-- 
2.25.1


^ permalink raw reply	[flat|nested] 36+ messages in thread

* [PATCH 1/7] fsnotify: allow fsnotify_{peek,remove}_first_event with empty queue
  2021-02-02 16:20 [PATCH 0/7] Performance improvement for fanotify merge Amir Goldstein
@ 2021-02-02 16:20 ` Amir Goldstein
  2021-02-02 16:20 ` [PATCH 2/7] fsnotify: support hashed notification queue Amir Goldstein
                   ` (6 subsequent siblings)
  7 siblings, 0 replies; 36+ messages in thread
From: Amir Goldstein @ 2021-02-02 16:20 UTC (permalink / raw)
  To: Jan Kara; +Cc: linux-fsdevel

Current code has an assumtion that fsnotify_notify_queue_is_empty() is
called to verify that queue is not empty before trying to peek or remove
an event from queue.

Remove this assumption by moving the fsnotify_notify_queue_is_empty()
into the functions, allow them to return NULL value and check return
value by all callers.

This is a prep patch for multi event queues.

Signed-off-by: Amir Goldstein <amir73il@gmail.com>
---
 fs/notify/fanotify/fanotify_user.c | 26 ++++++++++++++++----------
 fs/notify/inotify/inotify_user.c   |  5 ++---
 fs/notify/notification.c           |  6 ++++++
 3 files changed, 24 insertions(+), 13 deletions(-)

diff --git a/fs/notify/fanotify/fanotify_user.c b/fs/notify/fanotify/fanotify_user.c
index b78dd1f88f7c..8ff27d92d32c 100644
--- a/fs/notify/fanotify/fanotify_user.c
+++ b/fs/notify/fanotify/fanotify_user.c
@@ -100,24 +100,30 @@ static struct fanotify_event *get_one_event(struct fsnotify_group *group,
 {
 	size_t event_size = FAN_EVENT_METADATA_LEN;
 	struct fanotify_event *event = NULL;
+	struct fsnotify_event *fsn_event;
 	unsigned int fid_mode = FAN_GROUP_FLAG(group, FANOTIFY_FID_BITS);
 
 	pr_debug("%s: group=%p count=%zd\n", __func__, group, count);
 
 	spin_lock(&group->notification_lock);
-	if (fsnotify_notify_queue_is_empty(group))
+	fsn_event = fsnotify_peek_first_event(group);
+	if (!fsn_event)
 		goto out;
 
-	if (fid_mode) {
-		event_size += fanotify_event_info_len(fid_mode,
-			FANOTIFY_E(fsnotify_peek_first_event(group)));
-	}
+	event = FANOTIFY_E(fsn_event);
+	if (fid_mode)
+		event_size += fanotify_event_info_len(fid_mode, event);
 
 	if (event_size > count) {
 		event = ERR_PTR(-EINVAL);
 		goto out;
 	}
-	event = FANOTIFY_E(fsnotify_remove_first_event(group));
+
+	/*
+	 * Held the notification_lock the whole time, so this is the
+	 * same event we peeked above.
+	 */
+	fsnotify_remove_first_event(group);
 	if (fanotify_is_perm_event(event->mask))
 		FANOTIFY_PERM(event)->state = FAN_EVENT_REPORTED;
 out:
@@ -573,6 +579,7 @@ static ssize_t fanotify_write(struct file *file, const char __user *buf, size_t
 static int fanotify_release(struct inode *ignored, struct file *file)
 {
 	struct fsnotify_group *group = file->private_data;
+	struct fsnotify_event *fsn_event;
 
 	/*
 	 * Stop new events from arriving in the notification queue. since
@@ -601,13 +608,12 @@ static int fanotify_release(struct inode *ignored, struct file *file)
 	 * dequeue them and set the response. They will be freed once the
 	 * response is consumed and fanotify_get_response() returns.
 	 */
-	while (!fsnotify_notify_queue_is_empty(group)) {
-		struct fanotify_event *event;
+	while ((fsn_event = fsnotify_remove_first_event(group))) {
+		struct fanotify_event *event = FANOTIFY_E(fsn_event);
 
-		event = FANOTIFY_E(fsnotify_remove_first_event(group));
 		if (!(event->mask & FANOTIFY_PERM_EVENTS)) {
 			spin_unlock(&group->notification_lock);
-			fsnotify_destroy_event(group, &event->fse);
+			fsnotify_destroy_event(group, fsn_event);
 		} else {
 			finish_permission_event(group, FANOTIFY_PERM(event),
 						FAN_ALLOW);
diff --git a/fs/notify/inotify/inotify_user.c b/fs/notify/inotify/inotify_user.c
index 266d17e8ecb9..d8830be60a9b 100644
--- a/fs/notify/inotify/inotify_user.c
+++ b/fs/notify/inotify/inotify_user.c
@@ -146,10 +146,9 @@ static struct fsnotify_event *get_one_event(struct fsnotify_group *group,
 	size_t event_size = sizeof(struct inotify_event);
 	struct fsnotify_event *event;
 
-	if (fsnotify_notify_queue_is_empty(group))
-		return NULL;
-
 	event = fsnotify_peek_first_event(group);
+	if (!event)
+		return NULL;
 
 	pr_debug("%s: group=%p event=%p\n", __func__, group, event);
 
diff --git a/fs/notify/notification.c b/fs/notify/notification.c
index 75d79d6d3ef0..fcac2d72cbf5 100644
--- a/fs/notify/notification.c
+++ b/fs/notify/notification.c
@@ -150,6 +150,9 @@ struct fsnotify_event *fsnotify_remove_first_event(struct fsnotify_group *group)
 
 	assert_spin_locked(&group->notification_lock);
 
+	if (fsnotify_notify_queue_is_empty(group))
+		return NULL;
+
 	pr_debug("%s: group=%p\n", __func__, group);
 
 	event = list_first_entry(&group->notification_list,
@@ -166,6 +169,9 @@ struct fsnotify_event *fsnotify_peek_first_event(struct fsnotify_group *group)
 {
 	assert_spin_locked(&group->notification_lock);
 
+	if (fsnotify_notify_queue_is_empty(group))
+		return NULL;
+
 	return list_first_entry(&group->notification_list,
 				struct fsnotify_event, list);
 }
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH 2/7] fsnotify: support hashed notification queue
  2021-02-02 16:20 [PATCH 0/7] Performance improvement for fanotify merge Amir Goldstein
  2021-02-02 16:20 ` [PATCH 1/7] fsnotify: allow fsnotify_{peek,remove}_first_event with empty queue Amir Goldstein
@ 2021-02-02 16:20 ` Amir Goldstein
  2021-02-16 15:02   ` Jan Kara
  2021-02-02 16:20 ` [PATCH 3/7] fsnotify: read events from hashed notification queue by order of insertion Amir Goldstein
                   ` (5 subsequent siblings)
  7 siblings, 1 reply; 36+ messages in thread
From: Amir Goldstein @ 2021-02-02 16:20 UTC (permalink / raw)
  To: Jan Kara; +Cc: linux-fsdevel

In order to improve event merge performance, support sharding the
notification queue to several notification lists, hashed by an event
merge key.

The fanotify event merge key is the object id reduced to 32bit hash.

At the moment, both inotify and fanotify still use a single notification
list.

At the moment, reading events from the hashed queue is not by event
insertion order.  A non-empty list is read until it is empty.

The max events limit is accounted for total events in all lists.

Signed-off-by: Amir Goldstein <amir73il@gmail.com>
---
 fs/notify/fanotify/fanotify.c      |   5 +-
 fs/notify/fanotify/fanotify.h      |   4 +-
 fs/notify/fanotify/fanotify_user.c |  14 ++-
 fs/notify/group.c                  |  37 ++++++--
 fs/notify/inotify/inotify_user.c   |  17 ++--
 fs/notify/notification.c           | 131 ++++++++++++++++++++++++-----
 include/linux/fsnotify_backend.h   |  63 +++++++++++---
 7 files changed, 217 insertions(+), 54 deletions(-)

diff --git a/fs/notify/fanotify/fanotify.c b/fs/notify/fanotify/fanotify.c
index 1192c9953620..12df6957e4d8 100644
--- a/fs/notify/fanotify/fanotify.c
+++ b/fs/notify/fanotify/fanotify.c
@@ -97,7 +97,7 @@ static bool fanotify_should_merge(struct fsnotify_event *old_fsn,
 	old = FANOTIFY_E(old_fsn);
 	new = FANOTIFY_E(new_fsn);
 
-	if (old_fsn->objectid != new_fsn->objectid ||
+	if (old_fsn->key != new_fsn->key ||
 	    old->type != new->type || old->pid != new->pid)
 		return false;
 
@@ -600,8 +600,9 @@ static struct fanotify_event *fanotify_alloc_event(struct fsnotify_group *group,
 	 * Use the victim inode instead of the watching inode as the id for
 	 * event queue, so event reported on parent is merged with event
 	 * reported on child when both directory and child watches exist.
+	 * Reduce object id to 32bit hash for hashed queue merge.
 	 */
-	fanotify_init_event(event, (unsigned long)id, mask);
+	fanotify_init_event(event, hash_ptr(id, 32), mask);
 	if (FAN_GROUP_FLAG(group, FAN_REPORT_TID))
 		event->pid = get_pid(task_pid(current));
 	else
diff --git a/fs/notify/fanotify/fanotify.h b/fs/notify/fanotify/fanotify.h
index 896c819a1786..2e856372ffc8 100644
--- a/fs/notify/fanotify/fanotify.h
+++ b/fs/notify/fanotify/fanotify.h
@@ -145,9 +145,9 @@ struct fanotify_event {
 };
 
 static inline void fanotify_init_event(struct fanotify_event *event,
-				       unsigned long id, u32 mask)
+				       unsigned int key, u32 mask)
 {
-	fsnotify_init_event(&event->fse, id);
+	fsnotify_init_event(&event->fse, key);
 	event->mask = mask;
 	event->pid = NULL;
 }
diff --git a/fs/notify/fanotify/fanotify_user.c b/fs/notify/fanotify/fanotify_user.c
index 8ff27d92d32c..dee12d927f8d 100644
--- a/fs/notify/fanotify/fanotify_user.c
+++ b/fs/notify/fanotify/fanotify_user.c
@@ -635,6 +635,7 @@ static long fanotify_ioctl(struct file *file, unsigned int cmd, unsigned long ar
 {
 	struct fsnotify_group *group;
 	struct fsnotify_event *fsn_event;
+	struct list_head *list;
 	void __user *p;
 	int ret = -ENOTTY;
 	size_t send_len = 0;
@@ -646,8 +647,15 @@ static long fanotify_ioctl(struct file *file, unsigned int cmd, unsigned long ar
 	switch (cmd) {
 	case FIONREAD:
 		spin_lock(&group->notification_lock);
-		list_for_each_entry(fsn_event, &group->notification_list, list)
-			send_len += FAN_EVENT_METADATA_LEN;
+		list = fsnotify_first_notification_list(group);
+		/*
+		 * With multi queue, send_len will be a lower bound
+		 * on total events size.
+		 */
+		if (list) {
+			list_for_each_entry(fsn_event, list, list)
+				send_len += FAN_EVENT_METADATA_LEN;
+		}
 		spin_unlock(&group->notification_lock);
 		ret = put_user(send_len, (int __user *) p);
 		break;
@@ -982,7 +990,7 @@ SYSCALL_DEFINE2(fanotify_init, unsigned int, flags, unsigned int, event_f_flags)
 		f_flags |= O_NONBLOCK;
 
 	/* fsnotify_alloc_group takes a ref.  Dropped in fanotify_release */
-	group = fsnotify_alloc_user_group(&fanotify_fsnotify_ops);
+	group = fsnotify_alloc_user_group(0, &fanotify_fsnotify_ops);
 	if (IS_ERR(group)) {
 		free_uid(user);
 		return PTR_ERR(group);
diff --git a/fs/notify/group.c b/fs/notify/group.c
index ffd723ffe46d..b41ed68f9ff9 100644
--- a/fs/notify/group.c
+++ b/fs/notify/group.c
@@ -21,6 +21,11 @@
  */
 static void fsnotify_final_destroy_group(struct fsnotify_group *group)
 {
+	int i;
+
+	for (i = 0; i <= group->max_bucket; i++)
+		WARN_ON_ONCE(!list_empty(&group->notification_list[i]));
+
 	if (group->ops->free_group_priv)
 		group->ops->free_group_priv(group);
 
@@ -111,12 +116,20 @@ void fsnotify_put_group(struct fsnotify_group *group)
 }
 EXPORT_SYMBOL_GPL(fsnotify_put_group);
 
-static struct fsnotify_group *__fsnotify_alloc_group(
+static struct fsnotify_group *__fsnotify_alloc_group(unsigned int q_hash_bits,
 				const struct fsnotify_ops *ops, gfp_t gfp)
 {
 	struct fsnotify_group *group;
+	int i;
+
+#ifdef FSNOTIFY_HASHED_QUEUE
+	if (WARN_ON_ONCE(q_hash_bits > FSNOTIFY_HASHED_QUEUE_MAX_BITS))
+		q_hash_bits = FSNOTIFY_HASHED_QUEUE_MAX_BITS;
+#else
+	q_hash_bits = 0;
+#endif
 
-	group = kzalloc(sizeof(struct fsnotify_group), gfp);
+	group = kzalloc(fsnotify_group_size(q_hash_bits), gfp);
 	if (!group)
 		return ERR_PTR(-ENOMEM);
 
@@ -126,8 +139,12 @@ static struct fsnotify_group *__fsnotify_alloc_group(
 	atomic_set(&group->user_waits, 0);
 
 	spin_lock_init(&group->notification_lock);
-	INIT_LIST_HEAD(&group->notification_list);
 	init_waitqueue_head(&group->notification_waitq);
+	/* Initialize one or more notification lists */
+	group->q_hash_bits = q_hash_bits;
+	group->max_bucket = (1 << q_hash_bits) - 1;
+	for (i = 0; i <= group->max_bucket; i++)
+		INIT_LIST_HEAD(&group->notification_list[i]);
 	group->max_events = UINT_MAX;
 
 	mutex_init(&group->mark_mutex);
@@ -139,20 +156,24 @@ static struct fsnotify_group *__fsnotify_alloc_group(
 }
 
 /*
- * Create a new fsnotify_group and hold a reference for the group returned.
+ * Create a new fsnotify_group with no events queue.
+ * Hold a reference for the group returned.
  */
 struct fsnotify_group *fsnotify_alloc_group(const struct fsnotify_ops *ops)
 {
-	return __fsnotify_alloc_group(ops, GFP_KERNEL);
+	return __fsnotify_alloc_group(0, ops, GFP_KERNEL);
 }
 EXPORT_SYMBOL_GPL(fsnotify_alloc_group);
 
 /*
- * Create a new fsnotify_group and hold a reference for the group returned.
+ * Create a new fsnotify_group with an events queue.
+ * If @q_hash_bits > 0, the queue is shareded into several notification lists.
+ * Hold a reference for the group returned.
  */
-struct fsnotify_group *fsnotify_alloc_user_group(const struct fsnotify_ops *ops)
+struct fsnotify_group *fsnotify_alloc_user_group(unsigned int q_hash_bits,
+						 const struct fsnotify_ops *ops)
 {
-	return __fsnotify_alloc_group(ops, GFP_KERNEL_ACCOUNT);
+	return __fsnotify_alloc_group(q_hash_bits, ops, GFP_KERNEL_ACCOUNT);
 }
 EXPORT_SYMBOL_GPL(fsnotify_alloc_user_group);
 
diff --git a/fs/notify/inotify/inotify_user.c b/fs/notify/inotify/inotify_user.c
index d8830be60a9b..1c476b9485dc 100644
--- a/fs/notify/inotify/inotify_user.c
+++ b/fs/notify/inotify/inotify_user.c
@@ -288,6 +288,7 @@ static long inotify_ioctl(struct file *file, unsigned int cmd,
 {
 	struct fsnotify_group *group;
 	struct fsnotify_event *fsn_event;
+	struct list_head *list;
 	void __user *p;
 	int ret = -ENOTTY;
 	size_t send_len = 0;
@@ -300,10 +301,16 @@ static long inotify_ioctl(struct file *file, unsigned int cmd,
 	switch (cmd) {
 	case FIONREAD:
 		spin_lock(&group->notification_lock);
-		list_for_each_entry(fsn_event, &group->notification_list,
-				    list) {
-			send_len += sizeof(struct inotify_event);
-			send_len += round_event_name_len(fsn_event);
+		list = fsnotify_first_notification_list(group);
+		/*
+		 * With multi queue, send_len will be a lower bound
+		 * on total events size.
+		 */
+		if (list) {
+			list_for_each_entry(fsn_event, list, list) {
+				send_len += sizeof(struct inotify_event);
+				send_len += round_event_name_len(fsn_event);
+			}
 		}
 		spin_unlock(&group->notification_lock);
 		ret = put_user(send_len, (int __user *) p);
@@ -631,7 +638,7 @@ static struct fsnotify_group *inotify_new_group(unsigned int max_events)
 	struct fsnotify_group *group;
 	struct inotify_event_info *oevent;
 
-	group = fsnotify_alloc_user_group(&inotify_fsnotify_ops);
+	group = fsnotify_alloc_user_group(0, &inotify_fsnotify_ops);
 	if (IS_ERR(group))
 		return group;
 
diff --git a/fs/notify/notification.c b/fs/notify/notification.c
index fcac2d72cbf5..58c8f6c1be1b 100644
--- a/fs/notify/notification.c
+++ b/fs/notify/notification.c
@@ -47,13 +47,6 @@ u32 fsnotify_get_cookie(void)
 }
 EXPORT_SYMBOL_GPL(fsnotify_get_cookie);
 
-/* return true if the notify queue is empty, false otherwise */
-bool fsnotify_notify_queue_is_empty(struct fsnotify_group *group)
-{
-	assert_spin_locked(&group->notification_lock);
-	return list_empty(&group->notification_list) ? true : false;
-}
-
 void fsnotify_destroy_event(struct fsnotify_group *group,
 			    struct fsnotify_event *event)
 {
@@ -74,10 +67,75 @@ void fsnotify_destroy_event(struct fsnotify_group *group,
 	group->ops->free_event(event);
 }
 
+/* Check and fix inconsistencies in hashed queue */
+static void fsnotify_queue_check(struct fsnotify_group *group)
+{
+#ifdef FSNOTIFY_HASHED_QUEUE
+	struct list_head *list;
+	int i, nbuckets = 0;
+	bool first_empty, last_empty;
+
+	assert_spin_locked(&group->notification_lock);
+
+	pr_debug("%s: group=%p events: num=%u max=%u buckets: first=%u last=%u max=%u\n",
+		 __func__, group, group->num_events, group->max_events,
+		 group->first_bucket, group->last_bucket, group->max_bucket);
+
+	if (fsnotify_notify_queue_is_empty(group))
+		return;
+
+	first_empty = list_empty(&group->notification_list[group->first_bucket]);
+	last_empty = list_empty(&group->notification_list[group->last_bucket]);
+
+	list = &group->notification_list[0];
+	for (i = 0; i <= group->max_bucket; i++, list++) {
+		if (list_empty(list))
+			continue;
+		if (nbuckets++)
+			continue;
+		if (first_empty)
+			group->first_bucket = i;
+		if (last_empty)
+			group->last_bucket = i;
+	}
+
+	pr_debug("%s: %u non-empty buckets\n", __func__, nbuckets);
+
+	/* All buckets are empty, but non-zero num_events? */
+	if (WARN_ON_ONCE(!nbuckets && group->num_events))
+		group->num_events = 0;
+#endif
+}
+
 /*
- * Add an event to the group notification queue.  The group can later pull this
- * event off the queue to deal with.  The function returns 0 if the event was
- * added to the queue, 1 if the event was merged with some other queued event,
+ * Add an event to the group notification queue (no merge and no failure).
+ */
+static void fsnotify_queue_event(struct fsnotify_group *group,
+				struct fsnotify_event *event)
+{
+	/* Choose list to add event to */
+	unsigned int b = fsnotify_event_bucket(group, event);
+	struct list_head *list = &group->notification_list[b];
+
+	assert_spin_locked(&group->notification_lock);
+
+	pr_debug("%s: group=%p event=%p bucket=%u\n", __func__, group, event, b);
+
+	/*
+	 * TODO: set next_bucket of last event.
+	 */
+	group->last_bucket = b;
+	if (!group->num_events)
+		group->first_bucket = b;
+	group->num_events++;
+	list_add_tail(&event->list, list);
+}
+
+/*
+ * Try to Add an event to the group notification queue.
+ * The group can later pull this event off the queue to deal with.
+ * The function returns 0 if the event was added to a queue,
+ * 1 if the event was merged with some other queued event,
  * 2 if the event was not queued - either the queue of events has overflown
  * or the group is shutting down.
  */
@@ -87,7 +145,7 @@ int fsnotify_add_event(struct fsnotify_group *group,
 				    struct fsnotify_event *))
 {
 	int ret = 0;
-	struct list_head *list = &group->notification_list;
+	struct list_head *list;
 
 	pr_debug("%s: group=%p event=%p\n", __func__, group, event);
 
@@ -99,7 +157,7 @@ int fsnotify_add_event(struct fsnotify_group *group,
 	}
 
 	if (event == group->overflow_event ||
-	    group->q_len >= group->max_events) {
+	    group->num_events >= group->max_events) {
 		ret = 2;
 		/* Queue overflow event only if it isn't already queued */
 		if (!list_empty(&group->overflow_event->list)) {
@@ -110,6 +168,7 @@ int fsnotify_add_event(struct fsnotify_group *group,
 		goto queue;
 	}
 
+	list = fsnotify_event_notification_list(group, event);
 	if (!list_empty(list) && merge) {
 		ret = merge(list, event);
 		if (ret) {
@@ -119,8 +178,7 @@ int fsnotify_add_event(struct fsnotify_group *group,
 	}
 
 queue:
-	group->q_len++;
-	list_add_tail(&event->list, list);
+	fsnotify_queue_event(group, event);
 	spin_unlock(&group->notification_lock);
 
 	wake_up(&group->notification_waitq);
@@ -137,7 +195,30 @@ void fsnotify_remove_queued_event(struct fsnotify_group *group,
 	 * check in fsnotify_add_event() works
 	 */
 	list_del_init(&event->list);
-	group->q_len--;
+	group->num_events--;
+}
+
+/* Return the notification list of the first event */
+struct list_head *fsnotify_first_notification_list(struct fsnotify_group *group)
+{
+	struct list_head *list;
+
+	assert_spin_locked(&group->notification_lock);
+
+	if (fsnotify_notify_queue_is_empty(group))
+		return NULL;
+
+	list = &group->notification_list[group->first_bucket];
+	if (likely(!list_empty(list)))
+		return list;
+
+	/*
+	 * Look for any non-empty bucket.
+	 */
+	fsnotify_queue_check(group);
+	list = &group->notification_list[group->first_bucket];
+
+	return list_empty(list) ? NULL : list;
 }
 
 /*
@@ -147,17 +228,21 @@ void fsnotify_remove_queued_event(struct fsnotify_group *group,
 struct fsnotify_event *fsnotify_remove_first_event(struct fsnotify_group *group)
 {
 	struct fsnotify_event *event;
+	struct list_head *list;
 
 	assert_spin_locked(&group->notification_lock);
 
-	if (fsnotify_notify_queue_is_empty(group))
+	list = fsnotify_first_notification_list(group);
+	if (!list)
 		return NULL;
 
-	pr_debug("%s: group=%p\n", __func__, group);
+	pr_debug("%s: group=%p bucket=%u\n", __func__, group, group->first_bucket);
 
-	event = list_first_entry(&group->notification_list,
-				 struct fsnotify_event, list);
+	event = list_first_entry(list, struct fsnotify_event, list);
 	fsnotify_remove_queued_event(group, event);
+	/*
+	 * TODO: update group->first_bucket to next_bucket in first event.
+	 */
 	return event;
 }
 
@@ -167,13 +252,15 @@ struct fsnotify_event *fsnotify_remove_first_event(struct fsnotify_group *group)
  */
 struct fsnotify_event *fsnotify_peek_first_event(struct fsnotify_group *group)
 {
+	struct list_head *list;
+
 	assert_spin_locked(&group->notification_lock);
 
-	if (fsnotify_notify_queue_is_empty(group))
+	list = fsnotify_first_notification_list(group);
+	if (!list)
 		return NULL;
 
-	return list_first_entry(&group->notification_list,
-				struct fsnotify_event, list);
+	return list_first_entry(list, struct fsnotify_event, list);
 }
 
 /*
diff --git a/include/linux/fsnotify_backend.h b/include/linux/fsnotify_backend.h
index e5409b83e731..b2a80bc00108 100644
--- a/include/linux/fsnotify_backend.h
+++ b/include/linux/fsnotify_backend.h
@@ -160,6 +160,11 @@ struct fsnotify_ops {
 	void (*free_mark)(struct fsnotify_mark *mark);
 };
 
+#ifdef CONFIG_FANOTIFY
+#define FSNOTIFY_HASHED_QUEUE
+#define FSNOTIFY_HASHED_QUEUE_MAX_BITS 8
+#endif
+
 /*
  * all of the information about the original object we want to now send to
  * a group.  If you want to carry more info from the accessing task to the
@@ -167,7 +172,7 @@ struct fsnotify_ops {
  */
 struct fsnotify_event {
 	struct list_head list;
-	unsigned long objectid;	/* identifier for queue merges */
+	unsigned int key;		/* Key for hashed queue add/merge */
 };
 
 /*
@@ -189,12 +194,10 @@ struct fsnotify_group {
 	 */
 	refcount_t refcnt;		/* things with interest in this group */
 
-	/* needed to send notification to userspace */
-	spinlock_t notification_lock;		/* protect the notification_list */
-	struct list_head notification_list;	/* list of event_holder this group needs to send to userspace */
-	wait_queue_head_t notification_waitq;	/* read() on the notification file blocks on this waitq */
-	unsigned int q_len;			/* events on the queue */
-	unsigned int max_events;		/* maximum events allowed on the list */
+	/* needed to send notification to userspace and wait on group shutdown */
+	spinlock_t notification_lock;		/* protect the event queues */
+	wait_queue_head_t notification_waitq;	/* read() the events blocks on this waitq */
+
 	/*
 	 * Valid fsnotify group priorities.  Events are send in order from highest
 	 * priority to lowest priority.  We default to the lowest priority.
@@ -244,8 +247,36 @@ struct fsnotify_group {
 		} fanotify_data;
 #endif /* CONFIG_FANOTIFY */
 	};
+
+	/* Only relevant for groups that use events queue: */
+	unsigned int q_hash_bits;	/* >0 means hashed notification queue */
+	unsigned int max_bucket;	/* notification_list[] range is [0..max_bucket] */
+	unsigned int first_bucket;	/* List to read first event from */
+	unsigned int last_bucket;	/* List of last added event */
+	unsigned int num_events;	/* Number of events in all lists */
+	unsigned int max_events;	/* Maximum events allowed in all lists */
+	struct list_head notification_list[];	/* Queue of events sharded into several lists */
 };
 
+static inline size_t fsnotify_group_size(unsigned int q_hash_bits)
+{
+	return sizeof(struct fsnotify_group) + (sizeof(struct list_head) << q_hash_bits);
+}
+
+static inline unsigned int fsnotify_event_bucket(struct fsnotify_group *group,
+						 struct fsnotify_event *event)
+{
+	/* High bits are better for hash */
+	return (event->key >> (32 - group->q_hash_bits)) & group->max_bucket;
+}
+
+static inline struct list_head *fsnotify_event_notification_list(
+						struct fsnotify_group *group,
+						struct fsnotify_event *event)
+{
+	return &group->notification_list[fsnotify_event_bucket(group, event)];
+}
+
 /* When calling fsnotify tell it if the data is a path or inode */
 enum fsnotify_data_type {
 	FSNOTIFY_EVENT_NONE,
@@ -470,7 +501,8 @@ static inline void fsnotify_update_flags(struct dentry *dentry)
 
 /* create a new group */
 extern struct fsnotify_group *fsnotify_alloc_group(const struct fsnotify_ops *ops);
-extern struct fsnotify_group *fsnotify_alloc_user_group(const struct fsnotify_ops *ops);
+extern struct fsnotify_group *fsnotify_alloc_user_group(unsigned int q_hash_bits,
+							const struct fsnotify_ops *ops);
 /* get reference to a group */
 extern void fsnotify_get_group(struct fsnotify_group *group);
 /* drop reference on a group from fsnotify_alloc_group */
@@ -495,8 +527,15 @@ static inline void fsnotify_queue_overflow(struct fsnotify_group *group)
 	fsnotify_add_event(group, group->overflow_event, NULL);
 }
 
-/* true if the group notification queue is empty */
-extern bool fsnotify_notify_queue_is_empty(struct fsnotify_group *group);
+/* True if all the group event queues are empty */
+static inline bool fsnotify_notify_queue_is_empty(struct fsnotify_group *group)
+{
+	assert_spin_locked(&group->notification_lock);
+	return group->num_events == 0;
+}
+
+/* Return the notification list of the first event */
+extern struct list_head *fsnotify_first_notification_list(struct fsnotify_group *group);
 /* return, but do not dequeue the first event on the notification queue */
 extern struct fsnotify_event *fsnotify_peek_first_event(struct fsnotify_group *group);
 /* return AND dequeue the first event on the notification queue */
@@ -577,10 +616,10 @@ extern void fsnotify_finish_user_wait(struct fsnotify_iter_info *iter_info);
 extern bool fsnotify_prepare_user_wait(struct fsnotify_iter_info *iter_info);
 
 static inline void fsnotify_init_event(struct fsnotify_event *event,
-				       unsigned long objectid)
+				       unsigned int key)
 {
 	INIT_LIST_HEAD(&event->list);
-	event->objectid = objectid;
+	event->key = key;
 }
 
 #else
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH 3/7] fsnotify: read events from hashed notification queue by order of insertion
  2021-02-02 16:20 [PATCH 0/7] Performance improvement for fanotify merge Amir Goldstein
  2021-02-02 16:20 ` [PATCH 1/7] fsnotify: allow fsnotify_{peek,remove}_first_event with empty queue Amir Goldstein
  2021-02-02 16:20 ` [PATCH 2/7] fsnotify: support hashed notification queue Amir Goldstein
@ 2021-02-02 16:20 ` Amir Goldstein
  2021-02-16 15:10   ` Jan Kara
  2021-02-02 16:20 ` [PATCH 4/7] fanotify: enable hashed notification queue for FAN_CLASS_NOTIF groups Amir Goldstein
                   ` (4 subsequent siblings)
  7 siblings, 1 reply; 36+ messages in thread
From: Amir Goldstein @ 2021-02-02 16:20 UTC (permalink / raw)
  To: Jan Kara; +Cc: linux-fsdevel

On 64bit arch, use the available 32bit in event for next_bucket field,
that is used to chain all events by order of insertion.

The group has a cursor for the bucket containing the first event and
every event stores the bucket of the next event to read.

On 32bit arch, hashed notification queue is disabled.

Signed-off-by: Amir Goldstein <amir73il@gmail.com>
---
 fs/notify/notification.c         | 49 ++++++++++++++++++++++++--------
 include/linux/fsnotify_backend.h | 42 +++++++++++++++++++++++++++
 2 files changed, 79 insertions(+), 12 deletions(-)

diff --git a/fs/notify/notification.c b/fs/notify/notification.c
index 58c8f6c1be1b..d98f4c8bfb0e 100644
--- a/fs/notify/notification.c
+++ b/fs/notify/notification.c
@@ -84,8 +84,8 @@ static void fsnotify_queue_check(struct fsnotify_group *group)
 	if (fsnotify_notify_queue_is_empty(group))
 		return;
 
-	first_empty = list_empty(&group->notification_list[group->first_bucket]);
-	last_empty = list_empty(&group->notification_list[group->last_bucket]);
+	first_empty = WARN_ON_ONCE(list_empty(&group->notification_list[group->first_bucket]));
+	last_empty = WARN_ON_ONCE(list_empty(&group->notification_list[group->last_bucket]));
 
 	list = &group->notification_list[0];
 	for (i = 0; i <= group->max_bucket; i++, list++) {
@@ -121,12 +121,23 @@ static void fsnotify_queue_event(struct fsnotify_group *group,
 
 	pr_debug("%s: group=%p event=%p bucket=%u\n", __func__, group, event, b);
 
-	/*
-	 * TODO: set next_bucket of last event.
-	 */
-	group->last_bucket = b;
-	if (!group->num_events)
-		group->first_bucket = b;
+	if (fsnotify_notify_queue_is_hashed(group)) {
+		/*
+		 * On first insert, set this event's list as the list to read first event.
+		 * Otherwise, point from last event to this event's list.
+		 */
+		struct list_head *last_l = &group->notification_list[group->last_bucket];
+
+		if (!group->num_events) {
+			group->first_bucket = b;
+		} else if (!WARN_ON_ONCE(list_empty(last_l))) {
+			struct fsnotify_event *last_e;
+
+			last_e = list_last_entry(last_l, struct fsnotify_event, list);
+			fsnotify_event_set_next_bucket(last_e, b);
+		}
+		group->last_bucket = b;
+	}
 	group->num_events++;
 	list_add_tail(&event->list, list);
 }
@@ -186,8 +197,8 @@ int fsnotify_add_event(struct fsnotify_group *group,
 	return ret;
 }
 
-void fsnotify_remove_queued_event(struct fsnotify_group *group,
-				  struct fsnotify_event *event)
+static void __fsnotify_remove_queued_event(struct fsnotify_group *group,
+					   struct fsnotify_event *event)
 {
 	assert_spin_locked(&group->notification_lock);
 	/*
@@ -198,6 +209,17 @@ void fsnotify_remove_queued_event(struct fsnotify_group *group,
 	group->num_events--;
 }
 
+void fsnotify_remove_queued_event(struct fsnotify_group *group,
+				  struct fsnotify_event *event)
+{
+	/*
+	 * if called for removal of event in the middle of a hashed queue,
+	 * events may be read not in insertion order.
+	 */
+	WARN_ON_ONCE(fsnotify_notify_queue_is_hashed(group));
+	__fsnotify_remove_queued_event(group, event);
+}
+
 /* Return the notification list of the first event */
 struct list_head *fsnotify_first_notification_list(struct fsnotify_group *group)
 {
@@ -213,6 +235,7 @@ struct list_head *fsnotify_first_notification_list(struct fsnotify_group *group)
 		return list;
 
 	/*
+	 * Oops... first bucket is not supposed to be empty.
 	 * Look for any non-empty bucket.
 	 */
 	fsnotify_queue_check(group);
@@ -239,10 +262,12 @@ struct fsnotify_event *fsnotify_remove_first_event(struct fsnotify_group *group)
 	pr_debug("%s: group=%p bucket=%u\n", __func__, group, group->first_bucket);
 
 	event = list_first_entry(list, struct fsnotify_event, list);
-	fsnotify_remove_queued_event(group, event);
+	__fsnotify_remove_queued_event(group, event);
 	/*
-	 * TODO: update group->first_bucket to next_bucket in first event.
+	 * Removed event points to the next list to read from.
 	 */
+	group->first_bucket = fsnotify_event_next_bucket(event);
+
 	return event;
 }
 
diff --git a/include/linux/fsnotify_backend.h b/include/linux/fsnotify_backend.h
index b2a80bc00108..3fc3c9e4d21c 100644
--- a/include/linux/fsnotify_backend.h
+++ b/include/linux/fsnotify_backend.h
@@ -161,9 +161,12 @@ struct fsnotify_ops {
 };
 
 #ifdef CONFIG_FANOTIFY
+#if BITS_PER_LONG == 64
+/* Use available 32bit of event for hashed queue support */
 #define FSNOTIFY_HASHED_QUEUE
 #define FSNOTIFY_HASHED_QUEUE_MAX_BITS 8
 #endif
+#endif
 
 /*
  * all of the information about the original object we want to now send to
@@ -173,6 +176,9 @@ struct fsnotify_ops {
 struct fsnotify_event {
 	struct list_head list;
 	unsigned int key;		/* Key for hashed queue add/merge */
+#ifdef FSNOTIFY_HASHED_QUEUE
+	unsigned int next_bucket;	/* Bucket to read next event from */
+#endif
 };
 
 /*
@@ -277,6 +283,41 @@ static inline struct list_head *fsnotify_event_notification_list(
 	return &group->notification_list[fsnotify_event_bucket(group, event)];
 }
 
+#ifdef FSNOTIFY_HASHED_QUEUE
+static inline bool fsnotify_notify_queue_is_hashed(struct fsnotify_group *group)
+{
+	return group->max_bucket > 0;
+}
+
+static inline unsigned int fsnotify_event_next_bucket(struct fsnotify_event *event)
+{
+	return event->next_bucket;
+}
+
+static inline void fsnotify_event_set_next_bucket(struct fsnotify_event *event,
+						  unsigned int b)
+{
+	event->next_bucket = b;
+}
+
+#else
+static inline bool fsnotify_notify_queue_is_hashed(struct fsnotify_group *group)
+{
+	return false;
+}
+
+static inline unsigned int fsnotify_event_next_bucket(struct fsnotify_event *event)
+{
+	return 0;
+}
+
+static inline void fsnotify_event_set_next_bucket(struct fsnotify_event *event,
+						  unsigned int b)
+{
+}
+
+#endif
+
 /* When calling fsnotify tell it if the data is a path or inode */
 enum fsnotify_data_type {
 	FSNOTIFY_EVENT_NONE,
@@ -620,6 +661,7 @@ static inline void fsnotify_init_event(struct fsnotify_event *event,
 {
 	INIT_LIST_HEAD(&event->list);
 	event->key = key;
+	fsnotify_event_set_next_bucket(event, 0);
 }
 
 #else
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH 4/7] fanotify: enable hashed notification queue for FAN_CLASS_NOTIF groups
  2021-02-02 16:20 [PATCH 0/7] Performance improvement for fanotify merge Amir Goldstein
                   ` (2 preceding siblings ...)
  2021-02-02 16:20 ` [PATCH 3/7] fsnotify: read events from hashed notification queue by order of insertion Amir Goldstein
@ 2021-02-02 16:20 ` Amir Goldstein
  2021-02-02 16:20 ` [PATCH 5/7] fanotify: limit number of event merge attempts Amir Goldstein
                   ` (3 subsequent siblings)
  7 siblings, 0 replies; 36+ messages in thread
From: Amir Goldstein @ 2021-02-02 16:20 UTC (permalink / raw)
  To: Jan Kara; +Cc: linux-fsdevel

To improve performace of merging events when there is a large queue of
events, enable hashed notification queue for FAN_CLASS_NOTIF groups.

Use 128 hash buckets to reduce the avg. event list size by 128.
With the default queue size, that leads to avg. max list size of 128.

When fanotify permission event is canceled, event can be removed
from the middle of the queue and that can break the read order of
events in a hashed queue.  For now, we do not support hashed queue
with any classes that supports permission events.

Signed-off-by: Amir Goldstein <amir73il@gmail.com>
---
 fs/notify/fanotify/fanotify_user.c | 17 ++++++++++++++++-
 1 file changed, 16 insertions(+), 1 deletion(-)

diff --git a/fs/notify/fanotify/fanotify_user.c b/fs/notify/fanotify/fanotify_user.c
index dee12d927f8d..641c20ad96e4 100644
--- a/fs/notify/fanotify/fanotify_user.c
+++ b/fs/notify/fanotify/fanotify_user.c
@@ -29,6 +29,11 @@
 #define FANOTIFY_DEFAULT_MAX_EVENTS	16384
 #define FANOTIFY_DEFAULT_MAX_MARKS	8192
 #define FANOTIFY_DEFAULT_MAX_LISTENERS	128
+/*
+ * 128 hash buckets for fast events merge.
+ * With the default queue size, that leads to avg. list size of 128.
+ */
+#define FANOTIFY_NOTIF_Q_HASH_BITS	7
 
 /*
  * All flags that may be specified in parameter event_f_flags of fanotify_init.
@@ -941,6 +946,7 @@ SYSCALL_DEFINE2(fanotify_init, unsigned int, flags, unsigned int, event_f_flags)
 	struct user_struct *user;
 	unsigned int fid_mode = flags & FANOTIFY_FID_BITS;
 	unsigned int class = flags & FANOTIFY_CLASS_BITS;
+	unsigned int q_hash_bits = 0;
 
 	pr_debug("%s: flags=%x event_f_flags=%x\n",
 		 __func__, flags, event_f_flags);
@@ -989,8 +995,17 @@ SYSCALL_DEFINE2(fanotify_init, unsigned int, flags, unsigned int, event_f_flags)
 	if (flags & FAN_NONBLOCK)
 		f_flags |= O_NONBLOCK;
 
+	/*
+	 * When fanotify permission event is canceled, event can be removed
+	 * from the middle of the queue and that can break the read order of
+	 * events in a hashed queue.  For now, we do not support hashed queue
+	 * with any classes that supports permission events.
+	 */
+	if (class == FAN_CLASS_NOTIF)
+		q_hash_bits = FANOTIFY_NOTIF_Q_HASH_BITS;
+
 	/* fsnotify_alloc_group takes a ref.  Dropped in fanotify_release */
-	group = fsnotify_alloc_user_group(0, &fanotify_fsnotify_ops);
+	group = fsnotify_alloc_user_group(q_hash_bits, &fanotify_fsnotify_ops);
 	if (IS_ERR(group)) {
 		free_uid(user);
 		return PTR_ERR(group);
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH 5/7] fanotify: limit number of event merge attempts
  2021-02-02 16:20 [PATCH 0/7] Performance improvement for fanotify merge Amir Goldstein
                   ` (3 preceding siblings ...)
  2021-02-02 16:20 ` [PATCH 4/7] fanotify: enable hashed notification queue for FAN_CLASS_NOTIF groups Amir Goldstein
@ 2021-02-02 16:20 ` Amir Goldstein
  2021-02-27  8:31   ` Amir Goldstein
  2021-02-02 16:20 ` [PATCH 6/7] fanotify: mix event info into merge key hash Amir Goldstein
                   ` (2 subsequent siblings)
  7 siblings, 1 reply; 36+ messages in thread
From: Amir Goldstein @ 2021-02-02 16:20 UTC (permalink / raw)
  To: Jan Kara; +Cc: linux-fsdevel

Event merges are expensive when event queue size is large.
Limit the linear search to 128 merge tests.
In combination with 128 hash lists, there is a potential to
merge with up to 16K events in the hashed queue.

Signed-off-by: Amir Goldstein <amir73il@gmail.com>
---
 fs/notify/fanotify/fanotify.c | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/fs/notify/fanotify/fanotify.c b/fs/notify/fanotify/fanotify.c
index 12df6957e4d8..6d3807012851 100644
--- a/fs/notify/fanotify/fanotify.c
+++ b/fs/notify/fanotify/fanotify.c
@@ -129,11 +129,15 @@ static bool fanotify_should_merge(struct fsnotify_event *old_fsn,
 	return false;
 }
 
+/* Limit event merges to limit CPU overhead per event */
+#define FANOTIFY_MAX_MERGE_EVENTS 128
+
 /* and the list better be locked by something too! */
 static int fanotify_merge(struct list_head *list, struct fsnotify_event *event)
 {
 	struct fsnotify_event *test_event;
 	struct fanotify_event *new;
+	int i = 0;
 
 	pr_debug("%s: list=%p event=%p\n", __func__, list, event);
 	new = FANOTIFY_E(event);
@@ -147,6 +151,8 @@ static int fanotify_merge(struct list_head *list, struct fsnotify_event *event)
 		return 0;
 
 	list_for_each_entry_reverse(test_event, list, list) {
+		if (++i > FANOTIFY_MAX_MERGE_EVENTS)
+			break;
 		if (fanotify_should_merge(test_event, event)) {
 			FANOTIFY_E(test_event)->mask |= new->mask;
 			return 1;
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH 6/7] fanotify: mix event info into merge key hash
  2021-02-02 16:20 [PATCH 0/7] Performance improvement for fanotify merge Amir Goldstein
                   ` (4 preceding siblings ...)
  2021-02-02 16:20 ` [PATCH 5/7] fanotify: limit number of event merge attempts Amir Goldstein
@ 2021-02-02 16:20 ` Amir Goldstein
  2021-02-16 15:39   ` Jan Kara
  2021-02-02 16:20 ` [PATCH 7/7] fsnotify: print some debug stats on hashed queue overflow Amir Goldstein
  2021-02-16 16:02 ` [PATCH 0/7] Performance improvement for fanotify merge Jan Kara
  7 siblings, 1 reply; 36+ messages in thread
From: Amir Goldstein @ 2021-02-02 16:20 UTC (permalink / raw)
  To: Jan Kara; +Cc: linux-fsdevel

Improve the balance of hashed queue lists by mixing more event info
relevant for merge.

For example, all FAN_CREATE name events in the same dir used to have the
same merge key based on the dir inode.  With this change the created
file name is mixed into the merge key.

Signed-off-by: Amir Goldstein <amir73il@gmail.com>
---
 fs/notify/fanotify/fanotify.c | 33 +++++++++++++++++++++++++++------
 fs/notify/fanotify/fanotify.h | 20 +++++++++++++++++---
 2 files changed, 44 insertions(+), 9 deletions(-)

diff --git a/fs/notify/fanotify/fanotify.c b/fs/notify/fanotify/fanotify.c
index 6d3807012851..b19fef1c6f64 100644
--- a/fs/notify/fanotify/fanotify.c
+++ b/fs/notify/fanotify/fanotify.c
@@ -14,6 +14,7 @@
 #include <linux/audit.h>
 #include <linux/sched/mm.h>
 #include <linux/statfs.h>
+#include <linux/stringhash.h>
 
 #include "fanotify.h"
 
@@ -443,6 +444,10 @@ static struct fanotify_event *fanotify_alloc_path_event(const struct path *path,
 	pevent->path = *path;
 	path_get(path);
 
+	/* Mix path info into event merge key */
+	pevent->fae.info_hash = hash_ptr(path->dentry, 32) ^
+				hash_ptr(path->mnt, 32);
+
 	return &pevent->fae;
 }
 
@@ -461,6 +466,9 @@ static struct fanotify_event *fanotify_alloc_perm_event(const struct path *path,
 	pevent->path = *path;
 	path_get(path);
 
+	/* Permission events are not merged */
+	pevent->fae.info_hash = 0;
+
 	return &pevent->fae;
 }
 
@@ -469,6 +477,7 @@ static struct fanotify_event *fanotify_alloc_fid_event(struct inode *id,
 						       gfp_t gfp)
 {
 	struct fanotify_fid_event *ffe;
+	struct fanotify_fh *fh;
 
 	ffe = kmem_cache_alloc(fanotify_fid_event_cachep, gfp);
 	if (!ffe)
@@ -476,8 +485,11 @@ static struct fanotify_event *fanotify_alloc_fid_event(struct inode *id,
 
 	ffe->fae.type = FANOTIFY_EVENT_TYPE_FID;
 	ffe->fsid = *fsid;
-	fanotify_encode_fh(&ffe->object_fh, id, fanotify_encode_fh_len(id),
-			   gfp);
+	fh = &ffe->object_fh;
+	fanotify_encode_fh(fh, id, fanotify_encode_fh_len(id), gfp);
+
+	/* Mix fsid+fid info into event merge key */
+	ffe->fae.info_hash = full_name_hash(ffe->fskey, fanotify_fh_buf(fh), fh->len);
 
 	return &ffe->fae;
 }
@@ -517,6 +529,9 @@ static struct fanotify_event *fanotify_alloc_name_event(struct inode *id,
 	if (file_name)
 		fanotify_info_copy_name(info, file_name);
 
+	/* Mix fsid+dfid+name+fid info into event merge key */
+	fne->fae.info_hash = full_name_hash(fne->fskey, info->buf, fanotify_info_len(info));
+
 	pr_debug("%s: ino=%lu size=%u dir_fh_len=%u child_fh_len=%u name_len=%u name='%.*s'\n",
 		 __func__, id->i_ino, size, dir_fh_len, child_fh_len,
 		 info->name_len, info->name_len, fanotify_info_name(info));
@@ -539,6 +554,8 @@ static struct fanotify_event *fanotify_alloc_event(struct fsnotify_group *group,
 	struct mem_cgroup *old_memcg;
 	struct inode *child = NULL;
 	bool name_event = false;
+	unsigned int hash = 0;
+	struct pid *pid;
 
 	if ((fid_mode & FAN_REPORT_DIR_FID) && dirid) {
 		/*
@@ -606,13 +623,17 @@ static struct fanotify_event *fanotify_alloc_event(struct fsnotify_group *group,
 	 * Use the victim inode instead of the watching inode as the id for
 	 * event queue, so event reported on parent is merged with event
 	 * reported on child when both directory and child watches exist.
-	 * Reduce object id to 32bit hash for hashed queue merge.
+	 * Reduce object id and event info to 32bit hash for hashed queue merge.
 	 */
-	fanotify_init_event(event, hash_ptr(id, 32), mask);
+	hash = event->info_hash ^ hash_ptr(id, 32);
 	if (FAN_GROUP_FLAG(group, FAN_REPORT_TID))
-		event->pid = get_pid(task_pid(current));
+		pid = get_pid(task_pid(current));
 	else
-		event->pid = get_pid(task_tgid(current));
+		pid = get_pid(task_tgid(current));
+	/* Mix pid info into event merge key */
+	hash ^= hash_ptr(pid, 32);
+	fanotify_init_event(event, hash, mask);
+	event->pid = pid;
 
 out:
 	set_active_memcg(old_memcg);
diff --git a/fs/notify/fanotify/fanotify.h b/fs/notify/fanotify/fanotify.h
index 2e856372ffc8..522fb1a68b30 100644
--- a/fs/notify/fanotify/fanotify.h
+++ b/fs/notify/fanotify/fanotify.h
@@ -115,6 +115,11 @@ static inline void fanotify_info_init(struct fanotify_info *info)
 	info->name_len = 0;
 }
 
+static inline unsigned int fanotify_info_len(struct fanotify_info *info)
+{
+	return info->dir_fh_totlen + info->file_fh_totlen + info->name_len;
+}
+
 static inline void fanotify_info_copy_name(struct fanotify_info *info,
 					   const struct qstr *name)
 {
@@ -138,7 +143,10 @@ enum fanotify_event_type {
 };
 
 struct fanotify_event {
-	struct fsnotify_event fse;
+	union {
+		struct fsnotify_event fse;
+		unsigned int info_hash;
+	};
 	u32 mask;
 	enum fanotify_event_type type;
 	struct pid *pid;
@@ -154,7 +162,10 @@ static inline void fanotify_init_event(struct fanotify_event *event,
 
 struct fanotify_fid_event {
 	struct fanotify_event fae;
-	__kernel_fsid_t fsid;
+	union {
+		__kernel_fsid_t fsid;
+		void *fskey;	/* 64 or 32 bits of fsid used for salt */
+	};
 	struct fanotify_fh object_fh;
 	/* Reserve space in object_fh.buf[] - access with fanotify_fh_buf() */
 	unsigned char _inline_fh_buf[FANOTIFY_INLINE_FH_LEN];
@@ -168,7 +179,10 @@ FANOTIFY_FE(struct fanotify_event *event)
 
 struct fanotify_name_event {
 	struct fanotify_event fae;
-	__kernel_fsid_t fsid;
+	union {
+		__kernel_fsid_t fsid;
+		void *fskey;	/* 64 or 32 bits of fsid used for salt */
+	};
 	struct fanotify_info info;
 };
 
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH 7/7] fsnotify: print some debug stats on hashed queue overflow
  2021-02-02 16:20 [PATCH 0/7] Performance improvement for fanotify merge Amir Goldstein
                   ` (5 preceding siblings ...)
  2021-02-02 16:20 ` [PATCH 6/7] fanotify: mix event info into merge key hash Amir Goldstein
@ 2021-02-02 16:20 ` Amir Goldstein
  2021-02-16 16:02 ` [PATCH 0/7] Performance improvement for fanotify merge Jan Kara
  7 siblings, 0 replies; 36+ messages in thread
From: Amir Goldstein @ 2021-02-02 16:20 UTC (permalink / raw)
  To: Jan Kara; +Cc: linux-fsdevel

We can consider exporting those stats also via sysfs.

Signed-off-by: Amir Goldstein <amir73il@gmail.com>
---
 fs/notify/notification.c | 13 ++++++++++++-
 1 file changed, 12 insertions(+), 1 deletion(-)

diff --git a/fs/notify/notification.c b/fs/notify/notification.c
index d98f4c8bfb0e..998f8753d358 100644
--- a/fs/notify/notification.c
+++ b/fs/notify/notification.c
@@ -72,9 +72,12 @@ static void fsnotify_queue_check(struct fsnotify_group *group)
 {
 #ifdef FSNOTIFY_HASHED_QUEUE
 	struct list_head *list;
+	unsigned int bitmap[8];
 	int i, nbuckets = 0;
 	bool first_empty, last_empty;
 
+	BUILD_BUG_ON((1 << FSNOTIFY_HASHED_QUEUE_MAX_BITS) > 32 * 8);
+
 	assert_spin_locked(&group->notification_lock);
 
 	pr_debug("%s: group=%p events: num=%u max=%u buckets: first=%u last=%u max=%u\n",
@@ -87,10 +90,14 @@ static void fsnotify_queue_check(struct fsnotify_group *group)
 	first_empty = WARN_ON_ONCE(list_empty(&group->notification_list[group->first_bucket]));
 	last_empty = WARN_ON_ONCE(list_empty(&group->notification_list[group->last_bucket]));
 
+	for (i = 0; i < 8; i++)
+		bitmap[i] = 0;
+
 	list = &group->notification_list[0];
 	for (i = 0; i <= group->max_bucket; i++, list++) {
 		if (list_empty(list))
 			continue;
+		bitmap[i / 32] |= 1 << (i % 32);
 		if (nbuckets++)
 			continue;
 		if (first_empty)
@@ -99,7 +106,9 @@ static void fsnotify_queue_check(struct fsnotify_group *group)
 			group->last_bucket = i;
 	}
 
-	pr_debug("%s: %u non-empty buckets\n", __func__, nbuckets);
+	pr_debug("%s: %u non-empty buckets %x %x %x %x %x %x %x %x\n", __func__, nbuckets,
+		 bitmap[0], bitmap[1], bitmap[2], bitmap[3],
+		 bitmap[4], bitmap[5], bitmap[6], bitmap[7]);
 
 	/* All buckets are empty, but non-zero num_events? */
 	if (WARN_ON_ONCE(!nbuckets && group->num_events))
@@ -175,6 +184,8 @@ int fsnotify_add_event(struct fsnotify_group *group,
 			spin_unlock(&group->notification_lock);
 			return ret;
 		}
+		/* Run debug sanity checks on full queue */
+		fsnotify_queue_check(group);
 		event = group->overflow_event;
 		goto queue;
 	}
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* Re: [PATCH 2/7] fsnotify: support hashed notification queue
  2021-02-02 16:20 ` [PATCH 2/7] fsnotify: support hashed notification queue Amir Goldstein
@ 2021-02-16 15:02   ` Jan Kara
  2021-02-17 12:33     ` Amir Goldstein
  0 siblings, 1 reply; 36+ messages in thread
From: Jan Kara @ 2021-02-16 15:02 UTC (permalink / raw)
  To: Amir Goldstein; +Cc: Jan Kara, linux-fsdevel

On Tue 02-02-21 18:20:05, Amir Goldstein wrote:
> In order to improve event merge performance, support sharding the
> notification queue to several notification lists, hashed by an event
> merge key.
> 
> The fanotify event merge key is the object id reduced to 32bit hash.
> 
> At the moment, both inotify and fanotify still use a single notification
> list.
> 
> At the moment, reading events from the hashed queue is not by event
> insertion order.  A non-empty list is read until it is empty.
> 
> The max events limit is accounted for total events in all lists.
> 
> Signed-off-by: Amir Goldstein <amir73il@gmail.com>

Style nit: Please try to stay within 80 columns...

Also I think this change would be more comprehensible if it was merged with
the following patch 3/8. Having code with TODOs is kind of awkward and
makes it more difficult to verify the result is actually correct.

> diff --git a/fs/notify/fanotify/fanotify_user.c b/fs/notify/fanotify/fanotify_user.c
> index 8ff27d92d32c..dee12d927f8d 100644
> --- a/fs/notify/fanotify/fanotify_user.c
> +++ b/fs/notify/fanotify/fanotify_user.c
> @@ -635,6 +635,7 @@ static long fanotify_ioctl(struct file *file, unsigned int cmd, unsigned long ar
>  {
>  	struct fsnotify_group *group;
>  	struct fsnotify_event *fsn_event;
> +	struct list_head *list;
>  	void __user *p;
>  	int ret = -ENOTTY;
>  	size_t send_len = 0;
> @@ -646,8 +647,15 @@ static long fanotify_ioctl(struct file *file, unsigned int cmd, unsigned long ar
>  	switch (cmd) {
>  	case FIONREAD:
>  		spin_lock(&group->notification_lock);
> -		list_for_each_entry(fsn_event, &group->notification_list, list)
> -			send_len += FAN_EVENT_METADATA_LEN;
> +		list = fsnotify_first_notification_list(group);
> +		/*
> +		 * With multi queue, send_len will be a lower bound
> +		 * on total events size.
> +		 */
> +		if (list) {
> +			list_for_each_entry(fsn_event, list, list)
> +				send_len += FAN_EVENT_METADATA_LEN;
> +		}

IMO we should just walk all the lists here. Otherwise reported length would
be just odd. That being said the returned value already is odd because we
don't properly account for different event sizes (unrelated problem). So if
we want to keep it simple for now, we can just return group->num_events *
FAN_EVENT_METADATA_LEN, can't we?

> diff --git a/fs/notify/group.c b/fs/notify/group.c
> index ffd723ffe46d..b41ed68f9ff9 100644
> --- a/fs/notify/group.c
> +++ b/fs/notify/group.c
> @@ -111,12 +116,20 @@ void fsnotify_put_group(struct fsnotify_group *group)
>  }
>  EXPORT_SYMBOL_GPL(fsnotify_put_group);
>  
> -static struct fsnotify_group *__fsnotify_alloc_group(
> +static struct fsnotify_group *__fsnotify_alloc_group(unsigned int q_hash_bits,
>  				const struct fsnotify_ops *ops, gfp_t gfp)
>  {
>  	struct fsnotify_group *group;
> +	int i;
> +
> +#ifdef FSNOTIFY_HASHED_QUEUE
> +	if (WARN_ON_ONCE(q_hash_bits > FSNOTIFY_HASHED_QUEUE_MAX_BITS))
> +		q_hash_bits = FSNOTIFY_HASHED_QUEUE_MAX_BITS;
> +#else
> +	q_hash_bits = 0;
> +#endif

Why the check for q_hash_bits? We have in-kernel users only so I would not
be that paranoid :) Maybe I'd just let the group specify whether it wants
hashed queue or not and for hashed queues always set number of buckets to
128. So far we don't need anything more flexible and if we ever do, it's
easy enough to change the in-kernel API...

Also, honestly, I find these ifdefs ugly. I'd just leave hashed queues
unconditionally compiled in.

> -	group = kzalloc(sizeof(struct fsnotify_group), gfp);
> +	group = kzalloc(fsnotify_group_size(q_hash_bits), gfp);
>  	if (!group)
>  		return ERR_PTR(-ENOMEM);
>  
> @@ -126,8 +139,12 @@ static struct fsnotify_group *__fsnotify_alloc_group(
>  	atomic_set(&group->user_waits, 0);
>  
>  	spin_lock_init(&group->notification_lock);
> -	INIT_LIST_HEAD(&group->notification_list);
>  	init_waitqueue_head(&group->notification_waitq);
> +	/* Initialize one or more notification lists */
> +	group->q_hash_bits = q_hash_bits;
> +	group->max_bucket = (1 << q_hash_bits) - 1;
> +	for (i = 0; i <= group->max_bucket; i++)
> +		INIT_LIST_HEAD(&group->notification_list[i]);
>  	group->max_events = UINT_MAX;
>  
>  	mutex_init(&group->mark_mutex);
> @@ -139,20 +156,24 @@ static struct fsnotify_group *__fsnotify_alloc_group(
>  }
>  
>  /*
> - * Create a new fsnotify_group and hold a reference for the group returned.
> + * Create a new fsnotify_group with no events queue.

How come "no events queue"? There will be always at least one queue...

> + * Hold a reference for the group returned.
>   */
>  struct fsnotify_group *fsnotify_alloc_group(const struct fsnotify_ops *ops)
>  {
> -	return __fsnotify_alloc_group(ops, GFP_KERNEL);
> +	return __fsnotify_alloc_group(0, ops, GFP_KERNEL);
>  }
>  EXPORT_SYMBOL_GPL(fsnotify_alloc_group);
>  
>  /*
> - * Create a new fsnotify_group and hold a reference for the group returned.
> + * Create a new fsnotify_group with an events queue.
> + * If @q_hash_bits > 0, the queue is shareded into several notification lists.
> + * Hold a reference for the group returned.
>   */
> -struct fsnotify_group *fsnotify_alloc_user_group(const struct fsnotify_ops *ops)
> +struct fsnotify_group *fsnotify_alloc_user_group(unsigned int q_hash_bits,
> +						 const struct fsnotify_ops *ops)
>  {
> -	return __fsnotify_alloc_group(ops, GFP_KERNEL_ACCOUNT);
> +	return __fsnotify_alloc_group(q_hash_bits, ops, GFP_KERNEL_ACCOUNT);
>  }
>  EXPORT_SYMBOL_GPL(fsnotify_alloc_user_group);
>  
> diff --git a/fs/notify/inotify/inotify_user.c b/fs/notify/inotify/inotify_user.c
> index d8830be60a9b..1c476b9485dc 100644
> --- a/fs/notify/inotify/inotify_user.c
> +++ b/fs/notify/inotify/inotify_user.c
> @@ -288,6 +288,7 @@ static long inotify_ioctl(struct file *file, unsigned int cmd,
>  {
>  	struct fsnotify_group *group;
>  	struct fsnotify_event *fsn_event;
> +	struct list_head *list;
>  	void __user *p;
>  	int ret = -ENOTTY;
>  	size_t send_len = 0;
> @@ -300,10 +301,16 @@ static long inotify_ioctl(struct file *file, unsigned int cmd,
>  	switch (cmd) {
>  	case FIONREAD:
>  		spin_lock(&group->notification_lock);
> -		list_for_each_entry(fsn_event, &group->notification_list,
> -				    list) {
> -			send_len += sizeof(struct inotify_event);
> -			send_len += round_event_name_len(fsn_event);
> +		list = fsnotify_first_notification_list(group);
> +		/*
> +		 * With multi queue, send_len will be a lower bound
> +		 * on total events size.
> +		 */
> +		if (list) {
> +			list_for_each_entry(fsn_event, list, list) {
> +				send_len += sizeof(struct inotify_event);
> +				send_len += round_event_name_len(fsn_event);
> +			}

As I write below IMO we should enable hashed queues also for inotify (is
there good reason not to?) and here we should just walk through all the
lists.

>  		}
>  		spin_unlock(&group->notification_lock);
>  		ret = put_user(send_len, (int __user *) p);
> @@ -631,7 +638,7 @@ static struct fsnotify_group *inotify_new_group(unsigned int max_events)
>  	struct fsnotify_group *group;
>  	struct inotify_event_info *oevent;
>  
> -	group = fsnotify_alloc_user_group(&inotify_fsnotify_ops);
> +	group = fsnotify_alloc_user_group(0, &inotify_fsnotify_ops);
>  	if (IS_ERR(group))
>  		return group;
>  
> diff --git a/fs/notify/notification.c b/fs/notify/notification.c
> index fcac2d72cbf5..58c8f6c1be1b 100644
> --- a/fs/notify/notification.c
> +++ b/fs/notify/notification.c
...
> @@ -74,10 +67,75 @@ void fsnotify_destroy_event(struct fsnotify_group *group,
>  	group->ops->free_event(event);
>  }
>  
> +/* Check and fix inconsistencies in hashed queue */
> +static void fsnotify_queue_check(struct fsnotify_group *group)
> +{
> +#ifdef FSNOTIFY_HASHED_QUEUE
> +	struct list_head *list;
> +	int i, nbuckets = 0;
> +	bool first_empty, last_empty;
> +
> +	assert_spin_locked(&group->notification_lock);
> +
> +	pr_debug("%s: group=%p events: num=%u max=%u buckets: first=%u last=%u max=%u\n",
> +		 __func__, group, group->num_events, group->max_events,
> +		 group->first_bucket, group->last_bucket, group->max_bucket);
> +
> +	if (fsnotify_notify_queue_is_empty(group))
> +		return;
> +
> +	first_empty = list_empty(&group->notification_list[group->first_bucket]);
> +	last_empty = list_empty(&group->notification_list[group->last_bucket]);
> +
> +	list = &group->notification_list[0];
> +	for (i = 0; i <= group->max_bucket; i++, list++) {
> +		if (list_empty(list))
> +			continue;
> +		if (nbuckets++)
> +			continue;
> +		if (first_empty)
> +			group->first_bucket = i;
> +		if (last_empty)
> +			group->last_bucket = i;
> +	}
> +
> +	pr_debug("%s: %u non-empty buckets\n", __func__, nbuckets);
> +
> +	/* All buckets are empty, but non-zero num_events? */
> +	if (WARN_ON_ONCE(!nbuckets && group->num_events))
> +		group->num_events = 0;
> +#endif

Hum, what is this function about? I understand you might want to have this
around for debugging when developing the patch but is there are legitimate
reason for this in production?

I understand current patch uses this function for searching for next
non-empty list but after this patch is merged with the next one, is there
still a need for this?

> @@ -147,17 +228,21 @@ void fsnotify_remove_queued_event(struct fsnotify_group *group,
>  struct fsnotify_event *fsnotify_remove_first_event(struct fsnotify_group *group)
>  {
>  	struct fsnotify_event *event;
> +	struct list_head *list;
>  
>  	assert_spin_locked(&group->notification_lock);
>  
> -	if (fsnotify_notify_queue_is_empty(group))
> +	list = fsnotify_first_notification_list(group);
> +	if (!list)
>  		return NULL;
>  
> -	pr_debug("%s: group=%p\n", __func__, group);
> +	pr_debug("%s: group=%p bucket=%u\n", __func__, group, group->first_bucket);
>  
> -	event = list_first_entry(&group->notification_list,
> -				 struct fsnotify_event, list);
> +	event = list_first_entry(list, struct fsnotify_event, list);

Perhaps the above could just reuse fsnotify_peek_first_event()? It is now
complex enough to be worth it I guess.

>  	fsnotify_remove_queued_event(group, event);
> +	/*
> +	 * TODO: update group->first_bucket to next_bucket in first event.
> +	 */
>  	return event;
>  }
>  
> @@ -167,13 +252,15 @@ struct fsnotify_event *fsnotify_remove_first_event(struct fsnotify_group *group)
>   */
>  struct fsnotify_event *fsnotify_peek_first_event(struct fsnotify_group *group)
>  {
> +	struct list_head *list;
> +
>  	assert_spin_locked(&group->notification_lock);
>  
> -	if (fsnotify_notify_queue_is_empty(group))
> +	list = fsnotify_first_notification_list(group);
> +	if (!list)
>  		return NULL;
>  
> -	return list_first_entry(&group->notification_list,
> -				struct fsnotify_event, list);
> +	return list_first_entry(list, struct fsnotify_event, list);
>  }
>  
>  /*
> diff --git a/include/linux/fsnotify_backend.h b/include/linux/fsnotify_backend.h
> index e5409b83e731..b2a80bc00108 100644
> --- a/include/linux/fsnotify_backend.h
> +++ b/include/linux/fsnotify_backend.h
> @@ -160,6 +160,11 @@ struct fsnotify_ops {
>  	void (*free_mark)(struct fsnotify_mark *mark);
>  };
>  
> +#ifdef CONFIG_FANOTIFY
> +#define FSNOTIFY_HASHED_QUEUE
> +#define FSNOTIFY_HASHED_QUEUE_MAX_BITS 8
> +#endif
> +

Would not inotify benefit from this work as well? Also I'd just compile in
hashed queues unconditionally. The ifdefs are ugly and IMHO not worth it.

...

> +static inline size_t fsnotify_group_size(unsigned int q_hash_bits)
> +{
> +	return sizeof(struct fsnotify_group) + (sizeof(struct list_head) << q_hash_bits);
> +}
> +
> +static inline unsigned int fsnotify_event_bucket(struct fsnotify_group *group,
> +						 struct fsnotify_event *event)
> +{
> +	/* High bits are better for hash */
> +	return (event->key >> (32 - group->q_hash_bits)) & group->max_bucket;
> +}

Why not use hash_32() here? IMHO better than just stripping bits...

> +
> +static inline struct list_head *fsnotify_event_notification_list(
> +						struct fsnotify_group *group,
> +						struct fsnotify_event *event)
> +{
> +	return &group->notification_list[fsnotify_event_bucket(group, event)];
> +}
> +

Any reason for this to be in a header? Will it ever have other user than
fsnotify_add_event()?

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 3/7] fsnotify: read events from hashed notification queue by order of insertion
  2021-02-02 16:20 ` [PATCH 3/7] fsnotify: read events from hashed notification queue by order of insertion Amir Goldstein
@ 2021-02-16 15:10   ` Jan Kara
  0 siblings, 0 replies; 36+ messages in thread
From: Jan Kara @ 2021-02-16 15:10 UTC (permalink / raw)
  To: Amir Goldstein; +Cc: Jan Kara, linux-fsdevel

On Tue 02-02-21 18:20:06, Amir Goldstein wrote:
> On 64bit arch, use the available 32bit in event for next_bucket field,
> that is used to chain all events by order of insertion.
> 
> The group has a cursor for the bucket containing the first event and
> every event stores the bucket of the next event to read.
> 
> On 32bit arch, hashed notification queue is disabled.

I would not bother with saving 4-bytes per event on 32-bit archs... IMHO it
is just not worth the additional code complexity and ifdefs...

Otherwise the patch looks fine.

								Honza

> 
> Signed-off-by: Amir Goldstein <amir73il@gmail.com>
> ---
>  fs/notify/notification.c         | 49 ++++++++++++++++++++++++--------
>  include/linux/fsnotify_backend.h | 42 +++++++++++++++++++++++++++
>  2 files changed, 79 insertions(+), 12 deletions(-)
> 
> diff --git a/fs/notify/notification.c b/fs/notify/notification.c
> index 58c8f6c1be1b..d98f4c8bfb0e 100644
> --- a/fs/notify/notification.c
> +++ b/fs/notify/notification.c
> @@ -84,8 +84,8 @@ static void fsnotify_queue_check(struct fsnotify_group *group)
>  	if (fsnotify_notify_queue_is_empty(group))
>  		return;
>  
> -	first_empty = list_empty(&group->notification_list[group->first_bucket]);
> -	last_empty = list_empty(&group->notification_list[group->last_bucket]);
> +	first_empty = WARN_ON_ONCE(list_empty(&group->notification_list[group->first_bucket]));
> +	last_empty = WARN_ON_ONCE(list_empty(&group->notification_list[group->last_bucket]));
>  
>  	list = &group->notification_list[0];
>  	for (i = 0; i <= group->max_bucket; i++, list++) {
> @@ -121,12 +121,23 @@ static void fsnotify_queue_event(struct fsnotify_group *group,
>  
>  	pr_debug("%s: group=%p event=%p bucket=%u\n", __func__, group, event, b);
>  
> -	/*
> -	 * TODO: set next_bucket of last event.
> -	 */
> -	group->last_bucket = b;
> -	if (!group->num_events)
> -		group->first_bucket = b;
> +	if (fsnotify_notify_queue_is_hashed(group)) {
> +		/*
> +		 * On first insert, set this event's list as the list to read first event.
> +		 * Otherwise, point from last event to this event's list.
> +		 */
> +		struct list_head *last_l = &group->notification_list[group->last_bucket];
> +
> +		if (!group->num_events) {
> +			group->first_bucket = b;
> +		} else if (!WARN_ON_ONCE(list_empty(last_l))) {
> +			struct fsnotify_event *last_e;
> +
> +			last_e = list_last_entry(last_l, struct fsnotify_event, list);
> +			fsnotify_event_set_next_bucket(last_e, b);
> +		}
> +		group->last_bucket = b;
> +	}
>  	group->num_events++;
>  	list_add_tail(&event->list, list);
>  }
> @@ -186,8 +197,8 @@ int fsnotify_add_event(struct fsnotify_group *group,
>  	return ret;
>  }
>  
> -void fsnotify_remove_queued_event(struct fsnotify_group *group,
> -				  struct fsnotify_event *event)
> +static void __fsnotify_remove_queued_event(struct fsnotify_group *group,
> +					   struct fsnotify_event *event)
>  {
>  	assert_spin_locked(&group->notification_lock);
>  	/*
> @@ -198,6 +209,17 @@ void fsnotify_remove_queued_event(struct fsnotify_group *group,
>  	group->num_events--;
>  }
>  
> +void fsnotify_remove_queued_event(struct fsnotify_group *group,
> +				  struct fsnotify_event *event)
> +{
> +	/*
> +	 * if called for removal of event in the middle of a hashed queue,
> +	 * events may be read not in insertion order.
> +	 */
> +	WARN_ON_ONCE(fsnotify_notify_queue_is_hashed(group));
> +	__fsnotify_remove_queued_event(group, event);
> +}
> +
>  /* Return the notification list of the first event */
>  struct list_head *fsnotify_first_notification_list(struct fsnotify_group *group)
>  {
> @@ -213,6 +235,7 @@ struct list_head *fsnotify_first_notification_list(struct fsnotify_group *group)
>  		return list;
>  
>  	/*
> +	 * Oops... first bucket is not supposed to be empty.
>  	 * Look for any non-empty bucket.
>  	 */
>  	fsnotify_queue_check(group);
> @@ -239,10 +262,12 @@ struct fsnotify_event *fsnotify_remove_first_event(struct fsnotify_group *group)
>  	pr_debug("%s: group=%p bucket=%u\n", __func__, group, group->first_bucket);
>  
>  	event = list_first_entry(list, struct fsnotify_event, list);
> -	fsnotify_remove_queued_event(group, event);
> +	__fsnotify_remove_queued_event(group, event);
>  	/*
> -	 * TODO: update group->first_bucket to next_bucket in first event.
> +	 * Removed event points to the next list to read from.
>  	 */
> +	group->first_bucket = fsnotify_event_next_bucket(event);
> +
>  	return event;
>  }
>  
> diff --git a/include/linux/fsnotify_backend.h b/include/linux/fsnotify_backend.h
> index b2a80bc00108..3fc3c9e4d21c 100644
> --- a/include/linux/fsnotify_backend.h
> +++ b/include/linux/fsnotify_backend.h
> @@ -161,9 +161,12 @@ struct fsnotify_ops {
>  };
>  
>  #ifdef CONFIG_FANOTIFY
> +#if BITS_PER_LONG == 64
> +/* Use available 32bit of event for hashed queue support */
>  #define FSNOTIFY_HASHED_QUEUE
>  #define FSNOTIFY_HASHED_QUEUE_MAX_BITS 8
>  #endif
> +#endif
>  
>  /*
>   * all of the information about the original object we want to now send to
> @@ -173,6 +176,9 @@ struct fsnotify_ops {
>  struct fsnotify_event {
>  	struct list_head list;
>  	unsigned int key;		/* Key for hashed queue add/merge */
> +#ifdef FSNOTIFY_HASHED_QUEUE
> +	unsigned int next_bucket;	/* Bucket to read next event from */
> +#endif
>  };
>  
>  /*
> @@ -277,6 +283,41 @@ static inline struct list_head *fsnotify_event_notification_list(
>  	return &group->notification_list[fsnotify_event_bucket(group, event)];
>  }
>  
> +#ifdef FSNOTIFY_HASHED_QUEUE
> +static inline bool fsnotify_notify_queue_is_hashed(struct fsnotify_group *group)
> +{
> +	return group->max_bucket > 0;
> +}
> +
> +static inline unsigned int fsnotify_event_next_bucket(struct fsnotify_event *event)
> +{
> +	return event->next_bucket;
> +}
> +
> +static inline void fsnotify_event_set_next_bucket(struct fsnotify_event *event,
> +						  unsigned int b)
> +{
> +	event->next_bucket = b;
> +}
> +
> +#else
> +static inline bool fsnotify_notify_queue_is_hashed(struct fsnotify_group *group)
> +{
> +	return false;
> +}
> +
> +static inline unsigned int fsnotify_event_next_bucket(struct fsnotify_event *event)
> +{
> +	return 0;
> +}
> +
> +static inline void fsnotify_event_set_next_bucket(struct fsnotify_event *event,
> +						  unsigned int b)
> +{
> +}
> +
> +#endif
> +
>  /* When calling fsnotify tell it if the data is a path or inode */
>  enum fsnotify_data_type {
>  	FSNOTIFY_EVENT_NONE,
> @@ -620,6 +661,7 @@ static inline void fsnotify_init_event(struct fsnotify_event *event,
>  {
>  	INIT_LIST_HEAD(&event->list);
>  	event->key = key;
> +	fsnotify_event_set_next_bucket(event, 0);
>  }
>  
>  #else
> -- 
> 2.25.1
> 
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 6/7] fanotify: mix event info into merge key hash
  2021-02-02 16:20 ` [PATCH 6/7] fanotify: mix event info into merge key hash Amir Goldstein
@ 2021-02-16 15:39   ` Jan Kara
  2021-02-17 10:13     ` Amir Goldstein
  0 siblings, 1 reply; 36+ messages in thread
From: Jan Kara @ 2021-02-16 15:39 UTC (permalink / raw)
  To: Amir Goldstein; +Cc: Jan Kara, linux-fsdevel

On Tue 02-02-21 18:20:09, Amir Goldstein wrote:
> Improve the balance of hashed queue lists by mixing more event info
> relevant for merge.
> 
> For example, all FAN_CREATE name events in the same dir used to have the
> same merge key based on the dir inode.  With this change the created
> file name is mixed into the merge key.
> 
> Signed-off-by: Amir Goldstein <amir73il@gmail.com>
> ---
>  fs/notify/fanotify/fanotify.c | 33 +++++++++++++++++++++++++++------
>  fs/notify/fanotify/fanotify.h | 20 +++++++++++++++++---
>  2 files changed, 44 insertions(+), 9 deletions(-)
> 
> diff --git a/fs/notify/fanotify/fanotify.c b/fs/notify/fanotify/fanotify.c
> index 6d3807012851..b19fef1c6f64 100644
> --- a/fs/notify/fanotify/fanotify.c
> +++ b/fs/notify/fanotify/fanotify.c
...
> @@ -476,8 +485,11 @@ static struct fanotify_event *fanotify_alloc_fid_event(struct inode *id,
>  
>  	ffe->fae.type = FANOTIFY_EVENT_TYPE_FID;
>  	ffe->fsid = *fsid;
> -	fanotify_encode_fh(&ffe->object_fh, id, fanotify_encode_fh_len(id),
> -			   gfp);
> +	fh = &ffe->object_fh;
> +	fanotify_encode_fh(fh, id, fanotify_encode_fh_len(id), gfp);
> +
> +	/* Mix fsid+fid info into event merge key */
> +	ffe->fae.info_hash = full_name_hash(ffe->fskey, fanotify_fh_buf(fh), fh->len);

Is it really sensible to hash FID with full_name_hash()? It would make more
sense to treat it as binary data, not strings...

>  	return &ffe->fae;
>  }
> @@ -517,6 +529,9 @@ static struct fanotify_event *fanotify_alloc_name_event(struct inode *id,
>  	if (file_name)
>  		fanotify_info_copy_name(info, file_name);
>  
> +	/* Mix fsid+dfid+name+fid info into event merge key */
> +	fne->fae.info_hash = full_name_hash(fne->fskey, info->buf, fanotify_info_len(info));
> +

Similarly here...

>  	pr_debug("%s: ino=%lu size=%u dir_fh_len=%u child_fh_len=%u name_len=%u name='%.*s'\n",
>  		 __func__, id->i_ino, size, dir_fh_len, child_fh_len,
>  		 info->name_len, info->name_len, fanotify_info_name(info));
> @@ -539,6 +554,8 @@ static struct fanotify_event *fanotify_alloc_event(struct fsnotify_group *group,
>  	struct mem_cgroup *old_memcg;
>  	struct inode *child = NULL;
>  	bool name_event = false;
> +	unsigned int hash = 0;
> +	struct pid *pid;
>  
>  	if ((fid_mode & FAN_REPORT_DIR_FID) && dirid) {
>  		/*
> @@ -606,13 +623,17 @@ static struct fanotify_event *fanotify_alloc_event(struct fsnotify_group *group,
>  	 * Use the victim inode instead of the watching inode as the id for
>  	 * event queue, so event reported on parent is merged with event
>  	 * reported on child when both directory and child watches exist.
> -	 * Reduce object id to 32bit hash for hashed queue merge.
> +	 * Reduce object id and event info to 32bit hash for hashed queue merge.
>  	 */
> -	fanotify_init_event(event, hash_ptr(id, 32), mask);
> +	hash = event->info_hash ^ hash_ptr(id, 32);
>  	if (FAN_GROUP_FLAG(group, FAN_REPORT_TID))
> -		event->pid = get_pid(task_pid(current));
> +		pid = get_pid(task_pid(current));
>  	else
> -		event->pid = get_pid(task_tgid(current));
> +		pid = get_pid(task_tgid(current));
> +	/* Mix pid info into event merge key */
> +	hash ^= hash_ptr(pid, 32);

hash_32() here?

> +	fanotify_init_event(event, hash, mask);
> +	event->pid = pid;
>  
>  out:
>  	set_active_memcg(old_memcg);
> diff --git a/fs/notify/fanotify/fanotify.h b/fs/notify/fanotify/fanotify.h
> index 2e856372ffc8..522fb1a68b30 100644
> --- a/fs/notify/fanotify/fanotify.h
> +++ b/fs/notify/fanotify/fanotify.h
> @@ -115,6 +115,11 @@ static inline void fanotify_info_init(struct fanotify_info *info)
>  	info->name_len = 0;
>  }
>  
> +static inline unsigned int fanotify_info_len(struct fanotify_info *info)
> +{
> +	return info->dir_fh_totlen + info->file_fh_totlen + info->name_len;
> +}
> +
>  static inline void fanotify_info_copy_name(struct fanotify_info *info,
>  					   const struct qstr *name)
>  {
> @@ -138,7 +143,10 @@ enum fanotify_event_type {
>  };
>  
>  struct fanotify_event {
> -	struct fsnotify_event fse;
> +	union {
> +		struct fsnotify_event fse;
> +		unsigned int info_hash;
> +	};
>  	u32 mask;
>  	enum fanotify_event_type type;
>  	struct pid *pid;

How is this ever safe? info_hash will likely overlay with 'list' in
fsnotify_event.

> @@ -154,7 +162,10 @@ static inline void fanotify_init_event(struct fanotify_event *event,
>  
>  struct fanotify_fid_event {
>  	struct fanotify_event fae;
> -	__kernel_fsid_t fsid;
> +	union {
> +		__kernel_fsid_t fsid;
> +		void *fskey;	/* 64 or 32 bits of fsid used for salt */
> +	};
>  	struct fanotify_fh object_fh;
>  	/* Reserve space in object_fh.buf[] - access with fanotify_fh_buf() */
>  	unsigned char _inline_fh_buf[FANOTIFY_INLINE_FH_LEN];
> @@ -168,7 +179,10 @@ FANOTIFY_FE(struct fanotify_event *event)
>  
>  struct fanotify_name_event {
>  	struct fanotify_event fae;
> -	__kernel_fsid_t fsid;
> +	union {
> +		__kernel_fsid_t fsid;
> +		void *fskey;	/* 64 or 32 bits of fsid used for salt */
> +	};
>  	struct fanotify_info info;
>  };

What games are you playing here with the unions? I presume you can remove
these 'fskey' unions and just use (void *)(event->fsid) at appropriate
places? IMO much more comprehensible...

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 0/7] Performance improvement for fanotify merge
  2021-02-02 16:20 [PATCH 0/7] Performance improvement for fanotify merge Amir Goldstein
                   ` (6 preceding siblings ...)
  2021-02-02 16:20 ` [PATCH 7/7] fsnotify: print some debug stats on hashed queue overflow Amir Goldstein
@ 2021-02-16 16:02 ` Jan Kara
  2021-02-17 10:52   ` Amir Goldstein
  7 siblings, 1 reply; 36+ messages in thread
From: Jan Kara @ 2021-02-16 16:02 UTC (permalink / raw)
  To: Amir Goldstein; +Cc: Jan Kara, linux-fsdevel

Hi Amir!

Looking at the patches I've got one idea:

Currently you have fsnotify_event like:

struct fsnotify_event {
        struct list_head list;
        unsigned int key;
        unsigned int next_bucket;
};

And 'list' is used for hashed queue list, next_bucket is used to simulate
single queue out of all the individual lists. The option I'm considering
is:

struct fsnotify_event {
        struct list_head list;
	struct fsnotify_event *hash_next;
        unsigned int key;
};

So 'list' would stay to be used for the single queue of events like it was
before your patches. 'hash_next' would be used for list of events in the
hash chain. The advantage of this scheme would be somewhat more obvious
handling, also we can handle removal of permission events (they won't be
hashed so there's no risk of breaking hash-chain in the middle, removal
from global queue is easy as currently). The disadvantage is increase of
event size by one pointer on 64-bit but I think we can live with that. What
do you think?

								Honza

On Tue 02-02-21 18:20:03, Amir Goldstein wrote:
> Jan,
> 
> fanotify_merge() has been observed [1] to consume a lot of CPU.
> This is not surprising considering that it is implemented as a linear
> search on a practically unbounded size list.
> 
> The following series improves the linear search for an event to merge
> in three different ways:
> 1. Hash events into as much as to 128 lists
> 2. Limit linear search to 128 last list elements
> 3. Use a better key - instead of victim inode ptr, use a hash of all
>    the compared fields
> 
> The end result can be observed in the test run times below.
> The test is an extension of your queue overflow LTP test [2].
> The timing results use are from the 2nd run of -i 2, where files
> are already existing in the test fs.
> 
> With an unlimited queue, queueing of 16385 events on unique objects
> is ~3 times faster than before the change.
> 
> In fact, the run time of queueing 16385 events (~600ms) is almost the
> same as the run time of rejecting 16385 events (~550ms) due to full
> queue, which suggest a very low overhead for merging events.
> 
> The test runs two passes to test event merge, the "create" pass and
> the "open" pass.
> 
> Before the change (v5.11-rc2) 100% of the events of the "open" pass
> are merged (16385 files and 16385 events).
> 
> After the change, only %50 of the events of the "open" pass are
> merged (16385 files and 25462 events).
> 
> This is because 16384 is the maximum number of events that we can
> merge when hash table is fully balanced.
> When reducing the number of unique objects to 8192, all events
> on the "open" pass are merged.
> 
> Thanks,
> Amir.
> 
> v5.11-rc2, run #2 of ./fanotify05 -i 2:
> 
> fanotify05.c:109: TINFO: Test #0: Limited queue
> fanotify05.c:98: TINFO: Created 16385 files in 1653ms
> fanotify05.c:98: TINFO: Opened 16385 files in 543ms
> fanotify05.c:77: TINFO: Got event #0 filename=fname_0
> fanotify05.c:176: TPASS: Got an overflow event: pid=0 fd=-1
> fanotify05.c:182: TINFO: Got 16385 events
> 
> fanotify05.c:109: TINFO: Test #1: Unlimited queue
> fanotify05.c:98: TINFO: Created 16385 files in 1683ms
> fanotify05.c:98: TINFO: Opened 16385 files in 1647ms
> fanotify05.c:77: TINFO: Got event #0 filename=fname_0
> fanotify05.c:138: TPASS: Overflow event not generated!
> fanotify05.c:182: TINFO: Got 16385 events
> 
> fanotify_merge branch, run #2 of ./fanotify05 -i 2:
> 
> fanotify05.c:109: TINFO: Test #0: Limited queue
> fanotify05.c:98: TINFO: Created 16385 files in 616ms
> fanotify05.c:98: TINFO: Opened 16385 files in 549ms
> fanotify05.c:77: TINFO: Got event #0 filename=fname_0
> fanotify05.c:176: TPASS: Got an overflow event: pid=0 fd=-1
> fanotify05.c:182: TINFO: Got 16385 events
> 
> fanotify05.c:109: TINFO: Test #1: Unlimited queue
> fanotify05.c:98: TINFO: Created 16385 files in 614ms
> fanotify05.c:98: TINFO: Opened 16385 files in 599ms
> fanotify05.c:77: TINFO: Got event #0 filename=fname_0
> fanotify05.c:138: TPASS: Overflow event not generated!
> fanotify05.c:182: TINFO: Got 25462 events
> 
> [1] https://lore.kernel.org/linux-fsdevel/20200714025417.A25EB95C0339@us180.sjc.aristanetworks.com/
> [2] https://github.com/amir73il/ltp/commits/fanotify_merge
> 
> Amir Goldstein (7):
>   fsnotify: allow fsnotify_{peek,remove}_first_event with empty queue
>   fsnotify: support hashed notification queue
>   fsnotify: read events from hashed notification queue by order of
>     insertion
>   fanotify: enable hashed notification queue for FAN_CLASS_NOTIF groups
>   fanotify: limit number of event merge attempts
>   fanotify: mix event info into merge key hash
>   fsnotify: print some debug stats on hashed queue overflow
> 
>  fs/notify/fanotify/fanotify.c      |  40 ++++++-
>  fs/notify/fanotify/fanotify.h      |  24 +++-
>  fs/notify/fanotify/fanotify_user.c |  55 ++++++---
>  fs/notify/group.c                  |  37 ++++--
>  fs/notify/inotify/inotify_user.c   |  22 ++--
>  fs/notify/notification.c           | 175 +++++++++++++++++++++++++----
>  include/linux/fsnotify_backend.h   | 105 +++++++++++++++--
>  7 files changed, 383 insertions(+), 75 deletions(-)
> 
> -- 
> 2.25.1
> 
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 6/7] fanotify: mix event info into merge key hash
  2021-02-16 15:39   ` Jan Kara
@ 2021-02-17 10:13     ` Amir Goldstein
  2021-02-18 10:46       ` Amir Goldstein
  0 siblings, 1 reply; 36+ messages in thread
From: Amir Goldstein @ 2021-02-17 10:13 UTC (permalink / raw)
  To: Jan Kara; +Cc: linux-fsdevel

On Tue, Feb 16, 2021 at 5:39 PM Jan Kara <jack@suse.cz> wrote:
>
> On Tue 02-02-21 18:20:09, Amir Goldstein wrote:
> > Improve the balance of hashed queue lists by mixing more event info
> > relevant for merge.
> >
> > For example, all FAN_CREATE name events in the same dir used to have the
> > same merge key based on the dir inode.  With this change the created
> > file name is mixed into the merge key.
> >
> > Signed-off-by: Amir Goldstein <amir73il@gmail.com>
> > ---
> >  fs/notify/fanotify/fanotify.c | 33 +++++++++++++++++++++++++++------
> >  fs/notify/fanotify/fanotify.h | 20 +++++++++++++++++---
> >  2 files changed, 44 insertions(+), 9 deletions(-)
> >
> > diff --git a/fs/notify/fanotify/fanotify.c b/fs/notify/fanotify/fanotify.c
> > index 6d3807012851..b19fef1c6f64 100644
> > --- a/fs/notify/fanotify/fanotify.c
> > +++ b/fs/notify/fanotify/fanotify.c
> ...
> > @@ -476,8 +485,11 @@ static struct fanotify_event *fanotify_alloc_fid_event(struct inode *id,
> >
> >       ffe->fae.type = FANOTIFY_EVENT_TYPE_FID;
> >       ffe->fsid = *fsid;
> > -     fanotify_encode_fh(&ffe->object_fh, id, fanotify_encode_fh_len(id),
> > -                        gfp);
> > +     fh = &ffe->object_fh;
> > +     fanotify_encode_fh(fh, id, fanotify_encode_fh_len(id), gfp);
> > +
> > +     /* Mix fsid+fid info into event merge key */
> > +     ffe->fae.info_hash = full_name_hash(ffe->fskey, fanotify_fh_buf(fh), fh->len);
>
> Is it really sensible to hash FID with full_name_hash()? It would make more
> sense to treat it as binary data, not strings...

See the actual implementation of full_name_hash() for 64bit.
it treats the string as an array of ulong, which is quite perfect for FID.

>
> >       return &ffe->fae;
> >  }
> > @@ -517,6 +529,9 @@ static struct fanotify_event *fanotify_alloc_name_event(struct inode *id,
> >       if (file_name)
> >               fanotify_info_copy_name(info, file_name);
> >
> > +     /* Mix fsid+dfid+name+fid info into event merge key */
> > +     fne->fae.info_hash = full_name_hash(fne->fskey, info->buf, fanotify_info_len(info));
> > +
>
> Similarly here...
>
> >       pr_debug("%s: ino=%lu size=%u dir_fh_len=%u child_fh_len=%u name_len=%u name='%.*s'\n",
> >                __func__, id->i_ino, size, dir_fh_len, child_fh_len,
> >                info->name_len, info->name_len, fanotify_info_name(info));
> > @@ -539,6 +554,8 @@ static struct fanotify_event *fanotify_alloc_event(struct fsnotify_group *group,
> >       struct mem_cgroup *old_memcg;
> >       struct inode *child = NULL;
> >       bool name_event = false;
> > +     unsigned int hash = 0;
> > +     struct pid *pid;
> >
> >       if ((fid_mode & FAN_REPORT_DIR_FID) && dirid) {
> >               /*
> > @@ -606,13 +623,17 @@ static struct fanotify_event *fanotify_alloc_event(struct fsnotify_group *group,
> >        * Use the victim inode instead of the watching inode as the id for
> >        * event queue, so event reported on parent is merged with event
> >        * reported on child when both directory and child watches exist.
> > -      * Reduce object id to 32bit hash for hashed queue merge.
> > +      * Reduce object id and event info to 32bit hash for hashed queue merge.
> >        */
> > -     fanotify_init_event(event, hash_ptr(id, 32), mask);
> > +     hash = event->info_hash ^ hash_ptr(id, 32);
> >       if (FAN_GROUP_FLAG(group, FAN_REPORT_TID))
> > -             event->pid = get_pid(task_pid(current));
> > +             pid = get_pid(task_pid(current));
> >       else
> > -             event->pid = get_pid(task_tgid(current));
> > +             pid = get_pid(task_tgid(current));
> > +     /* Mix pid info into event merge key */
> > +     hash ^= hash_ptr(pid, 32);
>
> hash_32() here?

I don't think so.
hash_32() looses the high bits of ptr before mixing them.
hash_ptr(pid, 32) looses the *low* bits which contain less entropy
after mixing all 64bits of ptr.

>
> > +     fanotify_init_event(event, hash, mask);
> > +     event->pid = pid;
> >
> >  out:
> >       set_active_memcg(old_memcg);
> > diff --git a/fs/notify/fanotify/fanotify.h b/fs/notify/fanotify/fanotify.h
> > index 2e856372ffc8..522fb1a68b30 100644
> > --- a/fs/notify/fanotify/fanotify.h
> > +++ b/fs/notify/fanotify/fanotify.h
> > @@ -115,6 +115,11 @@ static inline void fanotify_info_init(struct fanotify_info *info)
> >       info->name_len = 0;
> >  }
> >
> > +static inline unsigned int fanotify_info_len(struct fanotify_info *info)
> > +{
> > +     return info->dir_fh_totlen + info->file_fh_totlen + info->name_len;
> > +}
> > +
> >  static inline void fanotify_info_copy_name(struct fanotify_info *info,
> >                                          const struct qstr *name)
> >  {
> > @@ -138,7 +143,10 @@ enum fanotify_event_type {
> >  };
> >
> >  struct fanotify_event {
> > -     struct fsnotify_event fse;
> > +     union {
> > +             struct fsnotify_event fse;
> > +             unsigned int info_hash;
> > +     };
> >       u32 mask;
> >       enum fanotify_event_type type;
> >       struct pid *pid;
>
> How is this ever safe? info_hash will likely overlay with 'list' in
> fsnotify_event.

Oh yeh. That's an ugly hack. Sorry for that.
I wanted to avoid adding an arg unsigned int *info_hash to all
fanotify_alloc_*_event() helpers, so I used this uninitialized space
in event instead.
I'll do it the proper way.

>
> > @@ -154,7 +162,10 @@ static inline void fanotify_init_event(struct fanotify_event *event,
> >
> >  struct fanotify_fid_event {
> >       struct fanotify_event fae;
> > -     __kernel_fsid_t fsid;
> > +     union {
> > +             __kernel_fsid_t fsid;
> > +             void *fskey;    /* 64 or 32 bits of fsid used for salt */
> > +     };
> >       struct fanotify_fh object_fh;
> >       /* Reserve space in object_fh.buf[] - access with fanotify_fh_buf() */
> >       unsigned char _inline_fh_buf[FANOTIFY_INLINE_FH_LEN];
> > @@ -168,7 +179,10 @@ FANOTIFY_FE(struct fanotify_event *event)
> >
> >  struct fanotify_name_event {
> >       struct fanotify_event fae;
> > -     __kernel_fsid_t fsid;
> > +     union {
> > +             __kernel_fsid_t fsid;
> > +             void *fskey;    /* 64 or 32 bits of fsid used for salt */
> > +     };
> >       struct fanotify_info info;
> >  };
>
> What games are you playing here with the unions? I presume you can remove
> these 'fskey' unions and just use (void *)(event->fsid) at appropriate
> places? IMO much more comprehensible...

Ok.

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 0/7] Performance improvement for fanotify merge
  2021-02-16 16:02 ` [PATCH 0/7] Performance improvement for fanotify merge Jan Kara
@ 2021-02-17 10:52   ` Amir Goldstein
  2021-02-17 11:25     ` Jan Kara
  0 siblings, 1 reply; 36+ messages in thread
From: Amir Goldstein @ 2021-02-17 10:52 UTC (permalink / raw)
  To: Jan Kara; +Cc: linux-fsdevel

On Tue, Feb 16, 2021 at 6:02 PM Jan Kara <jack@suse.cz> wrote:
>
> Hi Amir!
>
> Looking at the patches I've got one idea:
>
> Currently you have fsnotify_event like:
>
> struct fsnotify_event {
>         struct list_head list;
>         unsigned int key;
>         unsigned int next_bucket;
> };
>
> And 'list' is used for hashed queue list, next_bucket is used to simulate
> single queue out of all the individual lists. The option I'm considering
> is:
>
> struct fsnotify_event {
>         struct list_head list;
>         struct fsnotify_event *hash_next;
>         unsigned int key;
> };
>
> So 'list' would stay to be used for the single queue of events like it was
> before your patches. 'hash_next' would be used for list of events in the
> hash chain. The advantage of this scheme would be somewhat more obvious
> handling,

I can agree to that.

> also we can handle removal of permission events (they won't be
> hashed so there's no risk of breaking hash-chain in the middle, removal
> from global queue is easy as currently).

Ok. but I do not really see a value in hashing non-permission events
for high priority groups, so this is not a strong argument.

> The disadvantage is increase of
> event size by one pointer on 64-bit but I think we can live with that. What
> do you think?

Given the round size of fixes size events in v5.10, that would be a shame:

ls -l /sys/kernel/slab/*notify*event
lrwxrwxrwx 1 root root 0 Feb 17 12:23
/sys/kernel/slab/fanotify_fid_event -> :0000064
lrwxrwxrwx 1 root root 0 Feb 17 12:23
/sys/kernel/slab/fanotify_path_event -> :0000056
lrwxrwxrwx 1 root root 0 Feb 17 12:23
/sys/kernel/slab/fanotify_perm_event -> :0000064

Counter proposal:

struct fsnotify_event {
        struct list_head list;
        struct fsnotify_event *hash_next;
        unsigned int key;
        u32 mask;
};

It is quite strange that mask is a member of struct fanotify_event and
struct inotify_event_info to begin with.

Moving the mask member to struct fsnotify_event like that is not going
to change the resulting inotify/fanotify event size.

We can actually squeeze fanotify_event_type into 2 low bits of pid
pointer, and reduce the size of all fanotify events by one pointer,
because FANOTIFY_EVENT_TYPE_OVERFLOW is nice to have.
The overflow event can use FANOTIFY_EVENT_TYPE_PATH with a
NULL path values (as early versions of the patch did).

This is not worth doing with current round event size, IMO.

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 0/7] Performance improvement for fanotify merge
  2021-02-17 10:52   ` Amir Goldstein
@ 2021-02-17 11:25     ` Jan Kara
  2021-02-18 10:56       ` Amir Goldstein
  0 siblings, 1 reply; 36+ messages in thread
From: Jan Kara @ 2021-02-17 11:25 UTC (permalink / raw)
  To: Amir Goldstein; +Cc: Jan Kara, linux-fsdevel

On Wed 17-02-21 12:52:21, Amir Goldstein wrote:
> On Tue, Feb 16, 2021 at 6:02 PM Jan Kara <jack@suse.cz> wrote:
> >
> > Hi Amir!
> >
> > Looking at the patches I've got one idea:
> >
> > Currently you have fsnotify_event like:
> >
> > struct fsnotify_event {
> >         struct list_head list;
> >         unsigned int key;
> >         unsigned int next_bucket;
> > };
> >
> > And 'list' is used for hashed queue list, next_bucket is used to simulate
> > single queue out of all the individual lists. The option I'm considering
> > is:
> >
> > struct fsnotify_event {
> >         struct list_head list;
> >         struct fsnotify_event *hash_next;
> >         unsigned int key;
> > };
> >
> > So 'list' would stay to be used for the single queue of events like it was
> > before your patches. 'hash_next' would be used for list of events in the
> > hash chain. The advantage of this scheme would be somewhat more obvious
> > handling,
> 
> I can agree to that.
> 
> > also we can handle removal of permission events (they won't be
> > hashed so there's no risk of breaking hash-chain in the middle, removal
> > from global queue is easy as currently).
> 
> Ok. but I do not really see a value in hashing non-permission events
> for high priority groups, so this is not a strong argument.

The reason why I thought it is somewhat beneficial is that someone might be
using higher priority fanotify group just for watching non-permission
events because so far the group priority makes little difference. And
conceptually it isn't obvious (from userspace POV) why higher priority
groups should be merging events less efficiently...

> > The disadvantage is increase of
> > event size by one pointer on 64-bit but I think we can live with that. What
> > do you think?
> 
> Given the round size of fixes size events in v5.10, that would be a shame:
> 
> ls -l /sys/kernel/slab/*notify*event
> lrwxrwxrwx 1 root root 0 Feb 17 12:23
> /sys/kernel/slab/fanotify_fid_event -> :0000064
> lrwxrwxrwx 1 root root 0 Feb 17 12:23
> /sys/kernel/slab/fanotify_path_event -> :0000056
> lrwxrwxrwx 1 root root 0 Feb 17 12:23
> /sys/kernel/slab/fanotify_perm_event -> :0000064
> 
> Counter proposal:
> 
> struct fsnotify_event {
>         struct list_head list;
>         struct fsnotify_event *hash_next;
>         unsigned int key;
>         u32 mask;
> };

Even better!

> It is quite strange that mask is a member of struct fanotify_event and
> struct inotify_event_info to begin with.

Because they were moved there in the past to improve struct packing ;)

> Moving the mask member to struct fsnotify_event like that is not going
> to change the resulting inotify/fanotify event size.
> 
> We can actually squeeze fanotify_event_type into 2 low bits of pid
> pointer, and reduce the size of all fanotify events by one pointer,
> because FANOTIFY_EVENT_TYPE_OVERFLOW is nice to have.
> The overflow event can use FANOTIFY_EVENT_TYPE_PATH with a
> NULL path values (as early versions of the patch did).
> 
> This is not worth doing with current round event size, IMO.

I agree. Not worth it at this point.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 2/7] fsnotify: support hashed notification queue
  2021-02-16 15:02   ` Jan Kara
@ 2021-02-17 12:33     ` Amir Goldstein
  2021-02-17 13:48       ` Jan Kara
  0 siblings, 1 reply; 36+ messages in thread
From: Amir Goldstein @ 2021-02-17 12:33 UTC (permalink / raw)
  To: Jan Kara; +Cc: linux-fsdevel

On Tue, Feb 16, 2021 at 5:02 PM Jan Kara <jack@suse.cz> wrote:
>
> On Tue 02-02-21 18:20:05, Amir Goldstein wrote:
> > In order to improve event merge performance, support sharding the
> > notification queue to several notification lists, hashed by an event
> > merge key.
> >
> > The fanotify event merge key is the object id reduced to 32bit hash.
> >
> > At the moment, both inotify and fanotify still use a single notification
> > list.
> >
> > At the moment, reading events from the hashed queue is not by event
> > insertion order.  A non-empty list is read until it is empty.
> >
> > The max events limit is accounted for total events in all lists.
> >
> > Signed-off-by: Amir Goldstein <amir73il@gmail.com>
>
> Style nit: Please try to stay within 80 columns...
>
> Also I think this change would be more comprehensible if it was merged with
> the following patch 3/8. Having code with TODOs is kind of awkward and
> makes it more difficult to verify the result is actually correct.
>

Ok.

> > diff --git a/fs/notify/fanotify/fanotify_user.c b/fs/notify/fanotify/fanotify_user.c
> > index 8ff27d92d32c..dee12d927f8d 100644
> > --- a/fs/notify/fanotify/fanotify_user.c
> > +++ b/fs/notify/fanotify/fanotify_user.c
> > @@ -635,6 +635,7 @@ static long fanotify_ioctl(struct file *file, unsigned int cmd, unsigned long ar
> >  {
> >       struct fsnotify_group *group;
> >       struct fsnotify_event *fsn_event;
> > +     struct list_head *list;
> >       void __user *p;
> >       int ret = -ENOTTY;
> >       size_t send_len = 0;
> > @@ -646,8 +647,15 @@ static long fanotify_ioctl(struct file *file, unsigned int cmd, unsigned long ar
> >       switch (cmd) {
> >       case FIONREAD:
> >               spin_lock(&group->notification_lock);
> > -             list_for_each_entry(fsn_event, &group->notification_list, list)
> > -                     send_len += FAN_EVENT_METADATA_LEN;
> > +             list = fsnotify_first_notification_list(group);
> > +             /*
> > +              * With multi queue, send_len will be a lower bound
> > +              * on total events size.
> > +              */
> > +             if (list) {
> > +                     list_for_each_entry(fsn_event, list, list)
> > +                             send_len += FAN_EVENT_METADATA_LEN;
> > +             }
>
> IMO we should just walk all the lists here. Otherwise reported length would
> be just odd. That being said the returned value already is odd because we
> don't properly account for different event sizes (unrelated problem). So if
> we want to keep it simple for now, we can just return group->num_events *
> FAN_EVENT_METADATA_LEN, can't we?

Sure. Makes sense.

>
> > diff --git a/fs/notify/group.c b/fs/notify/group.c
> > index ffd723ffe46d..b41ed68f9ff9 100644
> > --- a/fs/notify/group.c
> > +++ b/fs/notify/group.c
> > @@ -111,12 +116,20 @@ void fsnotify_put_group(struct fsnotify_group *group)
> >  }
> >  EXPORT_SYMBOL_GPL(fsnotify_put_group);
> >
> > -static struct fsnotify_group *__fsnotify_alloc_group(
> > +static struct fsnotify_group *__fsnotify_alloc_group(unsigned int q_hash_bits,
> >                               const struct fsnotify_ops *ops, gfp_t gfp)
> >  {
> >       struct fsnotify_group *group;
> > +     int i;
> > +
> > +#ifdef FSNOTIFY_HASHED_QUEUE
> > +     if (WARN_ON_ONCE(q_hash_bits > FSNOTIFY_HASHED_QUEUE_MAX_BITS))
> > +             q_hash_bits = FSNOTIFY_HASHED_QUEUE_MAX_BITS;
> > +#else
> > +     q_hash_bits = 0;
> > +#endif
>
> Why the check for q_hash_bits? We have in-kernel users only so I would not
> be that paranoid :) Maybe I'd just let the group specify whether it wants
> hashed queue or not and for hashed queues always set number of buckets to
> 128. So far we don't need anything more flexible and if we ever do, it's
> easy enough to change the in-kernel API...

Ok.

>
> Also, honestly, I find these ifdefs ugly. I'd just leave hashed queues
> unconditionally compiled in.
>

Ok.

> > -     group = kzalloc(sizeof(struct fsnotify_group), gfp);
> > +     group = kzalloc(fsnotify_group_size(q_hash_bits), gfp);
> >       if (!group)
> >               return ERR_PTR(-ENOMEM);
> >
> > @@ -126,8 +139,12 @@ static struct fsnotify_group *__fsnotify_alloc_group(
> >       atomic_set(&group->user_waits, 0);
> >
> >       spin_lock_init(&group->notification_lock);
> > -     INIT_LIST_HEAD(&group->notification_list);
> >       init_waitqueue_head(&group->notification_waitq);
> > +     /* Initialize one or more notification lists */
> > +     group->q_hash_bits = q_hash_bits;
> > +     group->max_bucket = (1 << q_hash_bits) - 1;
> > +     for (i = 0; i <= group->max_bucket; i++)
> > +             INIT_LIST_HEAD(&group->notification_list[i]);
> >       group->max_events = UINT_MAX;
> >
> >       mutex_init(&group->mark_mutex);
> > @@ -139,20 +156,24 @@ static struct fsnotify_group *__fsnotify_alloc_group(
> >  }
> >
> >  /*
> > - * Create a new fsnotify_group and hold a reference for the group returned.
> > + * Create a new fsnotify_group with no events queue.
>
> How come "no events queue"? There will be always at least one queue...
>

Bad phrasing. The backends that use fsnotify_alloc_group() do not
use the events queue.
I considered not allocating any event queue for them but decided it is
not worth the risk of introducing bugs.

> > + * Hold a reference for the group returned.
> >   */
> >  struct fsnotify_group *fsnotify_alloc_group(const struct fsnotify_ops *ops)
> >  {
> > -     return __fsnotify_alloc_group(ops, GFP_KERNEL);
> > +     return __fsnotify_alloc_group(0, ops, GFP_KERNEL);
> >  }
> >  EXPORT_SYMBOL_GPL(fsnotify_alloc_group);
> >
> >  /*
> > - * Create a new fsnotify_group and hold a reference for the group returned.
> > + * Create a new fsnotify_group with an events queue.
> > + * If @q_hash_bits > 0, the queue is shareded into several notification lists.
> > + * Hold a reference for the group returned.
> >   */
> > -struct fsnotify_group *fsnotify_alloc_user_group(const struct fsnotify_ops *ops)
> > +struct fsnotify_group *fsnotify_alloc_user_group(unsigned int q_hash_bits,
> > +                                              const struct fsnotify_ops *ops)
> >  {
> > -     return __fsnotify_alloc_group(ops, GFP_KERNEL_ACCOUNT);
> > +     return __fsnotify_alloc_group(q_hash_bits, ops, GFP_KERNEL_ACCOUNT);
> >  }
> >  EXPORT_SYMBOL_GPL(fsnotify_alloc_user_group);
> >
> > diff --git a/fs/notify/inotify/inotify_user.c b/fs/notify/inotify/inotify_user.c
> > index d8830be60a9b..1c476b9485dc 100644
> > --- a/fs/notify/inotify/inotify_user.c
> > +++ b/fs/notify/inotify/inotify_user.c
> > @@ -288,6 +288,7 @@ static long inotify_ioctl(struct file *file, unsigned int cmd,
> >  {
> >       struct fsnotify_group *group;
> >       struct fsnotify_event *fsn_event;
> > +     struct list_head *list;
> >       void __user *p;
> >       int ret = -ENOTTY;
> >       size_t send_len = 0;
> > @@ -300,10 +301,16 @@ static long inotify_ioctl(struct file *file, unsigned int cmd,
> >       switch (cmd) {
> >       case FIONREAD:
> >               spin_lock(&group->notification_lock);
> > -             list_for_each_entry(fsn_event, &group->notification_list,
> > -                                 list) {
> > -                     send_len += sizeof(struct inotify_event);
> > -                     send_len += round_event_name_len(fsn_event);
> > +             list = fsnotify_first_notification_list(group);
> > +             /*
> > +              * With multi queue, send_len will be a lower bound
> > +              * on total events size.
> > +              */
> > +             if (list) {
> > +                     list_for_each_entry(fsn_event, list, list) {
> > +                             send_len += sizeof(struct inotify_event);
> > +                             send_len += round_event_name_len(fsn_event);
> > +                     }
>
> As I write below IMO we should enable hashed queues also for inotify (is
> there good reason not to?)

I see your perception of inotify_merge() is the same as mine was
when I wrote a patch to support hashed queues for inotify.
It is only after that I realized that inotify_merge() only ever merges
with the last event and I dropped that patch.
I see no reason to change this long time behavior.

> and here we should just walk through all the lists.

Ok.

>
> >               }
> >               spin_unlock(&group->notification_lock);
> >               ret = put_user(send_len, (int __user *) p);
> > @@ -631,7 +638,7 @@ static struct fsnotify_group *inotify_new_group(unsigned int max_events)
> >       struct fsnotify_group *group;
> >       struct inotify_event_info *oevent;
> >
> > -     group = fsnotify_alloc_user_group(&inotify_fsnotify_ops);
> > +     group = fsnotify_alloc_user_group(0, &inotify_fsnotify_ops);
> >       if (IS_ERR(group))
> >               return group;
> >
> > diff --git a/fs/notify/notification.c b/fs/notify/notification.c
> > index fcac2d72cbf5..58c8f6c1be1b 100644
> > --- a/fs/notify/notification.c
> > +++ b/fs/notify/notification.c
> ...
> > @@ -74,10 +67,75 @@ void fsnotify_destroy_event(struct fsnotify_group *group,
> >       group->ops->free_event(event);
> >  }
> >
> > +/* Check and fix inconsistencies in hashed queue */
> > +static void fsnotify_queue_check(struct fsnotify_group *group)
> > +{
> > +#ifdef FSNOTIFY_HASHED_QUEUE
> > +     struct list_head *list;
> > +     int i, nbuckets = 0;
> > +     bool first_empty, last_empty;
> > +
> > +     assert_spin_locked(&group->notification_lock);
> > +
> > +     pr_debug("%s: group=%p events: num=%u max=%u buckets: first=%u last=%u max=%u\n",
> > +              __func__, group, group->num_events, group->max_events,
> > +              group->first_bucket, group->last_bucket, group->max_bucket);
> > +
> > +     if (fsnotify_notify_queue_is_empty(group))
> > +             return;
> > +
> > +     first_empty = list_empty(&group->notification_list[group->first_bucket]);
> > +     last_empty = list_empty(&group->notification_list[group->last_bucket]);
> > +
> > +     list = &group->notification_list[0];
> > +     for (i = 0; i <= group->max_bucket; i++, list++) {
> > +             if (list_empty(list))
> > +                     continue;
> > +             if (nbuckets++)
> > +                     continue;
> > +             if (first_empty)
> > +                     group->first_bucket = i;
> > +             if (last_empty)
> > +                     group->last_bucket = i;
> > +     }
> > +
> > +     pr_debug("%s: %u non-empty buckets\n", __func__, nbuckets);
> > +
> > +     /* All buckets are empty, but non-zero num_events? */
> > +     if (WARN_ON_ONCE(!nbuckets && group->num_events))
> > +             group->num_events = 0;
> > +#endif
>
> Hum, what is this function about? I understand you might want to have this
> around for debugging when developing the patch but is there are legitimate
> reason for this in production?
>
> I understand current patch uses this function for searching for next
> non-empty list but after this patch is merged with the next one, is there
> still a need for this?

Only needed for debugging (added by last patch).
I wanted to make sure that hash is balanced.
I'll keep this code around only for the debug print on queue overflow.

>
> > @@ -147,17 +228,21 @@ void fsnotify_remove_queued_event(struct fsnotify_group *group,
> >  struct fsnotify_event *fsnotify_remove_first_event(struct fsnotify_group *group)
> >  {
> >       struct fsnotify_event *event;
> > +     struct list_head *list;
> >
> >       assert_spin_locked(&group->notification_lock);
> >
> > -     if (fsnotify_notify_queue_is_empty(group))
> > +     list = fsnotify_first_notification_list(group);
> > +     if (!list)
> >               return NULL;
> >
> > -     pr_debug("%s: group=%p\n", __func__, group);
> > +     pr_debug("%s: group=%p bucket=%u\n", __func__, group, group->first_bucket);
> >
> > -     event = list_first_entry(&group->notification_list,
> > -                              struct fsnotify_event, list);
> > +     event = list_first_entry(list, struct fsnotify_event, list);
>
> Perhaps the above could just reuse fsnotify_peek_first_event()? It is now
> complex enough to be worth it I guess.
>
> >       fsnotify_remove_queued_event(group, event);
> > +     /*
> > +      * TODO: update group->first_bucket to next_bucket in first event.
> > +      */
> >       return event;
> >  }
> >
> > @@ -167,13 +252,15 @@ struct fsnotify_event *fsnotify_remove_first_event(struct fsnotify_group *group)
> >   */
> >  struct fsnotify_event *fsnotify_peek_first_event(struct fsnotify_group *group)
> >  {
> > +     struct list_head *list;
> > +
> >       assert_spin_locked(&group->notification_lock);
> >
> > -     if (fsnotify_notify_queue_is_empty(group))
> > +     list = fsnotify_first_notification_list(group);
> > +     if (!list)
> >               return NULL;
> >
> > -     return list_first_entry(&group->notification_list,
> > -                             struct fsnotify_event, list);
> > +     return list_first_entry(list, struct fsnotify_event, list);
> >  }
> >
> >  /*
> > diff --git a/include/linux/fsnotify_backend.h b/include/linux/fsnotify_backend.h
> > index e5409b83e731..b2a80bc00108 100644
> > --- a/include/linux/fsnotify_backend.h
> > +++ b/include/linux/fsnotify_backend.h
> > @@ -160,6 +160,11 @@ struct fsnotify_ops {
> >       void (*free_mark)(struct fsnotify_mark *mark);
> >  };
> >
> > +#ifdef CONFIG_FANOTIFY
> > +#define FSNOTIFY_HASHED_QUEUE
> > +#define FSNOTIFY_HASHED_QUEUE_MAX_BITS 8
> > +#endif
> > +
>
> Would not inotify benefit from this work as well?

See above. inotify is not doing linear search.

> Also I'd just compile in
> hashed queues unconditionally. The ifdefs are ugly and IMHO not worth it.
>

OK.

> ...
>
> > +static inline size_t fsnotify_group_size(unsigned int q_hash_bits)
> > +{
> > +     return sizeof(struct fsnotify_group) + (sizeof(struct list_head) << q_hash_bits);
> > +}
> > +
> > +static inline unsigned int fsnotify_event_bucket(struct fsnotify_group *group,
> > +                                              struct fsnotify_event *event)
> > +{
> > +     /* High bits are better for hash */
> > +     return (event->key >> (32 - group->q_hash_bits)) & group->max_bucket;
> > +}
>
> Why not use hash_32() here? IMHO better than just stripping bits...

See hash_ptr(). There is a reason to use the highest bits.

>
> > +
> > +static inline struct list_head *fsnotify_event_notification_list(
> > +                                             struct fsnotify_group *group,
> > +                                             struct fsnotify_event *event)
> > +{
> > +     return &group->notification_list[fsnotify_event_bucket(group, event)];
> > +}
> > +
>
> Any reason for this to be in a header? Will it ever have other user than
> fsnotify_add_event()?

I guess not. I went thought several alternative patches.
It may have made sense in another variant.

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 2/7] fsnotify: support hashed notification queue
  2021-02-17 12:33     ` Amir Goldstein
@ 2021-02-17 13:48       ` Jan Kara
  2021-02-17 15:42         ` Amir Goldstein
  0 siblings, 1 reply; 36+ messages in thread
From: Jan Kara @ 2021-02-17 13:48 UTC (permalink / raw)
  To: Amir Goldstein; +Cc: Jan Kara, linux-fsdevel

On Wed 17-02-21 14:33:46, Amir Goldstein wrote:
> On Tue, Feb 16, 2021 at 5:02 PM Jan Kara <jack@suse.cz> wrote:
> > > @@ -300,10 +301,16 @@ static long inotify_ioctl(struct file *file, unsigned int cmd,
> > >       switch (cmd) {
> > >       case FIONREAD:
> > >               spin_lock(&group->notification_lock);
> > > -             list_for_each_entry(fsn_event, &group->notification_list,
> > > -                                 list) {
> > > -                     send_len += sizeof(struct inotify_event);
> > > -                     send_len += round_event_name_len(fsn_event);
> > > +             list = fsnotify_first_notification_list(group);
> > > +             /*
> > > +              * With multi queue, send_len will be a lower bound
> > > +              * on total events size.
> > > +              */
> > > +             if (list) {
> > > +                     list_for_each_entry(fsn_event, list, list) {
> > > +                             send_len += sizeof(struct inotify_event);
> > > +                             send_len += round_event_name_len(fsn_event);
> > > +                     }
> >
> > As I write below IMO we should enable hashed queues also for inotify (is
> > there good reason not to?)
> 
> I see your perception of inotify_merge() is the same as mine was
> when I wrote a patch to support hashed queues for inotify.
> It is only after that I realized that inotify_merge() only ever merges
> with the last event and I dropped that patch.
> I see no reason to change this long time behavior.

Ah, I even briefly looked at that code but didn't notice it merges only
with the last event. I agree that hashing for inotify doesn't make sense
then.

Hum, if the hashing and merging is specific to fanotify and as we decided
to keep the event->list for the global event list, we could easily have the
hash table just in fanotify private group data and hash->next pointer in
fanotify private part of the event? Maybe that would even result in a more
compact code?

> > > +static inline size_t fsnotify_group_size(unsigned int q_hash_bits)
> > > +{
> > > +     return sizeof(struct fsnotify_group) + (sizeof(struct list_head) << q_hash_bits);
> > > +}
> > > +
> > > +static inline unsigned int fsnotify_event_bucket(struct fsnotify_group *group,
> > > +                                              struct fsnotify_event *event)
> > > +{
> > > +     /* High bits are better for hash */
> > > +     return (event->key >> (32 - group->q_hash_bits)) & group->max_bucket;
> > > +}
> >
> > Why not use hash_32() here? IMHO better than just stripping bits...
> 
> See hash_ptr(). There is a reason to use the highest bits.

Well, but event->key is just a 32-bit number so I don't follow how high
bits used by hash_ptr() matter?

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 2/7] fsnotify: support hashed notification queue
  2021-02-17 13:48       ` Jan Kara
@ 2021-02-17 15:42         ` Amir Goldstein
  2021-02-17 16:49           ` Jan Kara
  2021-02-18 10:52           ` Amir Goldstein
  0 siblings, 2 replies; 36+ messages in thread
From: Amir Goldstein @ 2021-02-17 15:42 UTC (permalink / raw)
  To: Jan Kara; +Cc: linux-fsdevel

On Wed, Feb 17, 2021 at 3:48 PM Jan Kara <jack@suse.cz> wrote:
>
> On Wed 17-02-21 14:33:46, Amir Goldstein wrote:
> > On Tue, Feb 16, 2021 at 5:02 PM Jan Kara <jack@suse.cz> wrote:
> > > > @@ -300,10 +301,16 @@ static long inotify_ioctl(struct file *file, unsigned int cmd,
> > > >       switch (cmd) {
> > > >       case FIONREAD:
> > > >               spin_lock(&group->notification_lock);
> > > > -             list_for_each_entry(fsn_event, &group->notification_list,
> > > > -                                 list) {
> > > > -                     send_len += sizeof(struct inotify_event);
> > > > -                     send_len += round_event_name_len(fsn_event);
> > > > +             list = fsnotify_first_notification_list(group);
> > > > +             /*
> > > > +              * With multi queue, send_len will be a lower bound
> > > > +              * on total events size.
> > > > +              */
> > > > +             if (list) {
> > > > +                     list_for_each_entry(fsn_event, list, list) {
> > > > +                             send_len += sizeof(struct inotify_event);
> > > > +                             send_len += round_event_name_len(fsn_event);
> > > > +                     }
> > >
> > > As I write below IMO we should enable hashed queues also for inotify (is
> > > there good reason not to?)
> >
> > I see your perception of inotify_merge() is the same as mine was
> > when I wrote a patch to support hashed queues for inotify.
> > It is only after that I realized that inotify_merge() only ever merges
> > with the last event and I dropped that patch.
> > I see no reason to change this long time behavior.
>
> Ah, I even briefly looked at that code but didn't notice it merges only
> with the last event. I agree that hashing for inotify doesn't make sense
> then.
>
> Hum, if the hashing and merging is specific to fanotify and as we decided
> to keep the event->list for the global event list, we could easily have the
> hash table just in fanotify private group data and hash->next pointer in
> fanotify private part of the event? Maybe that would even result in a more
> compact code?
>

Maybe, I am not so sure. I will look into it.

> > > > +static inline size_t fsnotify_group_size(unsigned int q_hash_bits)
> > > > +{
> > > > +     return sizeof(struct fsnotify_group) + (sizeof(struct list_head) << q_hash_bits);
> > > > +}
> > > > +
> > > > +static inline unsigned int fsnotify_event_bucket(struct fsnotify_group *group,
> > > > +                                              struct fsnotify_event *event)
> > > > +{
> > > > +     /* High bits are better for hash */
> > > > +     return (event->key >> (32 - group->q_hash_bits)) & group->max_bucket;
> > > > +}
> > >
> > > Why not use hash_32() here? IMHO better than just stripping bits...
> >
> > See hash_ptr(). There is a reason to use the highest bits.
>
> Well, but event->key is just a 32-bit number so I don't follow how high
> bits used by hash_ptr() matter?

Of course, you are right.
But that 32-bit number was already generated using a xor of several
hash_32() results from hash_ptr() and full_name_hash(), so we do not really
need to mix it anymore to get better entropy in the higher 7 bits.

I do not mind using hash_32() here for clarity.

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 2/7] fsnotify: support hashed notification queue
  2021-02-17 15:42         ` Amir Goldstein
@ 2021-02-17 16:49           ` Jan Kara
  2021-02-18 10:52           ` Amir Goldstein
  1 sibling, 0 replies; 36+ messages in thread
From: Jan Kara @ 2021-02-17 16:49 UTC (permalink / raw)
  To: Amir Goldstein; +Cc: Jan Kara, linux-fsdevel

On Wed 17-02-21 17:42:34, Amir Goldstein wrote:
> On Wed, Feb 17, 2021 at 3:48 PM Jan Kara <jack@suse.cz> wrote:
> > > > > +static inline size_t fsnotify_group_size(unsigned int q_hash_bits)
> > > > > +{
> > > > > +     return sizeof(struct fsnotify_group) + (sizeof(struct list_head) << q_hash_bits);
> > > > > +}
> > > > > +
> > > > > +static inline unsigned int fsnotify_event_bucket(struct fsnotify_group *group,
> > > > > +                                              struct fsnotify_event *event)
> > > > > +{
> > > > > +     /* High bits are better for hash */
> > > > > +     return (event->key >> (32 - group->q_hash_bits)) & group->max_bucket;
> > > > > +}
> > > >
> > > > Why not use hash_32() here? IMHO better than just stripping bits...
> > >
> > > See hash_ptr(). There is a reason to use the highest bits.
> >
> > Well, but event->key is just a 32-bit number so I don't follow how high
> > bits used by hash_ptr() matter?
> 
> Of course, you are right.
> But that 32-bit number was already generated using a xor of several
> hash_32() results from hash_ptr() and full_name_hash(), so we do not really
> need to mix it anymore to get better entropy in the higher 7 bits.

True. Just masking it with q_hash_bits is fine as well.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 6/7] fanotify: mix event info into merge key hash
  2021-02-17 10:13     ` Amir Goldstein
@ 2021-02-18 10:46       ` Amir Goldstein
  2021-02-18 11:11         ` Jan Kara
  0 siblings, 1 reply; 36+ messages in thread
From: Amir Goldstein @ 2021-02-18 10:46 UTC (permalink / raw)
  To: Jan Kara; +Cc: linux-fsdevel

On Wed, Feb 17, 2021 at 12:13 PM Amir Goldstein <amir73il@gmail.com> wrote:
>
> On Tue, Feb 16, 2021 at 5:39 PM Jan Kara <jack@suse.cz> wrote:
> >
> > On Tue 02-02-21 18:20:09, Amir Goldstein wrote:
> > > Improve the balance of hashed queue lists by mixing more event info
> > > relevant for merge.
> > >
> > > For example, all FAN_CREATE name events in the same dir used to have the
> > > same merge key based on the dir inode.  With this change the created
> > > file name is mixed into the merge key.
> > >
> > > Signed-off-by: Amir Goldstein <amir73il@gmail.com>
> > > ---
> > >  fs/notify/fanotify/fanotify.c | 33 +++++++++++++++++++++++++++------
> > >  fs/notify/fanotify/fanotify.h | 20 +++++++++++++++++---
> > >  2 files changed, 44 insertions(+), 9 deletions(-)
> > >
> > > diff --git a/fs/notify/fanotify/fanotify.c b/fs/notify/fanotify/fanotify.c
> > > index 6d3807012851..b19fef1c6f64 100644
> > > --- a/fs/notify/fanotify/fanotify.c
> > > +++ b/fs/notify/fanotify/fanotify.c
> > ...
> > > @@ -476,8 +485,11 @@ static struct fanotify_event *fanotify_alloc_fid_event(struct inode *id,
> > >
> > >       ffe->fae.type = FANOTIFY_EVENT_TYPE_FID;
> > >       ffe->fsid = *fsid;
> > > -     fanotify_encode_fh(&ffe->object_fh, id, fanotify_encode_fh_len(id),
> > > -                        gfp);
> > > +     fh = &ffe->object_fh;
> > > +     fanotify_encode_fh(fh, id, fanotify_encode_fh_len(id), gfp);
> > > +
> > > +     /* Mix fsid+fid info into event merge key */
> > > +     ffe->fae.info_hash = full_name_hash(ffe->fskey, fanotify_fh_buf(fh), fh->len);
> >
> > Is it really sensible to hash FID with full_name_hash()? It would make more
> > sense to treat it as binary data, not strings...
>
> See the actual implementation of full_name_hash() for 64bit.
> it treats the string as an array of ulong, which is quite perfect for FID.
>
> >
> > >       return &ffe->fae;
> > >  }
> > > @@ -517,6 +529,9 @@ static struct fanotify_event *fanotify_alloc_name_event(struct inode *id,
> > >       if (file_name)
> > >               fanotify_info_copy_name(info, file_name);
> > >
> > > +     /* Mix fsid+dfid+name+fid info into event merge key */
> > > +     fne->fae.info_hash = full_name_hash(fne->fskey, info->buf, fanotify_info_len(info));
> > > +
> >
> > Similarly here...
> >
> > >       pr_debug("%s: ino=%lu size=%u dir_fh_len=%u child_fh_len=%u name_len=%u name='%.*s'\n",
> > >                __func__, id->i_ino, size, dir_fh_len, child_fh_len,
> > >                info->name_len, info->name_len, fanotify_info_name(info));
> > > @@ -539,6 +554,8 @@ static struct fanotify_event *fanotify_alloc_event(struct fsnotify_group *group,
> > >       struct mem_cgroup *old_memcg;
> > >       struct inode *child = NULL;
> > >       bool name_event = false;
> > > +     unsigned int hash = 0;
> > > +     struct pid *pid;
> > >
> > >       if ((fid_mode & FAN_REPORT_DIR_FID) && dirid) {
> > >               /*
> > > @@ -606,13 +623,17 @@ static struct fanotify_event *fanotify_alloc_event(struct fsnotify_group *group,
> > >        * Use the victim inode instead of the watching inode as the id for
> > >        * event queue, so event reported on parent is merged with event
> > >        * reported on child when both directory and child watches exist.
> > > -      * Reduce object id to 32bit hash for hashed queue merge.
> > > +      * Reduce object id and event info to 32bit hash for hashed queue merge.
> > >        */
> > > -     fanotify_init_event(event, hash_ptr(id, 32), mask);
> > > +     hash = event->info_hash ^ hash_ptr(id, 32);
> > >       if (FAN_GROUP_FLAG(group, FAN_REPORT_TID))
> > > -             event->pid = get_pid(task_pid(current));
> > > +             pid = get_pid(task_pid(current));
> > >       else
> > > -             event->pid = get_pid(task_tgid(current));
> > > +             pid = get_pid(task_tgid(current));
> > > +     /* Mix pid info into event merge key */
> > > +     hash ^= hash_ptr(pid, 32);
> >
> > hash_32() here?
>
> I don't think so.
> hash_32() looses the high bits of ptr before mixing them.
> hash_ptr(pid, 32) looses the *low* bits which contain less entropy
> after mixing all 64bits of ptr.
>
> >
> > > +     fanotify_init_event(event, hash, mask);
> > > +     event->pid = pid;
> > >
> > >  out:
> > >       set_active_memcg(old_memcg);
> > > diff --git a/fs/notify/fanotify/fanotify.h b/fs/notify/fanotify/fanotify.h
> > > index 2e856372ffc8..522fb1a68b30 100644
> > > --- a/fs/notify/fanotify/fanotify.h
> > > +++ b/fs/notify/fanotify/fanotify.h
> > > @@ -115,6 +115,11 @@ static inline void fanotify_info_init(struct fanotify_info *info)
> > >       info->name_len = 0;
> > >  }
> > >
> > > +static inline unsigned int fanotify_info_len(struct fanotify_info *info)
> > > +{
> > > +     return info->dir_fh_totlen + info->file_fh_totlen + info->name_len;
> > > +}
> > > +
> > >  static inline void fanotify_info_copy_name(struct fanotify_info *info,
> > >                                          const struct qstr *name)
> > >  {
> > > @@ -138,7 +143,10 @@ enum fanotify_event_type {
> > >  };
> > >
> > >  struct fanotify_event {
> > > -     struct fsnotify_event fse;
> > > +     union {
> > > +             struct fsnotify_event fse;
> > > +             unsigned int info_hash;
> > > +     };
> > >       u32 mask;
> > >       enum fanotify_event_type type;
> > >       struct pid *pid;
> >
> > How is this ever safe? info_hash will likely overlay with 'list' in
> > fsnotify_event.
>
> Oh yeh. That's an ugly hack. Sorry for that.
> I wanted to avoid adding an arg unsigned int *info_hash to all
> fanotify_alloc_*_event() helpers, so I used this uninitialized space
> in event instead.
> I'll do it the proper way.
>
> >
> > > @@ -154,7 +162,10 @@ static inline void fanotify_init_event(struct fanotify_event *event,
> > >
> > >  struct fanotify_fid_event {
> > >       struct fanotify_event fae;
> > > -     __kernel_fsid_t fsid;
> > > +     union {
> > > +             __kernel_fsid_t fsid;
> > > +             void *fskey;    /* 64 or 32 bits of fsid used for salt */
> > > +     };
> > >       struct fanotify_fh object_fh;
> > >       /* Reserve space in object_fh.buf[] - access with fanotify_fh_buf() */
> > >       unsigned char _inline_fh_buf[FANOTIFY_INLINE_FH_LEN];
> > > @@ -168,7 +179,10 @@ FANOTIFY_FE(struct fanotify_event *event)
> > >
> > >  struct fanotify_name_event {
> > >       struct fanotify_event fae;
> > > -     __kernel_fsid_t fsid;
> > > +     union {
> > > +             __kernel_fsid_t fsid;
> > > +             void *fskey;    /* 64 or 32 bits of fsid used for salt */
> > > +     };
> > >       struct fanotify_info info;
> > >  };
> >
> > What games are you playing here with the unions? I presume you can remove
> > these 'fskey' unions and just use (void *)(event->fsid) at appropriate
> > places? IMO much more comprehensible...
>

FYI, this is what the open coded conversion looks like:

(void *)*(long *)event->fsid.val

Not so comprehensible... but I used to open coded conversion anyway.

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 2/7] fsnotify: support hashed notification queue
  2021-02-17 15:42         ` Amir Goldstein
  2021-02-17 16:49           ` Jan Kara
@ 2021-02-18 10:52           ` Amir Goldstein
  1 sibling, 0 replies; 36+ messages in thread
From: Amir Goldstein @ 2021-02-18 10:52 UTC (permalink / raw)
  To: Jan Kara; +Cc: linux-fsdevel

On Wed, Feb 17, 2021 at 5:42 PM Amir Goldstein <amir73il@gmail.com> wrote:
>
> On Wed, Feb 17, 2021 at 3:48 PM Jan Kara <jack@suse.cz> wrote:
> >
> > On Wed 17-02-21 14:33:46, Amir Goldstein wrote:
> > > On Tue, Feb 16, 2021 at 5:02 PM Jan Kara <jack@suse.cz> wrote:
> > > > > @@ -300,10 +301,16 @@ static long inotify_ioctl(struct file *file, unsigned int cmd,
> > > > >       switch (cmd) {
> > > > >       case FIONREAD:
> > > > >               spin_lock(&group->notification_lock);
> > > > > -             list_for_each_entry(fsn_event, &group->notification_list,
> > > > > -                                 list) {
> > > > > -                     send_len += sizeof(struct inotify_event);
> > > > > -                     send_len += round_event_name_len(fsn_event);
> > > > > +             list = fsnotify_first_notification_list(group);
> > > > > +             /*
> > > > > +              * With multi queue, send_len will be a lower bound
> > > > > +              * on total events size.
> > > > > +              */
> > > > > +             if (list) {
> > > > > +                     list_for_each_entry(fsn_event, list, list) {
> > > > > +                             send_len += sizeof(struct inotify_event);
> > > > > +                             send_len += round_event_name_len(fsn_event);
> > > > > +                     }
> > > >
> > > > As I write below IMO we should enable hashed queues also for inotify (is
> > > > there good reason not to?)
> > >
> > > I see your perception of inotify_merge() is the same as mine was
> > > when I wrote a patch to support hashed queues for inotify.
> > > It is only after that I realized that inotify_merge() only ever merges
> > > with the last event and I dropped that patch.
> > > I see no reason to change this long time behavior.
> >
> > Ah, I even briefly looked at that code but didn't notice it merges only
> > with the last event. I agree that hashing for inotify doesn't make sense
> > then.
> >
> > Hum, if the hashing and merging is specific to fanotify and as we decided
> > to keep the event->list for the global event list, we could easily have the
> > hash table just in fanotify private group data and hash->next pointer in
> > fanotify private part of the event? Maybe that would even result in a more
> > compact code?
> >
>
> Maybe, I am not so sure. I will look into it.
>

I ended up doing something slightly different:
- The hash table and lists remained in fsnotify (and in a prep patch)
- event->key remains in fsnotify_event (and event->mask moved too)
- backend gets a callback insert() from fsnotify_add_event() to do it's thing
- event->next is in fanotify_event
- fanotify_insert() callback takes care of chaining all events by timeline

Hope you will like the result (pushed it to fanotify_merge branch).

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 0/7] Performance improvement for fanotify merge
  2021-02-17 11:25     ` Jan Kara
@ 2021-02-18 10:56       ` Amir Goldstein
  2021-02-18 11:15         ` Jan Kara
  0 siblings, 1 reply; 36+ messages in thread
From: Amir Goldstein @ 2021-02-18 10:56 UTC (permalink / raw)
  To: Jan Kara; +Cc: linux-fsdevel

On Wed, Feb 17, 2021 at 1:25 PM Jan Kara <jack@suse.cz> wrote:
>
> On Wed 17-02-21 12:52:21, Amir Goldstein wrote:
> > On Tue, Feb 16, 2021 at 6:02 PM Jan Kara <jack@suse.cz> wrote:
> > >
> > > Hi Amir!
> > >
> > > Looking at the patches I've got one idea:
> > >
> > > Currently you have fsnotify_event like:
> > >
> > > struct fsnotify_event {
> > >         struct list_head list;
> > >         unsigned int key;
> > >         unsigned int next_bucket;
> > > };
> > >
> > > And 'list' is used for hashed queue list, next_bucket is used to simulate
> > > single queue out of all the individual lists. The option I'm considering
> > > is:
> > >
> > > struct fsnotify_event {
> > >         struct list_head list;
> > >         struct fsnotify_event *hash_next;
> > >         unsigned int key;
> > > };
> > >
> > > So 'list' would stay to be used for the single queue of events like it was
> > > before your patches. 'hash_next' would be used for list of events in the
> > > hash chain. The advantage of this scheme would be somewhat more obvious
> > > handling,
> >
> > I can agree to that.
> >
> > > also we can handle removal of permission events (they won't be
> > > hashed so there's no risk of breaking hash-chain in the middle, removal
> > > from global queue is easy as currently).
> >
> > Ok. but I do not really see a value in hashing non-permission events
> > for high priority groups, so this is not a strong argument.
>
> The reason why I thought it is somewhat beneficial is that someone might be
> using higher priority fanotify group just for watching non-permission
> events because so far the group priority makes little difference. And
> conceptually it isn't obvious (from userspace POV) why higher priority
> groups should be merging events less efficiently...
>

So I implemented your suggestion with ->next_event, but it did not
end up with being able to remove from the middle of the queue.
The thing is we know that permission events are on list #0, but what
we need to find out when removing a permission event is the previous
event in timeline order and we do not have that information.
So I stayed with hashed queue only for group priority 0.

Pushed partly tested result to fanotify_merge branch.

Will post after testing unless you have reservations.

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 6/7] fanotify: mix event info into merge key hash
  2021-02-18 10:46       ` Amir Goldstein
@ 2021-02-18 11:11         ` Jan Kara
  2021-02-18 12:17           ` Amir Goldstein
  0 siblings, 1 reply; 36+ messages in thread
From: Jan Kara @ 2021-02-18 11:11 UTC (permalink / raw)
  To: Amir Goldstein; +Cc: Jan Kara, linux-fsdevel

On Thu 18-02-21 12:46:48, Amir Goldstein wrote:
> On Wed, Feb 17, 2021 at 12:13 PM Amir Goldstein <amir73il@gmail.com> wrote:
> > > > @@ -154,7 +162,10 @@ static inline void fanotify_init_event(struct fanotify_event *event,
> > > >
> > > >  struct fanotify_fid_event {
> > > >       struct fanotify_event fae;
> > > > -     __kernel_fsid_t fsid;
> > > > +     union {
> > > > +             __kernel_fsid_t fsid;
> > > > +             void *fskey;    /* 64 or 32 bits of fsid used for salt */
> > > > +     };
> > > >       struct fanotify_fh object_fh;
> > > >       /* Reserve space in object_fh.buf[] - access with fanotify_fh_buf() */
> > > >       unsigned char _inline_fh_buf[FANOTIFY_INLINE_FH_LEN];
> > > > @@ -168,7 +179,10 @@ FANOTIFY_FE(struct fanotify_event *event)
> > > >
> > > >  struct fanotify_name_event {
> > > >       struct fanotify_event fae;
> > > > -     __kernel_fsid_t fsid;
> > > > +     union {
> > > > +             __kernel_fsid_t fsid;
> > > > +             void *fskey;    /* 64 or 32 bits of fsid used for salt */
> > > > +     };
> > > >       struct fanotify_info info;
> > > >  };
> > >
> > > What games are you playing here with the unions? I presume you can remove
> > > these 'fskey' unions and just use (void *)(event->fsid) at appropriate
> > > places? IMO much more comprehensible...
> >
> 
> FYI, this is what the open coded conversion looks like:
> 
> (void *)*(long *)event->fsid.val

Not great but at least fairly localized. I'd just note that this doesn't quite
work on 32-bit archs (sizeof(long) != sizeof(__kernel_fsid_t) there). Maybe
we could just use

hash_32(event->fsid.val[0]) ^ hash_32(event->fsid.val[1])

for mixing into the 'key' value and thus avoid all these games?

								Honza

-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 0/7] Performance improvement for fanotify merge
  2021-02-18 10:56       ` Amir Goldstein
@ 2021-02-18 11:15         ` Jan Kara
  2021-02-18 12:35           ` Amir Goldstein
  0 siblings, 1 reply; 36+ messages in thread
From: Jan Kara @ 2021-02-18 11:15 UTC (permalink / raw)
  To: Amir Goldstein; +Cc: Jan Kara, linux-fsdevel

On Thu 18-02-21 12:56:18, Amir Goldstein wrote:
> On Wed, Feb 17, 2021 at 1:25 PM Jan Kara <jack@suse.cz> wrote:
> >
> > On Wed 17-02-21 12:52:21, Amir Goldstein wrote:
> > > On Tue, Feb 16, 2021 at 6:02 PM Jan Kara <jack@suse.cz> wrote:
> > > >
> > > > Hi Amir!
> > > >
> > > > Looking at the patches I've got one idea:
> > > >
> > > > Currently you have fsnotify_event like:
> > > >
> > > > struct fsnotify_event {
> > > >         struct list_head list;
> > > >         unsigned int key;
> > > >         unsigned int next_bucket;
> > > > };
> > > >
> > > > And 'list' is used for hashed queue list, next_bucket is used to simulate
> > > > single queue out of all the individual lists. The option I'm considering
> > > > is:
> > > >
> > > > struct fsnotify_event {
> > > >         struct list_head list;
> > > >         struct fsnotify_event *hash_next;
> > > >         unsigned int key;
> > > > };
> > > >
> > > > So 'list' would stay to be used for the single queue of events like it was
> > > > before your patches. 'hash_next' would be used for list of events in the
> > > > hash chain. The advantage of this scheme would be somewhat more obvious
> > > > handling,
> > >
> > > I can agree to that.
> > >
> > > > also we can handle removal of permission events (they won't be
> > > > hashed so there's no risk of breaking hash-chain in the middle, removal
> > > > from global queue is easy as currently).
> > >
> > > Ok. but I do not really see a value in hashing non-permission events
> > > for high priority groups, so this is not a strong argument.
> >
> > The reason why I thought it is somewhat beneficial is that someone might be
> > using higher priority fanotify group just for watching non-permission
> > events because so far the group priority makes little difference. And
> > conceptually it isn't obvious (from userspace POV) why higher priority
> > groups should be merging events less efficiently...
> >
> 
> So I implemented your suggestion with ->next_event, but it did not
> end up with being able to remove from the middle of the queue.
> The thing is we know that permission events are on list #0, but what
> we need to find out when removing a permission event is the previous
> event in timeline order and we do not have that information.

So my idea was that if 'list' is the time ordered list and permission
events are *never inserted into the hash* (we don't need them there as
hashed lists are used only for merging), then removal of permission events
is no problem.

> So I stayed with hashed queue only for group priority 0.
> 
> Pushed partly tested result to fanotify_merge branch.
> 
> Will post after testing unless you have reservations.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 6/7] fanotify: mix event info into merge key hash
  2021-02-18 11:11         ` Jan Kara
@ 2021-02-18 12:17           ` Amir Goldstein
  0 siblings, 0 replies; 36+ messages in thread
From: Amir Goldstein @ 2021-02-18 12:17 UTC (permalink / raw)
  To: Jan Kara; +Cc: linux-fsdevel

On Thu, Feb 18, 2021 at 1:11 PM Jan Kara <jack@suse.cz> wrote:
>
> On Thu 18-02-21 12:46:48, Amir Goldstein wrote:
> > On Wed, Feb 17, 2021 at 12:13 PM Amir Goldstein <amir73il@gmail.com> wrote:
> > > > > @@ -154,7 +162,10 @@ static inline void fanotify_init_event(struct fanotify_event *event,
> > > > >
> > > > >  struct fanotify_fid_event {
> > > > >       struct fanotify_event fae;
> > > > > -     __kernel_fsid_t fsid;
> > > > > +     union {
> > > > > +             __kernel_fsid_t fsid;
> > > > > +             void *fskey;    /* 64 or 32 bits of fsid used for salt */
> > > > > +     };
> > > > >       struct fanotify_fh object_fh;
> > > > >       /* Reserve space in object_fh.buf[] - access with fanotify_fh_buf() */
> > > > >       unsigned char _inline_fh_buf[FANOTIFY_INLINE_FH_LEN];
> > > > > @@ -168,7 +179,10 @@ FANOTIFY_FE(struct fanotify_event *event)
> > > > >
> > > > >  struct fanotify_name_event {
> > > > >       struct fanotify_event fae;
> > > > > -     __kernel_fsid_t fsid;
> > > > > +     union {
> > > > > +             __kernel_fsid_t fsid;
> > > > > +             void *fskey;    /* 64 or 32 bits of fsid used for salt */
> > > > > +     };
> > > > >       struct fanotify_info info;
> > > > >  };
> > > >
> > > > What games are you playing here with the unions? I presume you can remove
> > > > these 'fskey' unions and just use (void *)(event->fsid) at appropriate
> > > > places? IMO much more comprehensible...
> > >
> >
> > FYI, this is what the open coded conversion looks like:
> >
> > (void *)*(long *)event->fsid.val
>
> Not great but at least fairly localized. I'd just note that this doesn't quite
> work on 32-bit archs (sizeof(long) != sizeof(__kernel_fsid_t) there). Maybe
> we could just use
>
> hash_32(event->fsid.val[0]) ^ hash_32(event->fsid.val[1])
>
> for mixing into the 'key' value and thus avoid all these games?
>

Makes sense.

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 0/7] Performance improvement for fanotify merge
  2021-02-18 11:15         ` Jan Kara
@ 2021-02-18 12:35           ` Amir Goldstein
  2021-02-19 10:15             ` Jan Kara
  0 siblings, 1 reply; 36+ messages in thread
From: Amir Goldstein @ 2021-02-18 12:35 UTC (permalink / raw)
  To: Jan Kara; +Cc: linux-fsdevel

On Thu, Feb 18, 2021 at 1:15 PM Jan Kara <jack@suse.cz> wrote:
>
> On Thu 18-02-21 12:56:18, Amir Goldstein wrote:
> > On Wed, Feb 17, 2021 at 1:25 PM Jan Kara <jack@suse.cz> wrote:
> > >
> > > On Wed 17-02-21 12:52:21, Amir Goldstein wrote:
> > > > On Tue, Feb 16, 2021 at 6:02 PM Jan Kara <jack@suse.cz> wrote:
> > > > >
> > > > > Hi Amir!
> > > > >
> > > > > Looking at the patches I've got one idea:
> > > > >
> > > > > Currently you have fsnotify_event like:
> > > > >
> > > > > struct fsnotify_event {
> > > > >         struct list_head list;
> > > > >         unsigned int key;
> > > > >         unsigned int next_bucket;
> > > > > };
> > > > >
> > > > > And 'list' is used for hashed queue list, next_bucket is used to simulate
> > > > > single queue out of all the individual lists. The option I'm considering
> > > > > is:
> > > > >
> > > > > struct fsnotify_event {
> > > > >         struct list_head list;
> > > > >         struct fsnotify_event *hash_next;
> > > > >         unsigned int key;
> > > > > };
> > > > >
> > > > > So 'list' would stay to be used for the single queue of events like it was
> > > > > before your patches. 'hash_next' would be used for list of events in the
> > > > > hash chain. The advantage of this scheme would be somewhat more obvious
> > > > > handling,
> > > >
> > > > I can agree to that.
> > > >
> > > > > also we can handle removal of permission events (they won't be
> > > > > hashed so there's no risk of breaking hash-chain in the middle, removal
> > > > > from global queue is easy as currently).
> > > >
> > > > Ok. but I do not really see a value in hashing non-permission events
> > > > for high priority groups, so this is not a strong argument.
> > >
> > > The reason why I thought it is somewhat beneficial is that someone might be
> > > using higher priority fanotify group just for watching non-permission
> > > events because so far the group priority makes little difference. And
> > > conceptually it isn't obvious (from userspace POV) why higher priority
> > > groups should be merging events less efficiently...
> > >
> >
> > So I implemented your suggestion with ->next_event, but it did not
> > end up with being able to remove from the middle of the queue.
> > The thing is we know that permission events are on list #0, but what
> > we need to find out when removing a permission event is the previous
> > event in timeline order and we do not have that information.
>
> So my idea was that if 'list' is the time ordered list and permission
> events are *never inserted into the hash* (we don't need them there as
> hashed lists are used only for merging), then removal of permission events
> is no problem.
>

We are still not talking in the same language.

I think what you mean is use a dedicated list only for permission events
which is not any one of the hash lists.

In that case, get_one_event() will have to look at both the high
priority queue and the hash queue if we want to allow mixing hashed
event with permission events.

It will also mean that permission events always get priority over non-permission
events. While this makes a lot of sense, this is not the current behavior.

So what am I missing?

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 0/7] Performance improvement for fanotify merge
  2021-02-18 12:35           ` Amir Goldstein
@ 2021-02-19 10:15             ` Jan Kara
  2021-02-19 10:21               ` Jan Kara
  0 siblings, 1 reply; 36+ messages in thread
From: Jan Kara @ 2021-02-19 10:15 UTC (permalink / raw)
  To: Amir Goldstein; +Cc: Jan Kara, linux-fsdevel

On Thu 18-02-21 14:35:39, Amir Goldstein wrote:
> On Thu, Feb 18, 2021 at 1:15 PM Jan Kara <jack@suse.cz> wrote:
> >
> > On Thu 18-02-21 12:56:18, Amir Goldstein wrote:
> > > On Wed, Feb 17, 2021 at 1:25 PM Jan Kara <jack@suse.cz> wrote:
> > > >
> > > > On Wed 17-02-21 12:52:21, Amir Goldstein wrote:
> > > > > On Tue, Feb 16, 2021 at 6:02 PM Jan Kara <jack@suse.cz> wrote:
> > > > > >
> > > > > > Hi Amir!
> > > > > >
> > > > > > Looking at the patches I've got one idea:
> > > > > >
> > > > > > Currently you have fsnotify_event like:
> > > > > >
> > > > > > struct fsnotify_event {
> > > > > >         struct list_head list;
> > > > > >         unsigned int key;
> > > > > >         unsigned int next_bucket;
> > > > > > };
> > > > > >
> > > > > > And 'list' is used for hashed queue list, next_bucket is used to simulate
> > > > > > single queue out of all the individual lists. The option I'm considering
> > > > > > is:
> > > > > >
> > > > > > struct fsnotify_event {
> > > > > >         struct list_head list;
> > > > > >         struct fsnotify_event *hash_next;
> > > > > >         unsigned int key;
> > > > > > };
> > > > > >
> > > > > > So 'list' would stay to be used for the single queue of events like it was
> > > > > > before your patches. 'hash_next' would be used for list of events in the
> > > > > > hash chain. The advantage of this scheme would be somewhat more obvious
> > > > > > handling,
> > > > >
> > > > > I can agree to that.
> > > > >
> > > > > > also we can handle removal of permission events (they won't be
> > > > > > hashed so there's no risk of breaking hash-chain in the middle, removal
> > > > > > from global queue is easy as currently).
> > > > >
> > > > > Ok. but I do not really see a value in hashing non-permission events
> > > > > for high priority groups, so this is not a strong argument.
> > > >
> > > > The reason why I thought it is somewhat beneficial is that someone might be
> > > > using higher priority fanotify group just for watching non-permission
> > > > events because so far the group priority makes little difference. And
> > > > conceptually it isn't obvious (from userspace POV) why higher priority
> > > > groups should be merging events less efficiently...
> > > >
> > >
> > > So I implemented your suggestion with ->next_event, but it did not
> > > end up with being able to remove from the middle of the queue.
> > > The thing is we know that permission events are on list #0, but what
> > > we need to find out when removing a permission event is the previous
> > > event in timeline order and we do not have that information.
> >
> > So my idea was that if 'list' is the time ordered list and permission
> > events are *never inserted into the hash* (we don't need them there as
> > hashed lists are used only for merging), then removal of permission events
> > is no problem.
> 
> We are still not talking in the same language.

Yes, I think so :).

> I think what you mean is use a dedicated list only for permission events
> which is not any one of the hash lists.
> 
> In that case, get_one_event() will have to look at both the high
> priority queue and the hash queue if we want to allow mixing hashed
> event with permission events.
> 
> It will also mean that permission events always get priority over non-permission
> events. While this makes a lot of sense, this is not the current behavior.
> 
> So what am I missing?

Let me explain with the pseudocode. fsnotify_add_event() will do:

spin_lock(&group->notification_lock);
...
if (!list_empty(list) && merge) {
	ret = merge(list, event);
	if (ret)
		bail
}
group->q_len++;
list_add_tail(&event->list, &group->notification_list);
if (add_hash) {
	/* Add to merge hash */
	*(group->merge_hash[hash(event->key)]->lastp) = event;
	group->merge_hash[hash(event->key)]->lastp = &(event->hash_next);
}
spin_unlock(&group->notification_lock);

And we set 'add_hash' to true only for non-permission events. The merge()
function can use merge_hash[] to speedup the search for merge candidates.
There will be no changes to fsnotify_peek_first_event() (modulo cleanups)
compared to current upstream. fsnotify_remove_queued_event() needs to
update ->first and ->lastp pointers in merge_hash[]. So something like:

list_del_init(&event->list);
group->q_len--;
group->merge_hash[hash(event->key)]->first = event->next_hash;
if (!event->next_hash) {
	group->merge_hash[hash(event->key)]->lastp =
		&(group->merge_hash[hash(event->key)]->first);
}

Clearer now?
								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 0/7] Performance improvement for fanotify merge
  2021-02-19 10:15             ` Jan Kara
@ 2021-02-19 10:21               ` Jan Kara
  2021-02-19 13:38                 ` Amir Goldstein
  0 siblings, 1 reply; 36+ messages in thread
From: Jan Kara @ 2021-02-19 10:21 UTC (permalink / raw)
  To: Amir Goldstein; +Cc: Jan Kara, linux-fsdevel

On Fri 19-02-21 11:15:56, Jan Kara wrote:
> On Thu 18-02-21 14:35:39, Amir Goldstein wrote:
> > On Thu, Feb 18, 2021 at 1:15 PM Jan Kara <jack@suse.cz> wrote:
> > >
> > > On Thu 18-02-21 12:56:18, Amir Goldstein wrote:
> > > > On Wed, Feb 17, 2021 at 1:25 PM Jan Kara <jack@suse.cz> wrote:
> > > > >
> > > > > On Wed 17-02-21 12:52:21, Amir Goldstein wrote:
> > > > > > On Tue, Feb 16, 2021 at 6:02 PM Jan Kara <jack@suse.cz> wrote:
> > > > > > >
> > > > > > > Hi Amir!
> > > > > > >
> > > > > > > Looking at the patches I've got one idea:
> > > > > > >
> > > > > > > Currently you have fsnotify_event like:
> > > > > > >
> > > > > > > struct fsnotify_event {
> > > > > > >         struct list_head list;
> > > > > > >         unsigned int key;
> > > > > > >         unsigned int next_bucket;
> > > > > > > };
> > > > > > >
> > > > > > > And 'list' is used for hashed queue list, next_bucket is used to simulate
> > > > > > > single queue out of all the individual lists. The option I'm considering
> > > > > > > is:
> > > > > > >
> > > > > > > struct fsnotify_event {
> > > > > > >         struct list_head list;
> > > > > > >         struct fsnotify_event *hash_next;
> > > > > > >         unsigned int key;
> > > > > > > };
> > > > > > >
> > > > > > > So 'list' would stay to be used for the single queue of events like it was
> > > > > > > before your patches. 'hash_next' would be used for list of events in the
> > > > > > > hash chain. The advantage of this scheme would be somewhat more obvious
> > > > > > > handling,
> > > > > >
> > > > > > I can agree to that.
> > > > > >
> > > > > > > also we can handle removal of permission events (they won't be
> > > > > > > hashed so there's no risk of breaking hash-chain in the middle, removal
> > > > > > > from global queue is easy as currently).
> > > > > >
> > > > > > Ok. but I do not really see a value in hashing non-permission events
> > > > > > for high priority groups, so this is not a strong argument.
> > > > >
> > > > > The reason why I thought it is somewhat beneficial is that someone might be
> > > > > using higher priority fanotify group just for watching non-permission
> > > > > events because so far the group priority makes little difference. And
> > > > > conceptually it isn't obvious (from userspace POV) why higher priority
> > > > > groups should be merging events less efficiently...
> > > > >
> > > >
> > > > So I implemented your suggestion with ->next_event, but it did not
> > > > end up with being able to remove from the middle of the queue.
> > > > The thing is we know that permission events are on list #0, but what
> > > > we need to find out when removing a permission event is the previous
> > > > event in timeline order and we do not have that information.
> > >
> > > So my idea was that if 'list' is the time ordered list and permission
> > > events are *never inserted into the hash* (we don't need them there as
> > > hashed lists are used only for merging), then removal of permission events
> > > is no problem.
> > 
> > We are still not talking in the same language.
> 
> Yes, I think so :).
> 
> > I think what you mean is use a dedicated list only for permission events
> > which is not any one of the hash lists.
> > 
> > In that case, get_one_event() will have to look at both the high
> > priority queue and the hash queue if we want to allow mixing hashed
> > event with permission events.
> > 
> > It will also mean that permission events always get priority over non-permission
> > events. While this makes a lot of sense, this is not the current behavior.
> > 
> > So what am I missing?
> 
> Let me explain with the pseudocode. fsnotify_add_event() will do:
> 
> spin_lock(&group->notification_lock);
> ...
> if (!list_empty(list) && merge) {
> 	ret = merge(list, event);
> 	if (ret)
> 		bail
> }
> group->q_len++;
> list_add_tail(&event->list, &group->notification_list);
> if (add_hash) {
> 	/* Add to merge hash */
> 	*(group->merge_hash[hash(event->key)]->lastp) = event;
> 	group->merge_hash[hash(event->key)]->lastp = &(event->hash_next);
> }
> spin_unlock(&group->notification_lock);
> 
> And we set 'add_hash' to true only for non-permission events. The merge()
> function can use merge_hash[] to speedup the search for merge candidates.
> There will be no changes to fsnotify_peek_first_event() (modulo cleanups)
> compared to current upstream. fsnotify_remove_queued_event() needs to
> update ->first and ->lastp pointers in merge_hash[]. So something like:
> 
> list_del_init(&event->list);
> group->q_len--;
> group->merge_hash[hash(event->key)]->first = event->next_hash;

Actually we must do hash handling only if the event was added to the hash.
So either fsnotify_remove_queued_event() needs to take an argument whether
it should add event to a hash or we need to somehow identify that based on
->key having special value or checking
  group->merge_hash[hash(event->key)]->first == event

								Honza

> if (!event->next_hash) {
> 	group->merge_hash[hash(event->key)]->lastp =
> 		&(group->merge_hash[hash(event->key)]->first);
> }
> 
> Clearer now?
> 								Honza
> -- 
> Jan Kara <jack@suse.com>
> SUSE Labs, CR
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 0/7] Performance improvement for fanotify merge
  2021-02-19 10:21               ` Jan Kara
@ 2021-02-19 13:38                 ` Amir Goldstein
  2021-02-21 12:53                   ` Amir Goldstein
  0 siblings, 1 reply; 36+ messages in thread
From: Amir Goldstein @ 2021-02-19 13:38 UTC (permalink / raw)
  To: Jan Kara; +Cc: linux-fsdevel

On Fri, Feb 19, 2021 at 12:21 PM Jan Kara <jack@suse.cz> wrote:
>
> On Fri 19-02-21 11:15:56, Jan Kara wrote:
> > On Thu 18-02-21 14:35:39, Amir Goldstein wrote:
> > > On Thu, Feb 18, 2021 at 1:15 PM Jan Kara <jack@suse.cz> wrote:
> > > >
> > > > On Thu 18-02-21 12:56:18, Amir Goldstein wrote:
> > > > > On Wed, Feb 17, 2021 at 1:25 PM Jan Kara <jack@suse.cz> wrote:
> > > > > >
> > > > > > On Wed 17-02-21 12:52:21, Amir Goldstein wrote:
> > > > > > > On Tue, Feb 16, 2021 at 6:02 PM Jan Kara <jack@suse.cz> wrote:
> > > > > > > >
> > > > > > > > Hi Amir!
> > > > > > > >
> > > > > > > > Looking at the patches I've got one idea:
> > > > > > > >
> > > > > > > > Currently you have fsnotify_event like:
> > > > > > > >
> > > > > > > > struct fsnotify_event {
> > > > > > > >         struct list_head list;
> > > > > > > >         unsigned int key;
> > > > > > > >         unsigned int next_bucket;
> > > > > > > > };
> > > > > > > >
> > > > > > > > And 'list' is used for hashed queue list, next_bucket is used to simulate
> > > > > > > > single queue out of all the individual lists. The option I'm considering
> > > > > > > > is:
> > > > > > > >
> > > > > > > > struct fsnotify_event {
> > > > > > > >         struct list_head list;
> > > > > > > >         struct fsnotify_event *hash_next;
> > > > > > > >         unsigned int key;
> > > > > > > > };
> > > > > > > >
> > > > > > > > So 'list' would stay to be used for the single queue of events like it was
> > > > > > > > before your patches. 'hash_next' would be used for list of events in the
> > > > > > > > hash chain. The advantage of this scheme would be somewhat more obvious
> > > > > > > > handling,
> > > > > > >
> > > > > > > I can agree to that.
> > > > > > >
> > > > > > > > also we can handle removal of permission events (they won't be
> > > > > > > > hashed so there's no risk of breaking hash-chain in the middle, removal
> > > > > > > > from global queue is easy as currently).
> > > > > > >
> > > > > > > Ok. but I do not really see a value in hashing non-permission events
> > > > > > > for high priority groups, so this is not a strong argument.
> > > > > >
> > > > > > The reason why I thought it is somewhat beneficial is that someone might be
> > > > > > using higher priority fanotify group just for watching non-permission
> > > > > > events because so far the group priority makes little difference. And
> > > > > > conceptually it isn't obvious (from userspace POV) why higher priority
> > > > > > groups should be merging events less efficiently...
> > > > > >
> > > > >
> > > > > So I implemented your suggestion with ->next_event, but it did not
> > > > > end up with being able to remove from the middle of the queue.
> > > > > The thing is we know that permission events are on list #0, but what
> > > > > we need to find out when removing a permission event is the previous
> > > > > event in timeline order and we do not have that information.
> > > >
> > > > So my idea was that if 'list' is the time ordered list and permission
> > > > events are *never inserted into the hash* (we don't need them there as
> > > > hashed lists are used only for merging), then removal of permission events
> > > > is no problem.
> > >
> > > We are still not talking in the same language.
> >
> > Yes, I think so :).
> >
> > > I think what you mean is use a dedicated list only for permission events
> > > which is not any one of the hash lists.
> > >
> > > In that case, get_one_event() will have to look at both the high
> > > priority queue and the hash queue if we want to allow mixing hashed
> > > event with permission events.
> > >
> > > It will also mean that permission events always get priority over non-permission
> > > events. While this makes a lot of sense, this is not the current behavior.
> > >
> > > So what am I missing?
> >
> > Let me explain with the pseudocode. fsnotify_add_event() will do:
> >
> > spin_lock(&group->notification_lock);
> > ...
> > if (!list_empty(list) && merge) {
> >       ret = merge(list, event);
> >       if (ret)
> >               bail
> > }
> > group->q_len++;
> > list_add_tail(&event->list, &group->notification_list);
> > if (add_hash) {
> >       /* Add to merge hash */
> >       *(group->merge_hash[hash(event->key)]->lastp) = event;
> >       group->merge_hash[hash(event->key)]->lastp = &(event->hash_next);
> > }
> > spin_unlock(&group->notification_lock);
> >
> > And we set 'add_hash' to true only for non-permission events. The merge()
> > function can use merge_hash[] to speedup the search for merge candidates.
> > There will be no changes to fsnotify_peek_first_event() (modulo cleanups)
> > compared to current upstream. fsnotify_remove_queued_event() needs to
> > update ->first and ->lastp pointers in merge_hash[]. So something like:
> >
> > list_del_init(&event->list);
> > group->q_len--;
> > group->merge_hash[hash(event->key)]->first = event->next_hash;
>
> Actually we must do hash handling only if the event was added to the hash.
> So either fsnotify_remove_queued_event() needs to take an argument whether
> it should add event to a hash or we need to somehow identify that based on
> ->key having special value or checking
>   group->merge_hash[hash(event->key)]->first == event
>

Not a problem.
Permission events and the overflow event already have zero key.
In the very unlikely event of a random zero hash, that unicorn event
won't get merged - so what.

But anyway, I think we can keep the hash handling confined in fanotify.
With your suggestion, there can be no hashing code left in fsnotify core
and the only hash handling will remain in the fanotify insert() hook as in
current fanotify_merge branch.

Because the only case we care about the hash is actually when removing
the first event, fanotify already knows to identify if the event is hashed.
The other cases where event is removed on group cleanup the hash
chains are not relevant so fsnotify core doesn't need to care about it.

>
> > if (!event->next_hash) {
> >       group->merge_hash[hash(event->key)]->lastp =
> >               &(group->merge_hash[hash(event->key)]->first);
> > }
> >
> > Clearer now?

Yes, and much simpler.

Good idea!

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 0/7] Performance improvement for fanotify merge
  2021-02-19 13:38                 ` Amir Goldstein
@ 2021-02-21 12:53                   ` Amir Goldstein
  2021-02-22  9:29                     ` Jan Kara
  0 siblings, 1 reply; 36+ messages in thread
From: Amir Goldstein @ 2021-02-21 12:53 UTC (permalink / raw)
  To: Jan Kara; +Cc: linux-fsdevel

On Fri, Feb 19, 2021 at 3:38 PM Amir Goldstein <amir73il@gmail.com> wrote:
>
> On Fri, Feb 19, 2021 at 12:21 PM Jan Kara <jack@suse.cz> wrote:
> >
> > On Fri 19-02-21 11:15:56, Jan Kara wrote:
> > > On Thu 18-02-21 14:35:39, Amir Goldstein wrote:
> > > > On Thu, Feb 18, 2021 at 1:15 PM Jan Kara <jack@suse.cz> wrote:
> > > > >
> > > > > On Thu 18-02-21 12:56:18, Amir Goldstein wrote:
> > > > > > On Wed, Feb 17, 2021 at 1:25 PM Jan Kara <jack@suse.cz> wrote:
> > > > > > >
> > > > > > > On Wed 17-02-21 12:52:21, Amir Goldstein wrote:
> > > > > > > > On Tue, Feb 16, 2021 at 6:02 PM Jan Kara <jack@suse.cz> wrote:
> > > > > > > > >
> > > > > > > > > Hi Amir!
> > > > > > > > >
> > > > > > > > > Looking at the patches I've got one idea:
> > > > > > > > >
> > > > > > > > > Currently you have fsnotify_event like:
> > > > > > > > >
> > > > > > > > > struct fsnotify_event {
> > > > > > > > >         struct list_head list;
> > > > > > > > >         unsigned int key;
> > > > > > > > >         unsigned int next_bucket;
> > > > > > > > > };
> > > > > > > > >
> > > > > > > > > And 'list' is used for hashed queue list, next_bucket is used to simulate
> > > > > > > > > single queue out of all the individual lists. The option I'm considering
> > > > > > > > > is:
> > > > > > > > >
> > > > > > > > > struct fsnotify_event {
> > > > > > > > >         struct list_head list;
> > > > > > > > >         struct fsnotify_event *hash_next;
> > > > > > > > >         unsigned int key;
> > > > > > > > > };
> > > > > > > > >
> > > > > > > > > So 'list' would stay to be used for the single queue of events like it was
> > > > > > > > > before your patches. 'hash_next' would be used for list of events in the
> > > > > > > > > hash chain. The advantage of this scheme would be somewhat more obvious
> > > > > > > > > handling,
> > > > > > > >
> > > > > > > > I can agree to that.
> > > > > > > >
> > > > > > > > > also we can handle removal of permission events (they won't be
> > > > > > > > > hashed so there's no risk of breaking hash-chain in the middle, removal
> > > > > > > > > from global queue is easy as currently).
> > > > > > > >
> > > > > > > > Ok. but I do not really see a value in hashing non-permission events
> > > > > > > > for high priority groups, so this is not a strong argument.
> > > > > > >
> > > > > > > The reason why I thought it is somewhat beneficial is that someone might be
> > > > > > > using higher priority fanotify group just for watching non-permission
> > > > > > > events because so far the group priority makes little difference. And
> > > > > > > conceptually it isn't obvious (from userspace POV) why higher priority
> > > > > > > groups should be merging events less efficiently...
> > > > > > >
> > > > > >
> > > > > > So I implemented your suggestion with ->next_event, but it did not
> > > > > > end up with being able to remove from the middle of the queue.
> > > > > > The thing is we know that permission events are on list #0, but what
> > > > > > we need to find out when removing a permission event is the previous
> > > > > > event in timeline order and we do not have that information.
> > > > >
> > > > > So my idea was that if 'list' is the time ordered list and permission
> > > > > events are *never inserted into the hash* (we don't need them there as
> > > > > hashed lists are used only for merging), then removal of permission events
> > > > > is no problem.
> > > >
> > > > We are still not talking in the same language.
> > >
> > > Yes, I think so :).
> > >
> > > > I think what you mean is use a dedicated list only for permission events
> > > > which is not any one of the hash lists.
> > > >
> > > > In that case, get_one_event() will have to look at both the high
> > > > priority queue and the hash queue if we want to allow mixing hashed
> > > > event with permission events.
> > > >
> > > > It will also mean that permission events always get priority over non-permission
> > > > events. While this makes a lot of sense, this is not the current behavior.
> > > >
> > > > So what am I missing?
> > >
> > > Let me explain with the pseudocode. fsnotify_add_event() will do:
> > >
> > > spin_lock(&group->notification_lock);
> > > ...
> > > if (!list_empty(list) && merge) {
> > >       ret = merge(list, event);
> > >       if (ret)
> > >               bail
> > > }
> > > group->q_len++;
> > > list_add_tail(&event->list, &group->notification_list);
> > > if (add_hash) {
> > >       /* Add to merge hash */
> > >       *(group->merge_hash[hash(event->key)]->lastp) = event;
> > >       group->merge_hash[hash(event->key)]->lastp = &(event->hash_next);
> > > }
> > > spin_unlock(&group->notification_lock);
> > >
> > > And we set 'add_hash' to true only for non-permission events. The merge()
> > > function can use merge_hash[] to speedup the search for merge candidates.
> > > There will be no changes to fsnotify_peek_first_event() (modulo cleanups)
> > > compared to current upstream. fsnotify_remove_queued_event() needs to
> > > update ->first and ->lastp pointers in merge_hash[]. So something like:
> > >
> > > list_del_init(&event->list);
> > > group->q_len--;
> > > group->merge_hash[hash(event->key)]->first = event->next_hash;
> >
> > Actually we must do hash handling only if the event was added to the hash.
> > So either fsnotify_remove_queued_event() needs to take an argument whether
> > it should add event to a hash or we need to somehow identify that based on
> > ->key having special value or checking
> >   group->merge_hash[hash(event->key)]->first == event
> >
>
> Not a problem.
> Permission events and the overflow event already have zero key.
> In the very unlikely event of a random zero hash, that unicorn event
> won't get merged - so what.
>
> But anyway, I think we can keep the hash handling confined in fanotify.
> With your suggestion, there can be no hashing code left in fsnotify core
> and the only hash handling will remain in the fanotify insert() hook as in
> current fanotify_merge branch.
>
> Because the only case we care about the hash is actually when removing
> the first event, fanotify already knows to identify if the event is hashed.
> The other cases where event is removed on group cleanup the hash
> chains are not relevant so fsnotify core doesn't need to care about it.
>
> >
> > > if (!event->next_hash) {
> > >       group->merge_hash[hash(event->key)]->lastp =
> > >               &(group->merge_hash[hash(event->key)]->first);
> > > }
> > >
> > > Clearer now?
>
> Yes, and much simpler.
>

Much simpler but doesn't work.
The merge list needs to be ordered from the most recent event for merging,
not from the oldest event.

Anyway, enough with those games. I implemented the hash table using
hlist's and obviously the result is much simpler.

The space we lost for the pprev pointer of hlist_node I won us back
by cramming the hash together with the type:

struct fanotify_event {
        struct fsnotify_event fse;
        struct hlist_node merge_list;   /* List for hashed merge */
        u32 mask;
        struct {
                unsigned int type : FANOTIFY_EVENT_TYPE_BITS;
                unsigned int hash : FANOTIFY_EVENT_HASH_BITS;
        };
        struct pid *pid;
};

Anyway, pushed the following branches to my github linux and ltp trees:
* fanotify_merge
* fanotify_limits
* fanotify_unpriv

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 0/7] Performance improvement for fanotify merge
  2021-02-21 12:53                   ` Amir Goldstein
@ 2021-02-22  9:29                     ` Jan Kara
  0 siblings, 0 replies; 36+ messages in thread
From: Jan Kara @ 2021-02-22  9:29 UTC (permalink / raw)
  To: Amir Goldstein; +Cc: Jan Kara, linux-fsdevel

On Sun 21-02-21 14:53:46, Amir Goldstein wrote:
> On Fri, Feb 19, 2021 at 3:38 PM Amir Goldstein <amir73il@gmail.com> wrote:
> >
> > On Fri, Feb 19, 2021 at 12:21 PM Jan Kara <jack@suse.cz> wrote:
> > >
> > > On Fri 19-02-21 11:15:56, Jan Kara wrote:
> > > > On Thu 18-02-21 14:35:39, Amir Goldstein wrote:
> > > > > On Thu, Feb 18, 2021 at 1:15 PM Jan Kara <jack@suse.cz> wrote:
> > > > > >
> > > > > > On Thu 18-02-21 12:56:18, Amir Goldstein wrote:
> > > > > > > On Wed, Feb 17, 2021 at 1:25 PM Jan Kara <jack@suse.cz> wrote:
> > > > > > > >
> > > > > > > > On Wed 17-02-21 12:52:21, Amir Goldstein wrote:
> > > > > > > > > On Tue, Feb 16, 2021 at 6:02 PM Jan Kara <jack@suse.cz> wrote:
> > > > > > > > > >
> > > > > > > > > > Hi Amir!
> > > > > > > > > >
> > > > > > > > > > Looking at the patches I've got one idea:
> > > > > > > > > >
> > > > > > > > > > Currently you have fsnotify_event like:
> > > > > > > > > >
> > > > > > > > > > struct fsnotify_event {
> > > > > > > > > >         struct list_head list;
> > > > > > > > > >         unsigned int key;
> > > > > > > > > >         unsigned int next_bucket;
> > > > > > > > > > };
> > > > > > > > > >
> > > > > > > > > > And 'list' is used for hashed queue list, next_bucket is used to simulate
> > > > > > > > > > single queue out of all the individual lists. The option I'm considering
> > > > > > > > > > is:
> > > > > > > > > >
> > > > > > > > > > struct fsnotify_event {
> > > > > > > > > >         struct list_head list;
> > > > > > > > > >         struct fsnotify_event *hash_next;
> > > > > > > > > >         unsigned int key;
> > > > > > > > > > };
> > > > > > > > > >
> > > > > > > > > > So 'list' would stay to be used for the single queue of events like it was
> > > > > > > > > > before your patches. 'hash_next' would be used for list of events in the
> > > > > > > > > > hash chain. The advantage of this scheme would be somewhat more obvious
> > > > > > > > > > handling,
> > > > > > > > >
> > > > > > > > > I can agree to that.
> > > > > > > > >
> > > > > > > > > > also we can handle removal of permission events (they won't be
> > > > > > > > > > hashed so there's no risk of breaking hash-chain in the middle, removal
> > > > > > > > > > from global queue is easy as currently).
> > > > > > > > >
> > > > > > > > > Ok. but I do not really see a value in hashing non-permission events
> > > > > > > > > for high priority groups, so this is not a strong argument.
> > > > > > > >
> > > > > > > > The reason why I thought it is somewhat beneficial is that someone might be
> > > > > > > > using higher priority fanotify group just for watching non-permission
> > > > > > > > events because so far the group priority makes little difference. And
> > > > > > > > conceptually it isn't obvious (from userspace POV) why higher priority
> > > > > > > > groups should be merging events less efficiently...
> > > > > > > >
> > > > > > >
> > > > > > > So I implemented your suggestion with ->next_event, but it did not
> > > > > > > end up with being able to remove from the middle of the queue.
> > > > > > > The thing is we know that permission events are on list #0, but what
> > > > > > > we need to find out when removing a permission event is the previous
> > > > > > > event in timeline order and we do not have that information.
> > > > > >
> > > > > > So my idea was that if 'list' is the time ordered list and permission
> > > > > > events are *never inserted into the hash* (we don't need them there as
> > > > > > hashed lists are used only for merging), then removal of permission events
> > > > > > is no problem.
> > > > >
> > > > > We are still not talking in the same language.
> > > >
> > > > Yes, I think so :).
> > > >
> > > > > I think what you mean is use a dedicated list only for permission events
> > > > > which is not any one of the hash lists.
> > > > >
> > > > > In that case, get_one_event() will have to look at both the high
> > > > > priority queue and the hash queue if we want to allow mixing hashed
> > > > > event with permission events.
> > > > >
> > > > > It will also mean that permission events always get priority over non-permission
> > > > > events. While this makes a lot of sense, this is not the current behavior.
> > > > >
> > > > > So what am I missing?
> > > >
> > > > Let me explain with the pseudocode. fsnotify_add_event() will do:
> > > >
> > > > spin_lock(&group->notification_lock);
> > > > ...
> > > > if (!list_empty(list) && merge) {
> > > >       ret = merge(list, event);
> > > >       if (ret)
> > > >               bail
> > > > }
> > > > group->q_len++;
> > > > list_add_tail(&event->list, &group->notification_list);
> > > > if (add_hash) {
> > > >       /* Add to merge hash */
> > > >       *(group->merge_hash[hash(event->key)]->lastp) = event;
> > > >       group->merge_hash[hash(event->key)]->lastp = &(event->hash_next);
> > > > }
> > > > spin_unlock(&group->notification_lock);
> > > >
> > > > And we set 'add_hash' to true only for non-permission events. The merge()
> > > > function can use merge_hash[] to speedup the search for merge candidates.
> > > > There will be no changes to fsnotify_peek_first_event() (modulo cleanups)
> > > > compared to current upstream. fsnotify_remove_queued_event() needs to
> > > > update ->first and ->lastp pointers in merge_hash[]. So something like:
> > > >
> > > > list_del_init(&event->list);
> > > > group->q_len--;
> > > > group->merge_hash[hash(event->key)]->first = event->next_hash;
> > >
> > > Actually we must do hash handling only if the event was added to the hash.
> > > So either fsnotify_remove_queued_event() needs to take an argument whether
> > > it should add event to a hash or we need to somehow identify that based on
> > > ->key having special value or checking
> > >   group->merge_hash[hash(event->key)]->first == event
> > >
> >
> > Not a problem.
> > Permission events and the overflow event already have zero key.
> > In the very unlikely event of a random zero hash, that unicorn event
> > won't get merged - so what.
> >
> > But anyway, I think we can keep the hash handling confined in fanotify.
> > With your suggestion, there can be no hashing code left in fsnotify core
> > and the only hash handling will remain in the fanotify insert() hook as in
> > current fanotify_merge branch.
> >
> > Because the only case we care about the hash is actually when removing
> > the first event, fanotify already knows to identify if the event is hashed.
> > The other cases where event is removed on group cleanup the hash
> > chains are not relevant so fsnotify core doesn't need to care about it.
> >
> > >
> > > > if (!event->next_hash) {
> > > >       group->merge_hash[hash(event->key)]->lastp =
> > > >               &(group->merge_hash[hash(event->key)]->first);
> > > > }
> > > >
> > > > Clearer now?
> >
> > Yes, and much simpler.
> >
> 
> Much simpler but doesn't work.
> The merge list needs to be ordered from the most recent event for merging,
> not from the oldest event.

Hum, right. It is more efficient and when we limit the scanning to 128
elements, it is necessary.

> Anyway, enough with those games. I implemented the hash table using
> hlist's and obviously the result is much simpler.

Agreed.

> The space we lost for the pprev pointer of hlist_node I won us back
> by cramming the hash together with the type:
> 
> struct fanotify_event {
>         struct fsnotify_event fse;
>         struct hlist_node merge_list;   /* List for hashed merge */
>         u32 mask;
>         struct {
>                 unsigned int type : FANOTIFY_EVENT_TYPE_BITS;
>                 unsigned int hash : FANOTIFY_EVENT_HASH_BITS;
>         };
>         struct pid *pid;
> };
> 
> Anyway, pushed the following branches to my github linux and ltp trees:
> * fanotify_merge
> * fanotify_limits
> * fanotify_unpriv

I'll check it.

								Honza

-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 5/7] fanotify: limit number of event merge attempts
  2021-02-02 16:20 ` [PATCH 5/7] fanotify: limit number of event merge attempts Amir Goldstein
@ 2021-02-27  8:31   ` Amir Goldstein
  2021-03-01 13:08     ` Jan Kara
  0 siblings, 1 reply; 36+ messages in thread
From: Amir Goldstein @ 2021-02-27  8:31 UTC (permalink / raw)
  To: Jan Kara; +Cc: linux-fsdevel

On Tue, Feb 2, 2021 at 6:20 PM Amir Goldstein <amir73il@gmail.com> wrote:
>
> Event merges are expensive when event queue size is large.
> Limit the linear search to 128 merge tests.
> In combination with 128 hash lists, there is a potential to
> merge with up to 16K events in the hashed queue.
>
> Signed-off-by: Amir Goldstein <amir73il@gmail.com>
> ---
>  fs/notify/fanotify/fanotify.c | 6 ++++++
>  1 file changed, 6 insertions(+)
>
> diff --git a/fs/notify/fanotify/fanotify.c b/fs/notify/fanotify/fanotify.c
> index 12df6957e4d8..6d3807012851 100644
> --- a/fs/notify/fanotify/fanotify.c
> +++ b/fs/notify/fanotify/fanotify.c
> @@ -129,11 +129,15 @@ static bool fanotify_should_merge(struct fsnotify_event *old_fsn,
>         return false;
>  }
>
> +/* Limit event merges to limit CPU overhead per event */
> +#define FANOTIFY_MAX_MERGE_EVENTS 128
> +
>  /* and the list better be locked by something too! */
>  static int fanotify_merge(struct list_head *list, struct fsnotify_event *event)
>  {
>         struct fsnotify_event *test_event;
>         struct fanotify_event *new;
> +       int i = 0;
>
>         pr_debug("%s: list=%p event=%p\n", __func__, list, event);
>         new = FANOTIFY_E(event);
> @@ -147,6 +151,8 @@ static int fanotify_merge(struct list_head *list, struct fsnotify_event *event)
>                 return 0;
>
>         list_for_each_entry_reverse(test_event, list, list) {
> +               if (++i > FANOTIFY_MAX_MERGE_EVENTS)
> +                       break;
>                 if (fanotify_should_merge(test_event, event)) {
>                         FANOTIFY_E(test_event)->mask |= new->mask;
>                         return 1;
> --
> 2.25.1
>

Jan,

I was thinking that this patch or a variant thereof should be applied to stable
kernels, but not the entire series.

OTOH, I am concerned about regressing existing workloads that depend on
merging events on more than 128 inodes.
I thought of this compromise between performance and functional regressions:

/*
 * Limit event merges to limit CPU overhead per new event.
 * For legacy mode, avoid unlimited CPU overhead, but do not regress the event
 * merge ratio in heavy concurrent workloads with default queue size.
 * For new FAN_REPORT_FID modes, make sure that CPU overhead is low.
 */
#define FANOTIFY_MAX_MERGE_OLD_EVENTS   16384
#define FANOTIFY_MAX_MERGE_FID_EVENTS   128

static inline int fanotify_max_merge_events(struct fsnotify_group *group)
{
        if (FAN_GROUP_FLAG(group, FANOTIFY_FID_BITS))
                return FANOTIFY_MAX_MERGE_FID_EVENTS;
        else
                return FANOTIFY_MAX_MERGE_OLD_EVENTS;
}

I can start the series with this patch and change that to:

#define FANOTIFY_MAX_MERGE_FID_EVENTS   128

static inline int fanotify_max_merge_events(struct fsnotify_group *group)
{
               return FANOTIFY_MAX_MERGE_EVENTS;
}

At the end of the series.

What do you think?

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 5/7] fanotify: limit number of event merge attempts
  2021-02-27  8:31   ` Amir Goldstein
@ 2021-03-01 13:08     ` Jan Kara
  2021-03-01 13:58       ` Amir Goldstein
  2021-09-15 12:39       ` Amir Goldstein
  0 siblings, 2 replies; 36+ messages in thread
From: Jan Kara @ 2021-03-01 13:08 UTC (permalink / raw)
  To: Amir Goldstein; +Cc: Jan Kara, linux-fsdevel

On Sat 27-02-21 10:31:52, Amir Goldstein wrote:
> On Tue, Feb 2, 2021 at 6:20 PM Amir Goldstein <amir73il@gmail.com> wrote:
> >
> > Event merges are expensive when event queue size is large.
> > Limit the linear search to 128 merge tests.
> > In combination with 128 hash lists, there is a potential to
> > merge with up to 16K events in the hashed queue.
> >
> > Signed-off-by: Amir Goldstein <amir73il@gmail.com>
> > ---
> >  fs/notify/fanotify/fanotify.c | 6 ++++++
> >  1 file changed, 6 insertions(+)
> >
> > diff --git a/fs/notify/fanotify/fanotify.c b/fs/notify/fanotify/fanotify.c
> > index 12df6957e4d8..6d3807012851 100644
> > --- a/fs/notify/fanotify/fanotify.c
> > +++ b/fs/notify/fanotify/fanotify.c
> > @@ -129,11 +129,15 @@ static bool fanotify_should_merge(struct fsnotify_event *old_fsn,
> >         return false;
> >  }
> >
> > +/* Limit event merges to limit CPU overhead per event */
> > +#define FANOTIFY_MAX_MERGE_EVENTS 128
> > +
> >  /* and the list better be locked by something too! */
> >  static int fanotify_merge(struct list_head *list, struct fsnotify_event *event)
> >  {
> >         struct fsnotify_event *test_event;
> >         struct fanotify_event *new;
> > +       int i = 0;
> >
> >         pr_debug("%s: list=%p event=%p\n", __func__, list, event);
> >         new = FANOTIFY_E(event);
> > @@ -147,6 +151,8 @@ static int fanotify_merge(struct list_head *list, struct fsnotify_event *event)
> >                 return 0;
> >
> >         list_for_each_entry_reverse(test_event, list, list) {
> > +               if (++i > FANOTIFY_MAX_MERGE_EVENTS)
> > +                       break;
> >                 if (fanotify_should_merge(test_event, event)) {
> >                         FANOTIFY_E(test_event)->mask |= new->mask;
> >                         return 1;
> > --
> > 2.25.1
> >
> 
> Jan,
> 
> I was thinking that this patch or a variant thereof should be applied to stable
> kernels, but not the entire series.
> 
> OTOH, I am concerned about regressing existing workloads that depend on
> merging events on more than 128 inodes.

Honestly, I don't think pushing anything to stable for this is really worth
it.

1) fanotify() is limited to CAP_SYS_ADMIN (in init namespace) so this is
hardly a security issue.

2) We have cond_resched() in the merge code now so the kernel doesn't
lockup anymore. So this is only about fanotify becoming slow if you have
lots of events.

3) I haven't heard any complaints since we've added the cond_resched()
patch so the performance issue seems to be really rare.

If I get complaits from real users about this, we can easily reconsider, it
is not a big deal. But I just don't think preemptive action is warranted...

								Honza

> I thought of this compromise between performance and functional regressions:
> 
> /*
>  * Limit event merges to limit CPU overhead per new event.
>  * For legacy mode, avoid unlimited CPU overhead, but do not regress the event
>  * merge ratio in heavy concurrent workloads with default queue size.
>  * For new FAN_REPORT_FID modes, make sure that CPU overhead is low.
>  */
> #define FANOTIFY_MAX_MERGE_OLD_EVENTS   16384
> #define FANOTIFY_MAX_MERGE_FID_EVENTS   128
> 
> static inline int fanotify_max_merge_events(struct fsnotify_group *group)
> {
>         if (FAN_GROUP_FLAG(group, FANOTIFY_FID_BITS))
>                 return FANOTIFY_MAX_MERGE_FID_EVENTS;
>         else
>                 return FANOTIFY_MAX_MERGE_OLD_EVENTS;
> }
> 
> I can start the series with this patch and change that to:
> 
> #define FANOTIFY_MAX_MERGE_FID_EVENTS   128
> 
> static inline int fanotify_max_merge_events(struct fsnotify_group *group)
> {
>                return FANOTIFY_MAX_MERGE_EVENTS;
> }
> 
> At the end of the series.
> 
> What do you think?
> 
> Thanks,
> Amir.
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 5/7] fanotify: limit number of event merge attempts
  2021-03-01 13:08     ` Jan Kara
@ 2021-03-01 13:58       ` Amir Goldstein
  2021-09-15 12:39       ` Amir Goldstein
  1 sibling, 0 replies; 36+ messages in thread
From: Amir Goldstein @ 2021-03-01 13:58 UTC (permalink / raw)
  To: Jan Kara; +Cc: linux-fsdevel

On Mon, Mar 1, 2021 at 3:08 PM Jan Kara <jack@suse.cz> wrote:
>
> On Sat 27-02-21 10:31:52, Amir Goldstein wrote:
> > On Tue, Feb 2, 2021 at 6:20 PM Amir Goldstein <amir73il@gmail.com> wrote:
> > >
> > > Event merges are expensive when event queue size is large.
> > > Limit the linear search to 128 merge tests.
> > > In combination with 128 hash lists, there is a potential to
> > > merge with up to 16K events in the hashed queue.
> > >
> > > Signed-off-by: Amir Goldstein <amir73il@gmail.com>
> > > ---
> > >  fs/notify/fanotify/fanotify.c | 6 ++++++
> > >  1 file changed, 6 insertions(+)
> > >
> > > diff --git a/fs/notify/fanotify/fanotify.c b/fs/notify/fanotify/fanotify.c
> > > index 12df6957e4d8..6d3807012851 100644
> > > --- a/fs/notify/fanotify/fanotify.c
> > > +++ b/fs/notify/fanotify/fanotify.c
> > > @@ -129,11 +129,15 @@ static bool fanotify_should_merge(struct fsnotify_event *old_fsn,
> > >         return false;
> > >  }
> > >
> > > +/* Limit event merges to limit CPU overhead per event */
> > > +#define FANOTIFY_MAX_MERGE_EVENTS 128
> > > +
> > >  /* and the list better be locked by something too! */
> > >  static int fanotify_merge(struct list_head *list, struct fsnotify_event *event)
> > >  {
> > >         struct fsnotify_event *test_event;
> > >         struct fanotify_event *new;
> > > +       int i = 0;
> > >
> > >         pr_debug("%s: list=%p event=%p\n", __func__, list, event);
> > >         new = FANOTIFY_E(event);
> > > @@ -147,6 +151,8 @@ static int fanotify_merge(struct list_head *list, struct fsnotify_event *event)
> > >                 return 0;
> > >
> > >         list_for_each_entry_reverse(test_event, list, list) {
> > > +               if (++i > FANOTIFY_MAX_MERGE_EVENTS)
> > > +                       break;
> > >                 if (fanotify_should_merge(test_event, event)) {
> > >                         FANOTIFY_E(test_event)->mask |= new->mask;
> > >                         return 1;
> > > --
> > > 2.25.1
> > >
> >
> > Jan,
> >
> > I was thinking that this patch or a variant thereof should be applied to stable
> > kernels, but not the entire series.
> >
> > OTOH, I am concerned about regressing existing workloads that depend on
> > merging events on more than 128 inodes.
>
> Honestly, I don't think pushing anything to stable for this is really worth
> it.
>
> 1) fanotify() is limited to CAP_SYS_ADMIN (in init namespace) so this is
> hardly a security issue.
>
> 2) We have cond_resched() in the merge code now so the kernel doesn't
> lockup anymore. So this is only about fanotify becoming slow if you have
> lots of events.
>
> 3) I haven't heard any complaints since we've added the cond_resched()
> patch so the performance issue seems to be really rare.
>
> If I get complaits from real users about this, we can easily reconsider, it
> is not a big deal. But I just don't think preemptive action is warranted...
>

OK. Will post the series without this.

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 5/7] fanotify: limit number of event merge attempts
  2021-03-01 13:08     ` Jan Kara
  2021-03-01 13:58       ` Amir Goldstein
@ 2021-09-15 12:39       ` Amir Goldstein
  2021-09-15 16:33         ` Jan Kara
  1 sibling, 1 reply; 36+ messages in thread
From: Amir Goldstein @ 2021-09-15 12:39 UTC (permalink / raw)
  To: Jan Kara; +Cc: linux-fsdevel

On Mon, Mar 1, 2021 at 3:08 PM Jan Kara <jack@suse.cz> wrote:
>
> On Sat 27-02-21 10:31:52, Amir Goldstein wrote:
> > On Tue, Feb 2, 2021 at 6:20 PM Amir Goldstein <amir73il@gmail.com> wrote:
> > >
> > > Event merges are expensive when event queue size is large.
> > > Limit the linear search to 128 merge tests.
> > > In combination with 128 hash lists, there is a potential to
> > > merge with up to 16K events in the hashed queue.
> > >
> > > Signed-off-by: Amir Goldstein <amir73il@gmail.com>
> > > ---
> > >  fs/notify/fanotify/fanotify.c | 6 ++++++
> > >  1 file changed, 6 insertions(+)
> > >
> > > diff --git a/fs/notify/fanotify/fanotify.c b/fs/notify/fanotify/fanotify.c
> > > index 12df6957e4d8..6d3807012851 100644
> > > --- a/fs/notify/fanotify/fanotify.c
> > > +++ b/fs/notify/fanotify/fanotify.c
> > > @@ -129,11 +129,15 @@ static bool fanotify_should_merge(struct fsnotify_event *old_fsn,
> > >         return false;
> > >  }
> > >
> > > +/* Limit event merges to limit CPU overhead per event */
> > > +#define FANOTIFY_MAX_MERGE_EVENTS 128
> > > +
> > >  /* and the list better be locked by something too! */
> > >  static int fanotify_merge(struct list_head *list, struct fsnotify_event *event)
> > >  {
> > >         struct fsnotify_event *test_event;
> > >         struct fanotify_event *new;
> > > +       int i = 0;
> > >
> > >         pr_debug("%s: list=%p event=%p\n", __func__, list, event);
> > >         new = FANOTIFY_E(event);
> > > @@ -147,6 +151,8 @@ static int fanotify_merge(struct list_head *list, struct fsnotify_event *event)
> > >                 return 0;
> > >
> > >         list_for_each_entry_reverse(test_event, list, list) {
> > > +               if (++i > FANOTIFY_MAX_MERGE_EVENTS)
> > > +                       break;
> > >                 if (fanotify_should_merge(test_event, event)) {
> > >                         FANOTIFY_E(test_event)->mask |= new->mask;
> > >                         return 1;
> > > --
> > > 2.25.1
> > >
> >
> > Jan,
> >
> > I was thinking that this patch or a variant thereof should be applied to stable
> > kernels, but not the entire series.
> >
> > OTOH, I am concerned about regressing existing workloads that depend on
> > merging events on more than 128 inodes.
>
> Honestly, I don't think pushing anything to stable for this is really worth
> it.
>
> 1) fanotify() is limited to CAP_SYS_ADMIN (in init namespace) so this is
> hardly a security issue.
>
> 2) We have cond_resched() in the merge code now so the kernel doesn't
> lockup anymore. So this is only about fanotify becoming slow if you have
> lots of events.
>
> 3) I haven't heard any complaints since we've added the cond_resched()
> patch so the performance issue seems to be really rare.
>
> If I get complaits from real users about this, we can easily reconsider, it
> is not a big deal. But I just don't think preemptive action is warranted...
>

Hi Jan,

I know you have some catching up to do, but applying this patch to stable
has become a priority for me.
It was a mistake on my part not to push harder 6 months ago, so I am trying
to rectify this mistake now as soon as possible.

To answer your arguments against preemptive action:
1) My application has CAP_SYS_ADMIN, it is not malicious, but it cannot
    do its job without taking up more CPU that it needs to, because bursts of
    events will cause the event queue to grow to thousands of events and
    fanotify_merge() will become a high CPU consumer
2) It's not only about fanotify becoming slow, it's about fanotify making the
     entire system slow and as a result it takes a long time for the system
     to recover from this condition
3) You haven't heard any complains because nobody was using sb mark
    We have been using sb mark in production for a few years and carry
    this patch in our kernel, so I can say for certain that sb mark on a fs with
    heavy workload is disturbing the entire system without this patch.

I don't think that "regressing" the number of merged event is a big issue,
as we never guaranteed any specific merge behavior and the behavior
was hard enough to predict, so I don't think applications could have
relied on it.

So, are you ok with me sending this patch to stable as is?

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 5/7] fanotify: limit number of event merge attempts
  2021-09-15 12:39       ` Amir Goldstein
@ 2021-09-15 16:33         ` Jan Kara
  0 siblings, 0 replies; 36+ messages in thread
From: Jan Kara @ 2021-09-15 16:33 UTC (permalink / raw)
  To: Amir Goldstein; +Cc: Jan Kara, linux-fsdevel

On Wed 15-09-21 15:39:13, Amir Goldstein wrote:
> On Mon, Mar 1, 2021 at 3:08 PM Jan Kara <jack@suse.cz> wrote:
> >
> > On Sat 27-02-21 10:31:52, Amir Goldstein wrote:
> > > On Tue, Feb 2, 2021 at 6:20 PM Amir Goldstein <amir73il@gmail.com> wrote:
> > > >
> > > > Event merges are expensive when event queue size is large.
> > > > Limit the linear search to 128 merge tests.
> > > > In combination with 128 hash lists, there is a potential to
> > > > merge with up to 16K events in the hashed queue.
> > > >
> > > > Signed-off-by: Amir Goldstein <amir73il@gmail.com>
> > > > ---
> > > >  fs/notify/fanotify/fanotify.c | 6 ++++++
> > > >  1 file changed, 6 insertions(+)
> > > >
> > > > diff --git a/fs/notify/fanotify/fanotify.c b/fs/notify/fanotify/fanotify.c
> > > > index 12df6957e4d8..6d3807012851 100644
> > > > --- a/fs/notify/fanotify/fanotify.c
> > > > +++ b/fs/notify/fanotify/fanotify.c
> > > > @@ -129,11 +129,15 @@ static bool fanotify_should_merge(struct fsnotify_event *old_fsn,
> > > >         return false;
> > > >  }
> > > >
> > > > +/* Limit event merges to limit CPU overhead per event */
> > > > +#define FANOTIFY_MAX_MERGE_EVENTS 128
> > > > +
> > > >  /* and the list better be locked by something too! */
> > > >  static int fanotify_merge(struct list_head *list, struct fsnotify_event *event)
> > > >  {
> > > >         struct fsnotify_event *test_event;
> > > >         struct fanotify_event *new;
> > > > +       int i = 0;
> > > >
> > > >         pr_debug("%s: list=%p event=%p\n", __func__, list, event);
> > > >         new = FANOTIFY_E(event);
> > > > @@ -147,6 +151,8 @@ static int fanotify_merge(struct list_head *list, struct fsnotify_event *event)
> > > >                 return 0;
> > > >
> > > >         list_for_each_entry_reverse(test_event, list, list) {
> > > > +               if (++i > FANOTIFY_MAX_MERGE_EVENTS)
> > > > +                       break;
> > > >                 if (fanotify_should_merge(test_event, event)) {
> > > >                         FANOTIFY_E(test_event)->mask |= new->mask;
> > > >                         return 1;
> > > > --
> > > > 2.25.1
> > > >
> > >
> > > Jan,
> > >
> > > I was thinking that this patch or a variant thereof should be applied to stable
> > > kernels, but not the entire series.
> > >
> > > OTOH, I am concerned about regressing existing workloads that depend on
> > > merging events on more than 128 inodes.
> >
> > Honestly, I don't think pushing anything to stable for this is really worth
> > it.
> >
> > 1) fanotify() is limited to CAP_SYS_ADMIN (in init namespace) so this is
> > hardly a security issue.
> >
> > 2) We have cond_resched() in the merge code now so the kernel doesn't
> > lockup anymore. So this is only about fanotify becoming slow if you have
> > lots of events.
> >
> > 3) I haven't heard any complaints since we've added the cond_resched()
> > patch so the performance issue seems to be really rare.
> >
> > If I get complaits from real users about this, we can easily reconsider, it
> > is not a big deal. But I just don't think preemptive action is warranted...
> >
> 
> Hi Jan,
> 
> I know you have some catching up to do, but applying this patch to stable
> has become a priority for me.
> It was a mistake on my part not to push harder 6 months ago, so I am trying
> to rectify this mistake now as soon as possible.
> 
> To answer your arguments against preemptive action:
> 1) My application has CAP_SYS_ADMIN, it is not malicious, but it cannot
>     do its job without taking up more CPU that it needs to, because bursts of
>     events will cause the event queue to grow to thousands of events and
>     fanotify_merge() will become a high CPU consumer
> 2) It's not only about fanotify becoming slow, it's about fanotify making the
>      entire system slow and as a result it takes a long time for the system
>      to recover from this condition
> 3) You haven't heard any complains because nobody was using sb mark
>     We have been using sb mark in production for a few years and carry
>     this patch in our kernel, so I can say for certain that sb mark on a fs with
>     heavy workload is disturbing the entire system without this patch.
> 
> I don't think that "regressing" the number of merged event is a big issue,
> as we never guaranteed any specific merge behavior and the behavior
> was hard enough to predict, so I don't think applications could have
> relied on it.
> 
> So, are you ok with me sending this patch to stable as is?

Sure, go ahead. I was not strongly against pushing this to stable, I just
didn't see good reason to do that but your arguments make sense - you count
as a user report I was waiting for ;).

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 36+ messages in thread

end of thread, other threads:[~2021-09-15 16:33 UTC | newest]

Thread overview: 36+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-02-02 16:20 [PATCH 0/7] Performance improvement for fanotify merge Amir Goldstein
2021-02-02 16:20 ` [PATCH 1/7] fsnotify: allow fsnotify_{peek,remove}_first_event with empty queue Amir Goldstein
2021-02-02 16:20 ` [PATCH 2/7] fsnotify: support hashed notification queue Amir Goldstein
2021-02-16 15:02   ` Jan Kara
2021-02-17 12:33     ` Amir Goldstein
2021-02-17 13:48       ` Jan Kara
2021-02-17 15:42         ` Amir Goldstein
2021-02-17 16:49           ` Jan Kara
2021-02-18 10:52           ` Amir Goldstein
2021-02-02 16:20 ` [PATCH 3/7] fsnotify: read events from hashed notification queue by order of insertion Amir Goldstein
2021-02-16 15:10   ` Jan Kara
2021-02-02 16:20 ` [PATCH 4/7] fanotify: enable hashed notification queue for FAN_CLASS_NOTIF groups Amir Goldstein
2021-02-02 16:20 ` [PATCH 5/7] fanotify: limit number of event merge attempts Amir Goldstein
2021-02-27  8:31   ` Amir Goldstein
2021-03-01 13:08     ` Jan Kara
2021-03-01 13:58       ` Amir Goldstein
2021-09-15 12:39       ` Amir Goldstein
2021-09-15 16:33         ` Jan Kara
2021-02-02 16:20 ` [PATCH 6/7] fanotify: mix event info into merge key hash Amir Goldstein
2021-02-16 15:39   ` Jan Kara
2021-02-17 10:13     ` Amir Goldstein
2021-02-18 10:46       ` Amir Goldstein
2021-02-18 11:11         ` Jan Kara
2021-02-18 12:17           ` Amir Goldstein
2021-02-02 16:20 ` [PATCH 7/7] fsnotify: print some debug stats on hashed queue overflow Amir Goldstein
2021-02-16 16:02 ` [PATCH 0/7] Performance improvement for fanotify merge Jan Kara
2021-02-17 10:52   ` Amir Goldstein
2021-02-17 11:25     ` Jan Kara
2021-02-18 10:56       ` Amir Goldstein
2021-02-18 11:15         ` Jan Kara
2021-02-18 12:35           ` Amir Goldstein
2021-02-19 10:15             ` Jan Kara
2021-02-19 10:21               ` Jan Kara
2021-02-19 13:38                 ` Amir Goldstein
2021-02-21 12:53                   ` Amir Goldstein
2021-02-22  9:29                     ` Jan Kara

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).