All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v2 00/16] Evictable fanotify marks
@ 2022-03-29  7:48 Amir Goldstein
  2022-03-29  7:48 ` [PATCH v2 01/16] inotify: show inotify mask flags in proc fdinfo Amir Goldstein
                   ` (16 more replies)
  0 siblings, 17 replies; 36+ messages in thread
From: Amir Goldstein @ 2022-03-29  7:48 UTC (permalink / raw)
  To: Jan Kara; +Cc: Matthew Bobrowski, linux-fsdevel

Jan,

Following the discussion on direct reclaim of fsnotify marks [2],
this patch set includes your suggested fixes to core code along with
implementation of fanotify evictable marks (rebrand of volatile marks).

The LTP test I wrote [3] reproduces that deadlock within seconds on my
small test VM if the FSNOTIFY_GROUP_NOFS flag is removed from fanotify.

To be more exact, depending on the value of vfs_cache_pressure set by
the test, either a deadlock or lockdep warning (or both) are reproduced.
I chose a high value of 500, which usually reproduces only the lockdep
warning, but worked better and faster on several systems I tested on.

Thanks,
Amir.

Changes since v1 [1]:
- Fixes for direct reclaim deadlock
- Add ioctl for direct reclaim test
- Rebrand as FAN_MARK_EVICTABLE
- Remove FAN_MARK_CREATE and allow clearing FAN_MARK_EVICTABLE
- Replace connector proxy_iref with HAS_IREF flag
- Take iref in fsnotify_reclac_mark() rather than on add mark to list
- Remove fsnotify_add_mark() allow_dups/flags argument
- Remove pr_debug() prints

[1] https://lore.kernel.org/r/20220307155741.1352405-1-amir73il@gmail.com/
[2] https://lore.kernel.org/r/20220321112310.vpr7oxro2xkz5llh@quack3.lan/
[3] https://github.com/amir73il/ltp/commits/fan_evictable

Amir Goldstein (16):
  inotify: show inotify mask flags in proc fdinfo
  inotify: move control flags from mask to mark flags
  fsnotify: pass flags argument to fsnotify_add_mark() via mark
  fsnotify: remove unneeded refcounts of s_fsnotify_connectors
  fsnotify: fix wrong lockdep annotations
  fsnotify: create helpers for group mark_mutex lock
  inotify: use fsnotify group lock helpers
  audit: use fsnotify group lock helpers
  nfsd: use fsnotify group lock helpers
  dnotify: use fsnotify group lock helpers
  fsnotify: allow adding an inode mark without pinning inode
  fanotify: factor out helper fanotify_mark_update_flags()
  fanotify: implement "evictable" inode marks
  fanotify: add FAN_IOC_SET_MARK_PAGE_ORDER ioctl for testing
  fanotify: use fsnotify group lock helpers
  fanotify: enable "evictable" inode marks

 fs/nfsd/filecache.c                  |  12 +--
 fs/notify/dnotify/dnotify.c          |  13 +--
 fs/notify/fanotify/fanotify.c        |   6 +-
 fs/notify/fanotify/fanotify.h        |  12 +++
 fs/notify/fanotify/fanotify_user.c   | 128 +++++++++++++++++++++------
 fs/notify/fdinfo.c                   |  22 ++---
 fs/notify/fsnotify.c                 |   4 +-
 fs/notify/group.c                    |  11 +++
 fs/notify/inotify/inotify.h          |  19 ++++
 fs/notify/inotify/inotify_fsnotify.c |   2 +-
 fs/notify/inotify/inotify_user.c     |  46 ++++++----
 fs/notify/mark.c                     | 126 +++++++++++++++-----------
 include/linux/fanotify.h             |   1 +
 include/linux/fsnotify_backend.h     |  86 ++++++++++++++----
 include/uapi/linux/fanotify.h        |   5 ++
 kernel/audit_fsnotify.c              |   3 +-
 kernel/audit_tree.c                  |  32 +++----
 17 files changed, 370 insertions(+), 158 deletions(-)

-- 
2.25.1


^ permalink raw reply	[flat|nested] 36+ messages in thread

* [PATCH v2 01/16] inotify: show inotify mask flags in proc fdinfo
  2022-03-29  7:48 [PATCH v2 00/16] Evictable fanotify marks Amir Goldstein
@ 2022-03-29  7:48 ` Amir Goldstein
  2022-04-05 12:40   ` Jan Kara
  2022-03-29  7:48 ` [PATCH v2 02/16] inotify: move control flags from mask to mark flags Amir Goldstein
                   ` (15 subsequent siblings)
  16 siblings, 1 reply; 36+ messages in thread
From: Amir Goldstein @ 2022-03-29  7:48 UTC (permalink / raw)
  To: Jan Kara; +Cc: Matthew Bobrowski, linux-fsdevel

The inotify mask flags IN_ONESHOT and IN_EXCL_UNLINK are not "internal
to kernel" and should be exposed in procfs fdinfo so CRIU can restore
them.

Fixes: 6933599697c9 ("inotify: hide internal kernel bits from fdinfo")
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
---
 fs/notify/fdinfo.c               | 11 ++---------
 fs/notify/inotify/inotify.h      | 12 ++++++++++++
 fs/notify/inotify/inotify_user.c |  2 +-
 3 files changed, 15 insertions(+), 10 deletions(-)

diff --git a/fs/notify/fdinfo.c b/fs/notify/fdinfo.c
index 57f0d5d9f934..3451708fd035 100644
--- a/fs/notify/fdinfo.c
+++ b/fs/notify/fdinfo.c
@@ -83,16 +83,9 @@ static void inotify_fdinfo(struct seq_file *m, struct fsnotify_mark *mark)
 	inode_mark = container_of(mark, struct inotify_inode_mark, fsn_mark);
 	inode = igrab(fsnotify_conn_inode(mark->connector));
 	if (inode) {
-		/*
-		 * IN_ALL_EVENTS represents all of the mask bits
-		 * that we expose to userspace.  There is at
-		 * least one bit (FS_EVENT_ON_CHILD) which is
-		 * used only internally to the kernel.
-		 */
-		u32 mask = mark->mask & IN_ALL_EVENTS;
-		seq_printf(m, "inotify wd:%x ino:%lx sdev:%x mask:%x ignored_mask:%x ",
+		seq_printf(m, "inotify wd:%x ino:%lx sdev:%x mask:%x ignored_mask:0 ",
 			   inode_mark->wd, inode->i_ino, inode->i_sb->s_dev,
-			   mask, mark->ignored_mask);
+			   inotify_mark_user_mask(mark));
 		show_mark_fhandle(m, inode);
 		seq_putc(m, '\n');
 		iput(inode);
diff --git a/fs/notify/inotify/inotify.h b/fs/notify/inotify/inotify.h
index 2007e3711916..8f00151eb731 100644
--- a/fs/notify/inotify/inotify.h
+++ b/fs/notify/inotify/inotify.h
@@ -22,6 +22,18 @@ static inline struct inotify_event_info *INOTIFY_E(struct fsnotify_event *fse)
 	return container_of(fse, struct inotify_event_info, fse);
 }
 
+/*
+ * INOTIFY_USER_FLAGS represents all of the mask bits that we expose to
+ * userspace.  There is at least one bit (FS_EVENT_ON_CHILD) which is
+ * used only internally to the kernel.
+ */
+#define INOTIFY_USER_MASK (IN_ALL_EVENTS | IN_ONESHOT | IN_EXCL_UNLINK)
+
+static inline __u32 inotify_mark_user_mask(struct fsnotify_mark *fsn_mark)
+{
+	return fsn_mark->mask & INOTIFY_USER_MASK;
+}
+
 extern void inotify_ignored_and_remove_idr(struct fsnotify_mark *fsn_mark,
 					   struct fsnotify_group *group);
 extern int inotify_handle_inode_event(struct fsnotify_mark *inode_mark,
diff --git a/fs/notify/inotify/inotify_user.c b/fs/notify/inotify/inotify_user.c
index 54583f62dc44..3ef57db0ec9d 100644
--- a/fs/notify/inotify/inotify_user.c
+++ b/fs/notify/inotify/inotify_user.c
@@ -110,7 +110,7 @@ static inline __u32 inotify_arg_to_mask(struct inode *inode, u32 arg)
 		mask |= FS_EVENT_ON_CHILD;
 
 	/* mask off the flags used to open the fd */
-	mask |= (arg & (IN_ALL_EVENTS | IN_ONESHOT | IN_EXCL_UNLINK));
+	mask |= (arg & INOTIFY_USER_MASK);
 
 	return mask;
 }
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH v2 02/16] inotify: move control flags from mask to mark flags
  2022-03-29  7:48 [PATCH v2 00/16] Evictable fanotify marks Amir Goldstein
  2022-03-29  7:48 ` [PATCH v2 01/16] inotify: show inotify mask flags in proc fdinfo Amir Goldstein
@ 2022-03-29  7:48 ` Amir Goldstein
  2022-03-29  7:48 ` [PATCH v2 03/16] fsnotify: pass flags argument to fsnotify_add_mark() via mark Amir Goldstein
                   ` (14 subsequent siblings)
  16 siblings, 0 replies; 36+ messages in thread
From: Amir Goldstein @ 2022-03-29  7:48 UTC (permalink / raw)
  To: Jan Kara; +Cc: Matthew Bobrowski, linux-fsdevel

The inotify control flags in the mark mask (e.g. FS_IN_ONE_SHOT) are not
relevant to object interest mask, so move them to the mark flags.

This frees up some bits in the object interest mask.

Signed-off-by: Amir Goldstein <amir73il@gmail.com>
---
 fs/notify/fsnotify.c                 |  4 +--
 fs/notify/inotify/inotify.h          | 11 ++++++--
 fs/notify/inotify/inotify_fsnotify.c |  2 +-
 fs/notify/inotify/inotify_user.c     | 38 ++++++++++++++++++----------
 include/linux/fsnotify_backend.h     | 15 ++++++-----
 5 files changed, 44 insertions(+), 26 deletions(-)

diff --git a/fs/notify/fsnotify.c b/fs/notify/fsnotify.c
index 70a8516b78bc..6eee19d15e8c 100644
--- a/fs/notify/fsnotify.c
+++ b/fs/notify/fsnotify.c
@@ -253,7 +253,7 @@ static int fsnotify_handle_inode_event(struct fsnotify_group *group,
 	if (WARN_ON_ONCE(!inode && !dir))
 		return 0;
 
-	if ((inode_mark->mask & FS_EXCL_UNLINK) &&
+	if ((inode_mark->flags & FSNOTIFY_MARK_FLAG_EXCL_UNLINK) &&
 	    path && d_unlinked(path->dentry))
 		return 0;
 
@@ -581,7 +581,7 @@ static __init int fsnotify_init(void)
 {
 	int ret;
 
-	BUILD_BUG_ON(HWEIGHT32(ALL_FSNOTIFY_BITS) != 25);
+	BUILD_BUG_ON(HWEIGHT32(ALL_FSNOTIFY_BITS) != 23);
 
 	ret = init_srcu_struct(&fsnotify_mark_srcu);
 	if (ret)
diff --git a/fs/notify/inotify/inotify.h b/fs/notify/inotify/inotify.h
index 8f00151eb731..7d5df7a21539 100644
--- a/fs/notify/inotify/inotify.h
+++ b/fs/notify/inotify/inotify.h
@@ -27,11 +27,18 @@ static inline struct inotify_event_info *INOTIFY_E(struct fsnotify_event *fse)
  * userspace.  There is at least one bit (FS_EVENT_ON_CHILD) which is
  * used only internally to the kernel.
  */
-#define INOTIFY_USER_MASK (IN_ALL_EVENTS | IN_ONESHOT | IN_EXCL_UNLINK)
+#define INOTIFY_USER_MASK (IN_ALL_EVENTS)
 
 static inline __u32 inotify_mark_user_mask(struct fsnotify_mark *fsn_mark)
 {
-	return fsn_mark->mask & INOTIFY_USER_MASK;
+	__u32 mask = fsn_mark->mask & INOTIFY_USER_MASK;
+
+	if (fsn_mark->flags & FSNOTIFY_MARK_FLAG_EXCL_UNLINK)
+		mask |= IN_EXCL_UNLINK;
+	if (fsn_mark->flags & FSNOTIFY_MARK_FLAG_IN_ONESHOT)
+		mask |= IN_ONESHOT;
+
+	return mask;
 }
 
 extern void inotify_ignored_and_remove_idr(struct fsnotify_mark *fsn_mark,
diff --git a/fs/notify/inotify/inotify_fsnotify.c b/fs/notify/inotify/inotify_fsnotify.c
index d92d7b0adc9a..49cfe2ae6d23 100644
--- a/fs/notify/inotify/inotify_fsnotify.c
+++ b/fs/notify/inotify/inotify_fsnotify.c
@@ -122,7 +122,7 @@ int inotify_handle_inode_event(struct fsnotify_mark *inode_mark, u32 mask,
 		fsnotify_destroy_event(group, fsn_event);
 	}
 
-	if (inode_mark->mask & IN_ONESHOT)
+	if (inode_mark->flags & FSNOTIFY_MARK_FLAG_IN_ONESHOT)
 		fsnotify_destroy_mark(inode_mark, group);
 
 	return 0;
diff --git a/fs/notify/inotify/inotify_user.c b/fs/notify/inotify/inotify_user.c
index 3ef57db0ec9d..d8907d32a05b 100644
--- a/fs/notify/inotify/inotify_user.c
+++ b/fs/notify/inotify/inotify_user.c
@@ -115,6 +115,21 @@ static inline __u32 inotify_arg_to_mask(struct inode *inode, u32 arg)
 	return mask;
 }
 
+#define INOTIFY_MARK_FLAGS \
+	(FSNOTIFY_MARK_FLAG_EXCL_UNLINK | FSNOTIFY_MARK_FLAG_IN_ONESHOT)
+
+static inline unsigned int inotify_arg_to_flags(u32 arg)
+{
+	unsigned int flags = 0;
+
+	if (arg & IN_EXCL_UNLINK)
+		flags |= FSNOTIFY_MARK_FLAG_EXCL_UNLINK;
+	if (arg & IN_ONESHOT)
+		flags |= FSNOTIFY_MARK_FLAG_IN_ONESHOT;
+
+	return flags;
+}
+
 static inline u32 inotify_mask_to_arg(__u32 mask)
 {
 	return mask & (IN_ALL_EVENTS | IN_ISDIR | IN_UNMOUNT | IN_IGNORED |
@@ -526,13 +541,10 @@ static int inotify_update_existing_watch(struct fsnotify_group *group,
 	struct fsnotify_mark *fsn_mark;
 	struct inotify_inode_mark *i_mark;
 	__u32 old_mask, new_mask;
-	__u32 mask;
-	int add = (arg & IN_MASK_ADD);
+	int replace = !(arg & IN_MASK_ADD);
 	int create = (arg & IN_MASK_CREATE);
 	int ret;
 
-	mask = inotify_arg_to_mask(inode, arg);
-
 	fsn_mark = fsnotify_find_mark(&inode->i_fsnotify_marks, group);
 	if (!fsn_mark)
 		return -ENOENT;
@@ -545,10 +557,12 @@ static int inotify_update_existing_watch(struct fsnotify_group *group,
 
 	spin_lock(&fsn_mark->lock);
 	old_mask = fsn_mark->mask;
-	if (add)
-		fsn_mark->mask |= mask;
-	else
-		fsn_mark->mask = mask;
+	if (replace) {
+		fsn_mark->mask = 0;
+		fsn_mark->flags &= ~INOTIFY_MARK_FLAGS;
+	}
+	fsn_mark->mask |= inotify_arg_to_mask(inode, arg);
+	fsn_mark->flags |= inotify_arg_to_flags(arg);
 	new_mask = fsn_mark->mask;
 	spin_unlock(&fsn_mark->lock);
 
@@ -579,19 +593,17 @@ static int inotify_new_watch(struct fsnotify_group *group,
 			     u32 arg)
 {
 	struct inotify_inode_mark *tmp_i_mark;
-	__u32 mask;
 	int ret;
 	struct idr *idr = &group->inotify_data.idr;
 	spinlock_t *idr_lock = &group->inotify_data.idr_lock;
 
-	mask = inotify_arg_to_mask(inode, arg);
-
 	tmp_i_mark = kmem_cache_alloc(inotify_inode_mark_cachep, GFP_KERNEL);
 	if (unlikely(!tmp_i_mark))
 		return -ENOMEM;
 
 	fsnotify_init_mark(&tmp_i_mark->fsn_mark, group);
-	tmp_i_mark->fsn_mark.mask = mask;
+	tmp_i_mark->fsn_mark.mask = inotify_arg_to_mask(inode, arg);
+	tmp_i_mark->fsn_mark.flags = inotify_arg_to_flags(arg);
 	tmp_i_mark->wd = -1;
 
 	ret = inotify_add_to_idr(idr, idr_lock, tmp_i_mark);
@@ -845,9 +857,7 @@ static int __init inotify_user_setup(void)
 	BUILD_BUG_ON(IN_UNMOUNT != FS_UNMOUNT);
 	BUILD_BUG_ON(IN_Q_OVERFLOW != FS_Q_OVERFLOW);
 	BUILD_BUG_ON(IN_IGNORED != FS_IN_IGNORED);
-	BUILD_BUG_ON(IN_EXCL_UNLINK != FS_EXCL_UNLINK);
 	BUILD_BUG_ON(IN_ISDIR != FS_ISDIR);
-	BUILD_BUG_ON(IN_ONESHOT != FS_IN_ONESHOT);
 
 	BUILD_BUG_ON(HWEIGHT32(ALL_INOTIFY_BITS) != 22);
 
diff --git a/include/linux/fsnotify_backend.h b/include/linux/fsnotify_backend.h
index 0805b74cae44..8ecdc1750e67 100644
--- a/include/linux/fsnotify_backend.h
+++ b/include/linux/fsnotify_backend.h
@@ -55,7 +55,6 @@
 #define FS_ACCESS_PERM		0x00020000	/* access event in a permissions hook */
 #define FS_OPEN_EXEC_PERM	0x00040000	/* open/exec event in a permission hook */
 
-#define FS_EXCL_UNLINK		0x04000000	/* do not send events if object is unlinked */
 /*
  * Set on inode mark that cares about things that happen to its children.
  * Always set for dnotify and inotify.
@@ -66,7 +65,6 @@
 #define FS_RENAME		0x10000000	/* File was renamed */
 #define FS_DN_MULTISHOT		0x20000000	/* dnotify multishot */
 #define FS_ISDIR		0x40000000	/* event occurred against dir */
-#define FS_IN_ONESHOT		0x80000000	/* only send event once */
 
 #define FS_MOVE			(FS_MOVED_FROM | FS_MOVED_TO)
 
@@ -106,8 +104,7 @@
 			     FS_ERROR)
 
 /* Extra flags that may be reported with event or control handling of events */
-#define ALL_FSNOTIFY_FLAGS  (FS_EXCL_UNLINK | FS_ISDIR | FS_IN_ONESHOT | \
-			     FS_DN_MULTISHOT | FS_EVENT_ON_CHILD)
+#define ALL_FSNOTIFY_FLAGS  (FS_ISDIR | FS_EVENT_ON_CHILD | FS_DN_MULTISHOT)
 
 #define ALL_FSNOTIFY_BITS   (ALL_FSNOTIFY_EVENTS | ALL_FSNOTIFY_FLAGS)
 
@@ -473,9 +470,13 @@ struct fsnotify_mark {
 	struct fsnotify_mark_connector *connector;
 	/* Events types to ignore [mark->lock, group->mark_mutex] */
 	__u32 ignored_mask;
-#define FSNOTIFY_MARK_FLAG_IGNORED_SURV_MODIFY	0x01
-#define FSNOTIFY_MARK_FLAG_ALIVE		0x02
-#define FSNOTIFY_MARK_FLAG_ATTACHED		0x04
+	/* Internal fsnotify flags */
+#define FSNOTIFY_MARK_FLAG_ALIVE		0x0001
+#define FSNOTIFY_MARK_FLAG_ATTACHED		0x0002
+	/* Backend controlled flags */
+#define FSNOTIFY_MARK_FLAG_EXCL_UNLINK		0x0010
+#define FSNOTIFY_MARK_FLAG_IN_ONESHOT		0x0020
+#define FSNOTIFY_MARK_FLAG_IGNORED_SURV_MODIFY	0x0100
 	unsigned int flags;		/* flags [mark->lock] */
 };
 
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH v2 03/16] fsnotify: pass flags argument to fsnotify_add_mark() via mark
  2022-03-29  7:48 [PATCH v2 00/16] Evictable fanotify marks Amir Goldstein
  2022-03-29  7:48 ` [PATCH v2 01/16] inotify: show inotify mask flags in proc fdinfo Amir Goldstein
  2022-03-29  7:48 ` [PATCH v2 02/16] inotify: move control flags from mask to mark flags Amir Goldstein
@ 2022-03-29  7:48 ` Amir Goldstein
  2022-03-29  7:48 ` [PATCH v2 04/16] fsnotify: remove unneeded refcounts of s_fsnotify_connectors Amir Goldstein
                   ` (13 subsequent siblings)
  16 siblings, 0 replies; 36+ messages in thread
From: Amir Goldstein @ 2022-03-29  7:48 UTC (permalink / raw)
  To: Jan Kara; +Cc: Matthew Bobrowski, linux-fsdevel

Instead of passing the allow_dups argument to fsnotify_add_mark()
as an argument, define the mark flag FSNOTIFY_MARK_FLAG_ALLOW_DUPS
to express the old allow_dups meaning and pass this information on the
mark itself.

We use mark->flags to pass inotify control flags and will pass more
control flags in the future so let's make this the standard.

Although the FSNOTIFY_MARK_FLAG_ALLOW_DUPS flag is not used after the
call to fsnotify_add_mark(), it does not hurt to leave this information
on the mark for future use.

Signed-off-by: Amir Goldstein <amir73il@gmail.com>
---
 fs/notify/fanotify/fanotify_user.c |  2 +-
 fs/notify/inotify/inotify_user.c   |  4 ++--
 fs/notify/mark.c                   | 13 ++++++-------
 include/linux/fsnotify_backend.h   | 19 ++++++++++---------
 kernel/audit_fsnotify.c            |  3 ++-
 5 files changed, 21 insertions(+), 20 deletions(-)

diff --git a/fs/notify/fanotify/fanotify_user.c b/fs/notify/fanotify/fanotify_user.c
index 9b32b76a9c30..0f0db1efa379 100644
--- a/fs/notify/fanotify/fanotify_user.c
+++ b/fs/notify/fanotify/fanotify_user.c
@@ -1144,7 +1144,7 @@ static struct fsnotify_mark *fanotify_add_new_mark(struct fsnotify_group *group,
 	}
 
 	fsnotify_init_mark(mark, group);
-	ret = fsnotify_add_mark_locked(mark, connp, obj_type, 0, fsid);
+	ret = fsnotify_add_mark_locked(mark, connp, obj_type, fsid);
 	if (ret) {
 		fsnotify_put_mark(mark);
 		goto out_dec_ucounts;
diff --git a/fs/notify/inotify/inotify_user.c b/fs/notify/inotify/inotify_user.c
index d8907d32a05b..6fc0f598a7aa 100644
--- a/fs/notify/inotify/inotify_user.c
+++ b/fs/notify/inotify/inotify_user.c
@@ -603,7 +603,6 @@ static int inotify_new_watch(struct fsnotify_group *group,
 
 	fsnotify_init_mark(&tmp_i_mark->fsn_mark, group);
 	tmp_i_mark->fsn_mark.mask = inotify_arg_to_mask(inode, arg);
-	tmp_i_mark->fsn_mark.flags = inotify_arg_to_flags(arg);
 	tmp_i_mark->wd = -1;
 
 	ret = inotify_add_to_idr(idr, idr_lock, tmp_i_mark);
@@ -618,7 +617,8 @@ static int inotify_new_watch(struct fsnotify_group *group,
 	}
 
 	/* we are on the idr, now get on the inode */
-	ret = fsnotify_add_inode_mark_locked(&tmp_i_mark->fsn_mark, inode, 0);
+	ret = fsnotify_add_inode_mark_locked(&tmp_i_mark->fsn_mark, inode,
+					     inotify_arg_to_flags(arg));
 	if (ret) {
 		/* we failed to get on the inode, get off the idr */
 		inotify_remove_from_idr(group, tmp_i_mark);
diff --git a/fs/notify/mark.c b/fs/notify/mark.c
index 4853184f7dde..b1443e66ba26 100644
--- a/fs/notify/mark.c
+++ b/fs/notify/mark.c
@@ -574,7 +574,7 @@ static struct fsnotify_mark_connector *fsnotify_grab_connector(
 static int fsnotify_add_mark_list(struct fsnotify_mark *mark,
 				  fsnotify_connp_t *connp,
 				  unsigned int obj_type,
-				  int allow_dups, __kernel_fsid_t *fsid)
+				  __kernel_fsid_t *fsid)
 {
 	struct fsnotify_mark *lmark, *last = NULL;
 	struct fsnotify_mark_connector *conn;
@@ -633,7 +633,7 @@ static int fsnotify_add_mark_list(struct fsnotify_mark *mark,
 
 		if ((lmark->group == mark->group) &&
 		    (lmark->flags & FSNOTIFY_MARK_FLAG_ATTACHED) &&
-		    !allow_dups) {
+		    !(mark->flags & FSNOTIFY_MARK_FLAG_ALLOW_DUPS)) {
 			err = -EEXIST;
 			goto out_err;
 		}
@@ -668,7 +668,7 @@ static int fsnotify_add_mark_list(struct fsnotify_mark *mark,
  */
 int fsnotify_add_mark_locked(struct fsnotify_mark *mark,
 			     fsnotify_connp_t *connp, unsigned int obj_type,
-			     int allow_dups, __kernel_fsid_t *fsid)
+			     __kernel_fsid_t *fsid)
 {
 	struct fsnotify_group *group = mark->group;
 	int ret = 0;
@@ -688,7 +688,7 @@ int fsnotify_add_mark_locked(struct fsnotify_mark *mark,
 	fsnotify_get_mark(mark); /* for g_list */
 	spin_unlock(&mark->lock);
 
-	ret = fsnotify_add_mark_list(mark, connp, obj_type, allow_dups, fsid);
+	ret = fsnotify_add_mark_list(mark, connp, obj_type, fsid);
 	if (ret)
 		goto err;
 
@@ -708,14 +708,13 @@ int fsnotify_add_mark_locked(struct fsnotify_mark *mark,
 }
 
 int fsnotify_add_mark(struct fsnotify_mark *mark, fsnotify_connp_t *connp,
-		      unsigned int obj_type, int allow_dups,
-		      __kernel_fsid_t *fsid)
+		      unsigned int obj_type, __kernel_fsid_t *fsid)
 {
 	int ret;
 	struct fsnotify_group *group = mark->group;
 
 	mutex_lock(&group->mark_mutex);
-	ret = fsnotify_add_mark_locked(mark, connp, obj_type, allow_dups, fsid);
+	ret = fsnotify_add_mark_locked(mark, connp, obj_type, fsid);
 	mutex_unlock(&group->mark_mutex);
 	return ret;
 }
diff --git a/include/linux/fsnotify_backend.h b/include/linux/fsnotify_backend.h
index 8ecdc1750e67..9e8b5b52b9de 100644
--- a/include/linux/fsnotify_backend.h
+++ b/include/linux/fsnotify_backend.h
@@ -476,6 +476,7 @@ struct fsnotify_mark {
 	/* Backend controlled flags */
 #define FSNOTIFY_MARK_FLAG_EXCL_UNLINK		0x0010
 #define FSNOTIFY_MARK_FLAG_IN_ONESHOT		0x0020
+#define FSNOTIFY_MARK_FLAG_ALLOW_DUPS		0x0040
 #define FSNOTIFY_MARK_FLAG_IGNORED_SURV_MODIFY	0x0100
 	unsigned int flags;		/* flags [mark->lock] */
 };
@@ -633,30 +634,30 @@ extern struct fsnotify_mark *fsnotify_find_mark(fsnotify_connp_t *connp,
 /* Get cached fsid of filesystem containing object */
 extern int fsnotify_get_conn_fsid(const struct fsnotify_mark_connector *conn,
 				  __kernel_fsid_t *fsid);
+
 /* attach the mark to the object */
 extern int fsnotify_add_mark(struct fsnotify_mark *mark,
 			     fsnotify_connp_t *connp, unsigned int obj_type,
-			     int allow_dups, __kernel_fsid_t *fsid);
+			     __kernel_fsid_t *fsid);
 extern int fsnotify_add_mark_locked(struct fsnotify_mark *mark,
 				    fsnotify_connp_t *connp,
-				    unsigned int obj_type, int allow_dups,
+				    unsigned int obj_type,
 				    __kernel_fsid_t *fsid);
 
 /* attach the mark to the inode */
 static inline int fsnotify_add_inode_mark(struct fsnotify_mark *mark,
-					  struct inode *inode,
-					  int allow_dups)
+					  struct inode *inode, int flags)
 {
+	mark->flags = flags;
 	return fsnotify_add_mark(mark, &inode->i_fsnotify_marks,
-				 FSNOTIFY_OBJ_TYPE_INODE, allow_dups, NULL);
+				 FSNOTIFY_OBJ_TYPE_INODE, NULL);
 }
 static inline int fsnotify_add_inode_mark_locked(struct fsnotify_mark *mark,
-						 struct inode *inode,
-						 int allow_dups)
+						 struct inode *inode, int flags)
 {
+	mark->flags = flags;
 	return fsnotify_add_mark_locked(mark, &inode->i_fsnotify_marks,
-					FSNOTIFY_OBJ_TYPE_INODE, allow_dups,
-					NULL);
+					FSNOTIFY_OBJ_TYPE_INODE, NULL);
 }
 
 /* given a group and a mark, flag mark to be freed when all references are dropped */
diff --git a/kernel/audit_fsnotify.c b/kernel/audit_fsnotify.c
index 02348b48447c..3c35649bc7f5 100644
--- a/kernel/audit_fsnotify.c
+++ b/kernel/audit_fsnotify.c
@@ -100,7 +100,8 @@ struct audit_fsnotify_mark *audit_alloc_mark(struct audit_krule *krule, char *pa
 	audit_update_mark(audit_mark, dentry->d_inode);
 	audit_mark->rule = krule;
 
-	ret = fsnotify_add_inode_mark(&audit_mark->mark, inode, true);
+	ret = fsnotify_add_inode_mark(&audit_mark->mark, inode,
+				      FSNOTIFY_MARK_FLAG_ALLOW_DUPS);
 	if (ret < 0) {
 		fsnotify_put_mark(&audit_mark->mark);
 		audit_mark = ERR_PTR(ret);
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH v2 04/16] fsnotify: remove unneeded refcounts of s_fsnotify_connectors
  2022-03-29  7:48 [PATCH v2 00/16] Evictable fanotify marks Amir Goldstein
                   ` (2 preceding siblings ...)
  2022-03-29  7:48 ` [PATCH v2 03/16] fsnotify: pass flags argument to fsnotify_add_mark() via mark Amir Goldstein
@ 2022-03-29  7:48 ` Amir Goldstein
  2022-04-05 12:54   ` Jan Kara
  2022-03-29  7:48 ` [PATCH v2 05/16] fsnotify: fix wrong lockdep annotations Amir Goldstein
                   ` (12 subsequent siblings)
  16 siblings, 1 reply; 36+ messages in thread
From: Amir Goldstein @ 2022-03-29  7:48 UTC (permalink / raw)
  To: Jan Kara; +Cc: Matthew Bobrowski, linux-fsdevel

s_fsnotify_connectors is elevated for every inode mark in addition to
the refcount already taken by the inode connector.

This is a relic from s_fsnotify_inode_refs pre connector era.
Remove those unneeded recounts.

Signed-off-by: Amir Goldstein <amir73il@gmail.com>
---
 fs/notify/mark.c | 21 +++------------------
 1 file changed, 3 insertions(+), 18 deletions(-)

diff --git a/fs/notify/mark.c b/fs/notify/mark.c
index b1443e66ba26..698ed0a1a47e 100644
--- a/fs/notify/mark.c
+++ b/fs/notify/mark.c
@@ -169,21 +169,6 @@ static void fsnotify_connector_destroy_workfn(struct work_struct *work)
 	}
 }
 
-static void fsnotify_get_inode_ref(struct inode *inode)
-{
-	ihold(inode);
-	atomic_long_inc(&inode->i_sb->s_fsnotify_connectors);
-}
-
-static void fsnotify_put_inode_ref(struct inode *inode)
-{
-	struct super_block *sb = inode->i_sb;
-
-	iput(inode);
-	if (atomic_long_dec_and_test(&sb->s_fsnotify_connectors))
-		wake_up_var(&sb->s_fsnotify_connectors);
-}
-
 static void fsnotify_get_sb_connectors(struct fsnotify_mark_connector *conn)
 {
 	struct super_block *sb = fsnotify_connector_sb(conn);
@@ -245,7 +230,7 @@ static void fsnotify_drop_object(unsigned int type, void *objp)
 	/* Currently only inode references are passed to be dropped */
 	if (WARN_ON_ONCE(type != FSNOTIFY_OBJ_TYPE_INODE))
 		return;
-	fsnotify_put_inode_ref(objp);
+	iput(objp);
 }
 
 void fsnotify_put_mark(struct fsnotify_mark *mark)
@@ -519,7 +504,7 @@ static int fsnotify_attach_connector_to_object(fsnotify_connp_t *connp,
 	}
 	if (conn->type == FSNOTIFY_OBJ_TYPE_INODE) {
 		inode = fsnotify_conn_inode(conn);
-		fsnotify_get_inode_ref(inode);
+		ihold(inode);
 	}
 	fsnotify_get_sb_connectors(conn);
 
@@ -530,7 +515,7 @@ static int fsnotify_attach_connector_to_object(fsnotify_connp_t *connp,
 	if (cmpxchg(connp, NULL, conn)) {
 		/* Someone else created list structure for us */
 		if (inode)
-			fsnotify_put_inode_ref(inode);
+			iput(inode);
 		fsnotify_put_sb_connectors(conn);
 		kmem_cache_free(fsnotify_mark_connector_cachep, conn);
 	}
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH v2 05/16] fsnotify: fix wrong lockdep annotations
  2022-03-29  7:48 [PATCH v2 00/16] Evictable fanotify marks Amir Goldstein
                   ` (3 preceding siblings ...)
  2022-03-29  7:48 ` [PATCH v2 04/16] fsnotify: remove unneeded refcounts of s_fsnotify_connectors Amir Goldstein
@ 2022-03-29  7:48 ` Amir Goldstein
  2022-03-29  7:48 ` [PATCH v2 06/16] fsnotify: create helpers for group mark_mutex lock Amir Goldstein
                   ` (11 subsequent siblings)
  16 siblings, 0 replies; 36+ messages in thread
From: Amir Goldstein @ 2022-03-29  7:48 UTC (permalink / raw)
  To: Jan Kara; +Cc: Matthew Bobrowski, linux-fsdevel

Commit 6960b0d909cd ("fsnotify: change locking order") changed some
of the mark_mutex locks in direct reclaim path to use:
  mutex_lock_nested(&group->mark_mutex, SINGLE_DEPTH_NESTING);

This change is explained:
 "...It uses nested locking to avoid deadlock in case we do the final
  iput() on an inode which still holds marks and thus would take the
  mutex again when calling fsnotify_inode_delete() in destroy_inode()."

The problem is that the mutex_lock_nested() is not a nested lock at
all. In fact, it has the opposite effect of preventing lockdep from
warning about a very possible deadlock.

Due to these wrong annotations, a deadlock that was introduced with
nfsd filecache in kernel v5.4 went unnoticed in v5.4.y for over two
years until it was reported recently by Khazhismel Kumykov, only to
find out that the deadlock was already fixed in kernel v5.5.

Fix the wrong lockdep annotations.

Cc: Khazhismel Kumykov <khazhy@google.com>
Fixes: 6960b0d909cd ("fsnotify: change locking order")
Link: https://lore.kernel.org/r/20220321112310.vpr7oxro2xkz5llh@quack3.lan/
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
---
 fs/notify/mark.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/fs/notify/mark.c b/fs/notify/mark.c
index 698ed0a1a47e..3faf47def7d8 100644
--- a/fs/notify/mark.c
+++ b/fs/notify/mark.c
@@ -437,7 +437,7 @@ void fsnotify_free_mark(struct fsnotify_mark *mark)
 void fsnotify_destroy_mark(struct fsnotify_mark *mark,
 			   struct fsnotify_group *group)
 {
-	mutex_lock_nested(&group->mark_mutex, SINGLE_DEPTH_NESTING);
+	mutex_lock(&group->mark_mutex);
 	fsnotify_detach_mark(mark);
 	mutex_unlock(&group->mark_mutex);
 	fsnotify_free_mark(mark);
@@ -754,7 +754,7 @@ void fsnotify_clear_marks_by_group(struct fsnotify_group *group,
 	 * move marks to free to to_free list in one go and then free marks in
 	 * to_free list one by one.
 	 */
-	mutex_lock_nested(&group->mark_mutex, SINGLE_DEPTH_NESTING);
+	mutex_lock(&group->mark_mutex);
 	list_for_each_entry_safe(mark, lmark, &group->marks_list, g_list) {
 		if (mark->connector->type == obj_type)
 			list_move(&mark->g_list, &to_free);
@@ -763,7 +763,7 @@ void fsnotify_clear_marks_by_group(struct fsnotify_group *group,
 
 clear:
 	while (1) {
-		mutex_lock_nested(&group->mark_mutex, SINGLE_DEPTH_NESTING);
+		mutex_lock(&group->mark_mutex);
 		if (list_empty(head)) {
 			mutex_unlock(&group->mark_mutex);
 			break;
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH v2 06/16] fsnotify: create helpers for group mark_mutex lock
  2022-03-29  7:48 [PATCH v2 00/16] Evictable fanotify marks Amir Goldstein
                   ` (4 preceding siblings ...)
  2022-03-29  7:48 ` [PATCH v2 05/16] fsnotify: fix wrong lockdep annotations Amir Goldstein
@ 2022-03-29  7:48 ` Amir Goldstein
  2022-04-07 14:35   ` Jan Kara
  2022-03-29  7:48 ` [PATCH v2 07/16] inotify: use fsnotify group lock helpers Amir Goldstein
                   ` (10 subsequent siblings)
  16 siblings, 1 reply; 36+ messages in thread
From: Amir Goldstein @ 2022-03-29  7:48 UTC (permalink / raw)
  To: Jan Kara; +Cc: Matthew Bobrowski, linux-fsdevel

Create helpers to take and release the group mark_mutex lock.

Define a flag FSNOTIFY_GROUP_NOFS in fsnotify backend operations struct
that determines if the mark_mutex lock is fs reclaim safe or not.
If not safe, the nofs lock helpers should be used to take the lock and
disable direct fs reclaim.

In that case we annotate the mutex with different lockdep class to
express to lockdep that an allocation of mark of an fs reclaim safe group
may take the group lock of another "NOFS" group to evict inodes.

For now, converted only the callers in common code and no backend
defines the NOFS flag.  It is intended to be set by fanotify for
evictable marks support.

Suggested-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20220321112310.vpr7oxro2xkz5llh@quack3.lan/
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
---
 fs/notify/fdinfo.c               |  5 ++--
 fs/notify/group.c                | 11 ++++++++
 fs/notify/mark.c                 | 28 ++++++++++---------
 include/linux/fsnotify_backend.h | 48 ++++++++++++++++++++++++++++++++
 4 files changed, 77 insertions(+), 15 deletions(-)

diff --git a/fs/notify/fdinfo.c b/fs/notify/fdinfo.c
index 3451708fd035..754a546d647d 100644
--- a/fs/notify/fdinfo.c
+++ b/fs/notify/fdinfo.c
@@ -27,14 +27,15 @@ static void show_fdinfo(struct seq_file *m, struct file *f,
 {
 	struct fsnotify_group *group = f->private_data;
 	struct fsnotify_mark *mark;
+	unsigned int nofs;
 
-	mutex_lock(&group->mark_mutex);
+	nofs = fsnotify_group_nofs_lock(group);
 	list_for_each_entry(mark, &group->marks_list, g_list) {
 		show(m, mark);
 		if (seq_has_overflowed(m))
 			break;
 	}
-	mutex_unlock(&group->mark_mutex);
+	fsnotify_group_nofs_unlock(group, nofs);
 }
 
 #if defined(CONFIG_EXPORTFS)
diff --git a/fs/notify/group.c b/fs/notify/group.c
index b7d4d64f87c2..0f585febf3d7 100644
--- a/fs/notify/group.c
+++ b/fs/notify/group.c
@@ -114,6 +114,7 @@ EXPORT_SYMBOL_GPL(fsnotify_put_group);
 static struct fsnotify_group *__fsnotify_alloc_group(
 				const struct fsnotify_ops *ops, gfp_t gfp)
 {
+	static struct lock_class_key nofs_marks_lock;
 	struct fsnotify_group *group;
 
 	group = kzalloc(sizeof(struct fsnotify_group), gfp);
@@ -133,6 +134,16 @@ static struct fsnotify_group *__fsnotify_alloc_group(
 	INIT_LIST_HEAD(&group->marks_list);
 
 	group->ops = ops;
+	/*
+	 * For most backends, eviction of inode with a mark is not expected,
+	 * because marks hold a refcount on the inode against eviction.
+	 *
+	 * Use a different lockdep class for groups that support evictable
+	 * inode marks, because with evictable marks, mark_mutex is NOT
+	 * fs-reclaim safe - the mutex is taken when evicting inodes.
+	 */
+	if (FSNOTIFY_GROUP_FLAG(group, NOFS))
+		lockdep_set_class(&group->mark_mutex, &nofs_marks_lock);
 
 	return group;
 }
diff --git a/fs/notify/mark.c b/fs/notify/mark.c
index 3faf47def7d8..94d53f9d2b5f 100644
--- a/fs/notify/mark.c
+++ b/fs/notify/mark.c
@@ -383,9 +383,7 @@ void fsnotify_finish_user_wait(struct fsnotify_iter_info *iter_info)
  */
 void fsnotify_detach_mark(struct fsnotify_mark *mark)
 {
-	struct fsnotify_group *group = mark->group;
-
-	WARN_ON_ONCE(!mutex_is_locked(&group->mark_mutex));
+	fsnotify_group_assert_locked(mark->group);
 	WARN_ON_ONCE(!srcu_read_lock_held(&fsnotify_mark_srcu) &&
 		     refcount_read(&mark->refcnt) < 1 +
 			!!(mark->flags & FSNOTIFY_MARK_FLAG_ATTACHED));
@@ -437,9 +435,11 @@ void fsnotify_free_mark(struct fsnotify_mark *mark)
 void fsnotify_destroy_mark(struct fsnotify_mark *mark,
 			   struct fsnotify_group *group)
 {
-	mutex_lock(&group->mark_mutex);
+	unsigned int nofs;
+
+	nofs = fsnotify_group_nofs_lock(group);
 	fsnotify_detach_mark(mark);
-	mutex_unlock(&group->mark_mutex);
+	fsnotify_group_nofs_unlock(group, nofs);
 	fsnotify_free_mark(mark);
 }
 EXPORT_SYMBOL_GPL(fsnotify_destroy_mark);
@@ -658,7 +658,7 @@ int fsnotify_add_mark_locked(struct fsnotify_mark *mark,
 	struct fsnotify_group *group = mark->group;
 	int ret = 0;
 
-	BUG_ON(!mutex_is_locked(&group->mark_mutex));
+	fsnotify_group_assert_locked(group);
 
 	/*
 	 * LOCKING ORDER!!!!
@@ -697,10 +697,11 @@ int fsnotify_add_mark(struct fsnotify_mark *mark, fsnotify_connp_t *connp,
 {
 	int ret;
 	struct fsnotify_group *group = mark->group;
+	unsigned int nofs;
 
-	mutex_lock(&group->mark_mutex);
+	nofs = fsnotify_group_nofs_lock(group);
 	ret = fsnotify_add_mark_locked(mark, connp, obj_type, fsid);
-	mutex_unlock(&group->mark_mutex);
+	fsnotify_group_nofs_unlock(group, nofs);
 	return ret;
 }
 EXPORT_SYMBOL_GPL(fsnotify_add_mark);
@@ -739,6 +740,7 @@ void fsnotify_clear_marks_by_group(struct fsnotify_group *group,
 	struct fsnotify_mark *lmark, *mark;
 	LIST_HEAD(to_free);
 	struct list_head *head = &to_free;
+	unsigned int nofs;
 
 	/* Skip selection step if we want to clear all marks. */
 	if (obj_type == FSNOTIFY_OBJ_TYPE_ANY) {
@@ -754,24 +756,24 @@ void fsnotify_clear_marks_by_group(struct fsnotify_group *group,
 	 * move marks to free to to_free list in one go and then free marks in
 	 * to_free list one by one.
 	 */
-	mutex_lock(&group->mark_mutex);
+	nofs = fsnotify_group_nofs_lock(group);
 	list_for_each_entry_safe(mark, lmark, &group->marks_list, g_list) {
 		if (mark->connector->type == obj_type)
 			list_move(&mark->g_list, &to_free);
 	}
-	mutex_unlock(&group->mark_mutex);
+	fsnotify_group_nofs_unlock(group, nofs);
 
 clear:
 	while (1) {
-		mutex_lock(&group->mark_mutex);
+		nofs = fsnotify_group_nofs_lock(group);
 		if (list_empty(head)) {
-			mutex_unlock(&group->mark_mutex);
+			fsnotify_group_nofs_unlock(group, nofs);
 			break;
 		}
 		mark = list_first_entry(head, struct fsnotify_mark, g_list);
 		fsnotify_get_mark(mark);
 		fsnotify_detach_mark(mark);
-		mutex_unlock(&group->mark_mutex);
+		fsnotify_group_nofs_unlock(group, nofs);
 		fsnotify_free_mark(mark);
 		fsnotify_put_mark(mark);
 	}
diff --git a/include/linux/fsnotify_backend.h b/include/linux/fsnotify_backend.h
index 9e8b5b52b9de..083333ad451c 100644
--- a/include/linux/fsnotify_backend.h
+++ b/include/linux/fsnotify_backend.h
@@ -20,6 +20,7 @@
 #include <linux/user_namespace.h>
 #include <linux/refcount.h>
 #include <linux/mempool.h>
+#include <linux/sched/mm.h>
 
 /*
  * IN_* from inotfy.h lines up EXACTLY with FS_*, this is so we can easily
@@ -152,6 +153,10 @@ struct mem_cgroup;
  *		userspace messages that marks have been removed.
  */
 struct fsnotify_ops {
+#define FSNOTIFY_GROUP_NOFS	0x01 /* group lock is not direct reclaim safe */
+#define FSNOTIFY_GROUP_FLAG(group, flag) \
+	((group)->ops->group_flags & FSNOTIFY_GROUP_ ## flag)
+	int group_flags;
 	int (*handle_event)(struct fsnotify_group *group, u32 mask,
 			    const void *data, int data_type, struct inode *dir,
 			    const struct qstr *file_name, u32 cookie,
@@ -250,6 +255,49 @@ struct fsnotify_group {
 	};
 };
 
+/*
+ * Use this from common code to prevent deadlock when reclaiming inodes with
+ * evictable marks of the same group that is allocating a new mark.
+ */
+static inline unsigned int fsnotify_group_nofs_lock(
+						struct fsnotify_group *group)
+{
+	unsigned int nofs = current->flags & PF_MEMALLOC_NOFS;
+
+	mutex_lock(&group->mark_mutex);
+	if (FSNOTIFY_GROUP_FLAG(group, NOFS))
+		nofs = memalloc_nofs_save();
+	return nofs;
+}
+
+static inline void fsnotify_group_assert_locked(struct fsnotify_group *group)
+{
+	WARN_ON_ONCE(!mutex_is_locked(&group->mark_mutex));
+	if (FSNOTIFY_GROUP_FLAG(group, NOFS))
+		WARN_ON_ONCE(!(current->flags & PF_MEMALLOC_NOFS));
+}
+
+static inline void fsnotify_group_nofs_unlock(struct fsnotify_group *group,
+					      unsigned int nofs)
+{
+	memalloc_nofs_restore(nofs);
+	mutex_unlock(&group->mark_mutex);
+}
+
+/*
+ * Use this from common code that does not allocate memory or from backends
+ * who are known to be fs reclaim safe (i.e. no evictable inode marks).
+ */
+static inline void fsnotify_group_lock(struct fsnotify_group *group)
+{
+	mutex_lock(&group->mark_mutex);
+}
+
+static inline void fsnotify_group_unlock(struct fsnotify_group *group)
+{
+	mutex_unlock(&group->mark_mutex);
+}
+
 /* When calling fsnotify tell it if the data is a path or inode */
 enum fsnotify_data_type {
 	FSNOTIFY_EVENT_NONE,
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH v2 07/16] inotify: use fsnotify group lock helpers
  2022-03-29  7:48 [PATCH v2 00/16] Evictable fanotify marks Amir Goldstein
                   ` (5 preceding siblings ...)
  2022-03-29  7:48 ` [PATCH v2 06/16] fsnotify: create helpers for group mark_mutex lock Amir Goldstein
@ 2022-03-29  7:48 ` Amir Goldstein
  2022-03-29  7:48 ` [PATCH v2 08/16] audit: " Amir Goldstein
                   ` (9 subsequent siblings)
  16 siblings, 0 replies; 36+ messages in thread
From: Amir Goldstein @ 2022-03-29  7:48 UTC (permalink / raw)
  To: Jan Kara; +Cc: Matthew Bobrowski, linux-fsdevel

inotify inode marks pin the inode so there is no need to use the nofs
lock variants.

Suggested-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20220321112310.vpr7oxro2xkz5llh@quack3.lan/
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
---
 fs/notify/inotify/inotify_user.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/fs/notify/inotify/inotify_user.c b/fs/notify/inotify/inotify_user.c
index 6fc0f598a7aa..060621faa762 100644
--- a/fs/notify/inotify/inotify_user.c
+++ b/fs/notify/inotify/inotify_user.c
@@ -640,13 +640,13 @@ static int inotify_update_watch(struct fsnotify_group *group, struct inode *inod
 {
 	int ret = 0;
 
-	mutex_lock(&group->mark_mutex);
+	fsnotify_group_lock(group);
 	/* try to update and existing watch with the new arg */
 	ret = inotify_update_existing_watch(group, inode, arg);
 	/* no mark present, try to add a new one */
 	if (ret == -ENOENT)
 		ret = inotify_new_watch(group, inode, arg);
-	mutex_unlock(&group->mark_mutex);
+	fsnotify_group_unlock(group);
 
 	return ret;
 }
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH v2 08/16] audit: use fsnotify group lock helpers
  2022-03-29  7:48 [PATCH v2 00/16] Evictable fanotify marks Amir Goldstein
                   ` (6 preceding siblings ...)
  2022-03-29  7:48 ` [PATCH v2 07/16] inotify: use fsnotify group lock helpers Amir Goldstein
@ 2022-03-29  7:48 ` Amir Goldstein
  2022-03-29  7:48 ` [PATCH v2 09/16] nfsd: " Amir Goldstein
                   ` (8 subsequent siblings)
  16 siblings, 0 replies; 36+ messages in thread
From: Amir Goldstein @ 2022-03-29  7:48 UTC (permalink / raw)
  To: Jan Kara; +Cc: Matthew Bobrowski, linux-fsdevel

audit inode marks pin the inode so there is no need to use the nofs
lock variants.

Suggested-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20220321112310.vpr7oxro2xkz5llh@quack3.lan/
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
---
 kernel/audit_tree.c | 32 ++++++++++++++++----------------
 1 file changed, 16 insertions(+), 16 deletions(-)

diff --git a/kernel/audit_tree.c b/kernel/audit_tree.c
index e7315d487163..eaef9c0f1c10 100644
--- a/kernel/audit_tree.c
+++ b/kernel/audit_tree.c
@@ -351,7 +351,7 @@ static void untag_chunk(struct audit_chunk *chunk, struct fsnotify_mark *mark)
 	struct audit_chunk *new;
 	int size;
 
-	mutex_lock(&audit_tree_group->mark_mutex);
+	fsnotify_group_lock(audit_tree_group);
 	/*
 	 * mark_mutex stabilizes chunk attached to the mark so we can check
 	 * whether it didn't change while we've dropped hash_lock.
@@ -368,7 +368,7 @@ static void untag_chunk(struct audit_chunk *chunk, struct fsnotify_mark *mark)
 		replace_mark_chunk(mark, NULL);
 		spin_unlock(&hash_lock);
 		fsnotify_detach_mark(mark);
-		mutex_unlock(&audit_tree_group->mark_mutex);
+		fsnotify_group_unlock(audit_tree_group);
 		audit_mark_put_chunk(chunk);
 		fsnotify_free_mark(mark);
 		return;
@@ -385,12 +385,12 @@ static void untag_chunk(struct audit_chunk *chunk, struct fsnotify_mark *mark)
 	 */
 	replace_chunk(new, chunk);
 	spin_unlock(&hash_lock);
-	mutex_unlock(&audit_tree_group->mark_mutex);
+	fsnotify_group_unlock(audit_tree_group);
 	audit_mark_put_chunk(chunk);
 	return;
 
 out_mutex:
-	mutex_unlock(&audit_tree_group->mark_mutex);
+	fsnotify_group_unlock(audit_tree_group);
 }
 
 /* Call with group->mark_mutex held, releases it */
@@ -400,19 +400,19 @@ static int create_chunk(struct inode *inode, struct audit_tree *tree)
 	struct audit_chunk *chunk = alloc_chunk(1);
 
 	if (!chunk) {
-		mutex_unlock(&audit_tree_group->mark_mutex);
+		fsnotify_group_unlock(audit_tree_group);
 		return -ENOMEM;
 	}
 
 	mark = alloc_mark();
 	if (!mark) {
-		mutex_unlock(&audit_tree_group->mark_mutex);
+		fsnotify_group_unlock(audit_tree_group);
 		kfree(chunk);
 		return -ENOMEM;
 	}
 
 	if (fsnotify_add_inode_mark_locked(mark, inode, 0)) {
-		mutex_unlock(&audit_tree_group->mark_mutex);
+		fsnotify_group_unlock(audit_tree_group);
 		fsnotify_put_mark(mark);
 		kfree(chunk);
 		return -ENOSPC;
@@ -422,7 +422,7 @@ static int create_chunk(struct inode *inode, struct audit_tree *tree)
 	if (tree->goner) {
 		spin_unlock(&hash_lock);
 		fsnotify_detach_mark(mark);
-		mutex_unlock(&audit_tree_group->mark_mutex);
+		fsnotify_group_unlock(audit_tree_group);
 		fsnotify_free_mark(mark);
 		fsnotify_put_mark(mark);
 		kfree(chunk);
@@ -444,7 +444,7 @@ static int create_chunk(struct inode *inode, struct audit_tree *tree)
 	 */
 	insert_hash(chunk);
 	spin_unlock(&hash_lock);
-	mutex_unlock(&audit_tree_group->mark_mutex);
+	fsnotify_group_unlock(audit_tree_group);
 	/*
 	 * Drop our initial reference. When mark we point to is getting freed,
 	 * we get notification through ->freeing_mark callback and cleanup
@@ -462,7 +462,7 @@ static int tag_chunk(struct inode *inode, struct audit_tree *tree)
 	struct audit_node *p;
 	int n;
 
-	mutex_lock(&audit_tree_group->mark_mutex);
+	fsnotify_group_lock(audit_tree_group);
 	mark = fsnotify_find_mark(&inode->i_fsnotify_marks, audit_tree_group);
 	if (!mark)
 		return create_chunk(inode, tree);
@@ -478,7 +478,7 @@ static int tag_chunk(struct inode *inode, struct audit_tree *tree)
 	for (n = 0; n < old->count; n++) {
 		if (old->owners[n].owner == tree) {
 			spin_unlock(&hash_lock);
-			mutex_unlock(&audit_tree_group->mark_mutex);
+			fsnotify_group_unlock(audit_tree_group);
 			fsnotify_put_mark(mark);
 			return 0;
 		}
@@ -487,7 +487,7 @@ static int tag_chunk(struct inode *inode, struct audit_tree *tree)
 
 	chunk = alloc_chunk(old->count + 1);
 	if (!chunk) {
-		mutex_unlock(&audit_tree_group->mark_mutex);
+		fsnotify_group_unlock(audit_tree_group);
 		fsnotify_put_mark(mark);
 		return -ENOMEM;
 	}
@@ -495,7 +495,7 @@ static int tag_chunk(struct inode *inode, struct audit_tree *tree)
 	spin_lock(&hash_lock);
 	if (tree->goner) {
 		spin_unlock(&hash_lock);
-		mutex_unlock(&audit_tree_group->mark_mutex);
+		fsnotify_group_unlock(audit_tree_group);
 		fsnotify_put_mark(mark);
 		kfree(chunk);
 		return 0;
@@ -515,7 +515,7 @@ static int tag_chunk(struct inode *inode, struct audit_tree *tree)
 	 */
 	replace_chunk(chunk, old);
 	spin_unlock(&hash_lock);
-	mutex_unlock(&audit_tree_group->mark_mutex);
+	fsnotify_group_unlock(audit_tree_group);
 	fsnotify_put_mark(mark); /* pair to fsnotify_find_mark */
 	audit_mark_put_chunk(old);
 
@@ -1044,12 +1044,12 @@ static void audit_tree_freeing_mark(struct fsnotify_mark *mark,
 {
 	struct audit_chunk *chunk;
 
-	mutex_lock(&mark->group->mark_mutex);
+	fsnotify_group_lock(mark->group);
 	spin_lock(&hash_lock);
 	chunk = mark_chunk(mark);
 	replace_mark_chunk(mark, NULL);
 	spin_unlock(&hash_lock);
-	mutex_unlock(&mark->group->mark_mutex);
+	fsnotify_group_unlock(mark->group);
 	if (chunk) {
 		evict_chunk(chunk);
 		audit_mark_put_chunk(chunk);
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH v2 09/16] nfsd: use fsnotify group lock helpers
  2022-03-29  7:48 [PATCH v2 00/16] Evictable fanotify marks Amir Goldstein
                   ` (7 preceding siblings ...)
  2022-03-29  7:48 ` [PATCH v2 08/16] audit: " Amir Goldstein
@ 2022-03-29  7:48 ` Amir Goldstein
  2022-03-29  7:48 ` [PATCH v2 10/16] dnotify: " Amir Goldstein
                   ` (7 subsequent siblings)
  16 siblings, 0 replies; 36+ messages in thread
From: Amir Goldstein @ 2022-03-29  7:48 UTC (permalink / raw)
  To: Jan Kara; +Cc: Matthew Bobrowski, linux-fsdevel

Before commit 9542e6a643fc6 ("nfsd: Containerise filecache laundrette")
nfsd would close open files in direct reclaim context and that could
cause a deadlock when fsnotify mark allocation went into direct reclaim
and nfsd shrinker tried to free existing fsnotify marks.

To avoid issues like this in future code, set the FSNOTIFY_GROUP_NOFS
flag on nfsd fsnotify group to prevent going into direct reclaim from
fsnotify_add_inode_mark().

The lookup of fsnotify mark does not allocate memory so the nofs
variants of fsnotify group helpers are not needed.

Suggested-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20220321112310.vpr7oxro2xkz5llh@quack3.lan/
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
---
 fs/nfsd/filecache.c | 12 +++++++-----
 1 file changed, 7 insertions(+), 5 deletions(-)

diff --git a/fs/nfsd/filecache.c b/fs/nfsd/filecache.c
index 8bc807c5fea4..0a104e530934 100644
--- a/fs/nfsd/filecache.c
+++ b/fs/nfsd/filecache.c
@@ -118,14 +118,14 @@ nfsd_file_mark_find_or_create(struct nfsd_file *nf)
 	struct inode *inode = nf->nf_inode;
 
 	do {
-		mutex_lock(&nfsd_file_fsnotify_group->mark_mutex);
+		fsnotify_group_lock(nfsd_file_fsnotify_group);
 		mark = fsnotify_find_mark(&inode->i_fsnotify_marks,
-				nfsd_file_fsnotify_group);
+					  nfsd_file_fsnotify_group);
 		if (mark) {
 			nfm = nfsd_file_mark_get(container_of(mark,
 						 struct nfsd_file_mark,
 						 nfm_mark));
-			mutex_unlock(&nfsd_file_fsnotify_group->mark_mutex);
+			fsnotify_group_unlock(nfsd_file_fsnotify_group);
 			if (nfm) {
 				fsnotify_put_mark(mark);
 				break;
@@ -133,8 +133,9 @@ nfsd_file_mark_find_or_create(struct nfsd_file *nf)
 			/* Avoid soft lockup race with nfsd_file_mark_put() */
 			fsnotify_destroy_mark(mark, nfsd_file_fsnotify_group);
 			fsnotify_put_mark(mark);
-		} else
-			mutex_unlock(&nfsd_file_fsnotify_group->mark_mutex);
+		} else {
+			fsnotify_group_unlock(nfsd_file_fsnotify_group);
+		}
 
 		/* allocate a new nfm */
 		new = kmem_cache_alloc(nfsd_file_mark_slab, GFP_KERNEL);
@@ -613,6 +614,7 @@ nfsd_file_fsnotify_handle_event(struct fsnotify_mark *mark, u32 mask,
 
 
 static const struct fsnotify_ops nfsd_file_fsnotify_ops = {
+	.group_flags = FSNOTIFY_GROUP_NOFS,
 	.handle_inode_event = nfsd_file_fsnotify_handle_event,
 	.free_mark = nfsd_file_mark_free,
 };
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH v2 10/16] dnotify: use fsnotify group lock helpers
  2022-03-29  7:48 [PATCH v2 00/16] Evictable fanotify marks Amir Goldstein
                   ` (8 preceding siblings ...)
  2022-03-29  7:48 ` [PATCH v2 09/16] nfsd: " Amir Goldstein
@ 2022-03-29  7:48 ` Amir Goldstein
  2022-03-29  7:48 ` [PATCH v2 11/16] fsnotify: allow adding an inode mark without pinning inode Amir Goldstein
                   ` (6 subsequent siblings)
  16 siblings, 0 replies; 36+ messages in thread
From: Amir Goldstein @ 2022-03-29  7:48 UTC (permalink / raw)
  To: Jan Kara; +Cc: Matthew Bobrowski, linux-fsdevel

Before commit 9542e6a643fc6 ("nfsd: Containerise filecache laundrette")
nfsd would close open files in direct reclaim context.  There is no
guarantee that others memory shrinkers don't do the same and no
guarantee that future shrinkers won't do that.

For example, if overlayfs implements inode cache of fscache would
keep open files to cached objects, inode shrinkers could end up closing
open files to underlying fs.

Direct reclaim from dnotify mark allocation context may try to close
open files that have dnotify marks of the same group and hit a deadlock
on mark_mutex.

Set the FSNOTIFY_GROUP_NOFS flag to prevent going into direct reclaim
from allocations under dnotify group lock and use the nofs group lock
helpers.

Suggested-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20220321112310.vpr7oxro2xkz5llh@quack3.lan/
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
---
 fs/notify/dnotify/dnotify.c | 13 ++++++++-----
 1 file changed, 8 insertions(+), 5 deletions(-)

diff --git a/fs/notify/dnotify/dnotify.c b/fs/notify/dnotify/dnotify.c
index 829dd4a61b66..b291141104ea 100644
--- a/fs/notify/dnotify/dnotify.c
+++ b/fs/notify/dnotify/dnotify.c
@@ -139,6 +139,7 @@ static void dnotify_free_mark(struct fsnotify_mark *fsn_mark)
 }
 
 static const struct fsnotify_ops dnotify_fsnotify_ops = {
+	.group_flags = FSNOTIFY_GROUP_NOFS,
 	.handle_inode_event = dnotify_handle_event,
 	.free_mark = dnotify_free_mark,
 };
@@ -158,6 +159,7 @@ void dnotify_flush(struct file *filp, fl_owner_t id)
 	struct dnotify_struct **prev;
 	struct inode *inode;
 	bool free = false;
+	unsigned int nofs;
 
 	inode = file_inode(filp);
 	if (!S_ISDIR(inode->i_mode))
@@ -168,7 +170,7 @@ void dnotify_flush(struct file *filp, fl_owner_t id)
 		return;
 	dn_mark = container_of(fsn_mark, struct dnotify_mark, fsn_mark);
 
-	mutex_lock(&dnotify_group->mark_mutex);
+	nofs = fsnotify_group_nofs_lock(dnotify_group);
 
 	spin_lock(&fsn_mark->lock);
 	prev = &dn_mark->dn;
@@ -191,7 +193,7 @@ void dnotify_flush(struct file *filp, fl_owner_t id)
 		free = true;
 	}
 
-	mutex_unlock(&dnotify_group->mark_mutex);
+	fsnotify_group_nofs_unlock(dnotify_group, nofs);
 
 	if (free)
 		fsnotify_free_mark(fsn_mark);
@@ -267,6 +269,7 @@ int fcntl_dirnotify(int fd, struct file *filp, unsigned long arg)
 	fl_owner_t id = current->files;
 	struct file *f;
 	int destroy = 0, error = 0;
+	unsigned int nofs;
 	__u32 mask;
 
 	/* we use these to tell if we need to kfree */
@@ -324,7 +327,7 @@ int fcntl_dirnotify(int fd, struct file *filp, unsigned long arg)
 	new_dn_mark->dn = NULL;
 
 	/* this is needed to prevent the fcntl/close race described below */
-	mutex_lock(&dnotify_group->mark_mutex);
+	nofs = fsnotify_group_nofs_lock(dnotify_group);
 
 	/* add the new_fsn_mark or find an old one. */
 	fsn_mark = fsnotify_find_mark(&inode->i_fsnotify_marks, dnotify_group);
@@ -334,7 +337,7 @@ int fcntl_dirnotify(int fd, struct file *filp, unsigned long arg)
 	} else {
 		error = fsnotify_add_inode_mark_locked(new_fsn_mark, inode, 0);
 		if (error) {
-			mutex_unlock(&dnotify_group->mark_mutex);
+			fsnotify_group_nofs_unlock(dnotify_group, nofs);
 			goto out_err;
 		}
 		spin_lock(&new_fsn_mark->lock);
@@ -383,7 +386,7 @@ int fcntl_dirnotify(int fd, struct file *filp, unsigned long arg)
 
 	if (destroy)
 		fsnotify_detach_mark(fsn_mark);
-	mutex_unlock(&dnotify_group->mark_mutex);
+	fsnotify_group_nofs_unlock(dnotify_group, nofs);
 	if (destroy)
 		fsnotify_free_mark(fsn_mark);
 	fsnotify_put_mark(fsn_mark);
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH v2 11/16] fsnotify: allow adding an inode mark without pinning inode
  2022-03-29  7:48 [PATCH v2 00/16] Evictable fanotify marks Amir Goldstein
                   ` (9 preceding siblings ...)
  2022-03-29  7:48 ` [PATCH v2 10/16] dnotify: " Amir Goldstein
@ 2022-03-29  7:48 ` Amir Goldstein
  2022-03-29  7:49 ` [PATCH v2 12/16] fanotify: factor out helper fanotify_mark_update_flags() Amir Goldstein
                   ` (5 subsequent siblings)
  16 siblings, 0 replies; 36+ messages in thread
From: Amir Goldstein @ 2022-03-29  7:48 UTC (permalink / raw)
  To: Jan Kara; +Cc: Matthew Bobrowski, linux-fsdevel

fsnotify_add_mark() and variants implicitly take a reference on inode
when attaching a mark to an inode.

Make that behavior opt-out with the mark flag FSNOTIFY_MARK_FLAG_NO_IREF.

Instead of taking the inode reference when attaching connector to inode
and dropping the inode reference when detaching connector from inode,
take the inode reference on attach of the first mark that wants to hold
an inode reference and drop the inode reference on detach of the last
mark that wants to hold an inode reference.

Backends can "upgrade" an existing mark to take an inode reference, but
cannot "downgrade" a mark with inode reference to release the refernce.

This leaves the choice to the backend whether or not to pin the inode
when adding an inode mark.

This is intended to be used when adding a mark with ignored mask that is
used for optimization in cases where group can afford getting unneeded
events and reinstate the mark with ignored mask when inode is accessed
again after being evicted.

Signed-off-by: Amir Goldstein <amir73il@gmail.com>
---
 fs/notify/mark.c                 | 68 +++++++++++++++++++++++++-------
 include/linux/fsnotify_backend.h |  2 +
 2 files changed, 56 insertions(+), 14 deletions(-)

diff --git a/fs/notify/mark.c b/fs/notify/mark.c
index 94d53f9d2b5f..3e4de16c0593 100644
--- a/fs/notify/mark.c
+++ b/fs/notify/mark.c
@@ -116,20 +116,61 @@ __u32 fsnotify_conn_mask(struct fsnotify_mark_connector *conn)
 	return *fsnotify_conn_mask_p(conn);
 }
 
-static void __fsnotify_recalc_mask(struct fsnotify_mark_connector *conn)
+/*
+ * Update the proxy refcount on inode maintainted by connector.
+ *
+ * When it's time to drop the proxy refcount, clear the HAS_IREF flag
+ * and return the inode object.  fsnotify_drop_object() will be resonsible
+ * for doing iput() outside of spinlocks when last mark that wanted iref
+ * is detached.
+ *
+ * Note that the proxy refcount is NOT dropped if backend only sets the
+ * NO_IREF mark flag and does detach the mark!
+ */
+static struct inode *fsnotify_update_iref(struct fsnotify_mark_connector *conn,
+					  bool want_iref)
+{
+	bool has_iref = conn->flags & FSNOTIFY_CONN_FLAG_HAS_IREF;
+	struct inode *inode = NULL;
+
+	if (conn->type != FSNOTIFY_OBJ_TYPE_INODE ||
+	    want_iref == has_iref)
+		return NULL;
+
+	if (want_iref) {
+		/* Pin inode if any mark wants inode refcount held */
+		ihold(fsnotify_conn_inode(conn));
+		conn->flags |= FSNOTIFY_CONN_FLAG_HAS_IREF;
+	} else {
+		/* Unpin inode after detach of last mark that wanted iref */
+		inode = fsnotify_conn_inode(conn);
+		conn->flags &= ~FSNOTIFY_CONN_FLAG_HAS_IREF;
+	}
+
+	return inode;
+}
+
+static void *__fsnotify_recalc_mask(struct fsnotify_mark_connector *conn)
 {
 	u32 new_mask = 0;
+	bool want_iref = false;
 	struct fsnotify_mark *mark;
 
 	assert_spin_locked(&conn->lock);
 	/* We can get detached connector here when inode is getting unlinked. */
 	if (!fsnotify_valid_obj_type(conn->type))
-		return;
+		return NULL;
 	hlist_for_each_entry(mark, &conn->list, obj_list) {
-		if (mark->flags & FSNOTIFY_MARK_FLAG_ATTACHED)
-			new_mask |= fsnotify_calc_mask(mark);
+		if (!(mark->flags & FSNOTIFY_MARK_FLAG_ATTACHED))
+			continue;
+		new_mask |= fsnotify_calc_mask(mark);
+		if (conn->type == FSNOTIFY_OBJ_TYPE_INODE &&
+		    !(mark->flags & FSNOTIFY_MARK_FLAG_NO_IREF))
+			want_iref = true;
 	}
 	*fsnotify_conn_mask_p(conn) = new_mask;
+
+	return fsnotify_update_iref(conn, want_iref);
 }
 
 /*
@@ -198,6 +239,10 @@ static void *fsnotify_detach_connector_from_object(
 	if (conn->type == FSNOTIFY_OBJ_TYPE_INODE) {
 		inode = fsnotify_conn_inode(conn);
 		inode->i_fsnotify_mask = 0;
+
+		/* Unpin inode when detaching from connector */
+		if (!(conn->flags & FSNOTIFY_CONN_FLAG_HAS_IREF))
+			inode = NULL;
 	} else if (conn->type == FSNOTIFY_OBJ_TYPE_VFSMOUNT) {
 		fsnotify_conn_mount(conn)->mnt_fsnotify_mask = 0;
 	} else if (conn->type == FSNOTIFY_OBJ_TYPE_SB) {
@@ -208,6 +253,7 @@ static void *fsnotify_detach_connector_from_object(
 	rcu_assign_pointer(*(conn->obj), NULL);
 	conn->obj = NULL;
 	conn->type = FSNOTIFY_OBJ_TYPE_DETACHED;
+	conn->flags = 0;
 
 	return inode;
 }
@@ -259,7 +305,8 @@ void fsnotify_put_mark(struct fsnotify_mark *mark)
 		objp = fsnotify_detach_connector_from_object(conn, &type);
 		free_conn = true;
 	} else {
-		__fsnotify_recalc_mask(conn);
+		objp = __fsnotify_recalc_mask(conn);
+		type = conn->type;
 	}
 	WRITE_ONCE(mark->connector, NULL);
 	spin_unlock(&conn->lock);
@@ -484,7 +531,6 @@ static int fsnotify_attach_connector_to_object(fsnotify_connp_t *connp,
 					       unsigned int obj_type,
 					       __kernel_fsid_t *fsid)
 {
-	struct inode *inode = NULL;
 	struct fsnotify_mark_connector *conn;
 
 	conn = kmem_cache_alloc(fsnotify_mark_connector_cachep, GFP_KERNEL);
@@ -492,6 +538,7 @@ static int fsnotify_attach_connector_to_object(fsnotify_connp_t *connp,
 		return -ENOMEM;
 	spin_lock_init(&conn->lock);
 	INIT_HLIST_HEAD(&conn->list);
+	conn->flags = 0;
 	conn->type = obj_type;
 	conn->obj = connp;
 	/* Cache fsid of filesystem containing the object */
@@ -502,10 +549,6 @@ static int fsnotify_attach_connector_to_object(fsnotify_connp_t *connp,
 		conn->fsid.val[0] = conn->fsid.val[1] = 0;
 		conn->flags = 0;
 	}
-	if (conn->type == FSNOTIFY_OBJ_TYPE_INODE) {
-		inode = fsnotify_conn_inode(conn);
-		ihold(inode);
-	}
 	fsnotify_get_sb_connectors(conn);
 
 	/*
@@ -514,8 +557,6 @@ static int fsnotify_attach_connector_to_object(fsnotify_connp_t *connp,
 	 */
 	if (cmpxchg(connp, NULL, conn)) {
 		/* Someone else created list structure for us */
-		if (inode)
-			iput(inode);
 		fsnotify_put_sb_connectors(conn);
 		kmem_cache_free(fsnotify_mark_connector_cachep, conn);
 	}
@@ -677,8 +718,7 @@ int fsnotify_add_mark_locked(struct fsnotify_mark *mark,
 	if (ret)
 		goto err;
 
-	if (mark->mask || mark->ignored_mask)
-		fsnotify_recalc_mask(mark->connector);
+	fsnotify_recalc_mask(mark->connector);
 
 	return ret;
 err:
diff --git a/include/linux/fsnotify_backend.h b/include/linux/fsnotify_backend.h
index 083333ad451c..df58439a86fa 100644
--- a/include/linux/fsnotify_backend.h
+++ b/include/linux/fsnotify_backend.h
@@ -472,6 +472,7 @@ struct fsnotify_mark_connector {
 	spinlock_t lock;
 	unsigned short type;	/* Type of object [lock] */
 #define FSNOTIFY_CONN_FLAG_HAS_FSID	0x01
+#define FSNOTIFY_CONN_FLAG_HAS_IREF	0x02
 	unsigned short flags;	/* flags [lock] */
 	__kernel_fsid_t fsid;	/* fsid of filesystem containing object */
 	union {
@@ -526,6 +527,7 @@ struct fsnotify_mark {
 #define FSNOTIFY_MARK_FLAG_IN_ONESHOT		0x0020
 #define FSNOTIFY_MARK_FLAG_ALLOW_DUPS		0x0040
 #define FSNOTIFY_MARK_FLAG_IGNORED_SURV_MODIFY	0x0100
+#define FSNOTIFY_MARK_FLAG_NO_IREF		0x0200
 	unsigned int flags;		/* flags [mark->lock] */
 };
 
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH v2 12/16] fanotify: factor out helper fanotify_mark_update_flags()
  2022-03-29  7:48 [PATCH v2 00/16] Evictable fanotify marks Amir Goldstein
                   ` (10 preceding siblings ...)
  2022-03-29  7:48 ` [PATCH v2 11/16] fsnotify: allow adding an inode mark without pinning inode Amir Goldstein
@ 2022-03-29  7:49 ` Amir Goldstein
  2022-04-11 10:52   ` Jan Kara
  2022-03-29  7:49 ` [PATCH v2 13/16] fanotify: implement "evictable" inode marks Amir Goldstein
                   ` (4 subsequent siblings)
  16 siblings, 1 reply; 36+ messages in thread
From: Amir Goldstein @ 2022-03-29  7:49 UTC (permalink / raw)
  To: Jan Kara; +Cc: Matthew Bobrowski, linux-fsdevel

Handle FAN_MARK_IGNORED_SURV_MODIFY flag change in a helper that
is called after updating the mark mask.

Move recalc of object mask inside fanotify_mark_add_to_mask() which
makes the code a bit simpler to follow.

Add also helper to translate fsnotify mark flags to user visible
fanotify mark flags.

Signed-off-by: Amir Goldstein <amir73il@gmail.com>
---
 fs/notify/fanotify/fanotify.h      | 10 ++++++++
 fs/notify/fanotify/fanotify_user.c | 39 +++++++++++++++++-------------
 fs/notify/fdinfo.c                 |  6 ++---
 3 files changed, 34 insertions(+), 21 deletions(-)

diff --git a/fs/notify/fanotify/fanotify.h b/fs/notify/fanotify/fanotify.h
index a3d5b751cac5..87142bc0131a 100644
--- a/fs/notify/fanotify/fanotify.h
+++ b/fs/notify/fanotify/fanotify.h
@@ -490,3 +490,13 @@ static inline unsigned int fanotify_event_hash_bucket(
 {
 	return event->hash & FANOTIFY_HTABLE_MASK;
 }
+
+static inline unsigned int fanotify_mark_user_flags(struct fsnotify_mark *mark)
+{
+	unsigned int mflags = 0;
+
+	if (mark->flags & FSNOTIFY_MARK_FLAG_IGNORED_SURV_MODIFY)
+		mflags |= FAN_MARK_IGNORED_SURV_MODIFY;
+
+	return mflags;
+}
diff --git a/fs/notify/fanotify/fanotify_user.c b/fs/notify/fanotify/fanotify_user.c
index 0f0db1efa379..6e78ea12239c 100644
--- a/fs/notify/fanotify/fanotify_user.c
+++ b/fs/notify/fanotify/fanotify_user.c
@@ -1081,42 +1081,50 @@ static int fanotify_remove_inode_mark(struct fsnotify_group *group,
 				    flags, umask);
 }
 
-static void fanotify_mark_add_ignored_mask(struct fsnotify_mark *fsn_mark,
-					   __u32 mask, unsigned int flags,
-					   __u32 *removed)
+static int fanotify_mark_update_flags(struct fsnotify_mark *fsn_mark,
+				      unsigned int flags, bool *recalc)
 {
-	fsn_mark->ignored_mask |= mask;
-
 	/*
 	 * Setting FAN_MARK_IGNORED_SURV_MODIFY for the first time may lead to
 	 * the removal of the FS_MODIFY bit in calculated mask if it was set
 	 * because of an ignored mask that is now going to survive FS_MODIFY.
 	 */
 	if ((flags & FAN_MARK_IGNORED_SURV_MODIFY) &&
+	    (flags & FAN_MARK_IGNORED_MASK) &&
 	    !(fsn_mark->flags & FSNOTIFY_MARK_FLAG_IGNORED_SURV_MODIFY)) {
 		fsn_mark->flags |= FSNOTIFY_MARK_FLAG_IGNORED_SURV_MODIFY;
 		if (!(fsn_mark->mask & FS_MODIFY))
-			*removed = FS_MODIFY;
+			*recalc = true;
 	}
+
+	return 0;
 }
 
-static __u32 fanotify_mark_add_to_mask(struct fsnotify_mark *fsn_mark,
-				       __u32 mask, unsigned int flags,
-				       __u32 *removed)
+static int fanotify_mark_add_to_mask(struct fsnotify_mark *fsn_mark,
+				     __u32 mask, unsigned int flags)
 {
-	__u32 oldmask, newmask;
+	__u32 oldmask;
+	bool recalc = false;
+	int ret;
 
 	spin_lock(&fsn_mark->lock);
 	oldmask = fsnotify_calc_mask(fsn_mark);
 	if (!(flags & FAN_MARK_IGNORED_MASK)) {
 		fsn_mark->mask |= mask;
 	} else {
-		fanotify_mark_add_ignored_mask(fsn_mark, mask, flags, removed);
+		fsn_mark->ignored_mask |= mask;
 	}
-	newmask = fsnotify_calc_mask(fsn_mark);
+
+	recalc = fsnotify_calc_mask(fsn_mark) & ~oldmask &
+		~fsnotify_conn_mask(fsn_mark->connector);
+
+	ret = fanotify_mark_update_flags(fsn_mark, flags, &recalc);
 	spin_unlock(&fsn_mark->lock);
 
-	return newmask & ~oldmask;
+	if (recalc)
+		fsnotify_recalc_mask(fsn_mark->connector);
+
+	return ret;
 }
 
 static struct fsnotify_mark *fanotify_add_new_mark(struct fsnotify_group *group,
@@ -1174,7 +1182,6 @@ static int fanotify_add_mark(struct fsnotify_group *group,
 			     __kernel_fsid_t *fsid)
 {
 	struct fsnotify_mark *fsn_mark;
-	__u32 added, removed = 0;
 	int ret = 0;
 
 	mutex_lock(&group->mark_mutex);
@@ -1197,9 +1204,7 @@ static int fanotify_add_mark(struct fsnotify_group *group,
 			goto out;
 	}
 
-	added = fanotify_mark_add_to_mask(fsn_mark, mask, flags, &removed);
-	if (removed || (added & ~fsnotify_conn_mask(fsn_mark->connector)))
-		fsnotify_recalc_mask(fsn_mark->connector);
+	ret = fanotify_mark_add_to_mask(fsn_mark, mask, flags);
 
 out:
 	mutex_unlock(&group->mark_mutex);
diff --git a/fs/notify/fdinfo.c b/fs/notify/fdinfo.c
index 754a546d647d..9f81adada3c8 100644
--- a/fs/notify/fdinfo.c
+++ b/fs/notify/fdinfo.c
@@ -14,6 +14,7 @@
 #include <linux/exportfs.h>
 
 #include "inotify/inotify.h"
+#include "fanotify/fanotify.h"
 #include "fdinfo.h"
 #include "fsnotify.h"
 
@@ -104,12 +105,9 @@ void inotify_show_fdinfo(struct seq_file *m, struct file *f)
 
 static void fanotify_fdinfo(struct seq_file *m, struct fsnotify_mark *mark)
 {
-	unsigned int mflags = 0;
+	unsigned int mflags = fanotify_mark_user_flags(mark);
 	struct inode *inode;
 
-	if (mark->flags & FSNOTIFY_MARK_FLAG_IGNORED_SURV_MODIFY)
-		mflags |= FAN_MARK_IGNORED_SURV_MODIFY;
-
 	if (mark->connector->type == FSNOTIFY_OBJ_TYPE_INODE) {
 		inode = igrab(fsnotify_conn_inode(mark->connector));
 		if (!inode)
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH v2 13/16] fanotify: implement "evictable" inode marks
  2022-03-29  7:48 [PATCH v2 00/16] Evictable fanotify marks Amir Goldstein
                   ` (11 preceding siblings ...)
  2022-03-29  7:49 ` [PATCH v2 12/16] fanotify: factor out helper fanotify_mark_update_flags() Amir Goldstein
@ 2022-03-29  7:49 ` Amir Goldstein
  2022-04-11 11:47   ` Jan Kara
  2022-03-29  7:49 ` [PATCH v2 14/16] fanotify: add FAN_IOC_SET_MARK_PAGE_ORDER ioctl for testing Amir Goldstein
                   ` (3 subsequent siblings)
  16 siblings, 1 reply; 36+ messages in thread
From: Amir Goldstein @ 2022-03-29  7:49 UTC (permalink / raw)
  To: Jan Kara; +Cc: Matthew Bobrowski, linux-fsdevel

When an inode mark is created with flag FAN_MARK_EVICTABLE, it will not
pin the marked inode to inode cache, so when inode is evicted from cache
due to memory pressure, the mark will be lost.

When an inode mark with flag FAN_MARK_EVICATBLE is updated without using
this flag, the marked inode is pinned to inode cache.

When an inode mark is updated with flag FAN_MARK_EVICTABLE but an
existing mark already has the inode pinned, the mark update fails with
error EEXIST.

Evictable inode marks can be used to setup inode marks with ignored mask
to suppress events from uninteresting files or directories in a lazy
manner, upon receiving the first event, without having to iterate all
the uninteresting files or directories before hand.

The evictbale inode mark feature allows performing this lazy marks setup
without exhausting the system memory with pinned inodes.

This change does not enable the feature yet.

Link: https://lore.kernel.org/linux-fsdevel/CAOQ4uxiRDpuS=2uA6+ZUM7yG9vVU-u212tkunBmSnP_u=mkv=Q@mail.gmail.com/
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
---
 fs/notify/fanotify/fanotify.h      |  2 ++
 fs/notify/fanotify/fanotify_user.c | 31 +++++++++++++++++++++++++++++-
 include/uapi/linux/fanotify.h      |  1 +
 3 files changed, 33 insertions(+), 1 deletion(-)

diff --git a/fs/notify/fanotify/fanotify.h b/fs/notify/fanotify/fanotify.h
index 87142bc0131a..80e0ec95b113 100644
--- a/fs/notify/fanotify/fanotify.h
+++ b/fs/notify/fanotify/fanotify.h
@@ -497,6 +497,8 @@ static inline unsigned int fanotify_mark_user_flags(struct fsnotify_mark *mark)
 
 	if (mark->flags & FSNOTIFY_MARK_FLAG_IGNORED_SURV_MODIFY)
 		mflags |= FAN_MARK_IGNORED_SURV_MODIFY;
+	if (mark->flags & FSNOTIFY_MARK_FLAG_NO_IREF)
+		mflags |= FAN_MARK_EVICTABLE;
 
 	return mflags;
 }
diff --git a/fs/notify/fanotify/fanotify_user.c b/fs/notify/fanotify/fanotify_user.c
index 6e78ea12239c..2c65038da4ce 100644
--- a/fs/notify/fanotify/fanotify_user.c
+++ b/fs/notify/fanotify/fanotify_user.c
@@ -1084,6 +1084,8 @@ static int fanotify_remove_inode_mark(struct fsnotify_group *group,
 static int fanotify_mark_update_flags(struct fsnotify_mark *fsn_mark,
 				      unsigned int flags, bool *recalc)
 {
+	bool want_iref = !(flags & FAN_MARK_EVICTABLE);
+
 	/*
 	 * Setting FAN_MARK_IGNORED_SURV_MODIFY for the first time may lead to
 	 * the removal of the FS_MODIFY bit in calculated mask if it was set
@@ -1097,6 +1099,20 @@ static int fanotify_mark_update_flags(struct fsnotify_mark *fsn_mark,
 			*recalc = true;
 	}
 
+	if (fsn_mark->connector->type != FSNOTIFY_OBJ_TYPE_INODE ||
+	    want_iref == !(fsn_mark->flags & FSNOTIFY_MARK_FLAG_NO_IREF))
+		return 0;
+
+	/*
+	 * NO_IREF may be removed from a mark, but not added.
+	 * When removed, fsnotify_recalc_mask() will take the inode ref.
+	 */
+	if (!want_iref)
+		return -EEXIST;
+
+	fsn_mark->flags &= ~FSNOTIFY_MARK_FLAG_NO_IREF;
+	*recalc = true;
+
 	return 0;
 }
 
@@ -1130,6 +1146,7 @@ static int fanotify_mark_add_to_mask(struct fsnotify_mark *fsn_mark,
 static struct fsnotify_mark *fanotify_add_new_mark(struct fsnotify_group *group,
 						   fsnotify_connp_t *connp,
 						   unsigned int obj_type,
+						   unsigned int fan_flags,
 						   __kernel_fsid_t *fsid)
 {
 	struct ucounts *ucounts = group->fanotify_data.ucounts;
@@ -1152,6 +1169,9 @@ static struct fsnotify_mark *fanotify_add_new_mark(struct fsnotify_group *group,
 	}
 
 	fsnotify_init_mark(mark, group);
+	if (fan_flags & FAN_MARK_EVICTABLE)
+		mark->flags |= FSNOTIFY_MARK_FLAG_NO_IREF;
+
 	ret = fsnotify_add_mark_locked(mark, connp, obj_type, fsid);
 	if (ret) {
 		fsnotify_put_mark(mark);
@@ -1187,7 +1207,8 @@ static int fanotify_add_mark(struct fsnotify_group *group,
 	mutex_lock(&group->mark_mutex);
 	fsn_mark = fsnotify_find_mark(connp, group);
 	if (!fsn_mark) {
-		fsn_mark = fanotify_add_new_mark(group, connp, obj_type, fsid);
+		fsn_mark = fanotify_add_new_mark(group, connp, obj_type, flags,
+						 fsid);
 		if (IS_ERR(fsn_mark)) {
 			mutex_unlock(&group->mark_mutex);
 			return PTR_ERR(fsn_mark);
@@ -1602,6 +1623,14 @@ static int do_fanotify_mark(int fanotify_fd, unsigned int flags, __u64 mask,
 	    mark_type != FAN_MARK_FILESYSTEM)
 		goto fput_and_out;
 
+	/*
+	 * Evictable is only relevant for inode marks, because only inode object
+	 * can be evicted on memory pressure.
+	 */
+	if (flags & FAN_MARK_EVICTABLE &&
+	     mark_type != FAN_MARK_INODE)
+		goto fput_and_out;
+
 	/*
 	 * Events that do not carry enough information to report
 	 * event->fd require a group that supports reporting fid.  Those
diff --git a/include/uapi/linux/fanotify.h b/include/uapi/linux/fanotify.h
index e8ac38cc2fd6..f1f89132d60e 100644
--- a/include/uapi/linux/fanotify.h
+++ b/include/uapi/linux/fanotify.h
@@ -82,6 +82,7 @@
 #define FAN_MARK_IGNORED_SURV_MODIFY	0x00000040
 #define FAN_MARK_FLUSH		0x00000080
 /* FAN_MARK_FILESYSTEM is	0x00000100 */
+#define FAN_MARK_EVICTABLE	0x00000200
 
 /* These are NOT bitwise flags.  Both bits can be used togther.  */
 #define FAN_MARK_INODE		0x00000000
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH v2 14/16] fanotify: add FAN_IOC_SET_MARK_PAGE_ORDER ioctl for testing
  2022-03-29  7:48 [PATCH v2 00/16] Evictable fanotify marks Amir Goldstein
                   ` (12 preceding siblings ...)
  2022-03-29  7:49 ` [PATCH v2 13/16] fanotify: implement "evictable" inode marks Amir Goldstein
@ 2022-03-29  7:49 ` Amir Goldstein
  2022-04-11 12:53   ` Jan Kara
  2022-03-29  7:49 ` [PATCH v2 15/16] fanotify: use fsnotify group lock helpers Amir Goldstein
                   ` (2 subsequent siblings)
  16 siblings, 1 reply; 36+ messages in thread
From: Amir Goldstein @ 2022-03-29  7:49 UTC (permalink / raw)
  To: Jan Kara; +Cc: Matthew Bobrowski, linux-fsdevel

The ioctl can be used to request allocation of marks with large size
and attach them to an object, even if another mark already exists for
the group on the marked object.

These large marks serve no function other than testing direct reclaim
in the context of mark allocation.

Setting the value to 0 restores normal mark allocation.

FAN_MARK_REMOVE refers to the first mark of the group on an object, so
the number of FAN_MARK_REMOVE calls need to match the number of large
marks on the object in order to remove all marks from the object or use
FAN_MARK_FLUSH to remove all marks of that object type.

Signed-off-by: Amir Goldstein <amir73il@gmail.com>
---
 fs/notify/fanotify/fanotify.c      |  5 +++-
 fs/notify/fanotify/fanotify_user.c | 42 +++++++++++++++++++++++++++---
 include/linux/fsnotify_backend.h   |  2 ++
 include/uapi/linux/fanotify.h      |  4 +++
 4 files changed, 49 insertions(+), 4 deletions(-)

diff --git a/fs/notify/fanotify/fanotify.c b/fs/notify/fanotify/fanotify.c
index 985e995d2a39..02990a6b1b65 100644
--- a/fs/notify/fanotify/fanotify.c
+++ b/fs/notify/fanotify/fanotify.c
@@ -1075,7 +1075,10 @@ static void fanotify_freeing_mark(struct fsnotify_mark *mark,
 
 static void fanotify_free_mark(struct fsnotify_mark *fsn_mark)
 {
-	kmem_cache_free(fanotify_mark_cache, fsn_mark);
+	if (fsn_mark->flags & FSNOTIFY_MARK_FLAG_KMALLOC)
+		kfree(fsn_mark);
+	else
+		kmem_cache_free(fanotify_mark_cache, fsn_mark);
 }
 
 const struct fsnotify_ops fanotify_fsnotify_ops = {
diff --git a/fs/notify/fanotify/fanotify_user.c b/fs/notify/fanotify/fanotify_user.c
index 2c65038da4ce..a3539bd8e443 100644
--- a/fs/notify/fanotify/fanotify_user.c
+++ b/fs/notify/fanotify/fanotify_user.c
@@ -928,6 +928,16 @@ static long fanotify_ioctl(struct file *file, unsigned int cmd, unsigned long ar
 		spin_unlock(&group->notification_lock);
 		ret = put_user(send_len, (int __user *) p);
 		break;
+	case FAN_IOC_SET_MARK_PAGE_ORDER:
+		if (!capable(CAP_SYS_ADMIN))
+			return -EPERM;
+		mutex_lock(&group->mark_mutex);
+		group->fanotify_data.mark_page_order = (unsigned int)arg;
+		pr_info("fanotify: set mark size page order to %u",
+			group->fanotify_data.mark_page_order);
+		ret = 0;
+		mutex_unlock(&group->mark_mutex);
+		break;
 	}
 
 	return ret;
@@ -1150,6 +1160,7 @@ static struct fsnotify_mark *fanotify_add_new_mark(struct fsnotify_group *group,
 						   __kernel_fsid_t *fsid)
 {
 	struct ucounts *ucounts = group->fanotify_data.ucounts;
+	unsigned int order = group->fanotify_data.mark_page_order;
 	struct fsnotify_mark *mark;
 	int ret;
 
@@ -1162,7 +1173,21 @@ static struct fsnotify_mark *fanotify_add_new_mark(struct fsnotify_group *group,
 	    !inc_ucount(ucounts->ns, ucounts->uid, UCOUNT_FANOTIFY_MARKS))
 		return ERR_PTR(-ENOSPC);
 
-	mark = kmem_cache_alloc(fanotify_mark_cache, GFP_KERNEL);
+	/*
+	 * If requested to test direct reclaim in mark allocation context,
+	 * start by trying to allocate requested page order per mark and
+	 * fall back to allocation size that is likely to trigger direct
+	 * reclaim but not too large to trigger compaction.
+	 */
+	if (order) {
+		mark = kmalloc(PAGE_SIZE << order,
+			       GFP_KERNEL_ACCOUNT | __GFP_NOWARN);
+		if (!mark && order > PAGE_ALLOC_COSTLY_ORDER)
+			mark = kmalloc(PAGE_SIZE << PAGE_ALLOC_COSTLY_ORDER,
+				       GFP_KERNEL_ACCOUNT | __GFP_NOWARN);
+	} else {
+		mark = kmem_cache_alloc(fanotify_mark_cache, GFP_KERNEL);
+	}
 	if (!mark) {
 		ret = -ENOMEM;
 		goto out_dec_ucounts;
@@ -1171,6 +1196,15 @@ static struct fsnotify_mark *fanotify_add_new_mark(struct fsnotify_group *group,
 	fsnotify_init_mark(mark, group);
 	if (fan_flags & FAN_MARK_EVICTABLE)
 		mark->flags |= FSNOTIFY_MARK_FLAG_NO_IREF;
+	/*
+	 * Allow adding multiple large marks per object for testing.
+	 * FAN_MARK_REMOVE refers to the first mark of the group, so one
+	 * FAN_MARK_REMOVE is needed for every added large mark (or use
+	 * FAN_MARK_FLUSH to remove all marks).
+	 */
+	if (order)
+		mark->flags |= FSNOTIFY_MARK_FLAG_KMALLOC |
+			       FSNOTIFY_MARK_FLAG_ALLOW_DUPS;
 
 	ret = fsnotify_add_mark_locked(mark, connp, obj_type, fsid);
 	if (ret) {
@@ -1201,11 +1235,13 @@ static int fanotify_add_mark(struct fsnotify_group *group,
 			     __u32 mask, unsigned int flags,
 			     __kernel_fsid_t *fsid)
 {
-	struct fsnotify_mark *fsn_mark;
+	struct fsnotify_mark *fsn_mark = NULL;
 	int ret = 0;
 
 	mutex_lock(&group->mark_mutex);
-	fsn_mark = fsnotify_find_mark(connp, group);
+	/* Allow adding multiple large marks per object for testing */
+	if (!group->fanotify_data.mark_page_order)
+		fsn_mark = fsnotify_find_mark(connp, group);
 	if (!fsn_mark) {
 		fsn_mark = fanotify_add_new_mark(group, connp, obj_type, flags,
 						 fsid);
diff --git a/include/linux/fsnotify_backend.h b/include/linux/fsnotify_backend.h
index df58439a86fa..8220cf560c28 100644
--- a/include/linux/fsnotify_backend.h
+++ b/include/linux/fsnotify_backend.h
@@ -250,6 +250,7 @@ struct fsnotify_group {
 			int f_flags; /* event_f_flags from fanotify_init() */
 			struct ucounts *ucounts;
 			mempool_t error_events_pool;
+			unsigned int mark_page_order; /* for testing only */
 		} fanotify_data;
 #endif /* CONFIG_FANOTIFY */
 	};
@@ -528,6 +529,7 @@ struct fsnotify_mark {
 #define FSNOTIFY_MARK_FLAG_ALLOW_DUPS		0x0040
 #define FSNOTIFY_MARK_FLAG_IGNORED_SURV_MODIFY	0x0100
 #define FSNOTIFY_MARK_FLAG_NO_IREF		0x0200
+#define FSNOTIFY_MARK_FLAG_KMALLOC		0x0400
 	unsigned int flags;		/* flags [mark->lock] */
 };
 
diff --git a/include/uapi/linux/fanotify.h b/include/uapi/linux/fanotify.h
index f1f89132d60e..49cdc9008bf2 100644
--- a/include/uapi/linux/fanotify.h
+++ b/include/uapi/linux/fanotify.h
@@ -3,6 +3,7 @@
 #define _UAPI_LINUX_FANOTIFY_H
 
 #include <linux/types.h>
+#include <linux/ioctl.h>
 
 /* the following events that user-space can register for */
 #define FAN_ACCESS		0x00000001	/* File was accessed */
@@ -206,4 +207,7 @@ struct fanotify_response {
 				(long)(meta)->event_len >= (long)FAN_EVENT_METADATA_LEN && \
 				(long)(meta)->event_len <= (long)(len))
 
+/* Only for testing. Not useful otherwise */
+#define	FAN_IOC_SET_MARK_PAGE_ORDER	_IOW(0xfa, 1, long)
+
 #endif /* _UAPI_LINUX_FANOTIFY_H */
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH v2 15/16] fanotify: use fsnotify group lock helpers
  2022-03-29  7:48 [PATCH v2 00/16] Evictable fanotify marks Amir Goldstein
                   ` (13 preceding siblings ...)
  2022-03-29  7:49 ` [PATCH v2 14/16] fanotify: add FAN_IOC_SET_MARK_PAGE_ORDER ioctl for testing Amir Goldstein
@ 2022-03-29  7:49 ` Amir Goldstein
  2022-03-29  7:49 ` [PATCH v2 16/16] fanotify: enable "evictable" inode marks Amir Goldstein
  2022-04-09 16:12 ` [PATCH v2 00/16] Evictable fanotify marks Amir Goldstein
  16 siblings, 0 replies; 36+ messages in thread
From: Amir Goldstein @ 2022-03-29  7:49 UTC (permalink / raw)
  To: Jan Kara; +Cc: Matthew Bobrowski, linux-fsdevel

Direct reclaim from fanotify mark allocation context may try to evict
inodes with evictable marks of the same group and hit this deadlock:

[<0>] fsnotify_destroy_mark+0x1f/0x3a
[<0>] fsnotify_destroy_marks+0x71/0xd9
[<0>] __destroy_inode+0x24/0x7e
[<0>] destroy_inode+0x2c/0x67
[<0>] dispose_list+0x49/0x68
[<0>] prune_icache_sb+0x5b/0x79
[<0>] super_cache_scan+0x11c/0x16f
[<0>] shrink_slab.constprop.0+0x23e/0x40f
[<0>] shrink_node+0x218/0x3e7
[<0>] do_try_to_free_pages+0x12a/0x2d2
[<0>] try_to_free_pages+0x166/0x242
[<0>] __alloc_pages_slowpath.constprop.0+0x30c/0x903
[<0>] __alloc_pages+0xeb/0x1c7
[<0>] cache_grow_begin+0x6f/0x31e
[<0>] fallback_alloc+0xe0/0x12d
[<0>] ____cache_alloc_node+0x15a/0x17e
[<0>] kmem_cache_alloc_trace+0xa1/0x143
[<0>] fanotify_add_mark+0xd5/0x2b2
[<0>] do_fanotify_mark+0x566/0x5eb
[<0>] __x64_sys_fanotify_mark+0x21/0x24
[<0>] do_syscall_64+0x6d/0x80
[<0>] entry_SYSCALL_64_after_hwframe+0x44/0xae

Set the FSNOTIFY_GROUP_NOFS flag to prevent going into direct reclaim
from allocations under fanotify group lock and use the nofs group lock
helpers.

Suggested-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20220321112310.vpr7oxro2xkz5llh@quack3.lan/
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
---
 fs/notify/fanotify/fanotify.c      |  1 +
 fs/notify/fanotify/fanotify_user.c | 18 ++++++++++--------
 2 files changed, 11 insertions(+), 8 deletions(-)

diff --git a/fs/notify/fanotify/fanotify.c b/fs/notify/fanotify/fanotify.c
index 02990a6b1b65..c7bcada371cb 100644
--- a/fs/notify/fanotify/fanotify.c
+++ b/fs/notify/fanotify/fanotify.c
@@ -1082,6 +1082,7 @@ static void fanotify_free_mark(struct fsnotify_mark *fsn_mark)
 }
 
 const struct fsnotify_ops fanotify_fsnotify_ops = {
+	.group_flags = FSNOTIFY_GROUP_NOFS,
 	.handle_event = fanotify_handle_event,
 	.free_group_priv = fanotify_free_group_priv,
 	.free_event = fanotify_free_event,
diff --git a/fs/notify/fanotify/fanotify_user.c b/fs/notify/fanotify/fanotify_user.c
index a3539bd8e443..5c857bfe8c3e 100644
--- a/fs/notify/fanotify/fanotify_user.c
+++ b/fs/notify/fanotify/fanotify_user.c
@@ -931,12 +931,12 @@ static long fanotify_ioctl(struct file *file, unsigned int cmd, unsigned long ar
 	case FAN_IOC_SET_MARK_PAGE_ORDER:
 		if (!capable(CAP_SYS_ADMIN))
 			return -EPERM;
-		mutex_lock(&group->mark_mutex);
+		fsnotify_group_lock(group);
 		group->fanotify_data.mark_page_order = (unsigned int)arg;
 		pr_info("fanotify: set mark size page order to %u",
 			group->fanotify_data.mark_page_order);
 		ret = 0;
-		mutex_unlock(&group->mark_mutex);
+		fsnotify_group_unlock(group);
 		break;
 	}
 
@@ -1044,11 +1044,12 @@ static int fanotify_remove_mark(struct fsnotify_group *group,
 	struct fsnotify_mark *fsn_mark = NULL;
 	__u32 removed;
 	int destroy_mark;
+	unsigned int nofs;
 
-	mutex_lock(&group->mark_mutex);
+	nofs = fsnotify_group_nofs_lock(group);
 	fsn_mark = fsnotify_find_mark(connp, group);
 	if (!fsn_mark) {
-		mutex_unlock(&group->mark_mutex);
+		fsnotify_group_nofs_unlock(group, nofs);
 		return -ENOENT;
 	}
 
@@ -1058,7 +1059,7 @@ static int fanotify_remove_mark(struct fsnotify_group *group,
 		fsnotify_recalc_mask(fsn_mark->connector);
 	if (destroy_mark)
 		fsnotify_detach_mark(fsn_mark);
-	mutex_unlock(&group->mark_mutex);
+	fsnotify_group_nofs_unlock(group, nofs);
 	if (destroy_mark)
 		fsnotify_free_mark(fsn_mark);
 
@@ -1236,9 +1237,10 @@ static int fanotify_add_mark(struct fsnotify_group *group,
 			     __kernel_fsid_t *fsid)
 {
 	struct fsnotify_mark *fsn_mark = NULL;
+	unsigned int nofs;
 	int ret = 0;
 
-	mutex_lock(&group->mark_mutex);
+	nofs = fsnotify_group_nofs_lock(group);
 	/* Allow adding multiple large marks per object for testing */
 	if (!group->fanotify_data.mark_page_order)
 		fsn_mark = fsnotify_find_mark(connp, group);
@@ -1246,7 +1248,7 @@ static int fanotify_add_mark(struct fsnotify_group *group,
 		fsn_mark = fanotify_add_new_mark(group, connp, obj_type, flags,
 						 fsid);
 		if (IS_ERR(fsn_mark)) {
-			mutex_unlock(&group->mark_mutex);
+			fsnotify_group_nofs_unlock(group, nofs);
 			return PTR_ERR(fsn_mark);
 		}
 	}
@@ -1264,7 +1266,7 @@ static int fanotify_add_mark(struct fsnotify_group *group,
 	ret = fanotify_mark_add_to_mask(fsn_mark, mask, flags);
 
 out:
-	mutex_unlock(&group->mark_mutex);
+	fsnotify_group_nofs_unlock(group, nofs);
 
 	fsnotify_put_mark(fsn_mark);
 	return ret;
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH v2 16/16] fanotify: enable "evictable" inode marks
  2022-03-29  7:48 [PATCH v2 00/16] Evictable fanotify marks Amir Goldstein
                   ` (14 preceding siblings ...)
  2022-03-29  7:49 ` [PATCH v2 15/16] fanotify: use fsnotify group lock helpers Amir Goldstein
@ 2022-03-29  7:49 ` Amir Goldstein
  2022-04-09 16:12 ` [PATCH v2 00/16] Evictable fanotify marks Amir Goldstein
  16 siblings, 0 replies; 36+ messages in thread
From: Amir Goldstein @ 2022-03-29  7:49 UTC (permalink / raw)
  To: Jan Kara; +Cc: Matthew Bobrowski, linux-fsdevel

Now that the direct reclaim path is handled we can enable evictable
inode marks.

Signed-off-by: Amir Goldstein <amir73il@gmail.com>
---
 fs/notify/fanotify/fanotify_user.c | 2 +-
 include/linux/fanotify.h           | 1 +
 2 files changed, 2 insertions(+), 1 deletion(-)

diff --git a/fs/notify/fanotify/fanotify_user.c b/fs/notify/fanotify/fanotify_user.c
index 5c857bfe8c3e..4ea36659addc 100644
--- a/fs/notify/fanotify/fanotify_user.c
+++ b/fs/notify/fanotify/fanotify_user.c
@@ -1821,7 +1821,7 @@ static int __init fanotify_user_setup(void)
 
 	BUILD_BUG_ON(FANOTIFY_INIT_FLAGS & FANOTIFY_INTERNAL_GROUP_FLAGS);
 	BUILD_BUG_ON(HWEIGHT32(FANOTIFY_INIT_FLAGS) != 12);
-	BUILD_BUG_ON(HWEIGHT32(FANOTIFY_MARK_FLAGS) != 9);
+	BUILD_BUG_ON(HWEIGHT32(FANOTIFY_MARK_FLAGS) != 10);
 
 	fanotify_mark_cache = KMEM_CACHE(fsnotify_mark,
 					 SLAB_PANIC|SLAB_ACCOUNT);
diff --git a/include/linux/fanotify.h b/include/linux/fanotify.h
index 419cadcd7ff5..edc28555814c 100644
--- a/include/linux/fanotify.h
+++ b/include/linux/fanotify.h
@@ -66,6 +66,7 @@
 				 FAN_MARK_ONLYDIR | \
 				 FAN_MARK_IGNORED_MASK | \
 				 FAN_MARK_IGNORED_SURV_MODIFY | \
+				 FAN_MARK_EVICTABLE | \
 				 FAN_MARK_FLUSH)
 
 /*
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* Re: [PATCH v2 01/16] inotify: show inotify mask flags in proc fdinfo
  2022-03-29  7:48 ` [PATCH v2 01/16] inotify: show inotify mask flags in proc fdinfo Amir Goldstein
@ 2022-04-05 12:40   ` Jan Kara
  2022-04-05 12:43     ` Jan Kara
  0 siblings, 1 reply; 36+ messages in thread
From: Jan Kara @ 2022-04-05 12:40 UTC (permalink / raw)
  To: Amir Goldstein; +Cc: Jan Kara, Matthew Bobrowski, linux-fsdevel

On Tue 29-03-22 10:48:49, Amir Goldstein wrote:
> The inotify mask flags IN_ONESHOT and IN_EXCL_UNLINK are not "internal
> to kernel" and should be exposed in procfs fdinfo so CRIU can restore
> them.
> 
> Fixes: 6933599697c9 ("inotify: hide internal kernel bits from fdinfo")
> Signed-off-by: Amir Goldstein <amir73il@gmail.com>

Good catch!

> diff --git a/fs/notify/fdinfo.c b/fs/notify/fdinfo.c
> index 57f0d5d9f934..3451708fd035 100644
> --- a/fs/notify/fdinfo.c
> +++ b/fs/notify/fdinfo.c
> @@ -83,16 +83,9 @@ static void inotify_fdinfo(struct seq_file *m, struct fsnotify_mark *mark)
>  	inode_mark = container_of(mark, struct inotify_inode_mark, fsn_mark);
>  	inode = igrab(fsnotify_conn_inode(mark->connector));
>  	if (inode) {
> -		/*
> -		 * IN_ALL_EVENTS represents all of the mask bits
> -		 * that we expose to userspace.  There is at
> -		 * least one bit (FS_EVENT_ON_CHILD) which is
> -		 * used only internally to the kernel.
> -		 */
> -		u32 mask = mark->mask & IN_ALL_EVENTS;
> -		seq_printf(m, "inotify wd:%x ino:%lx sdev:%x mask:%x ignored_mask:%x ",
> +		seq_printf(m, "inotify wd:%x ino:%lx sdev:%x mask:%x ignored_mask:0 ",
>  			   inode_mark->wd, inode->i_ino, inode->i_sb->s_dev,
> -			   mask, mark->ignored_mask);
> +			   inotify_mark_user_mask(mark));

I think inotify_mark_user_mask() helper is overdoing it a bit. Just using
INOTIFY_USER_MASK here directly is IMHO fine.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v2 01/16] inotify: show inotify mask flags in proc fdinfo
  2022-04-05 12:40   ` Jan Kara
@ 2022-04-05 12:43     ` Jan Kara
  0 siblings, 0 replies; 36+ messages in thread
From: Jan Kara @ 2022-04-05 12:43 UTC (permalink / raw)
  To: Amir Goldstein; +Cc: Jan Kara, Matthew Bobrowski, linux-fsdevel

On Tue 05-04-22 14:40:45, Jan Kara wrote:
> On Tue 29-03-22 10:48:49, Amir Goldstein wrote:
> > diff --git a/fs/notify/fdinfo.c b/fs/notify/fdinfo.c
> > index 57f0d5d9f934..3451708fd035 100644
> > --- a/fs/notify/fdinfo.c
> > +++ b/fs/notify/fdinfo.c
> > @@ -83,16 +83,9 @@ static void inotify_fdinfo(struct seq_file *m, struct fsnotify_mark *mark)
> >  	inode_mark = container_of(mark, struct inotify_inode_mark, fsn_mark);
> >  	inode = igrab(fsnotify_conn_inode(mark->connector));
> >  	if (inode) {
> > -		/*
> > -		 * IN_ALL_EVENTS represents all of the mask bits
> > -		 * that we expose to userspace.  There is at
> > -		 * least one bit (FS_EVENT_ON_CHILD) which is
> > -		 * used only internally to the kernel.
> > -		 */
> > -		u32 mask = mark->mask & IN_ALL_EVENTS;
> > -		seq_printf(m, "inotify wd:%x ino:%lx sdev:%x mask:%x ignored_mask:%x ",
> > +		seq_printf(m, "inotify wd:%x ino:%lx sdev:%x mask:%x ignored_mask:0 ",
> >  			   inode_mark->wd, inode->i_ino, inode->i_sb->s_dev,
> > -			   mask, mark->ignored_mask);
> > +			   inotify_mark_user_mask(mark));
> 
> I think inotify_mark_user_mask() helper is overdoing it a bit. Just using
> INOTIFY_USER_MASK here directly is IMHO fine.

Ah, seeing the next patch I take this comment back. Helper is indeed
useful.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v2 04/16] fsnotify: remove unneeded refcounts of s_fsnotify_connectors
  2022-03-29  7:48 ` [PATCH v2 04/16] fsnotify: remove unneeded refcounts of s_fsnotify_connectors Amir Goldstein
@ 2022-04-05 12:54   ` Jan Kara
  2022-04-05 13:09     ` Amir Goldstein
  0 siblings, 1 reply; 36+ messages in thread
From: Jan Kara @ 2022-04-05 12:54 UTC (permalink / raw)
  To: Amir Goldstein; +Cc: Jan Kara, Matthew Bobrowski, linux-fsdevel

On Tue 29-03-22 10:48:52, Amir Goldstein wrote:
> s_fsnotify_connectors is elevated for every inode mark in addition to
> the refcount already taken by the inode connector.
> 
> This is a relic from s_fsnotify_inode_refs pre connector era.
> Remove those unneeded recounts.
> 
> Signed-off-by: Amir Goldstein <amir73il@gmail.com>

I disagree it is a relict. fsnotify_sb_delete() relies on
s_fsnotify_connectors to wait for all connectors to be properly torn down
on unmount so that we don't get "Busy inodes after unmount" error messages
(and use-after-free issues). Am I missing something?

								Honza

> ---
>  fs/notify/mark.c | 21 +++------------------
>  1 file changed, 3 insertions(+), 18 deletions(-)
> 
> diff --git a/fs/notify/mark.c b/fs/notify/mark.c
> index b1443e66ba26..698ed0a1a47e 100644
> --- a/fs/notify/mark.c
> +++ b/fs/notify/mark.c
> @@ -169,21 +169,6 @@ static void fsnotify_connector_destroy_workfn(struct work_struct *work)
>  	}
>  }
>  
> -static void fsnotify_get_inode_ref(struct inode *inode)
> -{
> -	ihold(inode);
> -	atomic_long_inc(&inode->i_sb->s_fsnotify_connectors);
> -}
> -
> -static void fsnotify_put_inode_ref(struct inode *inode)
> -{
> -	struct super_block *sb = inode->i_sb;
> -
> -	iput(inode);
> -	if (atomic_long_dec_and_test(&sb->s_fsnotify_connectors))
> -		wake_up_var(&sb->s_fsnotify_connectors);
> -}
> -
>  static void fsnotify_get_sb_connectors(struct fsnotify_mark_connector *conn)
>  {
>  	struct super_block *sb = fsnotify_connector_sb(conn);
> @@ -245,7 +230,7 @@ static void fsnotify_drop_object(unsigned int type, void *objp)
>  	/* Currently only inode references are passed to be dropped */
>  	if (WARN_ON_ONCE(type != FSNOTIFY_OBJ_TYPE_INODE))
>  		return;
> -	fsnotify_put_inode_ref(objp);
> +	iput(objp);
>  }
>  
>  void fsnotify_put_mark(struct fsnotify_mark *mark)
> @@ -519,7 +504,7 @@ static int fsnotify_attach_connector_to_object(fsnotify_connp_t *connp,
>  	}
>  	if (conn->type == FSNOTIFY_OBJ_TYPE_INODE) {
>  		inode = fsnotify_conn_inode(conn);
> -		fsnotify_get_inode_ref(inode);
> +		ihold(inode);
>  	}
>  	fsnotify_get_sb_connectors(conn);
>  
> @@ -530,7 +515,7 @@ static int fsnotify_attach_connector_to_object(fsnotify_connp_t *connp,
>  	if (cmpxchg(connp, NULL, conn)) {
>  		/* Someone else created list structure for us */
>  		if (inode)
> -			fsnotify_put_inode_ref(inode);
> +			iput(inode);
>  		fsnotify_put_sb_connectors(conn);
>  		kmem_cache_free(fsnotify_mark_connector_cachep, conn);
>  	}
> -- 
> 2.25.1
> 
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v2 04/16] fsnotify: remove unneeded refcounts of s_fsnotify_connectors
  2022-04-05 12:54   ` Jan Kara
@ 2022-04-05 13:09     ` Amir Goldstein
  2022-04-06 17:47       ` Jan Kara
  0 siblings, 1 reply; 36+ messages in thread
From: Amir Goldstein @ 2022-04-05 13:09 UTC (permalink / raw)
  To: Jan Kara; +Cc: Matthew Bobrowski, linux-fsdevel

On Tue, Apr 5, 2022 at 3:54 PM Jan Kara <jack@suse.cz> wrote:
>
> On Tue 29-03-22 10:48:52, Amir Goldstein wrote:
> > s_fsnotify_connectors is elevated for every inode mark in addition to
> > the refcount already taken by the inode connector.
> >
> > This is a relic from s_fsnotify_inode_refs pre connector era.
> > Remove those unneeded recounts.
> >
> > Signed-off-by: Amir Goldstein <amir73il@gmail.com>
>
> I disagree it is a relict. fsnotify_sb_delete() relies on
> s_fsnotify_connectors to wait for all connectors to be properly torn down
> on unmount so that we don't get "Busy inodes after unmount" error messages
> (and use-after-free issues). Am I missing something?
>

I meant it is a relic from the time before s_fsnotify_inode_refs became
s_fsnotify_connectors.

Nowadays, one s_fsnotify_connectors refcount per connector is enough.
No need for one refcount per inode.

Open code the the sequence:
    if (inode)
        fsnotify_put_inode_ref(inode);
    fsnotify_put_sb_connectors(conn);

To see how silly it is.

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v2 04/16] fsnotify: remove unneeded refcounts of s_fsnotify_connectors
  2022-04-05 13:09     ` Amir Goldstein
@ 2022-04-06 17:47       ` Jan Kara
  2022-04-06 18:19         ` Amir Goldstein
  0 siblings, 1 reply; 36+ messages in thread
From: Jan Kara @ 2022-04-06 17:47 UTC (permalink / raw)
  To: Amir Goldstein; +Cc: Jan Kara, Matthew Bobrowski, linux-fsdevel

On Tue 05-04-22 16:09:00, Amir Goldstein wrote:
> On Tue, Apr 5, 2022 at 3:54 PM Jan Kara <jack@suse.cz> wrote:
> >
> > On Tue 29-03-22 10:48:52, Amir Goldstein wrote:
> > > s_fsnotify_connectors is elevated for every inode mark in addition to
> > > the refcount already taken by the inode connector.
> > >
> > > This is a relic from s_fsnotify_inode_refs pre connector era.
> > > Remove those unneeded recounts.
> > >
> > > Signed-off-by: Amir Goldstein <amir73il@gmail.com>
> >
> > I disagree it is a relict. fsnotify_sb_delete() relies on
> > s_fsnotify_connectors to wait for all connectors to be properly torn down
> > on unmount so that we don't get "Busy inodes after unmount" error messages
> > (and use-after-free issues). Am I missing something?
> >
> 
> I meant it is a relic from the time before s_fsnotify_inode_refs became
> s_fsnotify_connectors.
> 
> Nowadays, one s_fsnotify_connectors refcount per connector is enough.
> No need for one refcount per inode.
> 
> Open code the the sequence:
>     if (inode)
>         fsnotify_put_inode_ref(inode);
>     fsnotify_put_sb_connectors(conn);
> 
> To see how silly it is.

I see your point and I agree the general direction makes sense but
technically I think your patch is buggy. Because notice that we do
fsnotify_put_sb_connectors() in fsnotify_detach_connector_from_object() so
after this call there's nothing blocking umount while we can be still
holding inode references from some marks attached to this connector. So
racing inode mark destruction & umount can lead to "Busy inodes after
umount" messages.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v2 04/16] fsnotify: remove unneeded refcounts of s_fsnotify_connectors
  2022-04-06 17:47       ` Jan Kara
@ 2022-04-06 18:19         ` Amir Goldstein
  0 siblings, 0 replies; 36+ messages in thread
From: Amir Goldstein @ 2022-04-06 18:19 UTC (permalink / raw)
  To: Jan Kara; +Cc: Matthew Bobrowski, linux-fsdevel

On Wed, Apr 6, 2022 at 8:47 PM Jan Kara <jack@suse.cz> wrote:
>
> On Tue 05-04-22 16:09:00, Amir Goldstein wrote:
> > On Tue, Apr 5, 2022 at 3:54 PM Jan Kara <jack@suse.cz> wrote:
> > >
> > > On Tue 29-03-22 10:48:52, Amir Goldstein wrote:
> > > > s_fsnotify_connectors is elevated for every inode mark in addition to
> > > > the refcount already taken by the inode connector.
> > > >
> > > > This is a relic from s_fsnotify_inode_refs pre connector era.
> > > > Remove those unneeded recounts.
> > > >
> > > > Signed-off-by: Amir Goldstein <amir73il@gmail.com>
> > >
> > > I disagree it is a relict. fsnotify_sb_delete() relies on
> > > s_fsnotify_connectors to wait for all connectors to be properly torn down
> > > on unmount so that we don't get "Busy inodes after unmount" error messages
> > > (and use-after-free issues). Am I missing something?
> > >
> >
> > I meant it is a relic from the time before s_fsnotify_inode_refs became
> > s_fsnotify_connectors.
> >
> > Nowadays, one s_fsnotify_connectors refcount per connector is enough.
> > No need for one refcount per inode.
> >
> > Open code the the sequence:
> >     if (inode)
> >         fsnotify_put_inode_ref(inode);
> >     fsnotify_put_sb_connectors(conn);
> >
> > To see how silly it is.
>
> I see your point and I agree the general direction makes sense but
> technically I think your patch is buggy. Because notice that we do
> fsnotify_put_sb_connectors() in fsnotify_detach_connector_from_object() so
> after this call there's nothing blocking umount while we can be still
> holding inode references from some marks attached to this connector. So
> racing inode mark destruction & umount can lead to "Busy inodes after
> umount" messages.
>

Yes, I see it now.
IIRC, it was just a cleanup patch, I can remove it and amend the rest
of the series.

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v2 06/16] fsnotify: create helpers for group mark_mutex lock
  2022-03-29  7:48 ` [PATCH v2 06/16] fsnotify: create helpers for group mark_mutex lock Amir Goldstein
@ 2022-04-07 14:35   ` Jan Kara
  2022-04-07 14:53     ` Amir Goldstein
  0 siblings, 1 reply; 36+ messages in thread
From: Jan Kara @ 2022-04-07 14:35 UTC (permalink / raw)
  To: Amir Goldstein; +Cc: Jan Kara, Matthew Bobrowski, linux-fsdevel

On Tue 29-03-22 10:48:54, Amir Goldstein wrote:
> Create helpers to take and release the group mark_mutex lock.
> 
> Define a flag FSNOTIFY_GROUP_NOFS in fsnotify backend operations struct
> that determines if the mark_mutex lock is fs reclaim safe or not.
> If not safe, the nofs lock helpers should be used to take the lock and
> disable direct fs reclaim.
> 
> In that case we annotate the mutex with different lockdep class to
> express to lockdep that an allocation of mark of an fs reclaim safe group
> may take the group lock of another "NOFS" group to evict inodes.
> 
> For now, converted only the callers in common code and no backend
> defines the NOFS flag.  It is intended to be set by fanotify for
> evictable marks support.
> 
> Suggested-by: Jan Kara <jack@suse.cz>
> Link: https://lore.kernel.org/r/20220321112310.vpr7oxro2xkz5llh@quack3.lan/
> Signed-off-by: Amir Goldstein <amir73il@gmail.com>

A few design question here:

1) Why do you store the FSNOTIFY_GROUP_NOFS flag in ops? Sure, this
particular flag is probably going to be the same per backend type but it
seems a bit strange to have it in ops instead of in the group itself...

2) Why do we have fsnotify_group_nofs_lock() as well as
fsnotify_group_lock()? We could detect whether we should set nofs based on
group flag anyway. Is that so that callers don't have to bother with passing
around the 'nofs'? Is it too bad? We could also store the old value of
'nofs' inside the group itself after locking it and then unlock can restore
it without the caller needing to care about anything...

								Honza

> ---
>  fs/notify/fdinfo.c               |  5 ++--
>  fs/notify/group.c                | 11 ++++++++
>  fs/notify/mark.c                 | 28 ++++++++++---------
>  include/linux/fsnotify_backend.h | 48 ++++++++++++++++++++++++++++++++
>  4 files changed, 77 insertions(+), 15 deletions(-)
> 
> diff --git a/fs/notify/fdinfo.c b/fs/notify/fdinfo.c
> index 3451708fd035..754a546d647d 100644
> --- a/fs/notify/fdinfo.c
> +++ b/fs/notify/fdinfo.c
> @@ -27,14 +27,15 @@ static void show_fdinfo(struct seq_file *m, struct file *f,
>  {
>  	struct fsnotify_group *group = f->private_data;
>  	struct fsnotify_mark *mark;
> +	unsigned int nofs;
>  
> -	mutex_lock(&group->mark_mutex);
> +	nofs = fsnotify_group_nofs_lock(group);
>  	list_for_each_entry(mark, &group->marks_list, g_list) {
>  		show(m, mark);
>  		if (seq_has_overflowed(m))
>  			break;
>  	}
> -	mutex_unlock(&group->mark_mutex);
> +	fsnotify_group_nofs_unlock(group, nofs);
>  }
>  
>  #if defined(CONFIG_EXPORTFS)
> diff --git a/fs/notify/group.c b/fs/notify/group.c
> index b7d4d64f87c2..0f585febf3d7 100644
> --- a/fs/notify/group.c
> +++ b/fs/notify/group.c
> @@ -114,6 +114,7 @@ EXPORT_SYMBOL_GPL(fsnotify_put_group);
>  static struct fsnotify_group *__fsnotify_alloc_group(
>  				const struct fsnotify_ops *ops, gfp_t gfp)
>  {
> +	static struct lock_class_key nofs_marks_lock;
>  	struct fsnotify_group *group;
>  
>  	group = kzalloc(sizeof(struct fsnotify_group), gfp);
> @@ -133,6 +134,16 @@ static struct fsnotify_group *__fsnotify_alloc_group(
>  	INIT_LIST_HEAD(&group->marks_list);
>  
>  	group->ops = ops;
> +	/*
> +	 * For most backends, eviction of inode with a mark is not expected,
> +	 * because marks hold a refcount on the inode against eviction.
> +	 *
> +	 * Use a different lockdep class for groups that support evictable
> +	 * inode marks, because with evictable marks, mark_mutex is NOT
> +	 * fs-reclaim safe - the mutex is taken when evicting inodes.
> +	 */
> +	if (FSNOTIFY_GROUP_FLAG(group, NOFS))
> +		lockdep_set_class(&group->mark_mutex, &nofs_marks_lock);
>  
>  	return group;
>  }
> diff --git a/fs/notify/mark.c b/fs/notify/mark.c
> index 3faf47def7d8..94d53f9d2b5f 100644
> --- a/fs/notify/mark.c
> +++ b/fs/notify/mark.c
> @@ -383,9 +383,7 @@ void fsnotify_finish_user_wait(struct fsnotify_iter_info *iter_info)
>   */
>  void fsnotify_detach_mark(struct fsnotify_mark *mark)
>  {
> -	struct fsnotify_group *group = mark->group;
> -
> -	WARN_ON_ONCE(!mutex_is_locked(&group->mark_mutex));
> +	fsnotify_group_assert_locked(mark->group);
>  	WARN_ON_ONCE(!srcu_read_lock_held(&fsnotify_mark_srcu) &&
>  		     refcount_read(&mark->refcnt) < 1 +
>  			!!(mark->flags & FSNOTIFY_MARK_FLAG_ATTACHED));
> @@ -437,9 +435,11 @@ void fsnotify_free_mark(struct fsnotify_mark *mark)
>  void fsnotify_destroy_mark(struct fsnotify_mark *mark,
>  			   struct fsnotify_group *group)
>  {
> -	mutex_lock(&group->mark_mutex);
> +	unsigned int nofs;
> +
> +	nofs = fsnotify_group_nofs_lock(group);
>  	fsnotify_detach_mark(mark);
> -	mutex_unlock(&group->mark_mutex);
> +	fsnotify_group_nofs_unlock(group, nofs);
>  	fsnotify_free_mark(mark);
>  }
>  EXPORT_SYMBOL_GPL(fsnotify_destroy_mark);
> @@ -658,7 +658,7 @@ int fsnotify_add_mark_locked(struct fsnotify_mark *mark,
>  	struct fsnotify_group *group = mark->group;
>  	int ret = 0;
>  
> -	BUG_ON(!mutex_is_locked(&group->mark_mutex));
> +	fsnotify_group_assert_locked(group);
>  
>  	/*
>  	 * LOCKING ORDER!!!!
> @@ -697,10 +697,11 @@ int fsnotify_add_mark(struct fsnotify_mark *mark, fsnotify_connp_t *connp,
>  {
>  	int ret;
>  	struct fsnotify_group *group = mark->group;
> +	unsigned int nofs;
>  
> -	mutex_lock(&group->mark_mutex);
> +	nofs = fsnotify_group_nofs_lock(group);
>  	ret = fsnotify_add_mark_locked(mark, connp, obj_type, fsid);
> -	mutex_unlock(&group->mark_mutex);
> +	fsnotify_group_nofs_unlock(group, nofs);
>  	return ret;
>  }
>  EXPORT_SYMBOL_GPL(fsnotify_add_mark);
> @@ -739,6 +740,7 @@ void fsnotify_clear_marks_by_group(struct fsnotify_group *group,
>  	struct fsnotify_mark *lmark, *mark;
>  	LIST_HEAD(to_free);
>  	struct list_head *head = &to_free;
> +	unsigned int nofs;
>  
>  	/* Skip selection step if we want to clear all marks. */
>  	if (obj_type == FSNOTIFY_OBJ_TYPE_ANY) {
> @@ -754,24 +756,24 @@ void fsnotify_clear_marks_by_group(struct fsnotify_group *group,
>  	 * move marks to free to to_free list in one go and then free marks in
>  	 * to_free list one by one.
>  	 */
> -	mutex_lock(&group->mark_mutex);
> +	nofs = fsnotify_group_nofs_lock(group);
>  	list_for_each_entry_safe(mark, lmark, &group->marks_list, g_list) {
>  		if (mark->connector->type == obj_type)
>  			list_move(&mark->g_list, &to_free);
>  	}
> -	mutex_unlock(&group->mark_mutex);
> +	fsnotify_group_nofs_unlock(group, nofs);
>  
>  clear:
>  	while (1) {
> -		mutex_lock(&group->mark_mutex);
> +		nofs = fsnotify_group_nofs_lock(group);
>  		if (list_empty(head)) {
> -			mutex_unlock(&group->mark_mutex);
> +			fsnotify_group_nofs_unlock(group, nofs);
>  			break;
>  		}
>  		mark = list_first_entry(head, struct fsnotify_mark, g_list);
>  		fsnotify_get_mark(mark);
>  		fsnotify_detach_mark(mark);
> -		mutex_unlock(&group->mark_mutex);
> +		fsnotify_group_nofs_unlock(group, nofs);
>  		fsnotify_free_mark(mark);
>  		fsnotify_put_mark(mark);
>  	}
> diff --git a/include/linux/fsnotify_backend.h b/include/linux/fsnotify_backend.h
> index 9e8b5b52b9de..083333ad451c 100644
> --- a/include/linux/fsnotify_backend.h
> +++ b/include/linux/fsnotify_backend.h
> @@ -20,6 +20,7 @@
>  #include <linux/user_namespace.h>
>  #include <linux/refcount.h>
>  #include <linux/mempool.h>
> +#include <linux/sched/mm.h>
>  
>  /*
>   * IN_* from inotfy.h lines up EXACTLY with FS_*, this is so we can easily
> @@ -152,6 +153,10 @@ struct mem_cgroup;
>   *		userspace messages that marks have been removed.
>   */
>  struct fsnotify_ops {
> +#define FSNOTIFY_GROUP_NOFS	0x01 /* group lock is not direct reclaim safe */
> +#define FSNOTIFY_GROUP_FLAG(group, flag) \
> +	((group)->ops->group_flags & FSNOTIFY_GROUP_ ## flag)
> +	int group_flags;
>  	int (*handle_event)(struct fsnotify_group *group, u32 mask,
>  			    const void *data, int data_type, struct inode *dir,
>  			    const struct qstr *file_name, u32 cookie,
> @@ -250,6 +255,49 @@ struct fsnotify_group {
>  	};
>  };
>  
> +/*
> + * Use this from common code to prevent deadlock when reclaiming inodes with
> + * evictable marks of the same group that is allocating a new mark.
> + */
> +static inline unsigned int fsnotify_group_nofs_lock(
> +						struct fsnotify_group *group)
> +{
> +	unsigned int nofs = current->flags & PF_MEMALLOC_NOFS;
> +
> +	mutex_lock(&group->mark_mutex);
> +	if (FSNOTIFY_GROUP_FLAG(group, NOFS))
> +		nofs = memalloc_nofs_save();
> +	return nofs;
> +}
> +
> +static inline void fsnotify_group_assert_locked(struct fsnotify_group *group)
> +{
> +	WARN_ON_ONCE(!mutex_is_locked(&group->mark_mutex));
> +	if (FSNOTIFY_GROUP_FLAG(group, NOFS))
> +		WARN_ON_ONCE(!(current->flags & PF_MEMALLOC_NOFS));
> +}
> +
> +static inline void fsnotify_group_nofs_unlock(struct fsnotify_group *group,
> +					      unsigned int nofs)
> +{
> +	memalloc_nofs_restore(nofs);
> +	mutex_unlock(&group->mark_mutex);
> +}
> +
> +/*
> + * Use this from common code that does not allocate memory or from backends
> + * who are known to be fs reclaim safe (i.e. no evictable inode marks).
> + */
> +static inline void fsnotify_group_lock(struct fsnotify_group *group)
> +{
> +	mutex_lock(&group->mark_mutex);
> +}
> +
> +static inline void fsnotify_group_unlock(struct fsnotify_group *group)
> +{
> +	mutex_unlock(&group->mark_mutex);
> +}
> +
>  /* When calling fsnotify tell it if the data is a path or inode */
>  enum fsnotify_data_type {
>  	FSNOTIFY_EVENT_NONE,
> -- 
> 2.25.1
> 
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v2 06/16] fsnotify: create helpers for group mark_mutex lock
  2022-04-07 14:35   ` Jan Kara
@ 2022-04-07 14:53     ` Amir Goldstein
  2022-04-08  8:38       ` Amir Goldstein
  0 siblings, 1 reply; 36+ messages in thread
From: Amir Goldstein @ 2022-04-07 14:53 UTC (permalink / raw)
  To: Jan Kara; +Cc: Matthew Bobrowski, linux-fsdevel

On Thu, Apr 7, 2022 at 5:35 PM Jan Kara <jack@suse.cz> wrote:
>
> On Tue 29-03-22 10:48:54, Amir Goldstein wrote:
> > Create helpers to take and release the group mark_mutex lock.
> >
> > Define a flag FSNOTIFY_GROUP_NOFS in fsnotify backend operations struct
> > that determines if the mark_mutex lock is fs reclaim safe or not.
> > If not safe, the nofs lock helpers should be used to take the lock and
> > disable direct fs reclaim.
> >
> > In that case we annotate the mutex with different lockdep class to
> > express to lockdep that an allocation of mark of an fs reclaim safe group
> > may take the group lock of another "NOFS" group to evict inodes.
> >
> > For now, converted only the callers in common code and no backend
> > defines the NOFS flag.  It is intended to be set by fanotify for
> > evictable marks support.
> >
> > Suggested-by: Jan Kara <jack@suse.cz>
> > Link: https://lore.kernel.org/r/20220321112310.vpr7oxro2xkz5llh@quack3.lan/
> > Signed-off-by: Amir Goldstein <amir73il@gmail.com>
>
> A few design question here:
>
> 1) Why do you store the FSNOTIFY_GROUP_NOFS flag in ops? Sure, this
> particular flag is probably going to be the same per backend type but it
> seems a bit strange to have it in ops instead of in the group itself...

I followed the pattern of struct file_system_type.
I didn't think per-group NOFS made much sense,
so it was easier this way.

>
> 2) Why do we have fsnotify_group_nofs_lock() as well as
> fsnotify_group_lock()? We could detect whether we should set nofs based on
> group flag anyway. Is that so that callers don't have to bother with passing
> around the 'nofs'? Is it too bad? We could also store the old value of
> 'nofs' inside the group itself after locking it and then unlock can restore
> it without the caller needing to care about anything...

Yes because it created unneeded code.
storing the local thread state in the group seems odd...

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v2 06/16] fsnotify: create helpers for group mark_mutex lock
  2022-04-07 14:53     ` Amir Goldstein
@ 2022-04-08  8:38       ` Amir Goldstein
  2022-04-11 10:31         ` Jan Kara
  0 siblings, 1 reply; 36+ messages in thread
From: Amir Goldstein @ 2022-04-08  8:38 UTC (permalink / raw)
  To: Jan Kara; +Cc: Matthew Bobrowski, linux-fsdevel

On Thu, Apr 7, 2022 at 5:53 PM Amir Goldstein <amir73il@gmail.com> wrote:
>
> On Thu, Apr 7, 2022 at 5:35 PM Jan Kara <jack@suse.cz> wrote:
> >
> > On Tue 29-03-22 10:48:54, Amir Goldstein wrote:
> > > Create helpers to take and release the group mark_mutex lock.
> > >
> > > Define a flag FSNOTIFY_GROUP_NOFS in fsnotify backend operations struct
> > > that determines if the mark_mutex lock is fs reclaim safe or not.
> > > If not safe, the nofs lock helpers should be used to take the lock and
> > > disable direct fs reclaim.
> > >
> > > In that case we annotate the mutex with different lockdep class to
> > > express to lockdep that an allocation of mark of an fs reclaim safe group
> > > may take the group lock of another "NOFS" group to evict inodes.
> > >
> > > For now, converted only the callers in common code and no backend
> > > defines the NOFS flag.  It is intended to be set by fanotify for
> > > evictable marks support.
> > >
> > > Suggested-by: Jan Kara <jack@suse.cz>
> > > Link: https://lore.kernel.org/r/20220321112310.vpr7oxro2xkz5llh@quack3.lan/
> > > Signed-off-by: Amir Goldstein <amir73il@gmail.com>
> >
> > A few design question here:
> >
> > 1) Why do you store the FSNOTIFY_GROUP_NOFS flag in ops? Sure, this
> > particular flag is probably going to be the same per backend type but it
> > seems a bit strange to have it in ops instead of in the group itself...
>
> I followed the pattern of struct file_system_type.
> I didn't think per-group NOFS made much sense,
> so it was easier this way.
>
> >
> > 2) Why do we have fsnotify_group_nofs_lock() as well as
> > fsnotify_group_lock()? We could detect whether we should set nofs based on
> > group flag anyway. Is that so that callers don't have to bother with passing
> > around the 'nofs'? Is it too bad? We could also store the old value of
> > 'nofs' inside the group itself after locking it and then unlock can restore
> > it without the caller needing to care about anything...
>
> Yes because it created unneeded code.
> storing the local thread state in the group seems odd...
>

I followed your suggestions and the result looks much better.
Pushed result to fan_evictable branch.

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v2 00/16] Evictable fanotify marks
  2022-03-29  7:48 [PATCH v2 00/16] Evictable fanotify marks Amir Goldstein
                   ` (15 preceding siblings ...)
  2022-03-29  7:49 ` [PATCH v2 16/16] fanotify: enable "evictable" inode marks Amir Goldstein
@ 2022-04-09 16:12 ` Amir Goldstein
  16 siblings, 0 replies; 36+ messages in thread
From: Amir Goldstein @ 2022-04-09 16:12 UTC (permalink / raw)
  To: Jan Kara; +Cc: Matthew Bobrowski, linux-fsdevel

On Tue, Mar 29, 2022 at 10:49 AM Amir Goldstein <amir73il@gmail.com> wrote:
>
> Jan,
>
> Following the discussion on direct reclaim of fsnotify marks [2],
> this patch set includes your suggested fixes to core code along with
> implementation of fanotify evictable marks (rebrand of volatile marks).
>
> The LTP test I wrote [3] reproduces that deadlock within seconds on my
> small test VM if the FSNOTIFY_GROUP_NOFS flag is removed from fanotify.
>
> To be more exact, depending on the value of vfs_cache_pressure set by
> the test, either a deadlock or lockdep warning (or both) are reproduced.
> I chose a high value of 500, which usually reproduces only the lockdep
> warning, but worked better and faster on several systems I tested on.
>
> Thanks,
> Amir.
>
> Changes since v1 [1]:
> - Fixes for direct reclaim deadlock
> - Add ioctl for direct reclaim test
> - Rebrand as FAN_MARK_EVICTABLE
> - Remove FAN_MARK_CREATE and allow clearing FAN_MARK_EVICTABLE
> - Replace connector proxy_iref with HAS_IREF flag
> - Take iref in fsnotify_reclac_mark() rather than on add mark to list
> - Remove fsnotify_add_mark() allow_dups/flags argument
> - Remove pr_debug() prints
>
> [1] https://lore.kernel.org/r/20220307155741.1352405-1-amir73il@gmail.com/
> [2] https://lore.kernel.org/r/20220321112310.vpr7oxro2xkz5llh@quack3.lan/
> [3] https://github.com/amir73il/ltp/commits/fan_evictable

And here is a first man-page draft:

https://github.com/amir73il/man-pages/commits/fan_evictable

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v2 06/16] fsnotify: create helpers for group mark_mutex lock
  2022-04-08  8:38       ` Amir Goldstein
@ 2022-04-11 10:31         ` Jan Kara
  0 siblings, 0 replies; 36+ messages in thread
From: Jan Kara @ 2022-04-11 10:31 UTC (permalink / raw)
  To: Amir Goldstein; +Cc: Jan Kara, Matthew Bobrowski, linux-fsdevel

On Fri 08-04-22 11:38:59, Amir Goldstein wrote:
> On Thu, Apr 7, 2022 at 5:53 PM Amir Goldstein <amir73il@gmail.com> wrote:
> >
> > On Thu, Apr 7, 2022 at 5:35 PM Jan Kara <jack@suse.cz> wrote:
> > >
> > > On Tue 29-03-22 10:48:54, Amir Goldstein wrote:
> > > > Create helpers to take and release the group mark_mutex lock.
> > > >
> > > > Define a flag FSNOTIFY_GROUP_NOFS in fsnotify backend operations struct
> > > > that determines if the mark_mutex lock is fs reclaim safe or not.
> > > > If not safe, the nofs lock helpers should be used to take the lock and
> > > > disable direct fs reclaim.
> > > >
> > > > In that case we annotate the mutex with different lockdep class to
> > > > express to lockdep that an allocation of mark of an fs reclaim safe group
> > > > may take the group lock of another "NOFS" group to evict inodes.
> > > >
> > > > For now, converted only the callers in common code and no backend
> > > > defines the NOFS flag.  It is intended to be set by fanotify for
> > > > evictable marks support.
> > > >
> > > > Suggested-by: Jan Kara <jack@suse.cz>
> > > > Link: https://lore.kernel.org/r/20220321112310.vpr7oxro2xkz5llh@quack3.lan/
> > > > Signed-off-by: Amir Goldstein <amir73il@gmail.com>
> > >
> > > A few design question here:
> > >
> > > 1) Why do you store the FSNOTIFY_GROUP_NOFS flag in ops? Sure, this
> > > particular flag is probably going to be the same per backend type but it
> > > seems a bit strange to have it in ops instead of in the group itself...
> >
> > I followed the pattern of struct file_system_type.
> > I didn't think per-group NOFS made much sense,
> > so it was easier this way.

I see. Yes, if they are unmutable, the having them in ops is probably fine.
But I'd rename them to say 'features' to better explain they are unmutable.

> > > 2) Why do we have fsnotify_group_nofs_lock() as well as
> > > fsnotify_group_lock()? We could detect whether we should set nofs based on
> > > group flag anyway. Is that so that callers don't have to bother with passing
> > > around the 'nofs'? Is it too bad? We could also store the old value of
> > > 'nofs' inside the group itself after locking it and then unlock can restore
> > > it without the caller needing to care about anything...
> >
> > Yes because it created unneeded code.
> > storing the local thread state in the group seems odd...
> >
> 
> I followed your suggestions and the result looks much better.
> Pushed result to fan_evictable branch.

Thanks!

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v2 12/16] fanotify: factor out helper fanotify_mark_update_flags()
  2022-03-29  7:49 ` [PATCH v2 12/16] fanotify: factor out helper fanotify_mark_update_flags() Amir Goldstein
@ 2022-04-11 10:52   ` Jan Kara
  2022-04-11 12:00     ` Amir Goldstein
  0 siblings, 1 reply; 36+ messages in thread
From: Jan Kara @ 2022-04-11 10:52 UTC (permalink / raw)
  To: Amir Goldstein; +Cc: Jan Kara, Matthew Bobrowski, linux-fsdevel

On Tue 29-03-22 10:49:00, Amir Goldstein wrote:
> Handle FAN_MARK_IGNORED_SURV_MODIFY flag change in a helper that
> is called after updating the mark mask.
> 
> Move recalc of object mask inside fanotify_mark_add_to_mask() which
> makes the code a bit simpler to follow.
> 
> Add also helper to translate fsnotify mark flags to user visible
> fanotify mark flags.
> 
> Signed-off-by: Amir Goldstein <amir73il@gmail.com>
> ---
>  fs/notify/fanotify/fanotify.h      | 10 ++++++++
>  fs/notify/fanotify/fanotify_user.c | 39 +++++++++++++++++-------------
>  fs/notify/fdinfo.c                 |  6 ++---
>  3 files changed, 34 insertions(+), 21 deletions(-)
> 
> diff --git a/fs/notify/fanotify/fanotify.h b/fs/notify/fanotify/fanotify.h
> index a3d5b751cac5..87142bc0131a 100644
> --- a/fs/notify/fanotify/fanotify.h
> +++ b/fs/notify/fanotify/fanotify.h
> @@ -490,3 +490,13 @@ static inline unsigned int fanotify_event_hash_bucket(
>  {
>  	return event->hash & FANOTIFY_HTABLE_MASK;
>  }
> +
> +static inline unsigned int fanotify_mark_user_flags(struct fsnotify_mark *mark)
> +{
> +	unsigned int mflags = 0;
> +
> +	if (mark->flags & FSNOTIFY_MARK_FLAG_IGNORED_SURV_MODIFY)
> +		mflags |= FAN_MARK_IGNORED_SURV_MODIFY;
> +
> +	return mflags;
> +}

This, together with fdinfo change should probably be a separate commit. I
don't see a good reason to mix these two changes...

> diff --git a/fs/notify/fanotify/fanotify_user.c b/fs/notify/fanotify/fanotify_user.c
> index 0f0db1efa379..6e78ea12239c 100644
> --- a/fs/notify/fanotify/fanotify_user.c
> +++ b/fs/notify/fanotify/fanotify_user.c
> @@ -1081,42 +1081,50 @@ static int fanotify_remove_inode_mark(struct fsnotify_group *group,
>  				    flags, umask);
>  }
>  
> -static void fanotify_mark_add_ignored_mask(struct fsnotify_mark *fsn_mark,
> -					   __u32 mask, unsigned int flags,
> -					   __u32 *removed)
> +static int fanotify_mark_update_flags(struct fsnotify_mark *fsn_mark,
> +				      unsigned int flags, bool *recalc)
>  {
> -	fsn_mark->ignored_mask |= mask;
> -
>  	/*
>  	 * Setting FAN_MARK_IGNORED_SURV_MODIFY for the first time may lead to
>  	 * the removal of the FS_MODIFY bit in calculated mask if it was set
>  	 * because of an ignored mask that is now going to survive FS_MODIFY.
>  	 */
>  	if ((flags & FAN_MARK_IGNORED_SURV_MODIFY) &&
> +	    (flags & FAN_MARK_IGNORED_MASK) &&
>  	    !(fsn_mark->flags & FSNOTIFY_MARK_FLAG_IGNORED_SURV_MODIFY)) {
>  		fsn_mark->flags |= FSNOTIFY_MARK_FLAG_IGNORED_SURV_MODIFY;
>  		if (!(fsn_mark->mask & FS_MODIFY))
> -			*removed = FS_MODIFY;
> +			*recalc = true;
>  	}
> +
> +	return 0;
>  }
>  
> -static __u32 fanotify_mark_add_to_mask(struct fsnotify_mark *fsn_mark,
> -				       __u32 mask, unsigned int flags,
> -				       __u32 *removed)
> +static int fanotify_mark_add_to_mask(struct fsnotify_mark *fsn_mark,
> +				     __u32 mask, unsigned int flags)
>  {
> -	__u32 oldmask, newmask;
> +	__u32 oldmask;
> +	bool recalc = false;
> +	int ret;
>  
>  	spin_lock(&fsn_mark->lock);
>  	oldmask = fsnotify_calc_mask(fsn_mark);
>  	if (!(flags & FAN_MARK_IGNORED_MASK)) {
>  		fsn_mark->mask |= mask;
>  	} else {
> -		fanotify_mark_add_ignored_mask(fsn_mark, mask, flags, removed);
> +		fsn_mark->ignored_mask |= mask;
>  	}
> -	newmask = fsnotify_calc_mask(fsn_mark);
> +
> +	recalc = fsnotify_calc_mask(fsn_mark) & ~oldmask &
> +		~fsnotify_conn_mask(fsn_mark->connector);

oldmask should be a subset of fsnotify_conn_mask() so the above should be
equivalent to:

recalc = fsnotify_calc_mask(fsn_mark) & ~fsnotify_conn_mask(fsn_mark->connector)

shouldn't it?

Otherwise this looks like a nice cleanup!

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v2 13/16] fanotify: implement "evictable" inode marks
  2022-03-29  7:49 ` [PATCH v2 13/16] fanotify: implement "evictable" inode marks Amir Goldstein
@ 2022-04-11 11:47   ` Jan Kara
  2022-04-11 12:57     ` Amir Goldstein
  0 siblings, 1 reply; 36+ messages in thread
From: Jan Kara @ 2022-04-11 11:47 UTC (permalink / raw)
  To: Amir Goldstein; +Cc: Jan Kara, Matthew Bobrowski, linux-fsdevel

On Tue 29-03-22 10:49:01, Amir Goldstein wrote:
> When an inode mark is created with flag FAN_MARK_EVICTABLE, it will not
> pin the marked inode to inode cache, so when inode is evicted from cache
> due to memory pressure, the mark will be lost.
> 
> When an inode mark with flag FAN_MARK_EVICATBLE is updated without using
> this flag, the marked inode is pinned to inode cache.
> 
> When an inode mark is updated with flag FAN_MARK_EVICTABLE but an
> existing mark already has the inode pinned, the mark update fails with
> error EEXIST.

I was thinking about this. FAN_MARK_EVICTABLE is effectively a hint to the
kernel - you can drop this if you wish. So does it make sense to return
error when we cannot follow the hint? Doesn't this just add unnecessary
work (determining whether the mark should be evictable or not) to the
userspace application using FAN_MARK_EVICTABLE?

I'd also note that FSNOTIFY_MARK_FLAG_NO_IREF needs to be stored only
because of this error checking behavior. Otherwise it would be enough to
have a flag on the connector (whether it holds iref or not) and
fsnotify_add_mark() would update the connector as needed given the added
mark. No need to mess with fsnotify_recalc_mask(). But this would be just
a small simplification. The API question in the above paragraph is more
important to me.

								Honza

> ---
>  fs/notify/fanotify/fanotify.h      |  2 ++
>  fs/notify/fanotify/fanotify_user.c | 31 +++++++++++++++++++++++++++++-
>  include/uapi/linux/fanotify.h      |  1 +
>  3 files changed, 33 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/notify/fanotify/fanotify.h b/fs/notify/fanotify/fanotify.h
> index 87142bc0131a..80e0ec95b113 100644
> --- a/fs/notify/fanotify/fanotify.h
> +++ b/fs/notify/fanotify/fanotify.h
> @@ -497,6 +497,8 @@ static inline unsigned int fanotify_mark_user_flags(struct fsnotify_mark *mark)
>  
>  	if (mark->flags & FSNOTIFY_MARK_FLAG_IGNORED_SURV_MODIFY)
>  		mflags |= FAN_MARK_IGNORED_SURV_MODIFY;
> +	if (mark->flags & FSNOTIFY_MARK_FLAG_NO_IREF)
> +		mflags |= FAN_MARK_EVICTABLE;
>  
>  	return mflags;
>  }
> diff --git a/fs/notify/fanotify/fanotify_user.c b/fs/notify/fanotify/fanotify_user.c
> index 6e78ea12239c..2c65038da4ce 100644
> --- a/fs/notify/fanotify/fanotify_user.c
> +++ b/fs/notify/fanotify/fanotify_user.c
> @@ -1084,6 +1084,8 @@ static int fanotify_remove_inode_mark(struct fsnotify_group *group,
>  static int fanotify_mark_update_flags(struct fsnotify_mark *fsn_mark,
>  				      unsigned int flags, bool *recalc)
>  {
> +	bool want_iref = !(flags & FAN_MARK_EVICTABLE);
> +
>  	/*
>  	 * Setting FAN_MARK_IGNORED_SURV_MODIFY for the first time may lead to
>  	 * the removal of the FS_MODIFY bit in calculated mask if it was set
> @@ -1097,6 +1099,20 @@ static int fanotify_mark_update_flags(struct fsnotify_mark *fsn_mark,
>  			*recalc = true;
>  	}
>  
> +	if (fsn_mark->connector->type != FSNOTIFY_OBJ_TYPE_INODE ||
> +	    want_iref == !(fsn_mark->flags & FSNOTIFY_MARK_FLAG_NO_IREF))
> +		return 0;
> +
> +	/*
> +	 * NO_IREF may be removed from a mark, but not added.
> +	 * When removed, fsnotify_recalc_mask() will take the inode ref.
> +	 */
> +	if (!want_iref)
> +		return -EEXIST;
> +
> +	fsn_mark->flags &= ~FSNOTIFY_MARK_FLAG_NO_IREF;
> +	*recalc = true;
> +
>  	return 0;
>  }
>  
> @@ -1130,6 +1146,7 @@ static int fanotify_mark_add_to_mask(struct fsnotify_mark *fsn_mark,
>  static struct fsnotify_mark *fanotify_add_new_mark(struct fsnotify_group *group,
>  						   fsnotify_connp_t *connp,
>  						   unsigned int obj_type,
> +						   unsigned int fan_flags,
>  						   __kernel_fsid_t *fsid)
>  {
>  	struct ucounts *ucounts = group->fanotify_data.ucounts;
> @@ -1152,6 +1169,9 @@ static struct fsnotify_mark *fanotify_add_new_mark(struct fsnotify_group *group,
>  	}
>  
>  	fsnotify_init_mark(mark, group);
> +	if (fan_flags & FAN_MARK_EVICTABLE)
> +		mark->flags |= FSNOTIFY_MARK_FLAG_NO_IREF;
> +
>  	ret = fsnotify_add_mark_locked(mark, connp, obj_type, fsid);
>  	if (ret) {
>  		fsnotify_put_mark(mark);
> @@ -1187,7 +1207,8 @@ static int fanotify_add_mark(struct fsnotify_group *group,
>  	mutex_lock(&group->mark_mutex);
>  	fsn_mark = fsnotify_find_mark(connp, group);
>  	if (!fsn_mark) {
> -		fsn_mark = fanotify_add_new_mark(group, connp, obj_type, fsid);
> +		fsn_mark = fanotify_add_new_mark(group, connp, obj_type, flags,
> +						 fsid);
>  		if (IS_ERR(fsn_mark)) {
>  			mutex_unlock(&group->mark_mutex);
>  			return PTR_ERR(fsn_mark);
> @@ -1602,6 +1623,14 @@ static int do_fanotify_mark(int fanotify_fd, unsigned int flags, __u64 mask,
>  	    mark_type != FAN_MARK_FILESYSTEM)
>  		goto fput_and_out;
>  
> +	/*
> +	 * Evictable is only relevant for inode marks, because only inode object
> +	 * can be evicted on memory pressure.
> +	 */
> +	if (flags & FAN_MARK_EVICTABLE &&
> +	     mark_type != FAN_MARK_INODE)
> +		goto fput_and_out;
> +
>  	/*
>  	 * Events that do not carry enough information to report
>  	 * event->fd require a group that supports reporting fid.  Those
> diff --git a/include/uapi/linux/fanotify.h b/include/uapi/linux/fanotify.h
> index e8ac38cc2fd6..f1f89132d60e 100644
> --- a/include/uapi/linux/fanotify.h
> +++ b/include/uapi/linux/fanotify.h
> @@ -82,6 +82,7 @@
>  #define FAN_MARK_IGNORED_SURV_MODIFY	0x00000040
>  #define FAN_MARK_FLUSH		0x00000080
>  /* FAN_MARK_FILESYSTEM is	0x00000100 */
> +#define FAN_MARK_EVICTABLE	0x00000200
>  
>  /* These are NOT bitwise flags.  Both bits can be used togther.  */
>  #define FAN_MARK_INODE		0x00000000
> -- 
> 2.25.1
> 
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v2 12/16] fanotify: factor out helper fanotify_mark_update_flags()
  2022-04-11 10:52   ` Jan Kara
@ 2022-04-11 12:00     ` Amir Goldstein
  0 siblings, 0 replies; 36+ messages in thread
From: Amir Goldstein @ 2022-04-11 12:00 UTC (permalink / raw)
  To: Jan Kara; +Cc: Matthew Bobrowski, linux-fsdevel

On Mon, Apr 11, 2022 at 1:53 PM Jan Kara <jack@suse.cz> wrote:
>
> On Tue 29-03-22 10:49:00, Amir Goldstein wrote:
> > Handle FAN_MARK_IGNORED_SURV_MODIFY flag change in a helper that
> > is called after updating the mark mask.
> >
> > Move recalc of object mask inside fanotify_mark_add_to_mask() which
> > makes the code a bit simpler to follow.
> >
> > Add also helper to translate fsnotify mark flags to user visible
> > fanotify mark flags.
> >
> > Signed-off-by: Amir Goldstein <amir73il@gmail.com>
> > ---
> >  fs/notify/fanotify/fanotify.h      | 10 ++++++++
> >  fs/notify/fanotify/fanotify_user.c | 39 +++++++++++++++++-------------
> >  fs/notify/fdinfo.c                 |  6 ++---
> >  3 files changed, 34 insertions(+), 21 deletions(-)
> >
> > diff --git a/fs/notify/fanotify/fanotify.h b/fs/notify/fanotify/fanotify.h
> > index a3d5b751cac5..87142bc0131a 100644
> > --- a/fs/notify/fanotify/fanotify.h
> > +++ b/fs/notify/fanotify/fanotify.h
> > @@ -490,3 +490,13 @@ static inline unsigned int fanotify_event_hash_bucket(
> >  {
> >       return event->hash & FANOTIFY_HTABLE_MASK;
> >  }
> > +
> > +static inline unsigned int fanotify_mark_user_flags(struct fsnotify_mark *mark)
> > +{
> > +     unsigned int mflags = 0;
> > +
> > +     if (mark->flags & FSNOTIFY_MARK_FLAG_IGNORED_SURV_MODIFY)
> > +             mflags |= FAN_MARK_IGNORED_SURV_MODIFY;
> > +
> > +     return mflags;
> > +}
>
> This, together with fdinfo change should probably be a separate commit. I
> don't see a good reason to mix these two changes...
>

True.

> > diff --git a/fs/notify/fanotify/fanotify_user.c b/fs/notify/fanotify/fanotify_user.c
> > index 0f0db1efa379..6e78ea12239c 100644
> > --- a/fs/notify/fanotify/fanotify_user.c
> > +++ b/fs/notify/fanotify/fanotify_user.c
> > @@ -1081,42 +1081,50 @@ static int fanotify_remove_inode_mark(struct fsnotify_group *group,
> >                                   flags, umask);
> >  }
> >
> > -static void fanotify_mark_add_ignored_mask(struct fsnotify_mark *fsn_mark,
> > -                                        __u32 mask, unsigned int flags,
> > -                                        __u32 *removed)
> > +static int fanotify_mark_update_flags(struct fsnotify_mark *fsn_mark,
> > +                                   unsigned int flags, bool *recalc)
> >  {
> > -     fsn_mark->ignored_mask |= mask;
> > -
> >       /*
> >        * Setting FAN_MARK_IGNORED_SURV_MODIFY for the first time may lead to
> >        * the removal of the FS_MODIFY bit in calculated mask if it was set
> >        * because of an ignored mask that is now going to survive FS_MODIFY.
> >        */
> >       if ((flags & FAN_MARK_IGNORED_SURV_MODIFY) &&
> > +         (flags & FAN_MARK_IGNORED_MASK) &&
> >           !(fsn_mark->flags & FSNOTIFY_MARK_FLAG_IGNORED_SURV_MODIFY)) {
> >               fsn_mark->flags |= FSNOTIFY_MARK_FLAG_IGNORED_SURV_MODIFY;
> >               if (!(fsn_mark->mask & FS_MODIFY))
> > -                     *removed = FS_MODIFY;
> > +                     *recalc = true;
> >       }
> > +
> > +     return 0;
> >  }
> >
> > -static __u32 fanotify_mark_add_to_mask(struct fsnotify_mark *fsn_mark,
> > -                                    __u32 mask, unsigned int flags,
> > -                                    __u32 *removed)
> > +static int fanotify_mark_add_to_mask(struct fsnotify_mark *fsn_mark,
> > +                                  __u32 mask, unsigned int flags)
> >  {
> > -     __u32 oldmask, newmask;
> > +     __u32 oldmask;
> > +     bool recalc = false;
> > +     int ret;
> >
> >       spin_lock(&fsn_mark->lock);
> >       oldmask = fsnotify_calc_mask(fsn_mark);
> >       if (!(flags & FAN_MARK_IGNORED_MASK)) {
> >               fsn_mark->mask |= mask;
> >       } else {
> > -             fanotify_mark_add_ignored_mask(fsn_mark, mask, flags, removed);
> > +             fsn_mark->ignored_mask |= mask;
> >       }
> > -     newmask = fsnotify_calc_mask(fsn_mark);
> > +
> > +     recalc = fsnotify_calc_mask(fsn_mark) & ~oldmask &
> > +             ~fsnotify_conn_mask(fsn_mark->connector);
>
> oldmask should be a subset of fsnotify_conn_mask() so the above should be
> equivalent to:
>
> recalc = fsnotify_calc_mask(fsn_mark) & ~fsnotify_conn_mask(fsn_mark->connector)
>
> shouldn't it?

I just translated the old expression of 'added',
but I guess there is no reason to look at the oldmask
only the newmask matters, so I can drop oldmask variable altogether.

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v2 14/16] fanotify: add FAN_IOC_SET_MARK_PAGE_ORDER ioctl for testing
  2022-03-29  7:49 ` [PATCH v2 14/16] fanotify: add FAN_IOC_SET_MARK_PAGE_ORDER ioctl for testing Amir Goldstein
@ 2022-04-11 12:53   ` Jan Kara
  2022-04-11 13:07     ` Amir Goldstein
  0 siblings, 1 reply; 36+ messages in thread
From: Jan Kara @ 2022-04-11 12:53 UTC (permalink / raw)
  To: Amir Goldstein; +Cc: Jan Kara, Matthew Bobrowski, linux-fsdevel

On Tue 29-03-22 10:49:02, Amir Goldstein wrote:
> The ioctl can be used to request allocation of marks with large size
> and attach them to an object, even if another mark already exists for
> the group on the marked object.
> 
> These large marks serve no function other than testing direct reclaim
> in the context of mark allocation.
> 
> Setting the value to 0 restores normal mark allocation.
> 
> FAN_MARK_REMOVE refers to the first mark of the group on an object, so
> the number of FAN_MARK_REMOVE calls need to match the number of large
> marks on the object in order to remove all marks from the object or use
> FAN_MARK_FLUSH to remove all marks of that object type.
> 
> Signed-off-by: Amir Goldstein <amir73il@gmail.com>

I understand this is useful as a debugging patch but I'm not sure we want
this permanently in the kernel. I'm wondering if generally more useful
approach would not be to improve error injection for page allocations to
allow easier stressing of direct reclaim...

								Honza

> ---
>  fs/notify/fanotify/fanotify.c      |  5 +++-
>  fs/notify/fanotify/fanotify_user.c | 42 +++++++++++++++++++++++++++---
>  include/linux/fsnotify_backend.h   |  2 ++
>  include/uapi/linux/fanotify.h      |  4 +++
>  4 files changed, 49 insertions(+), 4 deletions(-)
> 
> diff --git a/fs/notify/fanotify/fanotify.c b/fs/notify/fanotify/fanotify.c
> index 985e995d2a39..02990a6b1b65 100644
> --- a/fs/notify/fanotify/fanotify.c
> +++ b/fs/notify/fanotify/fanotify.c
> @@ -1075,7 +1075,10 @@ static void fanotify_freeing_mark(struct fsnotify_mark *mark,
>  
>  static void fanotify_free_mark(struct fsnotify_mark *fsn_mark)
>  {
> -	kmem_cache_free(fanotify_mark_cache, fsn_mark);
> +	if (fsn_mark->flags & FSNOTIFY_MARK_FLAG_KMALLOC)
> +		kfree(fsn_mark);
> +	else
> +		kmem_cache_free(fanotify_mark_cache, fsn_mark);
>  }
>  
>  const struct fsnotify_ops fanotify_fsnotify_ops = {
> diff --git a/fs/notify/fanotify/fanotify_user.c b/fs/notify/fanotify/fanotify_user.c
> index 2c65038da4ce..a3539bd8e443 100644
> --- a/fs/notify/fanotify/fanotify_user.c
> +++ b/fs/notify/fanotify/fanotify_user.c
> @@ -928,6 +928,16 @@ static long fanotify_ioctl(struct file *file, unsigned int cmd, unsigned long ar
>  		spin_unlock(&group->notification_lock);
>  		ret = put_user(send_len, (int __user *) p);
>  		break;
> +	case FAN_IOC_SET_MARK_PAGE_ORDER:
> +		if (!capable(CAP_SYS_ADMIN))
> +			return -EPERM;
> +		mutex_lock(&group->mark_mutex);
> +		group->fanotify_data.mark_page_order = (unsigned int)arg;
> +		pr_info("fanotify: set mark size page order to %u",
> +			group->fanotify_data.mark_page_order);
> +		ret = 0;
> +		mutex_unlock(&group->mark_mutex);
> +		break;
>  	}
>  
>  	return ret;
> @@ -1150,6 +1160,7 @@ static struct fsnotify_mark *fanotify_add_new_mark(struct fsnotify_group *group,
>  						   __kernel_fsid_t *fsid)
>  {
>  	struct ucounts *ucounts = group->fanotify_data.ucounts;
> +	unsigned int order = group->fanotify_data.mark_page_order;
>  	struct fsnotify_mark *mark;
>  	int ret;
>  
> @@ -1162,7 +1173,21 @@ static struct fsnotify_mark *fanotify_add_new_mark(struct fsnotify_group *group,
>  	    !inc_ucount(ucounts->ns, ucounts->uid, UCOUNT_FANOTIFY_MARKS))
>  		return ERR_PTR(-ENOSPC);
>  
> -	mark = kmem_cache_alloc(fanotify_mark_cache, GFP_KERNEL);
> +	/*
> +	 * If requested to test direct reclaim in mark allocation context,
> +	 * start by trying to allocate requested page order per mark and
> +	 * fall back to allocation size that is likely to trigger direct
> +	 * reclaim but not too large to trigger compaction.
> +	 */
> +	if (order) {
> +		mark = kmalloc(PAGE_SIZE << order,
> +			       GFP_KERNEL_ACCOUNT | __GFP_NOWARN);
> +		if (!mark && order > PAGE_ALLOC_COSTLY_ORDER)
> +			mark = kmalloc(PAGE_SIZE << PAGE_ALLOC_COSTLY_ORDER,
> +				       GFP_KERNEL_ACCOUNT | __GFP_NOWARN);
> +	} else {
> +		mark = kmem_cache_alloc(fanotify_mark_cache, GFP_KERNEL);
> +	}
>  	if (!mark) {
>  		ret = -ENOMEM;
>  		goto out_dec_ucounts;
> @@ -1171,6 +1196,15 @@ static struct fsnotify_mark *fanotify_add_new_mark(struct fsnotify_group *group,
>  	fsnotify_init_mark(mark, group);
>  	if (fan_flags & FAN_MARK_EVICTABLE)
>  		mark->flags |= FSNOTIFY_MARK_FLAG_NO_IREF;
> +	/*
> +	 * Allow adding multiple large marks per object for testing.
> +	 * FAN_MARK_REMOVE refers to the first mark of the group, so one
> +	 * FAN_MARK_REMOVE is needed for every added large mark (or use
> +	 * FAN_MARK_FLUSH to remove all marks).
> +	 */
> +	if (order)
> +		mark->flags |= FSNOTIFY_MARK_FLAG_KMALLOC |
> +			       FSNOTIFY_MARK_FLAG_ALLOW_DUPS;
>  
>  	ret = fsnotify_add_mark_locked(mark, connp, obj_type, fsid);
>  	if (ret) {
> @@ -1201,11 +1235,13 @@ static int fanotify_add_mark(struct fsnotify_group *group,
>  			     __u32 mask, unsigned int flags,
>  			     __kernel_fsid_t *fsid)
>  {
> -	struct fsnotify_mark *fsn_mark;
> +	struct fsnotify_mark *fsn_mark = NULL;
>  	int ret = 0;
>  
>  	mutex_lock(&group->mark_mutex);
> -	fsn_mark = fsnotify_find_mark(connp, group);
> +	/* Allow adding multiple large marks per object for testing */
> +	if (!group->fanotify_data.mark_page_order)
> +		fsn_mark = fsnotify_find_mark(connp, group);
>  	if (!fsn_mark) {
>  		fsn_mark = fanotify_add_new_mark(group, connp, obj_type, flags,
>  						 fsid);
> diff --git a/include/linux/fsnotify_backend.h b/include/linux/fsnotify_backend.h
> index df58439a86fa..8220cf560c28 100644
> --- a/include/linux/fsnotify_backend.h
> +++ b/include/linux/fsnotify_backend.h
> @@ -250,6 +250,7 @@ struct fsnotify_group {
>  			int f_flags; /* event_f_flags from fanotify_init() */
>  			struct ucounts *ucounts;
>  			mempool_t error_events_pool;
> +			unsigned int mark_page_order; /* for testing only */
>  		} fanotify_data;
>  #endif /* CONFIG_FANOTIFY */
>  	};
> @@ -528,6 +529,7 @@ struct fsnotify_mark {
>  #define FSNOTIFY_MARK_FLAG_ALLOW_DUPS		0x0040
>  #define FSNOTIFY_MARK_FLAG_IGNORED_SURV_MODIFY	0x0100
>  #define FSNOTIFY_MARK_FLAG_NO_IREF		0x0200
> +#define FSNOTIFY_MARK_FLAG_KMALLOC		0x0400
>  	unsigned int flags;		/* flags [mark->lock] */
>  };
>  
> diff --git a/include/uapi/linux/fanotify.h b/include/uapi/linux/fanotify.h
> index f1f89132d60e..49cdc9008bf2 100644
> --- a/include/uapi/linux/fanotify.h
> +++ b/include/uapi/linux/fanotify.h
> @@ -3,6 +3,7 @@
>  #define _UAPI_LINUX_FANOTIFY_H
>  
>  #include <linux/types.h>
> +#include <linux/ioctl.h>
>  
>  /* the following events that user-space can register for */
>  #define FAN_ACCESS		0x00000001	/* File was accessed */
> @@ -206,4 +207,7 @@ struct fanotify_response {
>  				(long)(meta)->event_len >= (long)FAN_EVENT_METADATA_LEN && \
>  				(long)(meta)->event_len <= (long)(len))
>  
> +/* Only for testing. Not useful otherwise */
> +#define	FAN_IOC_SET_MARK_PAGE_ORDER	_IOW(0xfa, 1, long)
> +
>  #endif /* _UAPI_LINUX_FANOTIFY_H */
> -- 
> 2.25.1
> 
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v2 13/16] fanotify: implement "evictable" inode marks
  2022-04-11 11:47   ` Jan Kara
@ 2022-04-11 12:57     ` Amir Goldstein
  2022-04-11 14:19       ` Jan Kara
  0 siblings, 1 reply; 36+ messages in thread
From: Amir Goldstein @ 2022-04-11 12:57 UTC (permalink / raw)
  To: Jan Kara; +Cc: Matthew Bobrowski, linux-fsdevel

On Mon, Apr 11, 2022 at 2:47 PM Jan Kara <jack@suse.cz> wrote:
>
> On Tue 29-03-22 10:49:01, Amir Goldstein wrote:
> > When an inode mark is created with flag FAN_MARK_EVICTABLE, it will not
> > pin the marked inode to inode cache, so when inode is evicted from cache
> > due to memory pressure, the mark will be lost.
> >
> > When an inode mark with flag FAN_MARK_EVICATBLE is updated without using
> > this flag, the marked inode is pinned to inode cache.
> >
> > When an inode mark is updated with flag FAN_MARK_EVICTABLE but an
> > existing mark already has the inode pinned, the mark update fails with
> > error EEXIST.
>
> I was thinking about this. FAN_MARK_EVICTABLE is effectively a hint to the
> kernel - you can drop this if you wish. So does it make sense to return
> error when we cannot follow the hint? Doesn't this just add unnecessary
> work (determining whether the mark should be evictable or not) to the
> userspace application using FAN_MARK_EVICTABLE?

I do not fully agree about your definition of  "hint to the kernel".
Yes, for a single inode it may be a hint, but for a million inodes it is pretty
much a directive that setting a very large number of evictable marks
CANNOT be used to choke the system.

It's true that the application should be able to avoid shooting its own
foot and we do not need to be the ones providing this protection, but
I rather prefer to keep the API more strict and safe than being sorry later.
After all, I don't think this complicates the implementation nor documentation
too much. Is it? see:

https://github.com/amir73il/man-pages/commit/b52eb7d1a8478cbd1456f4d9463902bbc4e80f0d

>
> I'd also note that FSNOTIFY_MARK_FLAG_NO_IREF needs to be stored only
> because of this error checking behavior. Otherwise it would be enough to
> have a flag on the connector (whether it holds iref or not) and
> fsnotify_add_mark() would update the connector as needed given the added

I am not sure I agree to that.
Maybe I am missing something, but the way fsnotify_recalc_mask() works now
is by checking if there is any mark without FSNOTIFY_MARK_FLAG_NO_IREF
attached to the object, so fsnotify_put_mark() knows to drop the inode when the
last non-evictable mark is removed.

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v2 14/16] fanotify: add FAN_IOC_SET_MARK_PAGE_ORDER ioctl for testing
  2022-04-11 12:53   ` Jan Kara
@ 2022-04-11 13:07     ` Amir Goldstein
  0 siblings, 0 replies; 36+ messages in thread
From: Amir Goldstein @ 2022-04-11 13:07 UTC (permalink / raw)
  To: Jan Kara; +Cc: Matthew Bobrowski, linux-fsdevel

On Mon, Apr 11, 2022 at 3:53 PM Jan Kara <jack@suse.cz> wrote:
>
> On Tue 29-03-22 10:49:02, Amir Goldstein wrote:
> > The ioctl can be used to request allocation of marks with large size
> > and attach them to an object, even if another mark already exists for
> > the group on the marked object.
> >
> > These large marks serve no function other than testing direct reclaim
> > in the context of mark allocation.
> >
> > Setting the value to 0 restores normal mark allocation.
> >
> > FAN_MARK_REMOVE refers to the first mark of the group on an object, so
> > the number of FAN_MARK_REMOVE calls need to match the number of large
> > marks on the object in order to remove all marks from the object or use
> > FAN_MARK_FLUSH to remove all marks of that object type.
> >
> > Signed-off-by: Amir Goldstein <amir73il@gmail.com>
>
> I understand this is useful as a debugging patch but I'm not sure we want
> this permanently in the kernel. I'm wondering if generally more useful
> approach would not be to improve error injection for page allocations to
> allow easier stressing of direct reclaim...

I think you are probably right, but I had to stay within time budget
to create a reproducer and it took me a lot of time even with this hack
so I don't see myself investing more time on a reproducer with improved
error injections.

So for me, it is an adequate solution to carry this patch out-of-tree
along with a matching out-of tree patch for LTP test:

https://github.com/amir73il/ltp/commit/383db59959c44bb27dbec81e74d1d9caba45b0f2

For the community, we could rely on lockdep reports users now that
we sorted out the lockdep annotations.

BTW, before resorting to this ioctl I also started going down the path of
running the test inside a restricted memcg, only to figure out that it
is the inodes
that need to be evicted not the marks and inodes are not accounted to memcg.
I think there may have been some work in the direction of somehow accounting
inode cache to memcg, but not sure where this stands.

I am open for other suggestions.

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v2 13/16] fanotify: implement "evictable" inode marks
  2022-04-11 12:57     ` Amir Goldstein
@ 2022-04-11 14:19       ` Jan Kara
  2022-04-12  8:07         ` Amir Goldstein
  0 siblings, 1 reply; 36+ messages in thread
From: Jan Kara @ 2022-04-11 14:19 UTC (permalink / raw)
  To: Amir Goldstein; +Cc: Jan Kara, Matthew Bobrowski, linux-fsdevel

On Mon 11-04-22 15:57:30, Amir Goldstein wrote:
> On Mon, Apr 11, 2022 at 2:47 PM Jan Kara <jack@suse.cz> wrote:
> >
> > On Tue 29-03-22 10:49:01, Amir Goldstein wrote:
> > > When an inode mark is created with flag FAN_MARK_EVICTABLE, it will not
> > > pin the marked inode to inode cache, so when inode is evicted from cache
> > > due to memory pressure, the mark will be lost.
> > >
> > > When an inode mark with flag FAN_MARK_EVICATBLE is updated without using
> > > this flag, the marked inode is pinned to inode cache.
> > >
> > > When an inode mark is updated with flag FAN_MARK_EVICTABLE but an
> > > existing mark already has the inode pinned, the mark update fails with
> > > error EEXIST.
> >
> > I was thinking about this. FAN_MARK_EVICTABLE is effectively a hint to the
> > kernel - you can drop this if you wish. So does it make sense to return
> > error when we cannot follow the hint? Doesn't this just add unnecessary
> > work (determining whether the mark should be evictable or not) to the
> > userspace application using FAN_MARK_EVICTABLE?
> 
> I do not fully agree about your definition of  "hint to the kernel".
> Yes, for a single inode it may be a hint, but for a million inodes it is pretty
> much a directive that setting a very large number of evictable marks
> CANNOT be used to choke the system.
> 
> It's true that the application should be able to avoid shooting its own
> foot and we do not need to be the ones providing this protection, but
> I rather prefer to keep the API more strict and safe than being sorry later.
> After all, I don't think this complicates the implementation nor documentation
> too much. Is it? see:
> 
> https://github.com/amir73il/man-pages/commit/b52eb7d1a8478cbd1456f4d9463902bbc4e80f0d

No, it is not complicating things too much and you're probably right that
having things stricter now may pay off in the future. I was just thinking
that app adding ignore mark now needs to remember whether it has already
added something else for the inode or not to know whether it can use
FAN_MARK_EVICTABLE. Which seemed like unnecessary complication.

> > I'd also note that FSNOTIFY_MARK_FLAG_NO_IREF needs to be stored only
> > because of this error checking behavior. Otherwise it would be enough to
> > have a flag on the connector (whether it holds iref or not) and
> > fsnotify_add_mark() would update the connector as needed given the added
> 
> I am not sure I agree to that.
> Maybe I am missing something, but the way fsnotify_recalc_mask() works now
> is by checking if there is any mark without FSNOTIFY_MARK_FLAG_NO_IREF
> attached to the object, so fsnotify_put_mark() knows to drop the inode when the
> last non-evictable mark is removed.

Right, I was confused and somehow thought that once connector has iref, it
will drop it only when the mark list gets empty which is obviously not
true.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v2 13/16] fanotify: implement "evictable" inode marks
  2022-04-11 14:19       ` Jan Kara
@ 2022-04-12  8:07         ` Amir Goldstein
  0 siblings, 0 replies; 36+ messages in thread
From: Amir Goldstein @ 2022-04-12  8:07 UTC (permalink / raw)
  To: Jan Kara; +Cc: Matthew Bobrowski, linux-fsdevel

On Mon, Apr 11, 2022 at 5:19 PM Jan Kara <jack@suse.cz> wrote:
>
> On Mon 11-04-22 15:57:30, Amir Goldstein wrote:
> > On Mon, Apr 11, 2022 at 2:47 PM Jan Kara <jack@suse.cz> wrote:
> > >
> > > On Tue 29-03-22 10:49:01, Amir Goldstein wrote:
> > > > When an inode mark is created with flag FAN_MARK_EVICTABLE, it will not
> > > > pin the marked inode to inode cache, so when inode is evicted from cache
> > > > due to memory pressure, the mark will be lost.
> > > >
> > > > When an inode mark with flag FAN_MARK_EVICATBLE is updated without using
> > > > this flag, the marked inode is pinned to inode cache.
> > > >
> > > > When an inode mark is updated with flag FAN_MARK_EVICTABLE but an
> > > > existing mark already has the inode pinned, the mark update fails with
> > > > error EEXIST.
> > >
> > > I was thinking about this. FAN_MARK_EVICTABLE is effectively a hint to the
> > > kernel - you can drop this if you wish. So does it make sense to return
> > > error when we cannot follow the hint? Doesn't this just add unnecessary
> > > work (determining whether the mark should be evictable or not) to the
> > > userspace application using FAN_MARK_EVICTABLE?
> >
> > I do not fully agree about your definition of  "hint to the kernel".
> > Yes, for a single inode it may be a hint, but for a million inodes it is pretty
> > much a directive that setting a very large number of evictable marks
> > CANNOT be used to choke the system.
> >
> > It's true that the application should be able to avoid shooting its own
> > foot and we do not need to be the ones providing this protection, but
> > I rather prefer to keep the API more strict and safe than being sorry later.
> > After all, I don't think this complicates the implementation nor documentation
> > too much. Is it? see:
> >
> > https://github.com/amir73il/man-pages/commit/b52eb7d1a8478cbd1456f4d9463902bbc4e80f0d
>
> No, it is not complicating things too much and you're probably right that
> having things stricter now may pay off in the future. I was just thinking
> that app adding ignore mark now needs to remember whether it has already
> added something else for the inode or not to know whether it can use
> FAN_MARK_EVICTABLE. Which seemed like unnecessary complication.
>

Well the way my app does it, it has a hardcoded list of "must exclude" dirs
which is sets on startup and then evictable ignore masks are added lazily
when events in non-interesting path show up.

If app would somehow get an event with path that was in the "must exclude"
set, then adding evictable mark would fail and nothing to it because it is
an optimization anyway. There is no tracking involved.

Anyway, I see there is a bug in EEXIST case, the error is returned *after*
updating the mask - I will fix it and re-post v3 with the rest of the fixes.

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 36+ messages in thread

end of thread, other threads:[~2022-04-12 10:12 UTC | newest]

Thread overview: 36+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-03-29  7:48 [PATCH v2 00/16] Evictable fanotify marks Amir Goldstein
2022-03-29  7:48 ` [PATCH v2 01/16] inotify: show inotify mask flags in proc fdinfo Amir Goldstein
2022-04-05 12:40   ` Jan Kara
2022-04-05 12:43     ` Jan Kara
2022-03-29  7:48 ` [PATCH v2 02/16] inotify: move control flags from mask to mark flags Amir Goldstein
2022-03-29  7:48 ` [PATCH v2 03/16] fsnotify: pass flags argument to fsnotify_add_mark() via mark Amir Goldstein
2022-03-29  7:48 ` [PATCH v2 04/16] fsnotify: remove unneeded refcounts of s_fsnotify_connectors Amir Goldstein
2022-04-05 12:54   ` Jan Kara
2022-04-05 13:09     ` Amir Goldstein
2022-04-06 17:47       ` Jan Kara
2022-04-06 18:19         ` Amir Goldstein
2022-03-29  7:48 ` [PATCH v2 05/16] fsnotify: fix wrong lockdep annotations Amir Goldstein
2022-03-29  7:48 ` [PATCH v2 06/16] fsnotify: create helpers for group mark_mutex lock Amir Goldstein
2022-04-07 14:35   ` Jan Kara
2022-04-07 14:53     ` Amir Goldstein
2022-04-08  8:38       ` Amir Goldstein
2022-04-11 10:31         ` Jan Kara
2022-03-29  7:48 ` [PATCH v2 07/16] inotify: use fsnotify group lock helpers Amir Goldstein
2022-03-29  7:48 ` [PATCH v2 08/16] audit: " Amir Goldstein
2022-03-29  7:48 ` [PATCH v2 09/16] nfsd: " Amir Goldstein
2022-03-29  7:48 ` [PATCH v2 10/16] dnotify: " Amir Goldstein
2022-03-29  7:48 ` [PATCH v2 11/16] fsnotify: allow adding an inode mark without pinning inode Amir Goldstein
2022-03-29  7:49 ` [PATCH v2 12/16] fanotify: factor out helper fanotify_mark_update_flags() Amir Goldstein
2022-04-11 10:52   ` Jan Kara
2022-04-11 12:00     ` Amir Goldstein
2022-03-29  7:49 ` [PATCH v2 13/16] fanotify: implement "evictable" inode marks Amir Goldstein
2022-04-11 11:47   ` Jan Kara
2022-04-11 12:57     ` Amir Goldstein
2022-04-11 14:19       ` Jan Kara
2022-04-12  8:07         ` Amir Goldstein
2022-03-29  7:49 ` [PATCH v2 14/16] fanotify: add FAN_IOC_SET_MARK_PAGE_ORDER ioctl for testing Amir Goldstein
2022-04-11 12:53   ` Jan Kara
2022-04-11 13:07     ` Amir Goldstein
2022-03-29  7:49 ` [PATCH v2 15/16] fanotify: use fsnotify group lock helpers Amir Goldstein
2022-03-29  7:49 ` [PATCH v2 16/16] fanotify: enable "evictable" inode marks Amir Goldstein
2022-04-09 16:12 ` [PATCH v2 00/16] Evictable fanotify marks Amir Goldstein

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.