All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/5] Add pidfd support to the fanotify API
@ 2021-05-20  2:09 Matthew Bobrowski
  2021-05-20  2:10 ` [PATCH 1/5] kernel/pid.c: remove static qualifier from pidfd_create() Matthew Bobrowski
                   ` (5 more replies)
  0 siblings, 6 replies; 38+ messages in thread
From: Matthew Bobrowski @ 2021-05-20  2:09 UTC (permalink / raw)
  To: jack, amir73il, christian.brauner; +Cc: linux-fsdevel, linux-api

Hey Jan/Amir/Christian,

This is the updated patch series for adding pidfd support to the
fanotify API. It incorporates all the suggestions that had come out of
the initial RFC patch series [0].

The main difference with this patch series is that FAN_REPORT_PIDFD
results in an additional info record object supplied alongside the
generic event metadata object instead of overloading metadata->pid. If
any of the fid flavoured init flags are specified, then the pidfd info
record object will follow any fid info record objects.

[0] https://www.spinics.net/lists/linux-fsdevel/msg193296.html

Matthew Bobrowski (5):
  kernel/pid.c: remove static qualifier from pidfd_create()
  kernel/pid.c: implement checks on parameters passed to pidfd_create()
  fanotify_user.c: minor cosmetic adjustments to fid labels
  fanotify/fanotify_user.c: introduce a generic info record copying
    function
  fanotify: Add pidfd info record support to the fanotify API

 fs/notify/fanotify/fanotify_user.c | 216 +++++++++++++++++++----------
 include/linux/fanotify.h           |   3 +
 include/linux/pid.h                |   1 +
 include/uapi/linux/fanotify.h      |  12 ++
 kernel/pid.c                       |  15 +-
 5 files changed, 170 insertions(+), 77 deletions(-)

-- 
2.31.1.751.gd2f1c929bd-goog

/M

^ permalink raw reply	[flat|nested] 38+ messages in thread

* [PATCH 1/5] kernel/pid.c: remove static qualifier from pidfd_create()
  2021-05-20  2:09 [PATCH 0/5] Add pidfd support to the fanotify API Matthew Bobrowski
@ 2021-05-20  2:10 ` Matthew Bobrowski
  2021-05-20  2:10 ` [PATCH 2/5] kernel/pid.c: implement checks on parameters passed to pidfd_create() Matthew Bobrowski
                   ` (4 subsequent siblings)
  5 siblings, 0 replies; 38+ messages in thread
From: Matthew Bobrowski @ 2021-05-20  2:10 UTC (permalink / raw)
  To: jack, amir73il, christian.brauner; +Cc: linux-fsdevel, linux-api

With the idea of returning pidfds from the fanotify API, there needs
to be mechanism exposed for creating pidfds. To do this, the static
qualifier is dropped from pidfd_create() and a declaration is added to
linux/pid.h so that this function can be called from other kernel
subsystems i.e. fanotify in this case.

Signed-off-by: Matthew Bobrowski <repnop@google.com>
---
 include/linux/pid.h | 1 +
 kernel/pid.c        | 4 +++-
 2 files changed, 4 insertions(+), 1 deletion(-)

diff --git a/include/linux/pid.h b/include/linux/pid.h
index fa10acb8d6a4..af308e15f174 100644
--- a/include/linux/pid.h
+++ b/include/linux/pid.h
@@ -78,6 +78,7 @@ struct file;
 
 extern struct pid *pidfd_pid(const struct file *file);
 struct pid *pidfd_get_pid(unsigned int fd, unsigned int *flags);
+int pidfd_create(struct pid *pid, unsigned int flags);
 
 static inline struct pid *get_pid(struct pid *pid)
 {
diff --git a/kernel/pid.c b/kernel/pid.c
index ebdf9c60cd0b..d3cd95b8b080 100644
--- a/kernel/pid.c
+++ b/kernel/pid.c
@@ -550,10 +550,12 @@ struct pid *pidfd_get_pid(unsigned int fd, unsigned int *flags)
  * Note, that this function can only be called after the fd table has
  * been unshared to avoid leaking the pidfd to the new process.
  *
+ * This symbol should not be explicitly exported to loadable modules.
+ *
  * Return: On success, a cloexec pidfd is returned.
  *         On error, a negative errno number will be returned.
  */
-static int pidfd_create(struct pid *pid, unsigned int flags)
+int pidfd_create(struct pid *pid, unsigned int flags)
 {
 	int fd;
 
-- 
2.31.1.751.gd2f1c929bd-goog

/M

^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH 2/5] kernel/pid.c: implement checks on parameters passed to pidfd_create()
  2021-05-20  2:09 [PATCH 0/5] Add pidfd support to the fanotify API Matthew Bobrowski
  2021-05-20  2:10 ` [PATCH 1/5] kernel/pid.c: remove static qualifier from pidfd_create() Matthew Bobrowski
@ 2021-05-20  2:10 ` Matthew Bobrowski
  2021-05-20  2:11 ` [PATCH 3/5] fanotify_user.c: minor cosmetic adjustments to fid labels Matthew Bobrowski
                   ` (3 subsequent siblings)
  5 siblings, 0 replies; 38+ messages in thread
From: Matthew Bobrowski @ 2021-05-20  2:10 UTC (permalink / raw)
  To: jack, amir73il, christian.brauner; +Cc: linux-fsdevel, linux-api

By adding the pidfd_create() declaration to linux/pid.h, it
effectively exposes this function to the rest of the kernel. As the
function currently stands, any part of the kernel could call
pidfd_create() and the pidfd creation would not be subject to the same
parameter constraints/checks as calling pidfd_open() would force upon
a caller. This could lead to unintended behavior and/or misuse of the
pidfd API.

To mitigate this, the pid_has_task() check is rolled up into
pidfd_create() and flags that are passed as parameter are checked for
validity.

Signed-off-by: Matthew Bobrowski <repnop@google.com>
---
 kernel/pid.c | 11 +++++++----
 1 file changed, 7 insertions(+), 4 deletions(-)

diff --git a/kernel/pid.c b/kernel/pid.c
index d3cd95b8b080..1805cb3d74b7 100644
--- a/kernel/pid.c
+++ b/kernel/pid.c
@@ -559,6 +559,12 @@ int pidfd_create(struct pid *pid, unsigned int flags)
 {
 	int fd;
 
+	if (!pid || !pid_has_task(pid, PIDTYPE_TGID))
+		return -EINVAL;
+
+	if (flags & ~(O_NONBLOCK | O_CLOEXEC | O_RDWR))
+		return -EINVAL;
+
 	fd = anon_inode_getfd("[pidfd]", &pidfd_fops, get_pid(pid),
 			      flags | O_RDWR | O_CLOEXEC);
 	if (fd < 0)
@@ -598,10 +604,7 @@ SYSCALL_DEFINE2(pidfd_open, pid_t, pid, unsigned int, flags)
 	if (!p)
 		return -ESRCH;
 
-	if (pid_has_task(p, PIDTYPE_TGID))
-		fd = pidfd_create(p, flags);
-	else
-		fd = -EINVAL;
+	fd = pidfd_create(p, flags);
 
 	put_pid(p);
 	return fd;
-- 
2.31.1.751.gd2f1c929bd-goog

/M

^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH 3/5] fanotify_user.c: minor cosmetic adjustments to fid labels
  2021-05-20  2:09 [PATCH 0/5] Add pidfd support to the fanotify API Matthew Bobrowski
  2021-05-20  2:10 ` [PATCH 1/5] kernel/pid.c: remove static qualifier from pidfd_create() Matthew Bobrowski
  2021-05-20  2:10 ` [PATCH 2/5] kernel/pid.c: implement checks on parameters passed to pidfd_create() Matthew Bobrowski
@ 2021-05-20  2:11 ` Matthew Bobrowski
  2021-05-20  2:11 ` [PATCH 4/5] fanotify/fanotify_user.c: introduce a generic info record copying function Matthew Bobrowski
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 38+ messages in thread
From: Matthew Bobrowski @ 2021-05-20  2:11 UTC (permalink / raw)
  To: jack, amir73il, christian.brauner; +Cc: linux-fsdevel, linux-api

With the idea to support additional info record types in the future
i.e. fanotify_event_info_pidfd, it's a good idea to rename some of the
labels assigned to some of the existing fid related functions,
parameters, etc which more accurately represent the intent behind
their usage.

For example, copy_info_to_user() was defined with a generic function
label, which arguably reads as being supportive of different info
record types, however the parameter list for this function is
explicitly tailored towards the creation and copying of the
fanotify_event_info_fid records. This same point applies to the macro
defined as FANOTIFY_INFO_HDR_LEN.

With fanotify_event_info_len(), we change the parameter label so that
the function implies that it can be extended to calculate the length
for additional info record types.

Signed-off-by: Matthew Bobrowski <repnop@google.com>
---
 fs/notify/fanotify/fanotify_user.c | 32 +++++++++++++++++-------------
 1 file changed, 18 insertions(+), 14 deletions(-)

diff --git a/fs/notify/fanotify/fanotify_user.c b/fs/notify/fanotify/fanotify_user.c
index 71fefb30e015..914ff8cf30df 100644
--- a/fs/notify/fanotify/fanotify_user.c
+++ b/fs/notify/fanotify/fanotify_user.c
@@ -104,7 +104,7 @@ struct kmem_cache *fanotify_path_event_cachep __read_mostly;
 struct kmem_cache *fanotify_perm_event_cachep __read_mostly;
 
 #define FANOTIFY_EVENT_ALIGN 4
-#define FANOTIFY_INFO_HDR_LEN \
+#define FANOTIFY_FID_INFO_HDR_LEN \
 	(sizeof(struct fanotify_event_info_fid) + sizeof(struct file_handle))
 
 static int fanotify_fid_info_len(int fh_len, int name_len)
@@ -114,10 +114,11 @@ static int fanotify_fid_info_len(int fh_len, int name_len)
 	if (name_len)
 		info_len += name_len + 1;
 
-	return roundup(FANOTIFY_INFO_HDR_LEN + info_len, FANOTIFY_EVENT_ALIGN);
+	return roundup(FANOTIFY_FID_INFO_HDR_LEN + info_len,
+		       FANOTIFY_EVENT_ALIGN);
 }
 
-static int fanotify_event_info_len(unsigned int fid_mode,
+static int fanotify_event_info_len(unsigned int info_mode,
 				   struct fanotify_event *event)
 {
 	struct fanotify_info *info = fanotify_event_info(event);
@@ -128,7 +129,8 @@ static int fanotify_event_info_len(unsigned int fid_mode,
 
 	if (dir_fh_len) {
 		info_len += fanotify_fid_info_len(dir_fh_len, info->name_len);
-	} else if ((fid_mode & FAN_REPORT_NAME) && (event->mask & FAN_ONDIR)) {
+	} else if ((info_mode & FAN_REPORT_NAME) &&
+		   (event->mask & FAN_ONDIR)) {
 		/*
 		 * With group flag FAN_REPORT_NAME, if name was not recorded in
 		 * event on a directory, we will report the name ".".
@@ -303,9 +305,10 @@ static int process_access_response(struct fsnotify_group *group,
 	return -ENOENT;
 }
 
-static int copy_info_to_user(__kernel_fsid_t *fsid, struct fanotify_fh *fh,
-			     int info_type, const char *name, size_t name_len,
-			     char __user *buf, size_t count)
+static int copy_fid_info_to_user(__kernel_fsid_t *fsid,
+				 struct fanotify_fh *fh, int info_type,
+				 const char *name, size_t name_len,
+				 char __user *buf, size_t count)
 {
 	struct fanotify_event_info_fid info = { };
 	struct file_handle handle = { };
@@ -459,10 +462,10 @@ static ssize_t copy_event_to_user(struct fsnotify_group *group,
 	if (fanotify_event_dir_fh_len(event)) {
 		info_type = info->name_len ? FAN_EVENT_INFO_TYPE_DFID_NAME :
 					     FAN_EVENT_INFO_TYPE_DFID;
-		ret = copy_info_to_user(fanotify_event_fsid(event),
-					fanotify_info_dir_fh(info),
-					info_type, fanotify_info_name(info),
-					info->name_len, buf, count);
+		ret = copy_fid_info_to_user(fanotify_event_fsid(event),
+					    fanotify_info_dir_fh(info),
+					    info_type, fanotify_info_name(info),
+					    info->name_len, buf, count);
 		if (ret < 0)
 			return ret;
 
@@ -508,9 +511,10 @@ static ssize_t copy_event_to_user(struct fsnotify_group *group,
 			info_type = FAN_EVENT_INFO_TYPE_FID;
 		}
 
-		ret = copy_info_to_user(fanotify_event_fsid(event),
-					fanotify_event_object_fh(event),
-					info_type, dot, dot_len, buf, count);
+		ret = copy_fid_info_to_user(fanotify_event_fsid(event),
+					    fanotify_event_object_fh(event),
+					    info_type, dot, dot_len,
+					    buf, count);
 		if (ret < 0)
 			return ret;
 
-- 
2.31.1.751.gd2f1c929bd-goog

/M

^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH 4/5] fanotify/fanotify_user.c: introduce a generic info record copying function
  2021-05-20  2:09 [PATCH 0/5] Add pidfd support to the fanotify API Matthew Bobrowski
                   ` (2 preceding siblings ...)
  2021-05-20  2:11 ` [PATCH 3/5] fanotify_user.c: minor cosmetic adjustments to fid labels Matthew Bobrowski
@ 2021-05-20  2:11 ` Matthew Bobrowski
  2021-05-20 13:59   ` Amir Goldstein
  2021-05-20  2:11 ` [PATCH 5/5] fanotify: Add pidfd info record support to the fanotify API Matthew Bobrowski
  2021-05-20 13:55 ` [PATCH 0/5] Add pidfd " Jan Kara
  5 siblings, 1 reply; 38+ messages in thread
From: Matthew Bobrowski @ 2021-05-20  2:11 UTC (permalink / raw)
  To: jack, amir73il, christian.brauner; +Cc: linux-fsdevel, linux-api

The copy_info_to_user() function has been repurposed with the idea to
handle the copying of different info record types. This helper allows
for the separation of info record copying routines/conditionals from
copy_event_to_user(), which reduces the overall clutter within this
function.

The newly defined macro FANOTIFY_INFO_MODES can be used to obtain info
record types that have been enabled for a specific notification
group. This becomes useful when FAN_REPORT_PIDFD is introduced.

Signed-off-by: Matthew Bobrowski <repnop@google.com>
---
 fs/notify/fanotify/fanotify_user.c | 145 ++++++++++++++++-------------
 include/linux/fanotify.h           |   2 +
 2 files changed, 81 insertions(+), 66 deletions(-)

diff --git a/fs/notify/fanotify/fanotify_user.c b/fs/notify/fanotify/fanotify_user.c
index 914ff8cf30df..1e15f3222eb2 100644
--- a/fs/notify/fanotify/fanotify_user.c
+++ b/fs/notify/fanotify/fanotify_user.c
@@ -173,7 +173,7 @@ static struct fanotify_event *get_one_event(struct fsnotify_group *group,
 	size_t event_size = FAN_EVENT_METADATA_LEN;
 	struct fanotify_event *event = NULL;
 	struct fsnotify_event *fsn_event;
-	unsigned int fid_mode = FAN_GROUP_FLAG(group, FANOTIFY_FID_BITS);
+	unsigned int info_mode = FAN_GROUP_FLAG(group, FANOTIFY_INFO_MODES);
 
 	pr_debug("%s: group=%p count=%zd\n", __func__, group, count);
 
@@ -183,8 +183,8 @@ static struct fanotify_event *get_one_event(struct fsnotify_group *group,
 		goto out;
 
 	event = FANOTIFY_E(fsn_event);
-	if (fid_mode)
-		event_size += fanotify_event_info_len(fid_mode, event);
+	if (info_mode)
+		event_size += fanotify_event_info_len(info_mode, event);
 
 	if (event_size > count) {
 		event = ERR_PTR(-EINVAL);
@@ -401,6 +401,78 @@ static int copy_fid_info_to_user(__kernel_fsid_t *fsid,
 	return info_len;
 }
 
+static int copy_info_to_user(struct fanotify_event *event,
+			     struct fanotify_info *info,
+			     unsigned int info_mode,
+			     char __user *buf, size_t count)
+{
+	int ret, info_type = 0;
+	unsigned int fid_mode = info_mode & FANOTIFY_FID_BITS;
+
+	/*
+	 * Event info records order is as follows: dir fid + name, child fid.
+	 */
+	if (fanotify_event_dir_fh_len(event)) {
+		info_type = info->name_len ? FAN_EVENT_INFO_TYPE_DFID_NAME :
+					     FAN_EVENT_INFO_TYPE_DFID;
+		ret = copy_fid_info_to_user(fanotify_event_fsid(event),
+					    fanotify_info_dir_fh(info),
+					    info_type,
+					    fanotify_info_name(info),
+					    info->name_len, buf, count);
+		if (ret < 0)
+			return ret;
+
+		buf += ret;
+		count -= ret;
+	}
+
+	if (fanotify_event_object_fh_len(event)) {
+		const char *dot = NULL;
+		int dot_len = 0;
+
+		if (fid_mode == FAN_REPORT_FID || info_type) {
+			/*
+			 * With only group flag FAN_REPORT_FID only type FID is
+			 * reported. Second info record type is always FID.
+			 */
+			info_type = FAN_EVENT_INFO_TYPE_FID;
+		} else if ((fid_mode & FAN_REPORT_NAME) && (event->mask & FAN_ONDIR)) {
+			/*
+			 * With group flag FAN_REPORT_NAME, if name was not
+			 * recorded in an event on a directory, report the name
+			 * "." with info type DFID_NAME.
+			 */
+			info_type = FAN_EVENT_INFO_TYPE_DFID_NAME;
+			dot = ".";
+			dot_len = 1;
+		} else if ((event->mask & ALL_FSNOTIFY_DIRENT_EVENTS) ||
+			   (event->mask & FAN_ONDIR)) {
+			/*
+			 * With group flag FAN_REPORT_DIR_FID, a single info
+			 * record has type DFID for directory entry
+			 * modification event and for event on a directory.
+			 */
+			info_type = FAN_EVENT_INFO_TYPE_DFID;
+		} else {
+			/*
+			 * With group flags FAN_REPORT_DIR_FID|FAN_REPORT_FID,
+			 * a single info record has type FID for event on a
+			 * non-directory, when there is no directory to report.
+			 * For example, on FAN_DELETE_SELF event.
+			 */
+			info_type = FAN_EVENT_INFO_TYPE_FID;
+		}
+
+		ret = copy_fid_info_to_user(fanotify_event_fsid(event),
+					    fanotify_event_object_fh(event),
+					    info_type, dot, dot_len,
+					    buf, count);
+	}
+
+	return ret;
+}
+
 static ssize_t copy_event_to_user(struct fsnotify_group *group,
 				  struct fanotify_event *event,
 				  char __user *buf, size_t count)
@@ -408,15 +480,14 @@ static ssize_t copy_event_to_user(struct fsnotify_group *group,
 	struct fanotify_event_metadata metadata;
 	struct path *path = fanotify_event_path(event);
 	struct fanotify_info *info = fanotify_event_info(event);
-	unsigned int fid_mode = FAN_GROUP_FLAG(group, FANOTIFY_FID_BITS);
+	unsigned int info_mode = FAN_GROUP_FLAG(group, FANOTIFY_INFO_MODES);
 	struct file *f = NULL;
 	int ret, fd = FAN_NOFD;
-	int info_type = 0;
 
 	pr_debug("%s: group=%p event=%p\n", __func__, group, event);
 
 	metadata.event_len = FAN_EVENT_METADATA_LEN +
-				fanotify_event_info_len(fid_mode, event);
+				fanotify_event_info_len(info_mode, event);
 	metadata.metadata_len = FAN_EVENT_METADATA_LEN;
 	metadata.vers = FANOTIFY_METADATA_VERSION;
 	metadata.reserved = 0;
@@ -458,68 +529,10 @@ static ssize_t copy_event_to_user(struct fsnotify_group *group,
 	if (f)
 		fd_install(fd, f);
 
-	/* Event info records order is: dir fid + name, child fid */
-	if (fanotify_event_dir_fh_len(event)) {
-		info_type = info->name_len ? FAN_EVENT_INFO_TYPE_DFID_NAME :
-					     FAN_EVENT_INFO_TYPE_DFID;
-		ret = copy_fid_info_to_user(fanotify_event_fsid(event),
-					    fanotify_info_dir_fh(info),
-					    info_type, fanotify_info_name(info),
-					    info->name_len, buf, count);
+	if (info_mode) {
+		ret = copy_info_to_user(event, info, info_mode, buf, count);
 		if (ret < 0)
 			return ret;
-
-		buf += ret;
-		count -= ret;
-	}
-
-	if (fanotify_event_object_fh_len(event)) {
-		const char *dot = NULL;
-		int dot_len = 0;
-
-		if (fid_mode == FAN_REPORT_FID || info_type) {
-			/*
-			 * With only group flag FAN_REPORT_FID only type FID is
-			 * reported. Second info record type is always FID.
-			 */
-			info_type = FAN_EVENT_INFO_TYPE_FID;
-		} else if ((fid_mode & FAN_REPORT_NAME) &&
-			   (event->mask & FAN_ONDIR)) {
-			/*
-			 * With group flag FAN_REPORT_NAME, if name was not
-			 * recorded in an event on a directory, report the
-			 * name "." with info type DFID_NAME.
-			 */
-			info_type = FAN_EVENT_INFO_TYPE_DFID_NAME;
-			dot = ".";
-			dot_len = 1;
-		} else if ((event->mask & ALL_FSNOTIFY_DIRENT_EVENTS) ||
-			   (event->mask & FAN_ONDIR)) {
-			/*
-			 * With group flag FAN_REPORT_DIR_FID, a single info
-			 * record has type DFID for directory entry modification
-			 * event and for event on a directory.
-			 */
-			info_type = FAN_EVENT_INFO_TYPE_DFID;
-		} else {
-			/*
-			 * With group flags FAN_REPORT_DIR_FID|FAN_REPORT_FID,
-			 * a single info record has type FID for event on a
-			 * non-directory, when there is no directory to report.
-			 * For example, on FAN_DELETE_SELF event.
-			 */
-			info_type = FAN_EVENT_INFO_TYPE_FID;
-		}
-
-		ret = copy_fid_info_to_user(fanotify_event_fsid(event),
-					    fanotify_event_object_fh(event),
-					    info_type, dot, dot_len,
-					    buf, count);
-		if (ret < 0)
-			return ret;
-
-		buf += ret;
-		count -= ret;
 	}
 
 	return metadata.event_len;
diff --git a/include/linux/fanotify.h b/include/linux/fanotify.h
index bad41bcb25df..f76c7635efc8 100644
--- a/include/linux/fanotify.h
+++ b/include/linux/fanotify.h
@@ -27,6 +27,8 @@ extern struct ctl_table fanotify_table[]; /* for sysctl */
 
 #define FANOTIFY_FID_BITS	(FAN_REPORT_FID | FAN_REPORT_DFID_NAME)
 
+#define FANOTIFY_INFO_MODES	(FANOTIFY_FID_BITS)
+
 /*
  * fanotify_init() flags that require CAP_SYS_ADMIN.
  * We do not allow unprivileged groups to request permission events.
-- 
2.31.1.751.gd2f1c929bd-goog

/M

^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH 5/5] fanotify: Add pidfd info record support to the fanotify API
  2021-05-20  2:09 [PATCH 0/5] Add pidfd support to the fanotify API Matthew Bobrowski
                   ` (3 preceding siblings ...)
  2021-05-20  2:11 ` [PATCH 4/5] fanotify/fanotify_user.c: introduce a generic info record copying function Matthew Bobrowski
@ 2021-05-20  2:11 ` Matthew Bobrowski
  2021-05-20  8:17   ` Christian Brauner
  2021-05-20 13:55 ` [PATCH 0/5] Add pidfd " Jan Kara
  5 siblings, 1 reply; 38+ messages in thread
From: Matthew Bobrowski @ 2021-05-20  2:11 UTC (permalink / raw)
  To: jack, amir73il, christian.brauner; +Cc: linux-fsdevel, linux-api

Introduce a new flag FAN_REPORT_PIDFD for fanotify_init(2) which
allows userspace applications to control whether a pidfd info record
containing a pidfd is to be returned with each event.

If FAN_REPORT_PIDFD is enabled for a notification group, an additional
struct fanotify_event_info_pidfd object will be supplied alongside the
generic struct fanotify_event_metadata within a single event. This
functionality is analogous to that of FAN_REPORT_FID in terms of how
the event structure is supplied to the userspace application. Usage of
FAN_REPORT_PIDFD with FAN_REPORT_FID/FAN_REPORT_DFID_NAME is
permitted, and in this case a struct fanotify_event_info_pidfd object
will follow any struct fanotify_event_info_fid object.

Usage of FAN_REPORT_TID is not permitted with FAN_REPORT_PIDFD as the
pidfd API only supports the creation of pidfds for thread-group
leaders. Attempting to do so will result with a -EINVAL being returned
when calling fanotify_init(2).

If pidfd creation fails via pidfd_create(), the pidfd field within
struct fanotify_event_info_pidfd is set to FAN_NOPIDFD.

Signed-off-by: Matthew Bobrowski <repnop@google.com>
---
 fs/notify/fanotify/fanotify_user.c | 65 +++++++++++++++++++++++++++---
 include/linux/fanotify.h           |  3 +-
 include/uapi/linux/fanotify.h      | 12 ++++++
 3 files changed, 74 insertions(+), 6 deletions(-)

diff --git a/fs/notify/fanotify/fanotify_user.c b/fs/notify/fanotify/fanotify_user.c
index 1e15f3222eb2..bba61988f4a0 100644
--- a/fs/notify/fanotify/fanotify_user.c
+++ b/fs/notify/fanotify/fanotify_user.c
@@ -106,6 +106,8 @@ struct kmem_cache *fanotify_perm_event_cachep __read_mostly;
 #define FANOTIFY_EVENT_ALIGN 4
 #define FANOTIFY_FID_INFO_HDR_LEN \
 	(sizeof(struct fanotify_event_info_fid) + sizeof(struct file_handle))
+#define FANOTIFY_PIDFD_INFO_HDR_LEN \
+	sizeof(struct fanotify_event_info_pidfd)
 
 static int fanotify_fid_info_len(int fh_len, int name_len)
 {
@@ -141,6 +143,9 @@ static int fanotify_event_info_len(unsigned int info_mode,
 	if (fh_len)
 		info_len += fanotify_fid_info_len(fh_len, dot_len);
 
+	if (info_mode & FAN_REPORT_PIDFD)
+		info_len += FANOTIFY_PIDFD_INFO_HDR_LEN;
+
 	return info_len;
 }
 
@@ -401,6 +406,29 @@ static int copy_fid_info_to_user(__kernel_fsid_t *fsid,
 	return info_len;
 }
 
+static int copy_pidfd_info_to_user(struct pid *pid,
+				   char __user *buf,
+				   size_t count)
+{
+	struct fanotify_event_info_pidfd info = { };
+	size_t info_len = FANOTIFY_PIDFD_INFO_HDR_LEN;
+
+	if (WARN_ON_ONCE(info_len > count))
+		return -EFAULT;
+
+	info.hdr.info_type = FAN_EVENT_INFO_TYPE_PIDFD;
+	info.hdr.len = info_len;
+
+	info.pidfd = pidfd_create(pid, 0);
+	if (info.pidfd < 0)
+		info.pidfd = FAN_NOPIDFD;
+
+	if (copy_to_user(buf, &info, info_len))
+		return -EFAULT;
+
+	return info_len;
+}
+
 static int copy_info_to_user(struct fanotify_event *event,
 			     struct fanotify_info *info,
 			     unsigned int info_mode,
@@ -408,9 +436,12 @@ static int copy_info_to_user(struct fanotify_event *event,
 {
 	int ret, info_type = 0;
 	unsigned int fid_mode = info_mode & FANOTIFY_FID_BITS;
+	unsigned int pidfd_mode = info_mode & FAN_REPORT_PIDFD;
 
 	/*
 	 * Event info records order is as follows: dir fid + name, child fid.
+	 * If FAN_REPORT_PIDFD has been specified, then a pidfd info record
+	 * will follow the fid info records.
 	 */
 	if (fanotify_event_dir_fh_len(event)) {
 		info_type = info->name_len ? FAN_EVENT_INFO_TYPE_DFID_NAME :
@@ -465,10 +496,18 @@ static int copy_info_to_user(struct fanotify_event *event,
 		}
 
 		ret = copy_fid_info_to_user(fanotify_event_fsid(event),
-					    fanotify_event_object_fh(event),
-					    info_type, dot, dot_len,
-					    buf, count);
-	}
+					    fanotify_event_object_fh(event),
+					    info_type, dot, dot_len,
+					    buf, count);
+		if (ret < 0)
+			return ret;
+
+		buf += ret;
+		count -= ret;
+	}
+
+	if (pidfd_mode)
+		return copy_pidfd_info_to_user(event->pid, buf, count);
 
 	return ret;
 }
@@ -530,6 +569,15 @@ static ssize_t copy_event_to_user(struct fsnotify_group *group,
 		fd_install(fd, f);
 
 	if (info_mode) {
+		/*
+		 * Complain if FAN_REPORT_PIDFD and FAN_REPORT_TID mutual
+		 * exclusion is ever lifted. At the time of implementing
+		 * FAN_REPORT_PIDFD, the pidfd API only supported the creation
+		 * of pidfds on thread-group leaders.
+		 */
+		WARN_ON_ONCE((info_mode & FAN_REPORT_PIDFD) &&
+			     FAN_GROUP_FLAG(group, FAN_REPORT_TID));
+
 		ret = copy_info_to_user(event, info, info_mode, buf, count);
 		if (ret < 0)
 			return ret;
@@ -1079,6 +1127,13 @@ SYSCALL_DEFINE2(fanotify_init, unsigned int, flags, unsigned int, event_f_flags)
 #endif
 		return -EINVAL;
 
+	/*
+	 * A pidfd can only be returned for a thread-group leader; thus
+	 * FAN_REPORT_TID and FAN_REPORT_PIDFD need to be mutually exclusive.
+	 */
+	if ((flags & FAN_REPORT_PIDFD) && (flags & FAN_REPORT_TID))
+		return -EINVAL;
+
 	if (event_f_flags & ~FANOTIFY_INIT_ALL_EVENT_F_BITS)
 		return -EINVAL;
 
@@ -1477,7 +1532,7 @@ static int __init fanotify_user_setup(void)
 	max_marks = clamp(max_marks, FANOTIFY_OLD_DEFAULT_MAX_MARKS,
 				     FANOTIFY_DEFAULT_MAX_USER_MARKS);
 
-	BUILD_BUG_ON(HWEIGHT32(FANOTIFY_INIT_FLAGS) != 10);
+	BUILD_BUG_ON(HWEIGHT32(FANOTIFY_INIT_FLAGS) != 11);
 	BUILD_BUG_ON(HWEIGHT32(FANOTIFY_MARK_FLAGS) != 9);
 
 	fanotify_mark_cache = KMEM_CACHE(fsnotify_mark,
diff --git a/include/linux/fanotify.h b/include/linux/fanotify.h
index f76c7635efc8..bb2898240e5a 100644
--- a/include/linux/fanotify.h
+++ b/include/linux/fanotify.h
@@ -27,7 +27,7 @@ extern struct ctl_table fanotify_table[]; /* for sysctl */
 
 #define FANOTIFY_FID_BITS	(FAN_REPORT_FID | FAN_REPORT_DFID_NAME)
 
-#define FANOTIFY_INFO_MODES	(FANOTIFY_FID_BITS)
+#define FANOTIFY_INFO_MODES	(FANOTIFY_FID_BITS | FAN_REPORT_PIDFD)
 
 /*
  * fanotify_init() flags that require CAP_SYS_ADMIN.
@@ -37,6 +37,7 @@ extern struct ctl_table fanotify_table[]; /* for sysctl */
  */
 #define FANOTIFY_ADMIN_INIT_FLAGS	(FANOTIFY_PERM_CLASSES | \
 					 FAN_REPORT_TID | \
+					 FAN_REPORT_PIDFD | \
 					 FAN_UNLIMITED_QUEUE | \
 					 FAN_UNLIMITED_MARKS)
 
diff --git a/include/uapi/linux/fanotify.h b/include/uapi/linux/fanotify.h
index fbf9c5c7dd59..36c3bddcf690 100644
--- a/include/uapi/linux/fanotify.h
+++ b/include/uapi/linux/fanotify.h
@@ -55,6 +55,7 @@
 #define FAN_REPORT_FID		0x00000200	/* Report unique file id */
 #define FAN_REPORT_DIR_FID	0x00000400	/* Report unique directory id */
 #define FAN_REPORT_NAME		0x00000800	/* Report events with name */
+#define FAN_REPORT_PIDFD	0x00001000	/* Report pidfd for event->pid */
 
 /* Convenience macro - FAN_REPORT_NAME requires FAN_REPORT_DIR_FID */
 #define FAN_REPORT_DFID_NAME	(FAN_REPORT_DIR_FID | FAN_REPORT_NAME)
@@ -123,6 +124,7 @@ struct fanotify_event_metadata {
 #define FAN_EVENT_INFO_TYPE_FID		1
 #define FAN_EVENT_INFO_TYPE_DFID_NAME	2
 #define FAN_EVENT_INFO_TYPE_DFID	3
+#define FAN_EVENT_INFO_TYPE_PIDFD	4
 
 /* Variable length info record following event metadata */
 struct fanotify_event_info_header {
@@ -148,6 +150,15 @@ struct fanotify_event_info_fid {
 	unsigned char handle[0];
 };
 
+/*
+ * This structure is used for info records of type FAN_EVENT_INFO_TYPE_PIDFD.
+ * It holds a pidfd for the pid responsible for generating an event.
+ */
+struct fanotify_event_info_pidfd {
+	struct fanotify_event_info_header hdr;
+	__s32 pidfd;
+};
+
 struct fanotify_response {
 	__s32 fd;
 	__u32 response;
@@ -160,6 +171,7 @@ struct fanotify_response {
 
 /* No fd set in event */
 #define FAN_NOFD	-1
+#define FAN_NOPIDFD	FAN_NOFD
 
 /* Helper functions to deal with fanotify_event_metadata buffers */
 #define FAN_EVENT_METADATA_LEN (sizeof(struct fanotify_event_metadata))
-- 
2.31.1.751.gd2f1c929bd-goog

/M

^ permalink raw reply related	[flat|nested] 38+ messages in thread

* Re: [PATCH 5/5] fanotify: Add pidfd info record support to the fanotify API
  2021-05-20  2:11 ` [PATCH 5/5] fanotify: Add pidfd info record support to the fanotify API Matthew Bobrowski
@ 2021-05-20  8:17   ` Christian Brauner
  2021-05-20 13:43     ` Amir Goldstein
  0 siblings, 1 reply; 38+ messages in thread
From: Christian Brauner @ 2021-05-20  8:17 UTC (permalink / raw)
  To: Matthew Bobrowski; +Cc: jack, amir73il, linux-fsdevel, linux-api

On Thu, May 20, 2021 at 12:11:51PM +1000, Matthew Bobrowski wrote:
> Introduce a new flag FAN_REPORT_PIDFD for fanotify_init(2) which
> allows userspace applications to control whether a pidfd info record
> containing a pidfd is to be returned with each event.
> 
> If FAN_REPORT_PIDFD is enabled for a notification group, an additional
> struct fanotify_event_info_pidfd object will be supplied alongside the
> generic struct fanotify_event_metadata within a single event. This
> functionality is analogous to that of FAN_REPORT_FID in terms of how
> the event structure is supplied to the userspace application. Usage of
> FAN_REPORT_PIDFD with FAN_REPORT_FID/FAN_REPORT_DFID_NAME is
> permitted, and in this case a struct fanotify_event_info_pidfd object
> will follow any struct fanotify_event_info_fid object.
> 
> Usage of FAN_REPORT_TID is not permitted with FAN_REPORT_PIDFD as the
> pidfd API only supports the creation of pidfds for thread-group
> leaders. Attempting to do so will result with a -EINVAL being returned
> when calling fanotify_init(2).
> 
> If pidfd creation fails via pidfd_create(), the pidfd field within
> struct fanotify_event_info_pidfd is set to FAN_NOPIDFD.
> 
> Signed-off-by: Matthew Bobrowski <repnop@google.com>
> ---
>  fs/notify/fanotify/fanotify_user.c | 65 +++++++++++++++++++++++++++---
>  include/linux/fanotify.h           |  3 +-
>  include/uapi/linux/fanotify.h      | 12 ++++++
>  3 files changed, 74 insertions(+), 6 deletions(-)
> 
> diff --git a/fs/notify/fanotify/fanotify_user.c b/fs/notify/fanotify/fanotify_user.c
> index 1e15f3222eb2..bba61988f4a0 100644
> --- a/fs/notify/fanotify/fanotify_user.c
> +++ b/fs/notify/fanotify/fanotify_user.c
> @@ -106,6 +106,8 @@ struct kmem_cache *fanotify_perm_event_cachep __read_mostly;
>  #define FANOTIFY_EVENT_ALIGN 4
>  #define FANOTIFY_FID_INFO_HDR_LEN \
>  	(sizeof(struct fanotify_event_info_fid) + sizeof(struct file_handle))
> +#define FANOTIFY_PIDFD_INFO_HDR_LEN \
> +	sizeof(struct fanotify_event_info_pidfd)
>  
>  static int fanotify_fid_info_len(int fh_len, int name_len)
>  {
> @@ -141,6 +143,9 @@ static int fanotify_event_info_len(unsigned int info_mode,
>  	if (fh_len)
>  		info_len += fanotify_fid_info_len(fh_len, dot_len);
>  
> +	if (info_mode & FAN_REPORT_PIDFD)
> +		info_len += FANOTIFY_PIDFD_INFO_HDR_LEN;
> +
>  	return info_len;
>  }
>  
> @@ -401,6 +406,29 @@ static int copy_fid_info_to_user(__kernel_fsid_t *fsid,
>  	return info_len;
>  }
>  
> +static int copy_pidfd_info_to_user(struct pid *pid,
> +				   char __user *buf,
> +				   size_t count)
> +{
> +	struct fanotify_event_info_pidfd info = { };
> +	size_t info_len = FANOTIFY_PIDFD_INFO_HDR_LEN;
> +
> +	if (WARN_ON_ONCE(info_len > count))
> +		return -EFAULT;
> +
> +	info.hdr.info_type = FAN_EVENT_INFO_TYPE_PIDFD;
> +	info.hdr.len = info_len;
> +
> +	info.pidfd = pidfd_create(pid, 0);
> +	if (info.pidfd < 0)
> +		info.pidfd = FAN_NOPIDFD;
> +
> +	if (copy_to_user(buf, &info, info_len))
> +		return -EFAULT;

Hm, well this kinda sucks. The caller can end up with a pidfd in their
fd table and when the copy_to_user() failed they won't know what fd it
is. I think this might be better served by sm like (see what I did in
kernel/fork.c): 

	pidfd = get_unused_fd_flags(O_RDWR | O_CLOEXEC);
	if (pidfd < 0)
		/* error handling */
	
	pidfile = anon_inode_getfile("[pidfd]", &pidfd_fops, pid, O_RDWR | O_CLOEXEC);
	if (IS_ERR(pidfile)) {
		put_unused_fd(pidfd);
		pidfd = PTR_ERR(pidfile);
		/* error handling */
	}
	get_pid(pid);	/* held by pidfile now */

	info.hdr.info_type = FAN_EVENT_INFO_TYPE_PIDFD;
	info.hdr.len = info_len;
	info.pidfd = pidfd;

	if (copy_to_user(buf, &info, info_len)) {
		fput(pidfile);
		put_unused_fd(pidfd);
		return -EFAULT;
	}

	fd_install(pidfd, pidfile);
	return 0;

> +
> +	return info_len;
> +}
> +
>  static int copy_info_to_user(struct fanotify_event *event,
>  			     struct fanotify_info *info,
>  			     unsigned int info_mode,
> @@ -408,9 +436,12 @@ static int copy_info_to_user(struct fanotify_event *event,
>  {
>  	int ret, info_type = 0;
>  	unsigned int fid_mode = info_mode & FANOTIFY_FID_BITS;
> +	unsigned int pidfd_mode = info_mode & FAN_REPORT_PIDFD;
>  
>  	/*
>  	 * Event info records order is as follows: dir fid + name, child fid.
> +	 * If FAN_REPORT_PIDFD has been specified, then a pidfd info record
> +	 * will follow the fid info records.
>  	 */
>  	if (fanotify_event_dir_fh_len(event)) {
>  		info_type = info->name_len ? FAN_EVENT_INFO_TYPE_DFID_NAME :
> @@ -465,10 +496,18 @@ static int copy_info_to_user(struct fanotify_event *event,
>  		}
>  
>  		ret = copy_fid_info_to_user(fanotify_event_fsid(event),
> -					    fanotify_event_object_fh(event),
> -					    info_type, dot, dot_len,
> -					    buf, count);
> -	}
> +					    fanotify_event_object_fh(event),
> +					    info_type, dot, dot_len,
> +					    buf, count);
> +		if (ret < 0)
> +			return ret;
> +
> +		buf += ret;
> +		count -= ret;
> +	}
> +
> +	if (pidfd_mode)
> +		return copy_pidfd_info_to_user(event->pid, buf, count);
>  
>  	return ret;
>  }
> @@ -530,6 +569,15 @@ static ssize_t copy_event_to_user(struct fsnotify_group *group,
>  		fd_install(fd, f);
>  
>  	if (info_mode) {
> +		/*
> +		 * Complain if FAN_REPORT_PIDFD and FAN_REPORT_TID mutual
> +		 * exclusion is ever lifted. At the time of implementing
> +		 * FAN_REPORT_PIDFD, the pidfd API only supported the creation
> +		 * of pidfds on thread-group leaders.
> +		 */
> +		WARN_ON_ONCE((info_mode & FAN_REPORT_PIDFD) &&
> +			     FAN_GROUP_FLAG(group, FAN_REPORT_TID));
> +
>  		ret = copy_info_to_user(event, info, info_mode, buf, count);
>  		if (ret < 0)
>  			return ret;
> @@ -1079,6 +1127,13 @@ SYSCALL_DEFINE2(fanotify_init, unsigned int, flags, unsigned int, event_f_flags)
>  #endif
>  		return -EINVAL;
>  
> +	/*
> +	 * A pidfd can only be returned for a thread-group leader; thus
> +	 * FAN_REPORT_TID and FAN_REPORT_PIDFD need to be mutually exclusive.
> +	 */
> +	if ((flags & FAN_REPORT_PIDFD) && (flags & FAN_REPORT_TID))
> +		return -EINVAL;
> +
>  	if (event_f_flags & ~FANOTIFY_INIT_ALL_EVENT_F_BITS)
>  		return -EINVAL;
>  
> @@ -1477,7 +1532,7 @@ static int __init fanotify_user_setup(void)
>  	max_marks = clamp(max_marks, FANOTIFY_OLD_DEFAULT_MAX_MARKS,
>  				     FANOTIFY_DEFAULT_MAX_USER_MARKS);
>  
> -	BUILD_BUG_ON(HWEIGHT32(FANOTIFY_INIT_FLAGS) != 10);
> +	BUILD_BUG_ON(HWEIGHT32(FANOTIFY_INIT_FLAGS) != 11);
>  	BUILD_BUG_ON(HWEIGHT32(FANOTIFY_MARK_FLAGS) != 9);
>  
>  	fanotify_mark_cache = KMEM_CACHE(fsnotify_mark,
> diff --git a/include/linux/fanotify.h b/include/linux/fanotify.h
> index f76c7635efc8..bb2898240e5a 100644
> --- a/include/linux/fanotify.h
> +++ b/include/linux/fanotify.h
> @@ -27,7 +27,7 @@ extern struct ctl_table fanotify_table[]; /* for sysctl */
>  
>  #define FANOTIFY_FID_BITS	(FAN_REPORT_FID | FAN_REPORT_DFID_NAME)
>  
> -#define FANOTIFY_INFO_MODES	(FANOTIFY_FID_BITS)
> +#define FANOTIFY_INFO_MODES	(FANOTIFY_FID_BITS | FAN_REPORT_PIDFD)
>  
>  /*
>   * fanotify_init() flags that require CAP_SYS_ADMIN.
> @@ -37,6 +37,7 @@ extern struct ctl_table fanotify_table[]; /* for sysctl */
>   */
>  #define FANOTIFY_ADMIN_INIT_FLAGS	(FANOTIFY_PERM_CLASSES | \
>  					 FAN_REPORT_TID | \
> +					 FAN_REPORT_PIDFD | \
>  					 FAN_UNLIMITED_QUEUE | \
>  					 FAN_UNLIMITED_MARKS)
>  
> diff --git a/include/uapi/linux/fanotify.h b/include/uapi/linux/fanotify.h
> index fbf9c5c7dd59..36c3bddcf690 100644
> --- a/include/uapi/linux/fanotify.h
> +++ b/include/uapi/linux/fanotify.h
> @@ -55,6 +55,7 @@
>  #define FAN_REPORT_FID		0x00000200	/* Report unique file id */
>  #define FAN_REPORT_DIR_FID	0x00000400	/* Report unique directory id */
>  #define FAN_REPORT_NAME		0x00000800	/* Report events with name */
> +#define FAN_REPORT_PIDFD	0x00001000	/* Report pidfd for event->pid */
>  
>  /* Convenience macro - FAN_REPORT_NAME requires FAN_REPORT_DIR_FID */
>  #define FAN_REPORT_DFID_NAME	(FAN_REPORT_DIR_FID | FAN_REPORT_NAME)
> @@ -123,6 +124,7 @@ struct fanotify_event_metadata {
>  #define FAN_EVENT_INFO_TYPE_FID		1
>  #define FAN_EVENT_INFO_TYPE_DFID_NAME	2
>  #define FAN_EVENT_INFO_TYPE_DFID	3
> +#define FAN_EVENT_INFO_TYPE_PIDFD	4
>  
>  /* Variable length info record following event metadata */
>  struct fanotify_event_info_header {
> @@ -148,6 +150,15 @@ struct fanotify_event_info_fid {
>  	unsigned char handle[0];
>  };
>  
> +/*
> + * This structure is used for info records of type FAN_EVENT_INFO_TYPE_PIDFD.
> + * It holds a pidfd for the pid responsible for generating an event.
> + */
> +struct fanotify_event_info_pidfd {
> +	struct fanotify_event_info_header hdr;
> +	__s32 pidfd;
> +};
> +
>  struct fanotify_response {
>  	__s32 fd;
>  	__u32 response;
> @@ -160,6 +171,7 @@ struct fanotify_response {
>  
>  /* No fd set in event */
>  #define FAN_NOFD	-1
> +#define FAN_NOPIDFD	FAN_NOFD
>  
>  /* Helper functions to deal with fanotify_event_metadata buffers */
>  #define FAN_EVENT_METADATA_LEN (sizeof(struct fanotify_event_metadata))
> -- 
> 2.31.1.751.gd2f1c929bd-goog
> 
> /M

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 5/5] fanotify: Add pidfd info record support to the fanotify API
  2021-05-20  8:17   ` Christian Brauner
@ 2021-05-20 13:43     ` Amir Goldstein
  2021-05-21  9:21       ` Matthew Bobrowski
  0 siblings, 1 reply; 38+ messages in thread
From: Amir Goldstein @ 2021-05-20 13:43 UTC (permalink / raw)
  To: Christian Brauner; +Cc: Matthew Bobrowski, Jan Kara, linux-fsdevel, Linux API

On Thu, May 20, 2021 at 11:17 AM Christian Brauner
<christian.brauner@ubuntu.com> wrote:
>
> On Thu, May 20, 2021 at 12:11:51PM +1000, Matthew Bobrowski wrote:
> > Introduce a new flag FAN_REPORT_PIDFD for fanotify_init(2) which
> > allows userspace applications to control whether a pidfd info record
> > containing a pidfd is to be returned with each event.
> >
> > If FAN_REPORT_PIDFD is enabled for a notification group, an additional
> > struct fanotify_event_info_pidfd object will be supplied alongside the
> > generic struct fanotify_event_metadata within a single event. This
> > functionality is analogous to that of FAN_REPORT_FID in terms of how
> > the event structure is supplied to the userspace application. Usage of
> > FAN_REPORT_PIDFD with FAN_REPORT_FID/FAN_REPORT_DFID_NAME is
> > permitted, and in this case a struct fanotify_event_info_pidfd object
> > will follow any struct fanotify_event_info_fid object.
> >
> > Usage of FAN_REPORT_TID is not permitted with FAN_REPORT_PIDFD as the
> > pidfd API only supports the creation of pidfds for thread-group
> > leaders. Attempting to do so will result with a -EINVAL being returned
> > when calling fanotify_init(2).
> >
> > If pidfd creation fails via pidfd_create(), the pidfd field within
> > struct fanotify_event_info_pidfd is set to FAN_NOPIDFD.
> >
> > Signed-off-by: Matthew Bobrowski <repnop@google.com>
> > ---
> >  fs/notify/fanotify/fanotify_user.c | 65 +++++++++++++++++++++++++++---
> >  include/linux/fanotify.h           |  3 +-
> >  include/uapi/linux/fanotify.h      | 12 ++++++
> >  3 files changed, 74 insertions(+), 6 deletions(-)
> >
> > diff --git a/fs/notify/fanotify/fanotify_user.c b/fs/notify/fanotify/fanotify_user.c
> > index 1e15f3222eb2..bba61988f4a0 100644
> > --- a/fs/notify/fanotify/fanotify_user.c
> > +++ b/fs/notify/fanotify/fanotify_user.c
> > @@ -106,6 +106,8 @@ struct kmem_cache *fanotify_perm_event_cachep __read_mostly;
> >  #define FANOTIFY_EVENT_ALIGN 4
> >  #define FANOTIFY_FID_INFO_HDR_LEN \
> >       (sizeof(struct fanotify_event_info_fid) + sizeof(struct file_handle))
> > +#define FANOTIFY_PIDFD_INFO_HDR_LEN \
> > +     sizeof(struct fanotify_event_info_pidfd)
> >
> >  static int fanotify_fid_info_len(int fh_len, int name_len)
> >  {
> > @@ -141,6 +143,9 @@ static int fanotify_event_info_len(unsigned int info_mode,
> >       if (fh_len)
> >               info_len += fanotify_fid_info_len(fh_len, dot_len);
> >
> > +     if (info_mode & FAN_REPORT_PIDFD)
> > +             info_len += FANOTIFY_PIDFD_INFO_HDR_LEN;
> > +
> >       return info_len;
> >  }
> >
> > @@ -401,6 +406,29 @@ static int copy_fid_info_to_user(__kernel_fsid_t *fsid,
> >       return info_len;
> >  }
> >
> > +static int copy_pidfd_info_to_user(struct pid *pid,
> > +                                char __user *buf,
> > +                                size_t count)
> > +{
> > +     struct fanotify_event_info_pidfd info = { };
> > +     size_t info_len = FANOTIFY_PIDFD_INFO_HDR_LEN;
> > +
> > +     if (WARN_ON_ONCE(info_len > count))
> > +             return -EFAULT;
> > +
> > +     info.hdr.info_type = FAN_EVENT_INFO_TYPE_PIDFD;
> > +     info.hdr.len = info_len;
> > +
> > +     info.pidfd = pidfd_create(pid, 0);
> > +     if (info.pidfd < 0)
> > +             info.pidfd = FAN_NOPIDFD;
> > +
> > +     if (copy_to_user(buf, &info, info_len))
> > +             return -EFAULT;
>
> Hm, well this kinda sucks. The caller can end up with a pidfd in their
> fd table and when the copy_to_user() failed they won't know what fd it

Good catch!
But I prefer to solve it differently, because moving fd_install() to the
end of this function does not guarantee that copy_event_to_user()
won't return an error one day with dangling pidfd in fd table.

It might be simpler to do pidfd_create() next to create_fd() in
copy_event_to_user() and pass pidfd to copy_pidfd_info_to_user().
pidfd can be closed on error along with fd on out_close_fd label.

You also forgot to add CAP_SYS_ADMIN check before pidfd_create()
(even though fanotify_init() does check for that).

Anyway, something like:

        if (!capable(CAP_SYS_ADMIN) &&
            task_tgid(current) != event->pid)
                metadata.pid = 0;
+      else if (pidfd_mode)
+              pidfd = pidfd_create(pid, 0);

[...]

+       if (pidfd_mode)
+               ret = copy_pidfd_info_to_user(pidfd, buf, count);

        return metadata.event_len;

out_close_fd:
+        if (pidfd != FAN_NOPIDFD) {
...


And in any case, it wrong to call copy_pidfd_info_to_user() from
copy_info_to_user(). It needs to be called once from copy_event_to_user()
because copy_pidfd_info_to_user() may be called twice to report both
FAN_EVENT_INFO_TYPE_FID and FAN_EVENT_INFO_TYPE_DFID
records for the same event.

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 0/5] Add pidfd support to the fanotify API
  2021-05-20  2:09 [PATCH 0/5] Add pidfd support to the fanotify API Matthew Bobrowski
                   ` (4 preceding siblings ...)
  2021-05-20  2:11 ` [PATCH 5/5] fanotify: Add pidfd info record support to the fanotify API Matthew Bobrowski
@ 2021-05-20 13:55 ` Jan Kara
  2021-05-21 10:15   ` Matthew Bobrowski
  5 siblings, 1 reply; 38+ messages in thread
From: Jan Kara @ 2021-05-20 13:55 UTC (permalink / raw)
  To: Matthew Bobrowski
  Cc: jack, amir73il, christian.brauner, linux-fsdevel, linux-api

On Thu 20-05-21 12:09:45, Matthew Bobrowski wrote:
> Hey Jan/Amir/Christian,
> 
> This is the updated patch series for adding pidfd support to the
> fanotify API. It incorporates all the suggestions that had come out of
> the initial RFC patch series [0].
> 
> The main difference with this patch series is that FAN_REPORT_PIDFD
> results in an additional info record object supplied alongside the
> generic event metadata object instead of overloading metadata->pid. If
> any of the fid flavoured init flags are specified, then the pidfd info
> record object will follow any fid info record objects.
> 
> [0] https://www.spinics.net/lists/linux-fsdevel/msg193296.html

Overall the series looks fine to me - modulo the problems Christian & Amir
found. Do you have any tests for this? Preferably for LTP so that we can
extend the coverage there?

								Honza

> 
> Matthew Bobrowski (5):
>   kernel/pid.c: remove static qualifier from pidfd_create()
>   kernel/pid.c: implement checks on parameters passed to pidfd_create()
>   fanotify_user.c: minor cosmetic adjustments to fid labels
>   fanotify/fanotify_user.c: introduce a generic info record copying
>     function
>   fanotify: Add pidfd info record support to the fanotify API
> 
>  fs/notify/fanotify/fanotify_user.c | 216 +++++++++++++++++++----------
>  include/linux/fanotify.h           |   3 +
>  include/linux/pid.h                |   1 +
>  include/uapi/linux/fanotify.h      |  12 ++
>  kernel/pid.c                       |  15 +-
>  5 files changed, 170 insertions(+), 77 deletions(-)
> 
> -- 
> 2.31.1.751.gd2f1c929bd-goog
> 
> /M
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 4/5] fanotify/fanotify_user.c: introduce a generic info record copying function
  2021-05-20  2:11 ` [PATCH 4/5] fanotify/fanotify_user.c: introduce a generic info record copying function Matthew Bobrowski
@ 2021-05-20 13:59   ` Amir Goldstein
  2021-05-21  9:26     ` Matthew Bobrowski
  0 siblings, 1 reply; 38+ messages in thread
From: Amir Goldstein @ 2021-05-20 13:59 UTC (permalink / raw)
  To: Matthew Bobrowski; +Cc: Jan Kara, Christian Brauner, linux-fsdevel, Linux API

On Thu, May 20, 2021 at 5:11 AM Matthew Bobrowski <repnop@google.com> wrote:
>
> The copy_info_to_user() function has been repurposed with the idea to

Sorry I gave comment on patch 5 before I knew you repurposed
copy_info_to_user().
Perhaps it would be better to give a non ambiguous name like
copy_info_records_to_user() or
copy_fid_info_records_to_user() and leave pidfd out of this helper.
If you do wish to keep pidfd copy inside this helper and follow my
suggestion on patch 5 you will have to pass in pidfd as another argument.

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 5/5] fanotify: Add pidfd info record support to the fanotify API
  2021-05-20 13:43     ` Amir Goldstein
@ 2021-05-21  9:21       ` Matthew Bobrowski
  2021-05-21  9:41         ` Amir Goldstein
  0 siblings, 1 reply; 38+ messages in thread
From: Matthew Bobrowski @ 2021-05-21  9:21 UTC (permalink / raw)
  To: Amir Goldstein; +Cc: Christian Brauner, Jan Kara, linux-fsdevel, Linux API

Hey Amir/Christian,

On Thu, May 20, 2021 at 04:43:48PM +0300, Amir Goldstein wrote:
> On Thu, May 20, 2021 at 11:17 AM Christian Brauner
> <christian.brauner@ubuntu.com> wrote:
> > > +#define FANOTIFY_PIDFD_INFO_HDR_LEN \
> > > +     sizeof(struct fanotify_event_info_pidfd)
> > >
> > >  static int fanotify_fid_info_len(int fh_len, int name_len)
> > >  {
> > > @@ -141,6 +143,9 @@ static int fanotify_event_info_len(unsigned int info_mode,
> > >       if (fh_len)
> > >               info_len += fanotify_fid_info_len(fh_len, dot_len);
> > >
> > > +     if (info_mode & FAN_REPORT_PIDFD)
> > > +             info_len += FANOTIFY_PIDFD_INFO_HDR_LEN;
> > > +
> > >       return info_len;
> > >  }
> > >
> > > @@ -401,6 +406,29 @@ static int copy_fid_info_to_user(__kernel_fsid_t *fsid,
> > >       return info_len;
> > >  }
> > >
> > > +static int copy_pidfd_info_to_user(struct pid *pid,
> > > +                                char __user *buf,
> > > +                                size_t count)
> > > +{
> > > +     struct fanotify_event_info_pidfd info = { };
> > > +     size_t info_len = FANOTIFY_PIDFD_INFO_HDR_LEN;
> > > +
> > > +     if (WARN_ON_ONCE(info_len > count))
> > > +             return -EFAULT;
> > > +
> > > +     info.hdr.info_type = FAN_EVENT_INFO_TYPE_PIDFD;
> > > +     info.hdr.len = info_len;
> > > +
> > > +     info.pidfd = pidfd_create(pid, 0);
> > > +     if (info.pidfd < 0)
> > > +             info.pidfd = FAN_NOPIDFD;
> > > +
> > > +     if (copy_to_user(buf, &info, info_len))
> > > +             return -EFAULT;
> >
> > Hm, well this kinda sucks. The caller can end up with a pidfd in their
> > fd table and when the copy_to_user() failed they won't know what fd it
> 
> Good catch!

Super awesome catch Christian, thanks pulling this up!

> But I prefer to solve it differently, because moving fd_install() to the
> end of this function does not guarantee that copy_event_to_user()
> won't return an error one day with dangling pidfd in fd table.

I can see the angle you're approaching this from...

> It might be simpler to do pidfd_create() next to create_fd() in
> copy_event_to_user() and pass pidfd to copy_pidfd_info_to_user().
> pidfd can be closed on error along with fd on out_close_fd label.
> 
> You also forgot to add CAP_SYS_ADMIN check before pidfd_create()
> (even though fanotify_init() does check for that).

I didn't really understand the need for this check here given that the
administrative bits are already being checked for in fanotify_init()
i.e. FAN_REPORT_PIDFD can never be set for an unprivileged listener;
thus never walking any of the pidfd_mode paths. Is this just a defense
in depth approach here, or is it something else that I'm missing?

> Anyway, something like:
> 
>         if (!capable(CAP_SYS_ADMIN) &&
>             task_tgid(current) != event->pid)
>                 metadata.pid = 0;
> +      else if (pidfd_mode)
> +              pidfd = pidfd_create(pid, 0);
> 
> [...]
> 
> +       if (pidfd_mode)
> +               ret = copy_pidfd_info_to_user(pidfd, buf, count);
> 
>         return metadata.event_len;
> 
> out_close_fd:
> +        if (pidfd != FAN_NOPIDFD) {
> ...

The early call to pidfd_create() and clean up in copy_event_to_user()
makes most sense to me.

> And in any case, it wrong to call copy_pidfd_info_to_user() from
> copy_info_to_user(). It needs to be called once from copy_event_to_user()
> because copy_pidfd_info_to_user() may be called twice to report both
> FAN_EVENT_INFO_TYPE_FID and FAN_EVENT_INFO_TYPE_DFID
> records for the same event.

Right, as mentioned in patch 4 of this series, copy_info_to_user() has
been repurposed to account for the double call into
copy_fid_info_to_user() when reporting FAN_EVENT_INFO_TYPE_FID and
FAN_EVENT_INFO_TYPE_DFID.

/M

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 4/5] fanotify/fanotify_user.c: introduce a generic info record copying function
  2021-05-20 13:59   ` Amir Goldstein
@ 2021-05-21  9:26     ` Matthew Bobrowski
  0 siblings, 0 replies; 38+ messages in thread
From: Matthew Bobrowski @ 2021-05-21  9:26 UTC (permalink / raw)
  To: Amir Goldstein; +Cc: Jan Kara, Christian Brauner, linux-fsdevel, Linux API

On Thu, May 20, 2021 at 04:59:23PM +0300, Amir Goldstein wrote:
> On Thu, May 20, 2021 at 5:11 AM Matthew Bobrowski <repnop@google.com> wrote:
> >
> > The copy_info_to_user() function has been repurposed with the idea to
> 
> Sorry I gave comment on patch 5 before I knew you repurposed
> copy_info_to_user().
> Perhaps it would be better to give a non ambiguous name like
> copy_info_records_to_user() or

^ I like this.

> copy_fid_info_records_to_user() and leave pidfd out of this helper.
> If you do wish to keep pidfd copy inside this helper and follow my
> suggestion on patch 5 you will have to pass in pidfd as another argument.

I'd like to keep all the info record related copying branching out
from a single helper, which in this case will now be
copy_info_records_to_user().

/M

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 5/5] fanotify: Add pidfd info record support to the fanotify API
  2021-05-21  9:21       ` Matthew Bobrowski
@ 2021-05-21  9:41         ` Amir Goldstein
  2021-05-21 10:24           ` Jan Kara
  0 siblings, 1 reply; 38+ messages in thread
From: Amir Goldstein @ 2021-05-21  9:41 UTC (permalink / raw)
  To: Matthew Bobrowski; +Cc: Christian Brauner, Jan Kara, linux-fsdevel, Linux API

On Fri, May 21, 2021 at 12:22 PM Matthew Bobrowski <repnop@google.com> wrote:
>
> Hey Amir/Christian,
>
> On Thu, May 20, 2021 at 04:43:48PM +0300, Amir Goldstein wrote:
> > On Thu, May 20, 2021 at 11:17 AM Christian Brauner
> > <christian.brauner@ubuntu.com> wrote:
> > > > +#define FANOTIFY_PIDFD_INFO_HDR_LEN \
> > > > +     sizeof(struct fanotify_event_info_pidfd)
> > > >
> > > >  static int fanotify_fid_info_len(int fh_len, int name_len)
> > > >  {
> > > > @@ -141,6 +143,9 @@ static int fanotify_event_info_len(unsigned int info_mode,
> > > >       if (fh_len)
> > > >               info_len += fanotify_fid_info_len(fh_len, dot_len);
> > > >
> > > > +     if (info_mode & FAN_REPORT_PIDFD)
> > > > +             info_len += FANOTIFY_PIDFD_INFO_HDR_LEN;
> > > > +
> > > >       return info_len;
> > > >  }
> > > >
> > > > @@ -401,6 +406,29 @@ static int copy_fid_info_to_user(__kernel_fsid_t *fsid,
> > > >       return info_len;
> > > >  }
> > > >
> > > > +static int copy_pidfd_info_to_user(struct pid *pid,
> > > > +                                char __user *buf,
> > > > +                                size_t count)
> > > > +{
> > > > +     struct fanotify_event_info_pidfd info = { };
> > > > +     size_t info_len = FANOTIFY_PIDFD_INFO_HDR_LEN;
> > > > +
> > > > +     if (WARN_ON_ONCE(info_len > count))
> > > > +             return -EFAULT;
> > > > +
> > > > +     info.hdr.info_type = FAN_EVENT_INFO_TYPE_PIDFD;
> > > > +     info.hdr.len = info_len;
> > > > +
> > > > +     info.pidfd = pidfd_create(pid, 0);
> > > > +     if (info.pidfd < 0)
> > > > +             info.pidfd = FAN_NOPIDFD;
> > > > +
> > > > +     if (copy_to_user(buf, &info, info_len))
> > > > +             return -EFAULT;
> > >
> > > Hm, well this kinda sucks. The caller can end up with a pidfd in their
> > > fd table and when the copy_to_user() failed they won't know what fd it
> >
> > Good catch!
>
> Super awesome catch Christian, thanks pulling this up!
>
> > But I prefer to solve it differently, because moving fd_install() to the
> > end of this function does not guarantee that copy_event_to_user()
> > won't return an error one day with dangling pidfd in fd table.
>
> I can see the angle you're approaching this from...
>
> > It might be simpler to do pidfd_create() next to create_fd() in
> > copy_event_to_user() and pass pidfd to copy_pidfd_info_to_user().
> > pidfd can be closed on error along with fd on out_close_fd label.
> >
> > You also forgot to add CAP_SYS_ADMIN check before pidfd_create()
> > (even though fanotify_init() does check for that).
>
> I didn't really understand the need for this check here given that the
> administrative bits are already being checked for in fanotify_init()
> i.e. FAN_REPORT_PIDFD can never be set for an unprivileged listener;
> thus never walking any of the pidfd_mode paths. Is this just a defense
> in depth approach here, or is it something else that I'm missing?
>

We want to be extra careful not to create privilege escalations,
so even if the fanotify fd is leaked or intentionally passed to a less
privileged user, it cannot get an open pidfd.

IOW, it is *much* easier to be defensive in this case than to prove
that the change cannot introduce any privilege escalations.

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 0/5] Add pidfd support to the fanotify API
  2021-05-20 13:55 ` [PATCH 0/5] Add pidfd " Jan Kara
@ 2021-05-21 10:15   ` Matthew Bobrowski
  2021-05-21 10:40     ` Jan Kara
  0 siblings, 1 reply; 38+ messages in thread
From: Matthew Bobrowski @ 2021-05-21 10:15 UTC (permalink / raw)
  To: Jan Kara; +Cc: amir73il, christian.brauner, linux-fsdevel, linux-api

Hey Jan!

On Thu, May 20, 2021 at 03:55:27PM +0200, Jan Kara wrote:
> On Thu 20-05-21 12:09:45, Matthew Bobrowski wrote:
> > Hey Jan/Amir/Christian,
> > 
> > This is the updated patch series for adding pidfd support to the
> > fanotify API. It incorporates all the suggestions that had come out of
> > the initial RFC patch series [0].
> > 
> > The main difference with this patch series is that FAN_REPORT_PIDFD
> > results in an additional info record object supplied alongside the
> > generic event metadata object instead of overloading metadata->pid. If
> > any of the fid flavoured init flags are specified, then the pidfd info
> > record object will follow any fid info record objects.
> > 
> > [0] https://www.spinics.net/lists/linux-fsdevel/msg193296.html
> 
> Overall the series looks fine to me - modulo the problems Christian & Amir
> found. Do you have any tests for this? Preferably for LTP so that we can
> extend the coverage there?

Cool and thanks for glancing over this series.

I've written some simple programs to verify this functionality works in
FID and non-FID modes. I definitely plan on writing LTP tests,
although it's something I'll do once we've agreed on the approach and
I've received an ACK from yourself, Amir and Christian. This series
passes all current LTP regressions. Also, I guess I'll need to write
some patches for man-pages given this is an ABI change.

There's one thing that I'd like to mention, and it's something in
regards to the overall approach we've taken that I'm not particularly
happy about and I'd like to hear all your thoughts. Basically, with
this approach the pidfd creation is done only once an event has been
queued and the notification worker wakes up and picks up the event
from the queue processes it. There's a subtle latency introduced when
taking such an approach which at times leads to pidfd creation
failures. As in, by the time pidfd_create() is called the struct pid
has already been reaped, which then results in FAN_NOPIDFD being
returned in the pidfd info record.

Having said that, I'm wondering what the thoughts are on doing pidfd
creation earlier on i.e. in the event allocation stages? This way, the
struct pid is pinned earlier on and rather than FAN_NOPIDFD being
returned in the pidfd info record because the struct pid has been
already reaped, userspace application will atleast receive a valid
pidfd which can be used to check whether the process still exists or
not. I think it'll just set the expectation better from an API
perspective.

/M

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 5/5] fanotify: Add pidfd info record support to the fanotify API
  2021-05-21  9:41         ` Amir Goldstein
@ 2021-05-21 10:24           ` Jan Kara
  2021-05-21 11:10             ` Amir Goldstein
  0 siblings, 1 reply; 38+ messages in thread
From: Jan Kara @ 2021-05-21 10:24 UTC (permalink / raw)
  To: Amir Goldstein
  Cc: Matthew Bobrowski, Christian Brauner, Jan Kara, linux-fsdevel, Linux API

On Fri 21-05-21 12:41:51, Amir Goldstein wrote:
> On Fri, May 21, 2021 at 12:22 PM Matthew Bobrowski <repnop@google.com> wrote:
> >
> > Hey Amir/Christian,
> >
> > On Thu, May 20, 2021 at 04:43:48PM +0300, Amir Goldstein wrote:
> > > On Thu, May 20, 2021 at 11:17 AM Christian Brauner
> > > <christian.brauner@ubuntu.com> wrote:
> > > > > +#define FANOTIFY_PIDFD_INFO_HDR_LEN \
> > > > > +     sizeof(struct fanotify_event_info_pidfd)
> > > > >
> > > > >  static int fanotify_fid_info_len(int fh_len, int name_len)
> > > > >  {
> > > > > @@ -141,6 +143,9 @@ static int fanotify_event_info_len(unsigned int info_mode,
> > > > >       if (fh_len)
> > > > >               info_len += fanotify_fid_info_len(fh_len, dot_len);
> > > > >
> > > > > +     if (info_mode & FAN_REPORT_PIDFD)
> > > > > +             info_len += FANOTIFY_PIDFD_INFO_HDR_LEN;
> > > > > +
> > > > >       return info_len;
> > > > >  }
> > > > >
> > > > > @@ -401,6 +406,29 @@ static int copy_fid_info_to_user(__kernel_fsid_t *fsid,
> > > > >       return info_len;
> > > > >  }
> > > > >
> > > > > +static int copy_pidfd_info_to_user(struct pid *pid,
> > > > > +                                char __user *buf,
> > > > > +                                size_t count)
> > > > > +{
> > > > > +     struct fanotify_event_info_pidfd info = { };
> > > > > +     size_t info_len = FANOTIFY_PIDFD_INFO_HDR_LEN;
> > > > > +
> > > > > +     if (WARN_ON_ONCE(info_len > count))
> > > > > +             return -EFAULT;
> > > > > +
> > > > > +     info.hdr.info_type = FAN_EVENT_INFO_TYPE_PIDFD;
> > > > > +     info.hdr.len = info_len;
> > > > > +
> > > > > +     info.pidfd = pidfd_create(pid, 0);
> > > > > +     if (info.pidfd < 0)
> > > > > +             info.pidfd = FAN_NOPIDFD;
> > > > > +
> > > > > +     if (copy_to_user(buf, &info, info_len))
> > > > > +             return -EFAULT;
> > > >
> > > > Hm, well this kinda sucks. The caller can end up with a pidfd in their
> > > > fd table and when the copy_to_user() failed they won't know what fd it
> > >
> > > Good catch!
> >
> > Super awesome catch Christian, thanks pulling this up!
> >
> > > But I prefer to solve it differently, because moving fd_install() to the
> > > end of this function does not guarantee that copy_event_to_user()
> > > won't return an error one day with dangling pidfd in fd table.
> >
> > I can see the angle you're approaching this from...
> >
> > > It might be simpler to do pidfd_create() next to create_fd() in
> > > copy_event_to_user() and pass pidfd to copy_pidfd_info_to_user().
> > > pidfd can be closed on error along with fd on out_close_fd label.
> > >
> > > You also forgot to add CAP_SYS_ADMIN check before pidfd_create()
> > > (even though fanotify_init() does check for that).
> >
> > I didn't really understand the need for this check here given that the
> > administrative bits are already being checked for in fanotify_init()
> > i.e. FAN_REPORT_PIDFD can never be set for an unprivileged listener;
> > thus never walking any of the pidfd_mode paths. Is this just a defense
> > in depth approach here, or is it something else that I'm missing?
> >
> 
> We want to be extra careful not to create privilege escalations,
> so even if the fanotify fd is leaked or intentionally passed to a less
> privileged user, it cannot get an open pidfd.
> 
> IOW, it is *much* easier to be defensive in this case than to prove
> that the change cannot introduce any privilege escalations.

I have no problems with being more defensive (it's certainly better than
being too lax) but does it really make sence here? I mean if CAP_SYS_ADMIN
task opens O_RDWR /etc/passwd and then passes this fd to unpriviledged
process, that process is also free to update all the passwords.
Traditionally permission checks in Unix are performed on open and then who
has fd can do whatever that fd allows... I've tried to follow similar
philosophy with fanotify as well and e.g. open happening as a result of
fanotify path events does not check permissions either.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 0/5] Add pidfd support to the fanotify API
  2021-05-21 10:15   ` Matthew Bobrowski
@ 2021-05-21 10:40     ` Jan Kara
  2021-05-21 23:32       ` Matthew Bobrowski
  0 siblings, 1 reply; 38+ messages in thread
From: Jan Kara @ 2021-05-21 10:40 UTC (permalink / raw)
  To: Matthew Bobrowski
  Cc: Jan Kara, amir73il, christian.brauner, linux-fsdevel, linux-api

On Fri 21-05-21 20:15:35, Matthew Bobrowski wrote:
> Hey Jan!
> 
> On Thu, May 20, 2021 at 03:55:27PM +0200, Jan Kara wrote:
> > On Thu 20-05-21 12:09:45, Matthew Bobrowski wrote:
> > > Hey Jan/Amir/Christian,
> > > 
> > > This is the updated patch series for adding pidfd support to the
> > > fanotify API. It incorporates all the suggestions that had come out of
> > > the initial RFC patch series [0].
> > > 
> > > The main difference with this patch series is that FAN_REPORT_PIDFD
> > > results in an additional info record object supplied alongside the
> > > generic event metadata object instead of overloading metadata->pid. If
> > > any of the fid flavoured init flags are specified, then the pidfd info
> > > record object will follow any fid info record objects.
> > > 
> > > [0] https://www.spinics.net/lists/linux-fsdevel/msg193296.html
> > 
> > Overall the series looks fine to me - modulo the problems Christian & Amir
> > found. Do you have any tests for this? Preferably for LTP so that we can
> > extend the coverage there?
> 
> Cool and thanks for glancing over this series.
> 
> I've written some simple programs to verify this functionality works in
> FID and non-FID modes. I definitely plan on writing LTP tests,
> although it's something I'll do once we've agreed on the approach and
> I've received an ACK from yourself, Amir and Christian. This series
> passes all current LTP regressions. Also, I guess I'll need to write
> some patches for man-pages given this is an ABI change.

Yes, manpage update will be necessary as well.

> There's one thing that I'd like to mention, and it's something in
> regards to the overall approach we've taken that I'm not particularly
> happy about and I'd like to hear all your thoughts. Basically, with
> this approach the pidfd creation is done only once an event has been
> queued and the notification worker wakes up and picks up the event
> from the queue processes it. There's a subtle latency introduced when
> taking such an approach which at times leads to pidfd creation
> failures. As in, by the time pidfd_create() is called the struct pid
> has already been reaped, which then results in FAN_NOPIDFD being
> returned in the pidfd info record.
> 
> Having said that, I'm wondering what the thoughts are on doing pidfd
> creation earlier on i.e. in the event allocation stages? This way, the
> struct pid is pinned earlier on and rather than FAN_NOPIDFD being
> returned in the pidfd info record because the struct pid has been
> already reaped, userspace application will atleast receive a valid
> pidfd which can be used to check whether the process still exists or
> not. I think it'll just set the expectation better from an API
> perspective.

Yes, there's this race. OTOH if FAN_NOPIDFD is returned, the listener can
be sure the original process doesn't exist anymore. So is it useful to
still receive pidfd of the dead process? Also opening pidfd in the context
of event generation is problematic for two reasons:

1) Technically, the context under which events are generated can be rather
constrained (various locks held etc.). Adding relatively complex operations
such as pidfd creation is going to introduce strange lock dependencies,
possibly deadlocks.

2) Doing pidfd generation in the context of the process generating event is
problematic - you don't know in which fd_table the fd will live. Also that
process is unfairly penalized (performance wise) because someone else is
listening. We try to keep overhead of event generation as low as possible
for this reason.

								Honza

> 
> /M
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 5/5] fanotify: Add pidfd info record support to the fanotify API
  2021-05-21 10:24           ` Jan Kara
@ 2021-05-21 11:10             ` Amir Goldstein
  2021-05-21 13:19               ` Jan Kara
  0 siblings, 1 reply; 38+ messages in thread
From: Amir Goldstein @ 2021-05-21 11:10 UTC (permalink / raw)
  To: Jan Kara; +Cc: Matthew Bobrowski, Christian Brauner, linux-fsdevel, Linux API

On Fri, May 21, 2021 at 1:24 PM Jan Kara <jack@suse.cz> wrote:
>
> On Fri 21-05-21 12:41:51, Amir Goldstein wrote:
> > On Fri, May 21, 2021 at 12:22 PM Matthew Bobrowski <repnop@google.com> wrote:
> > >
> > > Hey Amir/Christian,
> > >
> > > On Thu, May 20, 2021 at 04:43:48PM +0300, Amir Goldstein wrote:
> > > > On Thu, May 20, 2021 at 11:17 AM Christian Brauner
> > > > <christian.brauner@ubuntu.com> wrote:
> > > > > > +#define FANOTIFY_PIDFD_INFO_HDR_LEN \
> > > > > > +     sizeof(struct fanotify_event_info_pidfd)
> > > > > >
> > > > > >  static int fanotify_fid_info_len(int fh_len, int name_len)
> > > > > >  {
> > > > > > @@ -141,6 +143,9 @@ static int fanotify_event_info_len(unsigned int info_mode,
> > > > > >       if (fh_len)
> > > > > >               info_len += fanotify_fid_info_len(fh_len, dot_len);
> > > > > >
> > > > > > +     if (info_mode & FAN_REPORT_PIDFD)
> > > > > > +             info_len += FANOTIFY_PIDFD_INFO_HDR_LEN;
> > > > > > +
> > > > > >       return info_len;
> > > > > >  }
> > > > > >
> > > > > > @@ -401,6 +406,29 @@ static int copy_fid_info_to_user(__kernel_fsid_t *fsid,
> > > > > >       return info_len;
> > > > > >  }
> > > > > >
> > > > > > +static int copy_pidfd_info_to_user(struct pid *pid,
> > > > > > +                                char __user *buf,
> > > > > > +                                size_t count)
> > > > > > +{
> > > > > > +     struct fanotify_event_info_pidfd info = { };
> > > > > > +     size_t info_len = FANOTIFY_PIDFD_INFO_HDR_LEN;
> > > > > > +
> > > > > > +     if (WARN_ON_ONCE(info_len > count))
> > > > > > +             return -EFAULT;
> > > > > > +
> > > > > > +     info.hdr.info_type = FAN_EVENT_INFO_TYPE_PIDFD;
> > > > > > +     info.hdr.len = info_len;
> > > > > > +
> > > > > > +     info.pidfd = pidfd_create(pid, 0);
> > > > > > +     if (info.pidfd < 0)
> > > > > > +             info.pidfd = FAN_NOPIDFD;
> > > > > > +
> > > > > > +     if (copy_to_user(buf, &info, info_len))
> > > > > > +             return -EFAULT;
> > > > >
> > > > > Hm, well this kinda sucks. The caller can end up with a pidfd in their
> > > > > fd table and when the copy_to_user() failed they won't know what fd it
> > > >
> > > > Good catch!
> > >
> > > Super awesome catch Christian, thanks pulling this up!
> > >
> > > > But I prefer to solve it differently, because moving fd_install() to the
> > > > end of this function does not guarantee that copy_event_to_user()
> > > > won't return an error one day with dangling pidfd in fd table.
> > >
> > > I can see the angle you're approaching this from...
> > >
> > > > It might be simpler to do pidfd_create() next to create_fd() in
> > > > copy_event_to_user() and pass pidfd to copy_pidfd_info_to_user().
> > > > pidfd can be closed on error along with fd on out_close_fd label.
> > > >
> > > > You also forgot to add CAP_SYS_ADMIN check before pidfd_create()
> > > > (even though fanotify_init() does check for that).
> > >
> > > I didn't really understand the need for this check here given that the
> > > administrative bits are already being checked for in fanotify_init()
> > > i.e. FAN_REPORT_PIDFD can never be set for an unprivileged listener;
> > > thus never walking any of the pidfd_mode paths. Is this just a defense
> > > in depth approach here, or is it something else that I'm missing?
> > >
> >
> > We want to be extra careful not to create privilege escalations,
> > so even if the fanotify fd is leaked or intentionally passed to a less
> > privileged user, it cannot get an open pidfd.
> >
> > IOW, it is *much* easier to be defensive in this case than to prove
> > that the change cannot introduce any privilege escalations.
>
> I have no problems with being more defensive (it's certainly better than
> being too lax) but does it really make sence here? I mean if CAP_SYS_ADMIN
> task opens O_RDWR /etc/passwd and then passes this fd to unpriviledged
> process, that process is also free to update all the passwords.
> Traditionally permission checks in Unix are performed on open and then who
> has fd can do whatever that fd allows... I've tried to follow similar
> philosophy with fanotify as well and e.g. open happening as a result of
> fanotify path events does not check permissions either.
>

Agreed.

However, because we had this issue with no explicit FAN_REPORT_PID
we added the CAP_SYS_ADMIN check for reporting event->pid as next
best thing. So now that becomes weird if priv process created fanotify fd
and passes it to unpriv process, then unpriv process gets events with
pidfd but without event->pid.

We can change the code to:

        if (!capable(CAP_SYS_ADMIN) && !pidfd_mode &&
            task_tgid(current) != event->pid)
                metadata.pid = 0;

So the case I decscribed above ends up reporting both pidfd
and event->pid to unpriv user, but that is a bit inconsistent...

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 5/5] fanotify: Add pidfd info record support to the fanotify API
  2021-05-21 11:10             ` Amir Goldstein
@ 2021-05-21 13:19               ` Jan Kara
  2021-05-21 13:52                 ` Amir Goldstein
  0 siblings, 1 reply; 38+ messages in thread
From: Jan Kara @ 2021-05-21 13:19 UTC (permalink / raw)
  To: Amir Goldstein
  Cc: Jan Kara, Matthew Bobrowski, Christian Brauner, linux-fsdevel, Linux API

On Fri 21-05-21 14:10:32, Amir Goldstein wrote:
> On Fri, May 21, 2021 at 1:24 PM Jan Kara <jack@suse.cz> wrote:
> >
> > On Fri 21-05-21 12:41:51, Amir Goldstein wrote:
> > > On Fri, May 21, 2021 at 12:22 PM Matthew Bobrowski <repnop@google.com> wrote:
> > > >
> > > > Hey Amir/Christian,
> > > >
> > > > On Thu, May 20, 2021 at 04:43:48PM +0300, Amir Goldstein wrote:
> > > > > On Thu, May 20, 2021 at 11:17 AM Christian Brauner
> > > > > <christian.brauner@ubuntu.com> wrote:
> > > > > > > +#define FANOTIFY_PIDFD_INFO_HDR_LEN \
> > > > > > > +     sizeof(struct fanotify_event_info_pidfd)
> > > > > > >
> > > > > > >  static int fanotify_fid_info_len(int fh_len, int name_len)
> > > > > > >  {
> > > > > > > @@ -141,6 +143,9 @@ static int fanotify_event_info_len(unsigned int info_mode,
> > > > > > >       if (fh_len)
> > > > > > >               info_len += fanotify_fid_info_len(fh_len, dot_len);
> > > > > > >
> > > > > > > +     if (info_mode & FAN_REPORT_PIDFD)
> > > > > > > +             info_len += FANOTIFY_PIDFD_INFO_HDR_LEN;
> > > > > > > +
> > > > > > >       return info_len;
> > > > > > >  }
> > > > > > >
> > > > > > > @@ -401,6 +406,29 @@ static int copy_fid_info_to_user(__kernel_fsid_t *fsid,
> > > > > > >       return info_len;
> > > > > > >  }
> > > > > > >
> > > > > > > +static int copy_pidfd_info_to_user(struct pid *pid,
> > > > > > > +                                char __user *buf,
> > > > > > > +                                size_t count)
> > > > > > > +{
> > > > > > > +     struct fanotify_event_info_pidfd info = { };
> > > > > > > +     size_t info_len = FANOTIFY_PIDFD_INFO_HDR_LEN;
> > > > > > > +
> > > > > > > +     if (WARN_ON_ONCE(info_len > count))
> > > > > > > +             return -EFAULT;
> > > > > > > +
> > > > > > > +     info.hdr.info_type = FAN_EVENT_INFO_TYPE_PIDFD;
> > > > > > > +     info.hdr.len = info_len;
> > > > > > > +
> > > > > > > +     info.pidfd = pidfd_create(pid, 0);
> > > > > > > +     if (info.pidfd < 0)
> > > > > > > +             info.pidfd = FAN_NOPIDFD;
> > > > > > > +
> > > > > > > +     if (copy_to_user(buf, &info, info_len))
> > > > > > > +             return -EFAULT;
> > > > > >
> > > > > > Hm, well this kinda sucks. The caller can end up with a pidfd in their
> > > > > > fd table and when the copy_to_user() failed they won't know what fd it
> > > > >
> > > > > Good catch!
> > > >
> > > > Super awesome catch Christian, thanks pulling this up!
> > > >
> > > > > But I prefer to solve it differently, because moving fd_install() to the
> > > > > end of this function does not guarantee that copy_event_to_user()
> > > > > won't return an error one day with dangling pidfd in fd table.
> > > >
> > > > I can see the angle you're approaching this from...
> > > >
> > > > > It might be simpler to do pidfd_create() next to create_fd() in
> > > > > copy_event_to_user() and pass pidfd to copy_pidfd_info_to_user().
> > > > > pidfd can be closed on error along with fd on out_close_fd label.
> > > > >
> > > > > You also forgot to add CAP_SYS_ADMIN check before pidfd_create()
> > > > > (even though fanotify_init() does check for that).
> > > >
> > > > I didn't really understand the need for this check here given that the
> > > > administrative bits are already being checked for in fanotify_init()
> > > > i.e. FAN_REPORT_PIDFD can never be set for an unprivileged listener;
> > > > thus never walking any of the pidfd_mode paths. Is this just a defense
> > > > in depth approach here, or is it something else that I'm missing?
> > > >
> > >
> > > We want to be extra careful not to create privilege escalations,
> > > so even if the fanotify fd is leaked or intentionally passed to a less
> > > privileged user, it cannot get an open pidfd.
> > >
> > > IOW, it is *much* easier to be defensive in this case than to prove
> > > that the change cannot introduce any privilege escalations.
> >
> > I have no problems with being more defensive (it's certainly better than
> > being too lax) but does it really make sence here? I mean if CAP_SYS_ADMIN
> > task opens O_RDWR /etc/passwd and then passes this fd to unpriviledged
> > process, that process is also free to update all the passwords.
> > Traditionally permission checks in Unix are performed on open and then who
> > has fd can do whatever that fd allows... I've tried to follow similar
> > philosophy with fanotify as well and e.g. open happening as a result of
> > fanotify path events does not check permissions either.
> >
> 
> Agreed.
> 
> However, because we had this issue with no explicit FAN_REPORT_PID
> we added the CAP_SYS_ADMIN check for reporting event->pid as next
> best thing. So now that becomes weird if priv process created fanotify fd
> and passes it to unpriv process, then unpriv process gets events with
> pidfd but without event->pid.
> 
> We can change the code to:
> 
>         if (!capable(CAP_SYS_ADMIN) && !pidfd_mode &&
>             task_tgid(current) != event->pid)
>                 metadata.pid = 0;
> 
> So the case I decscribed above ends up reporting both pidfd
> and event->pid to unpriv user, but that is a bit inconsistent...

Oh, now I see where you are coming from :) Thanks for explanation. And
remind me please, cannot we just have internal FAN_REPORT_PID flag that
gets set on notification group when priviledged process creates it and then
test for that instead of CAP_SYS_ADMIN in copy_event_to_user()? It is
mostly equivalent but I guess more in the spirit of how fanotify
traditionally does things. Also FAN_REPORT_PIDFD could then behave in the
same way...

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 5/5] fanotify: Add pidfd info record support to the fanotify API
  2021-05-21 13:19               ` Jan Kara
@ 2021-05-21 13:52                 ` Amir Goldstein
  2021-05-21 15:14                   ` Jan Kara
  0 siblings, 1 reply; 38+ messages in thread
From: Amir Goldstein @ 2021-05-21 13:52 UTC (permalink / raw)
  To: Jan Kara; +Cc: Matthew Bobrowski, Christian Brauner, linux-fsdevel, Linux API

On Fri, May 21, 2021 at 4:19 PM Jan Kara <jack@suse.cz> wrote:
>
> On Fri 21-05-21 14:10:32, Amir Goldstein wrote:
> > On Fri, May 21, 2021 at 1:24 PM Jan Kara <jack@suse.cz> wrote:
> > >
> > > On Fri 21-05-21 12:41:51, Amir Goldstein wrote:
> > > > On Fri, May 21, 2021 at 12:22 PM Matthew Bobrowski <repnop@google.com> wrote:
> > > > >
> > > > > Hey Amir/Christian,
> > > > >
> > > > > On Thu, May 20, 2021 at 04:43:48PM +0300, Amir Goldstein wrote:
> > > > > > On Thu, May 20, 2021 at 11:17 AM Christian Brauner
> > > > > > <christian.brauner@ubuntu.com> wrote:
> > > > > > > > +#define FANOTIFY_PIDFD_INFO_HDR_LEN \
> > > > > > > > +     sizeof(struct fanotify_event_info_pidfd)
> > > > > > > >
> > > > > > > >  static int fanotify_fid_info_len(int fh_len, int name_len)
> > > > > > > >  {
> > > > > > > > @@ -141,6 +143,9 @@ static int fanotify_event_info_len(unsigned int info_mode,
> > > > > > > >       if (fh_len)
> > > > > > > >               info_len += fanotify_fid_info_len(fh_len, dot_len);
> > > > > > > >
> > > > > > > > +     if (info_mode & FAN_REPORT_PIDFD)
> > > > > > > > +             info_len += FANOTIFY_PIDFD_INFO_HDR_LEN;
> > > > > > > > +
> > > > > > > >       return info_len;
> > > > > > > >  }
> > > > > > > >
> > > > > > > > @@ -401,6 +406,29 @@ static int copy_fid_info_to_user(__kernel_fsid_t *fsid,
> > > > > > > >       return info_len;
> > > > > > > >  }
> > > > > > > >
> > > > > > > > +static int copy_pidfd_info_to_user(struct pid *pid,
> > > > > > > > +                                char __user *buf,
> > > > > > > > +                                size_t count)
> > > > > > > > +{
> > > > > > > > +     struct fanotify_event_info_pidfd info = { };
> > > > > > > > +     size_t info_len = FANOTIFY_PIDFD_INFO_HDR_LEN;
> > > > > > > > +
> > > > > > > > +     if (WARN_ON_ONCE(info_len > count))
> > > > > > > > +             return -EFAULT;
> > > > > > > > +
> > > > > > > > +     info.hdr.info_type = FAN_EVENT_INFO_TYPE_PIDFD;
> > > > > > > > +     info.hdr.len = info_len;
> > > > > > > > +
> > > > > > > > +     info.pidfd = pidfd_create(pid, 0);
> > > > > > > > +     if (info.pidfd < 0)
> > > > > > > > +             info.pidfd = FAN_NOPIDFD;
> > > > > > > > +
> > > > > > > > +     if (copy_to_user(buf, &info, info_len))
> > > > > > > > +             return -EFAULT;
> > > > > > >
> > > > > > > Hm, well this kinda sucks. The caller can end up with a pidfd in their
> > > > > > > fd table and when the copy_to_user() failed they won't know what fd it
> > > > > >
> > > > > > Good catch!
> > > > >
> > > > > Super awesome catch Christian, thanks pulling this up!
> > > > >
> > > > > > But I prefer to solve it differently, because moving fd_install() to the
> > > > > > end of this function does not guarantee that copy_event_to_user()
> > > > > > won't return an error one day with dangling pidfd in fd table.
> > > > >
> > > > > I can see the angle you're approaching this from...
> > > > >
> > > > > > It might be simpler to do pidfd_create() next to create_fd() in
> > > > > > copy_event_to_user() and pass pidfd to copy_pidfd_info_to_user().
> > > > > > pidfd can be closed on error along with fd on out_close_fd label.
> > > > > >
> > > > > > You also forgot to add CAP_SYS_ADMIN check before pidfd_create()
> > > > > > (even though fanotify_init() does check for that).
> > > > >
> > > > > I didn't really understand the need for this check here given that the
> > > > > administrative bits are already being checked for in fanotify_init()
> > > > > i.e. FAN_REPORT_PIDFD can never be set for an unprivileged listener;
> > > > > thus never walking any of the pidfd_mode paths. Is this just a defense
> > > > > in depth approach here, or is it something else that I'm missing?
> > > > >
> > > >
> > > > We want to be extra careful not to create privilege escalations,
> > > > so even if the fanotify fd is leaked or intentionally passed to a less
> > > > privileged user, it cannot get an open pidfd.
> > > >
> > > > IOW, it is *much* easier to be defensive in this case than to prove
> > > > that the change cannot introduce any privilege escalations.
> > >
> > > I have no problems with being more defensive (it's certainly better than
> > > being too lax) but does it really make sence here? I mean if CAP_SYS_ADMIN
> > > task opens O_RDWR /etc/passwd and then passes this fd to unpriviledged
> > > process, that process is also free to update all the passwords.
> > > Traditionally permission checks in Unix are performed on open and then who
> > > has fd can do whatever that fd allows... I've tried to follow similar
> > > philosophy with fanotify as well and e.g. open happening as a result of
> > > fanotify path events does not check permissions either.
> > >
> >
> > Agreed.
> >
> > However, because we had this issue with no explicit FAN_REPORT_PID
> > we added the CAP_SYS_ADMIN check for reporting event->pid as next
> > best thing. So now that becomes weird if priv process created fanotify fd
> > and passes it to unpriv process, then unpriv process gets events with
> > pidfd but without event->pid.
> >
> > We can change the code to:
> >
> >         if (!capable(CAP_SYS_ADMIN) && !pidfd_mode &&
> >             task_tgid(current) != event->pid)
> >                 metadata.pid = 0;
> >
> > So the case I decscribed above ends up reporting both pidfd
> > and event->pid to unpriv user, but that is a bit inconsistent...
>
> Oh, now I see where you are coming from :) Thanks for explanation. And
> remind me please, cannot we just have internal FAN_REPORT_PID flag that
> gets set on notification group when priviledged process creates it and then
> test for that instead of CAP_SYS_ADMIN in copy_event_to_user()? It is
> mostly equivalent but I guess more in the spirit of how fanotify
> traditionally does things. Also FAN_REPORT_PIDFD could then behave in the
> same way...

Yes, we can. In fact, we should call the internal flag FANOTIFY_UNPRIV
as it described the situation better than FAN_REPORT_PID.
This happens to be how I implemented it in the initial RFC [1].

It's not easy to follow our entire discussion on this thread, but I think
we can resurrect the FANOTIFY_UNPRIV internal flag and use it
in this case instead of CAP_SYS_ADMIN.

Thanks,
Amir.

https://lore.kernel.org/linux-fsdevel/20210124184204.899729-3-amir73il@gmail.com/

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 5/5] fanotify: Add pidfd info record support to the fanotify API
  2021-05-21 13:52                 ` Amir Goldstein
@ 2021-05-21 15:14                   ` Jan Kara
  2021-05-22  0:41                     ` Matthew Bobrowski
  0 siblings, 1 reply; 38+ messages in thread
From: Jan Kara @ 2021-05-21 15:14 UTC (permalink / raw)
  To: Amir Goldstein
  Cc: Jan Kara, Matthew Bobrowski, Christian Brauner, linux-fsdevel, Linux API

On Fri 21-05-21 16:52:08, Amir Goldstein wrote:
> On Fri, May 21, 2021 at 4:19 PM Jan Kara <jack@suse.cz> wrote:
> >
> > On Fri 21-05-21 14:10:32, Amir Goldstein wrote:
> > > On Fri, May 21, 2021 at 1:24 PM Jan Kara <jack@suse.cz> wrote:
> > > >
> > > > On Fri 21-05-21 12:41:51, Amir Goldstein wrote:
> > > > > On Fri, May 21, 2021 at 12:22 PM Matthew Bobrowski <repnop@google.com> wrote:
> > > > > >
> > > > > > Hey Amir/Christian,
> > > > > >
> > > > > > On Thu, May 20, 2021 at 04:43:48PM +0300, Amir Goldstein wrote:
> > > > > > > On Thu, May 20, 2021 at 11:17 AM Christian Brauner
> > > > > > > <christian.brauner@ubuntu.com> wrote:
> > > > > > > > > +#define FANOTIFY_PIDFD_INFO_HDR_LEN \
> > > > > > > > > +     sizeof(struct fanotify_event_info_pidfd)
> > > > > > > > >
> > > > > > > > >  static int fanotify_fid_info_len(int fh_len, int name_len)
> > > > > > > > >  {
> > > > > > > > > @@ -141,6 +143,9 @@ static int fanotify_event_info_len(unsigned int info_mode,
> > > > > > > > >       if (fh_len)
> > > > > > > > >               info_len += fanotify_fid_info_len(fh_len, dot_len);
> > > > > > > > >
> > > > > > > > > +     if (info_mode & FAN_REPORT_PIDFD)
> > > > > > > > > +             info_len += FANOTIFY_PIDFD_INFO_HDR_LEN;
> > > > > > > > > +
> > > > > > > > >       return info_len;
> > > > > > > > >  }
> > > > > > > > >
> > > > > > > > > @@ -401,6 +406,29 @@ static int copy_fid_info_to_user(__kernel_fsid_t *fsid,
> > > > > > > > >       return info_len;
> > > > > > > > >  }
> > > > > > > > >
> > > > > > > > > +static int copy_pidfd_info_to_user(struct pid *pid,
> > > > > > > > > +                                char __user *buf,
> > > > > > > > > +                                size_t count)
> > > > > > > > > +{
> > > > > > > > > +     struct fanotify_event_info_pidfd info = { };
> > > > > > > > > +     size_t info_len = FANOTIFY_PIDFD_INFO_HDR_LEN;
> > > > > > > > > +
> > > > > > > > > +     if (WARN_ON_ONCE(info_len > count))
> > > > > > > > > +             return -EFAULT;
> > > > > > > > > +
> > > > > > > > > +     info.hdr.info_type = FAN_EVENT_INFO_TYPE_PIDFD;
> > > > > > > > > +     info.hdr.len = info_len;
> > > > > > > > > +
> > > > > > > > > +     info.pidfd = pidfd_create(pid, 0);
> > > > > > > > > +     if (info.pidfd < 0)
> > > > > > > > > +             info.pidfd = FAN_NOPIDFD;
> > > > > > > > > +
> > > > > > > > > +     if (copy_to_user(buf, &info, info_len))
> > > > > > > > > +             return -EFAULT;
> > > > > > > >
> > > > > > > > Hm, well this kinda sucks. The caller can end up with a pidfd in their
> > > > > > > > fd table and when the copy_to_user() failed they won't know what fd it
> > > > > > >
> > > > > > > Good catch!
> > > > > >
> > > > > > Super awesome catch Christian, thanks pulling this up!
> > > > > >
> > > > > > > But I prefer to solve it differently, because moving fd_install() to the
> > > > > > > end of this function does not guarantee that copy_event_to_user()
> > > > > > > won't return an error one day with dangling pidfd in fd table.
> > > > > >
> > > > > > I can see the angle you're approaching this from...
> > > > > >
> > > > > > > It might be simpler to do pidfd_create() next to create_fd() in
> > > > > > > copy_event_to_user() and pass pidfd to copy_pidfd_info_to_user().
> > > > > > > pidfd can be closed on error along with fd on out_close_fd label.
> > > > > > >
> > > > > > > You also forgot to add CAP_SYS_ADMIN check before pidfd_create()
> > > > > > > (even though fanotify_init() does check for that).
> > > > > >
> > > > > > I didn't really understand the need for this check here given that the
> > > > > > administrative bits are already being checked for in fanotify_init()
> > > > > > i.e. FAN_REPORT_PIDFD can never be set for an unprivileged listener;
> > > > > > thus never walking any of the pidfd_mode paths. Is this just a defense
> > > > > > in depth approach here, or is it something else that I'm missing?
> > > > > >
> > > > >
> > > > > We want to be extra careful not to create privilege escalations,
> > > > > so even if the fanotify fd is leaked or intentionally passed to a less
> > > > > privileged user, it cannot get an open pidfd.
> > > > >
> > > > > IOW, it is *much* easier to be defensive in this case than to prove
> > > > > that the change cannot introduce any privilege escalations.
> > > >
> > > > I have no problems with being more defensive (it's certainly better than
> > > > being too lax) but does it really make sence here? I mean if CAP_SYS_ADMIN
> > > > task opens O_RDWR /etc/passwd and then passes this fd to unpriviledged
> > > > process, that process is also free to update all the passwords.
> > > > Traditionally permission checks in Unix are performed on open and then who
> > > > has fd can do whatever that fd allows... I've tried to follow similar
> > > > philosophy with fanotify as well and e.g. open happening as a result of
> > > > fanotify path events does not check permissions either.
> > > >
> > >
> > > Agreed.
> > >
> > > However, because we had this issue with no explicit FAN_REPORT_PID
> > > we added the CAP_SYS_ADMIN check for reporting event->pid as next
> > > best thing. So now that becomes weird if priv process created fanotify fd
> > > and passes it to unpriv process, then unpriv process gets events with
> > > pidfd but without event->pid.
> > >
> > > We can change the code to:
> > >
> > >         if (!capable(CAP_SYS_ADMIN) && !pidfd_mode &&
> > >             task_tgid(current) != event->pid)
> > >                 metadata.pid = 0;
> > >
> > > So the case I decscribed above ends up reporting both pidfd
> > > and event->pid to unpriv user, but that is a bit inconsistent...
> >
> > Oh, now I see where you are coming from :) Thanks for explanation. And
> > remind me please, cannot we just have internal FAN_REPORT_PID flag that
> > gets set on notification group when priviledged process creates it and then
> > test for that instead of CAP_SYS_ADMIN in copy_event_to_user()? It is
> > mostly equivalent but I guess more in the spirit of how fanotify
> > traditionally does things. Also FAN_REPORT_PIDFD could then behave in the
> > same way...
> 
> Yes, we can. In fact, we should call the internal flag FANOTIFY_UNPRIV
> as it described the situation better than FAN_REPORT_PID.
> This happens to be how I implemented it in the initial RFC [1].
> 
> It's not easy to follow our entire discussion on this thread, but I think
> we can resurrect the FANOTIFY_UNPRIV internal flag and use it
> in this case instead of CAP_SYS_ADMIN.

I think at that time we were discussing how to handle opening of fds and
we decided to not depend on FANOTIFY_UNPRIV and then I didn't see a value
of that flag because I forgot about pids... Anyway now I agree to go for
that flag. :)

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 0/5] Add pidfd support to the fanotify API
  2021-05-21 10:40     ` Jan Kara
@ 2021-05-21 23:32       ` Matthew Bobrowski
  2021-05-24  8:47         ` Jan Kara
  0 siblings, 1 reply; 38+ messages in thread
From: Matthew Bobrowski @ 2021-05-21 23:32 UTC (permalink / raw)
  To: Jan Kara; +Cc: amir73il, christian.brauner, linux-fsdevel, linux-api

On Fri, May 21, 2021 at 12:40:56PM +0200, Jan Kara wrote:
> On Fri 21-05-21 20:15:35, Matthew Bobrowski wrote:
> > On Thu, May 20, 2021 at 03:55:27PM +0200, Jan Kara wrote:
> > There's one thing that I'd like to mention, and it's something in
> > regards to the overall approach we've taken that I'm not particularly
> > happy about and I'd like to hear all your thoughts. Basically, with
> > this approach the pidfd creation is done only once an event has been
> > queued and the notification worker wakes up and picks up the event
> > from the queue processes it. There's a subtle latency introduced when
> > taking such an approach which at times leads to pidfd creation
> > failures. As in, by the time pidfd_create() is called the struct pid
> > has already been reaped, which then results in FAN_NOPIDFD being
> > returned in the pidfd info record.
> > 
> > Having said that, I'm wondering what the thoughts are on doing pidfd
> > creation earlier on i.e. in the event allocation stages? This way, the
> > struct pid is pinned earlier on and rather than FAN_NOPIDFD being
> > returned in the pidfd info record because the struct pid has been
> > already reaped, userspace application will atleast receive a valid
> > pidfd which can be used to check whether the process still exists or
> > not. I think it'll just set the expectation better from an API
> > perspective.
> 
> Yes, there's this race. OTOH if FAN_NOPIDFD is returned, the listener can
> be sure the original process doesn't exist anymore. So is it useful to
> still receive pidfd of the dead process?

Well, you're absolutely right. However, FWIW I was approaching this
from two different angles:

1) I wanted to keep the pattern in which the listener checks for the
   existence/recycling of the process consistent. As in, the listener
   would receive the pidfd, then send the pidfd a signal via
   pidfd_send_signal() and check for -ESRCH which clearly indicates
   that the target process has terminated.

2) I didn't want to mask failed pidfd creation because of early
   process termination and other possible failures behind a single
   FAN_NOPIDFD. IOW, if we take the -ESRCH approach above, the
   listener can take clear corrective branches as what's to be done
   next if a race is to have been detected, whereas simply returning
   FAN_NOPIDFD at this stage can mean multiple things.

Now that I've written the above and keeping in mind that we'd like to
refrain from doing anything in the event allocation stages, perhaps we
could introduce a different error code for detecting early process
termination while attempting to construct the info record. WDYT?

> Also opening pidfd in the context of event generation is problematic
> for two reasons:
>
> 1) Technically, the context under which events are generated can be rather
> constrained (various locks held etc.). Adding relatively complex operations
> such as pidfd creation is going to introduce strange lock dependencies,
> possibly deadlocks.
> 
> 2) Doing pidfd generation in the context of the process generating event is
> problematic - you don't know in which fd_table the fd will live. Also that
> process is unfairly penalized (performance wise) because someone else is
> listening. We try to keep overhead of event generation as low as possible
> for this reason.

Fair points, thanks for sharing your view. :)

/M

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 5/5] fanotify: Add pidfd info record support to the fanotify API
  2021-05-21 15:14                   ` Jan Kara
@ 2021-05-22  0:41                     ` Matthew Bobrowski
  2021-05-22  9:01                       ` Amir Goldstein
  0 siblings, 1 reply; 38+ messages in thread
From: Matthew Bobrowski @ 2021-05-22  0:41 UTC (permalink / raw)
  To: Jan Kara; +Cc: Amir Goldstein, Christian Brauner, linux-fsdevel, Linux API

On Fri, May 21, 2021 at 05:14:15PM +0200, Jan Kara wrote:
> On Fri 21-05-21 16:52:08, Amir Goldstein wrote:
> > On Fri, May 21, 2021 at 4:19 PM Jan Kara <jack@suse.cz> wrote:
> > >
> > > On Fri 21-05-21 14:10:32, Amir Goldstein wrote:
> > > > On Fri, May 21, 2021 at 1:24 PM Jan Kara <jack@suse.cz> wrote:
> > > > >
> > > > > On Fri 21-05-21 12:41:51, Amir Goldstein wrote:
> > > > > > On Fri, May 21, 2021 at 12:22 PM Matthew Bobrowski <repnop@google.com> wrote:
> > > > > > >
> > > > > > > Hey Amir/Christian,
> > > > > > >
> > > > > > > On Thu, May 20, 2021 at 04:43:48PM +0300, Amir Goldstein wrote:
> > > > > > > > On Thu, May 20, 2021 at 11:17 AM Christian Brauner
> > > > > > > > <christian.brauner@ubuntu.com> wrote:
> > > > > > > > > > +#define FANOTIFY_PIDFD_INFO_HDR_LEN \
> > > > > > > > > > +     sizeof(struct fanotify_event_info_pidfd)
> > > > > > > > > >
> > > > > > > > > >  static int fanotify_fid_info_len(int fh_len, int name_len)
> > > > > > > > > >  {
> > > > > > > > > > @@ -141,6 +143,9 @@ static int fanotify_event_info_len(unsigned int info_mode,
> > > > > > > > > >       if (fh_len)
> > > > > > > > > >               info_len += fanotify_fid_info_len(fh_len, dot_len);
> > > > > > > > > >
> > > > > > > > > > +     if (info_mode & FAN_REPORT_PIDFD)
> > > > > > > > > > +             info_len += FANOTIFY_PIDFD_INFO_HDR_LEN;
> > > > > > > > > > +
> > > > > > > > > >       return info_len;
> > > > > > > > > >  }
> > > > > > > > > >
> > > > > > > > > > @@ -401,6 +406,29 @@ static int copy_fid_info_to_user(__kernel_fsid_t *fsid,
> > > > > > > > > >       return info_len;
> > > > > > > > > >  }
> > > > > > > > > >
> > > > > > > > > > +static int copy_pidfd_info_to_user(struct pid *pid,
> > > > > > > > > > +                                char __user *buf,
> > > > > > > > > > +                                size_t count)
> > > > > > > > > > +{
> > > > > > > > > > +     struct fanotify_event_info_pidfd info = { };
> > > > > > > > > > +     size_t info_len = FANOTIFY_PIDFD_INFO_HDR_LEN;
> > > > > > > > > > +
> > > > > > > > > > +     if (WARN_ON_ONCE(info_len > count))
> > > > > > > > > > +             return -EFAULT;
> > > > > > > > > > +
> > > > > > > > > > +     info.hdr.info_type = FAN_EVENT_INFO_TYPE_PIDFD;
> > > > > > > > > > +     info.hdr.len = info_len;
> > > > > > > > > > +
> > > > > > > > > > +     info.pidfd = pidfd_create(pid, 0);
> > > > > > > > > > +     if (info.pidfd < 0)
> > > > > > > > > > +             info.pidfd = FAN_NOPIDFD;
> > > > > > > > > > +
> > > > > > > > > > +     if (copy_to_user(buf, &info, info_len))
> > > > > > > > > > +             return -EFAULT;
> > > > > > > > >
> > > > > > > > > Hm, well this kinda sucks. The caller can end up with a pidfd in their
> > > > > > > > > fd table and when the copy_to_user() failed they won't know what fd it
> > > > > > > >
> > > > > > > > Good catch!
> > > > > > >
> > > > > > > Super awesome catch Christian, thanks pulling this up!
> > > > > > >
> > > > > > > > But I prefer to solve it differently, because moving fd_install() to the
> > > > > > > > end of this function does not guarantee that copy_event_to_user()
> > > > > > > > won't return an error one day with dangling pidfd in fd table.
> > > > > > >
> > > > > > > I can see the angle you're approaching this from...
> > > > > > >
> > > > > > > > It might be simpler to do pidfd_create() next to create_fd() in
> > > > > > > > copy_event_to_user() and pass pidfd to copy_pidfd_info_to_user().
> > > > > > > > pidfd can be closed on error along with fd on out_close_fd label.
> > > > > > > >
> > > > > > > > You also forgot to add CAP_SYS_ADMIN check before pidfd_create()
> > > > > > > > (even though fanotify_init() does check for that).
> > > > > > >
> > > > > > > I didn't really understand the need for this check here given that the
> > > > > > > administrative bits are already being checked for in fanotify_init()
> > > > > > > i.e. FAN_REPORT_PIDFD can never be set for an unprivileged listener;
> > > > > > > thus never walking any of the pidfd_mode paths. Is this just a defense
> > > > > > > in depth approach here, or is it something else that I'm missing?
> > > > > > >
> > > > > >
> > > > > > We want to be extra careful not to create privilege escalations,
> > > > > > so even if the fanotify fd is leaked or intentionally passed to a less
> > > > > > privileged user, it cannot get an open pidfd.
> > > > > >
> > > > > > IOW, it is *much* easier to be defensive in this case than to prove
> > > > > > that the change cannot introduce any privilege escalations.
> > > > >
> > > > > I have no problems with being more defensive (it's certainly better than
> > > > > being too lax) but does it really make sence here? I mean if CAP_SYS_ADMIN
> > > > > task opens O_RDWR /etc/passwd and then passes this fd to unpriviledged
> > > > > process, that process is also free to update all the passwords.
> > > > > Traditionally permission checks in Unix are performed on open and then who
> > > > > has fd can do whatever that fd allows... I've tried to follow similar
> > > > > philosophy with fanotify as well and e.g. open happening as a result of
> > > > > fanotify path events does not check permissions either.
> > > > >
> > > >
> > > > Agreed.
> > > >
> > > > However, because we had this issue with no explicit FAN_REPORT_PID
> > > > we added the CAP_SYS_ADMIN check for reporting event->pid as next
> > > > best thing. So now that becomes weird if priv process created fanotify fd
> > > > and passes it to unpriv process, then unpriv process gets events with
> > > > pidfd but without event->pid.
> > > >
> > > > We can change the code to:
> > > >
> > > >         if (!capable(CAP_SYS_ADMIN) && !pidfd_mode &&
> > > >             task_tgid(current) != event->pid)
> > > >                 metadata.pid = 0;
> > > >
> > > > So the case I decscribed above ends up reporting both pidfd
> > > > and event->pid to unpriv user, but that is a bit inconsistent...
> > >
> > > Oh, now I see where you are coming from :) Thanks for explanation. And
> > > remind me please, cannot we just have internal FAN_REPORT_PID flag that
> > > gets set on notification group when priviledged process creates it and then
> > > test for that instead of CAP_SYS_ADMIN in copy_event_to_user()? It is
> > > mostly equivalent but I guess more in the spirit of how fanotify
> > > traditionally does things. Also FAN_REPORT_PIDFD could then behave in the
> > > same way...
> > 
> > Yes, we can. In fact, we should call the internal flag FANOTIFY_UNPRIV
> > as it described the situation better than FAN_REPORT_PID.
> > This happens to be how I implemented it in the initial RFC [1].
> > 
> > It's not easy to follow our entire discussion on this thread, but I think
> > we can resurrect the FANOTIFY_UNPRIV internal flag and use it
> > in this case instead of CAP_SYS_ADMIN.
> 
> I think at that time we were discussing how to handle opening of fds and
> we decided to not depend on FANOTIFY_UNPRIV and then I didn't see a value
> of that flag because I forgot about pids... Anyway now I agree to go for
> that flag. :)

Resurrection of this flag SGTM! However, it also sounds like we need
to land that series before this PIDFD series or simply incorporate the
UNPRIV flag into this one.

Will chat with Amir to get this done.

/M

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 5/5] fanotify: Add pidfd info record support to the fanotify API
  2021-05-22  0:41                     ` Matthew Bobrowski
@ 2021-05-22  9:01                       ` Amir Goldstein
  0 siblings, 0 replies; 38+ messages in thread
From: Amir Goldstein @ 2021-05-22  9:01 UTC (permalink / raw)
  To: Matthew Bobrowski; +Cc: Jan Kara, Christian Brauner, linux-fsdevel, Linux API

> > > > > > > > > You also forgot to add CAP_SYS_ADMIN check before pidfd_create()
> > > > > > > > > (even though fanotify_init() does check for that).
> > > > > > > >
> > > > > > > > I didn't really understand the need for this check here given that the
> > > > > > > > administrative bits are already being checked for in fanotify_init()
> > > > > > > > i.e. FAN_REPORT_PIDFD can never be set for an unprivileged listener;
> > > > > > > > thus never walking any of the pidfd_mode paths. Is this just a defense
> > > > > > > > in depth approach here, or is it something else that I'm missing?
> > > > > > > >
> > > > > > >
> > > > > > > We want to be extra careful not to create privilege escalations,
> > > > > > > so even if the fanotify fd is leaked or intentionally passed to a less
> > > > > > > privileged user, it cannot get an open pidfd.
> > > > > > >
> > > > > > > IOW, it is *much* easier to be defensive in this case than to prove
> > > > > > > that the change cannot introduce any privilege escalations.
> > > > > >
> > > > > > I have no problems with being more defensive (it's certainly better than
> > > > > > being too lax) but does it really make sence here? I mean if CAP_SYS_ADMIN
> > > > > > task opens O_RDWR /etc/passwd and then passes this fd to unpriviledged
> > > > > > process, that process is also free to update all the passwords.
> > > > > > Traditionally permission checks in Unix are performed on open and then who
> > > > > > has fd can do whatever that fd allows... I've tried to follow similar
> > > > > > philosophy with fanotify as well and e.g. open happening as a result of
> > > > > > fanotify path events does not check permissions either.
> > > > > >
> > > > >
> > > > > Agreed.
> > > > >
> > > > > However, because we had this issue with no explicit FAN_REPORT_PID
> > > > > we added the CAP_SYS_ADMIN check for reporting event->pid as next
> > > > > best thing. So now that becomes weird if priv process created fanotify fd
> > > > > and passes it to unpriv process, then unpriv process gets events with
> > > > > pidfd but without event->pid.
> > > > >
> > > > > We can change the code to:
> > > > >
> > > > >         if (!capable(CAP_SYS_ADMIN) && !pidfd_mode &&
> > > > >             task_tgid(current) != event->pid)
> > > > >                 metadata.pid = 0;
> > > > >
> > > > > So the case I decscribed above ends up reporting both pidfd
> > > > > and event->pid to unpriv user, but that is a bit inconsistent...
> > > >
> > > > Oh, now I see where you are coming from :) Thanks for explanation. And
> > > > remind me please, cannot we just have internal FAN_REPORT_PID flag that
> > > > gets set on notification group when priviledged process creates it and then
> > > > test for that instead of CAP_SYS_ADMIN in copy_event_to_user()? It is
> > > > mostly equivalent but I guess more in the spirit of how fanotify
> > > > traditionally does things. Also FAN_REPORT_PIDFD could then behave in the
> > > > same way...
> > >
> > > Yes, we can. In fact, we should call the internal flag FANOTIFY_UNPRIV
> > > as it described the situation better than FAN_REPORT_PID.
> > > This happens to be how I implemented it in the initial RFC [1].
> > >
> > > It's not easy to follow our entire discussion on this thread, but I think
> > > we can resurrect the FANOTIFY_UNPRIV internal flag and use it
> > > in this case instead of CAP_SYS_ADMIN.
> >
> > I think at that time we were discussing how to handle opening of fds and
> > we decided to not depend on FANOTIFY_UNPRIV and then I didn't see a value
> > of that flag because I forgot about pids... Anyway now I agree to go for
> > that flag. :)
>
> Resurrection of this flag SGTM! However, it also sounds like we need
> to land that series before this PIDFD series or simply incorporate the
> UNPRIV flag into this one.
>
> Will chat with Amir to get this done.

Let me post this patch as a fix patch to unprivileged group.

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 0/5] Add pidfd support to the fanotify API
  2021-05-21 23:32       ` Matthew Bobrowski
@ 2021-05-24  8:47         ` Jan Kara
  2021-05-25 10:31           ` Christian Brauner
  0 siblings, 1 reply; 38+ messages in thread
From: Jan Kara @ 2021-05-24  8:47 UTC (permalink / raw)
  To: Matthew Bobrowski
  Cc: Jan Kara, amir73il, christian.brauner, linux-fsdevel, linux-api

On Sat 22-05-21 09:32:36, Matthew Bobrowski wrote:
> On Fri, May 21, 2021 at 12:40:56PM +0200, Jan Kara wrote:
> > On Fri 21-05-21 20:15:35, Matthew Bobrowski wrote:
> > > On Thu, May 20, 2021 at 03:55:27PM +0200, Jan Kara wrote:
> > > There's one thing that I'd like to mention, and it's something in
> > > regards to the overall approach we've taken that I'm not particularly
> > > happy about and I'd like to hear all your thoughts. Basically, with
> > > this approach the pidfd creation is done only once an event has been
> > > queued and the notification worker wakes up and picks up the event
> > > from the queue processes it. There's a subtle latency introduced when
> > > taking such an approach which at times leads to pidfd creation
> > > failures. As in, by the time pidfd_create() is called the struct pid
> > > has already been reaped, which then results in FAN_NOPIDFD being
> > > returned in the pidfd info record.
> > > 
> > > Having said that, I'm wondering what the thoughts are on doing pidfd
> > > creation earlier on i.e. in the event allocation stages? This way, the
> > > struct pid is pinned earlier on and rather than FAN_NOPIDFD being
> > > returned in the pidfd info record because the struct pid has been
> > > already reaped, userspace application will atleast receive a valid
> > > pidfd which can be used to check whether the process still exists or
> > > not. I think it'll just set the expectation better from an API
> > > perspective.
> > 
> > Yes, there's this race. OTOH if FAN_NOPIDFD is returned, the listener can
> > be sure the original process doesn't exist anymore. So is it useful to
> > still receive pidfd of the dead process?
> 
> Well, you're absolutely right. However, FWIW I was approaching this
> from two different angles:
> 
> 1) I wanted to keep the pattern in which the listener checks for the
>    existence/recycling of the process consistent. As in, the listener
>    would receive the pidfd, then send the pidfd a signal via
>    pidfd_send_signal() and check for -ESRCH which clearly indicates
>    that the target process has terminated.
> 
> 2) I didn't want to mask failed pidfd creation because of early
>    process termination and other possible failures behind a single
>    FAN_NOPIDFD. IOW, if we take the -ESRCH approach above, the
>    listener can take clear corrective branches as what's to be done
>    next if a race is to have been detected, whereas simply returning
>    FAN_NOPIDFD at this stage can mean multiple things.
> 
> Now that I've written the above and keeping in mind that we'd like to
> refrain from doing anything in the event allocation stages, perhaps we
> could introduce a different error code for detecting early process
> termination while attempting to construct the info record. WDYT?

Sure, I wouldn't like to overengineer it but having one special fd value for
"process doesn't exist anymore" and another for general "creating pidfd
failed" looks OK to me.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 0/5] Add pidfd support to the fanotify API
  2021-05-24  8:47         ` Jan Kara
@ 2021-05-25 10:31           ` Christian Brauner
  2021-05-25 23:20             ` Matthew Bobrowski
  0 siblings, 1 reply; 38+ messages in thread
From: Christian Brauner @ 2021-05-25 10:31 UTC (permalink / raw)
  To: Jan Kara; +Cc: Matthew Bobrowski, amir73il, linux-fsdevel, linux-api

On Mon, May 24, 2021 at 10:47:46AM +0200, Jan Kara wrote:
> On Sat 22-05-21 09:32:36, Matthew Bobrowski wrote:
> > On Fri, May 21, 2021 at 12:40:56PM +0200, Jan Kara wrote:
> > > On Fri 21-05-21 20:15:35, Matthew Bobrowski wrote:
> > > > On Thu, May 20, 2021 at 03:55:27PM +0200, Jan Kara wrote:
> > > > There's one thing that I'd like to mention, and it's something in
> > > > regards to the overall approach we've taken that I'm not particularly
> > > > happy about and I'd like to hear all your thoughts. Basically, with
> > > > this approach the pidfd creation is done only once an event has been
> > > > queued and the notification worker wakes up and picks up the event
> > > > from the queue processes it. There's a subtle latency introduced when
> > > > taking such an approach which at times leads to pidfd creation
> > > > failures. As in, by the time pidfd_create() is called the struct pid
> > > > has already been reaped, which then results in FAN_NOPIDFD being
> > > > returned in the pidfd info record.
> > > > 
> > > > Having said that, I'm wondering what the thoughts are on doing pidfd
> > > > creation earlier on i.e. in the event allocation stages? This way, the
> > > > struct pid is pinned earlier on and rather than FAN_NOPIDFD being
> > > > returned in the pidfd info record because the struct pid has been
> > > > already reaped, userspace application will atleast receive a valid
> > > > pidfd which can be used to check whether the process still exists or
> > > > not. I think it'll just set the expectation better from an API
> > > > perspective.
> > > 
> > > Yes, there's this race. OTOH if FAN_NOPIDFD is returned, the listener can
> > > be sure the original process doesn't exist anymore. So is it useful to
> > > still receive pidfd of the dead process?
> > 
> > Well, you're absolutely right. However, FWIW I was approaching this
> > from two different angles:
> > 
> > 1) I wanted to keep the pattern in which the listener checks for the
> >    existence/recycling of the process consistent. As in, the listener
> >    would receive the pidfd, then send the pidfd a signal via
> >    pidfd_send_signal() and check for -ESRCH which clearly indicates
> >    that the target process has terminated.
> > 
> > 2) I didn't want to mask failed pidfd creation because of early
> >    process termination and other possible failures behind a single
> >    FAN_NOPIDFD. IOW, if we take the -ESRCH approach above, the
> >    listener can take clear corrective branches as what's to be done
> >    next if a race is to have been detected, whereas simply returning
> >    FAN_NOPIDFD at this stage can mean multiple things.
> > 
> > Now that I've written the above and keeping in mind that we'd like to
> > refrain from doing anything in the event allocation stages, perhaps we
> > could introduce a different error code for detecting early process
> > termination while attempting to construct the info record. WDYT?
> 
> Sure, I wouldn't like to overengineer it but having one special fd value for
> "process doesn't exist anymore" and another for general "creating pidfd
> failed" looks OK to me.

FAN_EPIDFD -> "creation failed"
FAN_NOPIDFD -> "no such process"

?

Christian

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 0/5] Add pidfd support to the fanotify API
  2021-05-25 10:31           ` Christian Brauner
@ 2021-05-25 23:20             ` Matthew Bobrowski
  2021-05-26 18:05               ` Christian Brauner
  0 siblings, 1 reply; 38+ messages in thread
From: Matthew Bobrowski @ 2021-05-25 23:20 UTC (permalink / raw)
  To: Christian Brauner; +Cc: Jan Kara, amir73il, linux-fsdevel, linux-api

On Tue, May 25, 2021 at 12:31:33PM +0200, Christian Brauner wrote:
> On Mon, May 24, 2021 at 10:47:46AM +0200, Jan Kara wrote:
> > On Sat 22-05-21 09:32:36, Matthew Bobrowski wrote:
> > > On Fri, May 21, 2021 at 12:40:56PM +0200, Jan Kara wrote:
> > > > On Fri 21-05-21 20:15:35, Matthew Bobrowski wrote:
> > > > > On Thu, May 20, 2021 at 03:55:27PM +0200, Jan Kara wrote:
> > > > > There's one thing that I'd like to mention, and it's something in
> > > > > regards to the overall approach we've taken that I'm not particularly
> > > > > happy about and I'd like to hear all your thoughts. Basically, with
> > > > > this approach the pidfd creation is done only once an event has been
> > > > > queued and the notification worker wakes up and picks up the event
> > > > > from the queue processes it. There's a subtle latency introduced when
> > > > > taking such an approach which at times leads to pidfd creation
> > > > > failures. As in, by the time pidfd_create() is called the struct pid
> > > > > has already been reaped, which then results in FAN_NOPIDFD being
> > > > > returned in the pidfd info record.
> > > > > 
> > > > > Having said that, I'm wondering what the thoughts are on doing pidfd
> > > > > creation earlier on i.e. in the event allocation stages? This way, the
> > > > > struct pid is pinned earlier on and rather than FAN_NOPIDFD being
> > > > > returned in the pidfd info record because the struct pid has been
> > > > > already reaped, userspace application will atleast receive a valid
> > > > > pidfd which can be used to check whether the process still exists or
> > > > > not. I think it'll just set the expectation better from an API
> > > > > perspective.
> > > > 
> > > > Yes, there's this race. OTOH if FAN_NOPIDFD is returned, the listener can
> > > > be sure the original process doesn't exist anymore. So is it useful to
> > > > still receive pidfd of the dead process?
> > > 
> > > Well, you're absolutely right. However, FWIW I was approaching this
> > > from two different angles:
> > > 
> > > 1) I wanted to keep the pattern in which the listener checks for the
> > >    existence/recycling of the process consistent. As in, the listener
> > >    would receive the pidfd, then send the pidfd a signal via
> > >    pidfd_send_signal() and check for -ESRCH which clearly indicates
> > >    that the target process has terminated.
> > > 
> > > 2) I didn't want to mask failed pidfd creation because of early
> > >    process termination and other possible failures behind a single
> > >    FAN_NOPIDFD. IOW, if we take the -ESRCH approach above, the
> > >    listener can take clear corrective branches as what's to be done
> > >    next if a race is to have been detected, whereas simply returning
> > >    FAN_NOPIDFD at this stage can mean multiple things.
> > > 
> > > Now that I've written the above and keeping in mind that we'd like to
> > > refrain from doing anything in the event allocation stages, perhaps we
> > > could introduce a different error code for detecting early process
> > > termination while attempting to construct the info record. WDYT?
> > 
> > Sure, I wouldn't like to overengineer it but having one special fd value for
> > "process doesn't exist anymore" and another for general "creating pidfd
> > failed" looks OK to me.
> 
> FAN_EPIDFD -> "creation failed"
> FAN_NOPIDFD -> "no such process"

Yes, I was thinking something along the lines of this...

With the approach that I've proposed in this series, the pidfd
creation failure trips up in pidfd_create() at the following
condition:

	if (!pid || !pid_has_task(pid, PIDTYPE_TGID))
	   	 return -EINVAL;

Specifically, the following check:
	!pid_has_task(pid, PIDTYPE_TGID)

In order to properly report either FAN_NOPIDFD/FAN_EPIDFD to
userspace, AFAIK I'll have to do one of either two things to better
distinguish between why the pidfd creation had failed:

1) Implement an additional check in pidfd_create() that effectively
   checks whether provided pid still holds reference to a struct pid
   that isn't in the process of being cleaned up. If it is being
   cleaned up, then return something like -ESRCH instead of -EINVAL so
   that the caller, in this case fanotify, can check and set
   FAN_NOPIDFD if -ESRCH is returned from pidfd_create(). I definitely
   don't feel as though returning -ESRCH from the !pid_has_task(pid,
   PIDTYPE_TGID) would be appropriate. In saying that, I'm not aware
   of a helper by which would allow us to perform such an in-flight
   check? Perhaps something needs to be introduced here, IDK...

2) Refrain from performing any further changes to pidfd_create()
   i.e. as proposed in option 1), and manually perform the pidfd
   creation from some kind of new fanotify helper, as suggested by you
   here [0]. However, I'm not convinved that I like this approach as
   we may end up slowly drifting away from pidfd creation semantics
   over time.

[0] https://www.spinics.net/lists/linux-fsdevel/msg195556.html 

/M

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 0/5] Add pidfd support to the fanotify API
  2021-05-25 23:20             ` Matthew Bobrowski
@ 2021-05-26 18:05               ` Christian Brauner
  2021-05-26 22:54                 ` Matthew Bobrowski
  2021-06-01 11:03                 ` Matthew Bobrowski
  0 siblings, 2 replies; 38+ messages in thread
From: Christian Brauner @ 2021-05-26 18:05 UTC (permalink / raw)
  To: Matthew Bobrowski; +Cc: Jan Kara, amir73il, linux-fsdevel, linux-api

On Wed, May 26, 2021 at 09:20:55AM +1000, Matthew Bobrowski wrote:
> On Tue, May 25, 2021 at 12:31:33PM +0200, Christian Brauner wrote:
> > On Mon, May 24, 2021 at 10:47:46AM +0200, Jan Kara wrote:
> > > On Sat 22-05-21 09:32:36, Matthew Bobrowski wrote:
> > > > On Fri, May 21, 2021 at 12:40:56PM +0200, Jan Kara wrote:
> > > > > On Fri 21-05-21 20:15:35, Matthew Bobrowski wrote:
> > > > > > On Thu, May 20, 2021 at 03:55:27PM +0200, Jan Kara wrote:
> > > > > > There's one thing that I'd like to mention, and it's something in
> > > > > > regards to the overall approach we've taken that I'm not particularly
> > > > > > happy about and I'd like to hear all your thoughts. Basically, with
> > > > > > this approach the pidfd creation is done only once an event has been
> > > > > > queued and the notification worker wakes up and picks up the event
> > > > > > from the queue processes it. There's a subtle latency introduced when
> > > > > > taking such an approach which at times leads to pidfd creation
> > > > > > failures. As in, by the time pidfd_create() is called the struct pid
> > > > > > has already been reaped, which then results in FAN_NOPIDFD being
> > > > > > returned in the pidfd info record.
> > > > > > 
> > > > > > Having said that, I'm wondering what the thoughts are on doing pidfd
> > > > > > creation earlier on i.e. in the event allocation stages? This way, the
> > > > > > struct pid is pinned earlier on and rather than FAN_NOPIDFD being
> > > > > > returned in the pidfd info record because the struct pid has been
> > > > > > already reaped, userspace application will atleast receive a valid
> > > > > > pidfd which can be used to check whether the process still exists or
> > > > > > not. I think it'll just set the expectation better from an API
> > > > > > perspective.
> > > > > 
> > > > > Yes, there's this race. OTOH if FAN_NOPIDFD is returned, the listener can
> > > > > be sure the original process doesn't exist anymore. So is it useful to
> > > > > still receive pidfd of the dead process?
> > > > 
> > > > Well, you're absolutely right. However, FWIW I was approaching this
> > > > from two different angles:
> > > > 
> > > > 1) I wanted to keep the pattern in which the listener checks for the
> > > >    existence/recycling of the process consistent. As in, the listener
> > > >    would receive the pidfd, then send the pidfd a signal via
> > > >    pidfd_send_signal() and check for -ESRCH which clearly indicates
> > > >    that the target process has terminated.
> > > > 
> > > > 2) I didn't want to mask failed pidfd creation because of early
> > > >    process termination and other possible failures behind a single
> > > >    FAN_NOPIDFD. IOW, if we take the -ESRCH approach above, the
> > > >    listener can take clear corrective branches as what's to be done
> > > >    next if a race is to have been detected, whereas simply returning
> > > >    FAN_NOPIDFD at this stage can mean multiple things.
> > > > 
> > > > Now that I've written the above and keeping in mind that we'd like to
> > > > refrain from doing anything in the event allocation stages, perhaps we
> > > > could introduce a different error code for detecting early process
> > > > termination while attempting to construct the info record. WDYT?
> > > 
> > > Sure, I wouldn't like to overengineer it but having one special fd value for
> > > "process doesn't exist anymore" and another for general "creating pidfd
> > > failed" looks OK to me.
> > 
> > FAN_EPIDFD -> "creation failed"
> > FAN_NOPIDFD -> "no such process"
> 
> Yes, I was thinking something along the lines of this...
> 
> With the approach that I've proposed in this series, the pidfd
> creation failure trips up in pidfd_create() at the following
> condition:
> 
> 	if (!pid || !pid_has_task(pid, PIDTYPE_TGID))
> 	   	 return -EINVAL;
> 
> Specifically, the following check:
> 	!pid_has_task(pid, PIDTYPE_TGID)
> 
> In order to properly report either FAN_NOPIDFD/FAN_EPIDFD to
> userspace, AFAIK I'll have to do one of either two things to better
> distinguish between why the pidfd creation had failed:

Ok, I see. You already do have a reference to a struct pid and in that
case we should just always return a pidfd to the caller. For
pidfd_open() for example we only report an error when
find_get_pid(<pidnr>) doesn't find a struct pid to refer to. But in your
case here you already have a struct pid so I think we should just keep
this simple and always return a pidfd to the caller and in fact do
burden them with figuring out that the process is gone via
pidfd_send_signal() instead of complicating our lives here.

(I think if would be interesting to see perf numbers and how high you
need to bump the number of open fds sysctl limit to make this useable on
systems with a lot of events. I'd be interested in that just in case you
have something there.)

> 
> 1) Implement an additional check in pidfd_create() that effectively
>    checks whether provided pid still holds reference to a struct pid
>    that isn't in the process of being cleaned up. If it is being
>    cleaned up, then return something like -ESRCH instead of -EINVAL so
>    that the caller, in this case fanotify, can check and set
>    FAN_NOPIDFD if -ESRCH is returned from pidfd_create(). I definitely
>    don't feel as though returning -ESRCH from the !pid_has_task(pid,
>    PIDTYPE_TGID) would be appropriate. In saying that, I'm not aware
>    of a helper by which would allow us to perform such an in-flight
>    check? Perhaps something needs to be introduced here, IDK...
> 
> 2) Refrain from performing any further changes to pidfd_create()
>    i.e. as proposed in option 1), and manually perform the pidfd
>    creation from some kind of new fanotify helper, as suggested by you
>    here [0]. However, I'm not convinved that I like this approach as
>    we may end up slowly drifting away from pidfd creation semantics
>    over time.
> 
> [0] https://www.spinics.net/lists/linux-fsdevel/msg195556.html 
> 
> /M

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 0/5] Add pidfd support to the fanotify API
  2021-05-26 18:05               ` Christian Brauner
@ 2021-05-26 22:54                 ` Matthew Bobrowski
  2021-06-01 11:03                 ` Matthew Bobrowski
  1 sibling, 0 replies; 38+ messages in thread
From: Matthew Bobrowski @ 2021-05-26 22:54 UTC (permalink / raw)
  To: Christian Brauner; +Cc: Jan Kara, amir73il, linux-fsdevel, linux-api

On Wed, May 26, 2021 at 08:05:29PM +0200, Christian Brauner wrote:
> On Wed, May 26, 2021 at 09:20:55AM +1000, Matthew Bobrowski wrote:
> > On Tue, May 25, 2021 at 12:31:33PM +0200, Christian Brauner wrote:
> > > On Mon, May 24, 2021 at 10:47:46AM +0200, Jan Kara wrote:
> > > > On Sat 22-05-21 09:32:36, Matthew Bobrowski wrote:
> > > > > On Fri, May 21, 2021 at 12:40:56PM +0200, Jan Kara wrote:
> > > > > > On Fri 21-05-21 20:15:35, Matthew Bobrowski wrote:
> > > > > > > On Thu, May 20, 2021 at 03:55:27PM +0200, Jan Kara wrote:
> > > > > > > There's one thing that I'd like to mention, and it's something in
> > > > > > > regards to the overall approach we've taken that I'm not particularly
> > > > > > > happy about and I'd like to hear all your thoughts. Basically, with
> > > > > > > this approach the pidfd creation is done only once an event has been
> > > > > > > queued and the notification worker wakes up and picks up the event
> > > > > > > from the queue processes it. There's a subtle latency introduced when
> > > > > > > taking such an approach which at times leads to pidfd creation
> > > > > > > failures. As in, by the time pidfd_create() is called the struct pid
> > > > > > > has already been reaped, which then results in FAN_NOPIDFD being
> > > > > > > returned in the pidfd info record.
> > > > > > > 
> > > > > > > Having said that, I'm wondering what the thoughts are on doing pidfd
> > > > > > > creation earlier on i.e. in the event allocation stages? This way, the
> > > > > > > struct pid is pinned earlier on and rather than FAN_NOPIDFD being
> > > > > > > returned in the pidfd info record because the struct pid has been
> > > > > > > already reaped, userspace application will atleast receive a valid
> > > > > > > pidfd which can be used to check whether the process still exists or
> > > > > > > not. I think it'll just set the expectation better from an API
> > > > > > > perspective.
> > > > > > 
> > > > > > Yes, there's this race. OTOH if FAN_NOPIDFD is returned, the listener can
> > > > > > be sure the original process doesn't exist anymore. So is it useful to
> > > > > > still receive pidfd of the dead process?
> > > > > 
> > > > > Well, you're absolutely right. However, FWIW I was approaching this
> > > > > from two different angles:
> > > > > 
> > > > > 1) I wanted to keep the pattern in which the listener checks for the
> > > > >    existence/recycling of the process consistent. As in, the listener
> > > > >    would receive the pidfd, then send the pidfd a signal via
> > > > >    pidfd_send_signal() and check for -ESRCH which clearly indicates
> > > > >    that the target process has terminated.
> > > > > 
> > > > > 2) I didn't want to mask failed pidfd creation because of early
> > > > >    process termination and other possible failures behind a single
> > > > >    FAN_NOPIDFD. IOW, if we take the -ESRCH approach above, the
> > > > >    listener can take clear corrective branches as what's to be done
> > > > >    next if a race is to have been detected, whereas simply returning
> > > > >    FAN_NOPIDFD at this stage can mean multiple things.
> > > > > 
> > > > > Now that I've written the above and keeping in mind that we'd like to
> > > > > refrain from doing anything in the event allocation stages, perhaps we
> > > > > could introduce a different error code for detecting early process
> > > > > termination while attempting to construct the info record. WDYT?
> > > > 
> > > > Sure, I wouldn't like to overengineer it but having one special fd value for
> > > > "process doesn't exist anymore" and another for general "creating pidfd
> > > > failed" looks OK to me.
> > > 
> > > FAN_EPIDFD -> "creation failed"
> > > FAN_NOPIDFD -> "no such process"
> > 
> > Yes, I was thinking something along the lines of this...
> > 
> > With the approach that I've proposed in this series, the pidfd
> > creation failure trips up in pidfd_create() at the following
> > condition:
> > 
> > 	if (!pid || !pid_has_task(pid, PIDTYPE_TGID))
> > 	   	 return -EINVAL;
> > 
> > Specifically, the following check:
> > 	!pid_has_task(pid, PIDTYPE_TGID)
> > 
> > In order to properly report either FAN_NOPIDFD/FAN_EPIDFD to
> > userspace, AFAIK I'll have to do one of either two things to better
> > distinguish between why the pidfd creation had failed:
> 
> Ok, I see. You already do have a reference to a struct pid and in that
> case we should just always return a pidfd to the caller. For
> pidfd_open() for example we only report an error when
> find_get_pid(<pidnr>) doesn't find a struct pid to refer to. But in your
> case here you already have a struct pid so I think we should just keep
> this simple and always return a pidfd to the caller and in fact do
> burden them with figuring out that the process is gone via
> pidfd_send_signal() instead of complicating our lives here.

Right, that makes sense to me. In that case let's keep it simple and
lift the pid_has_task() check from pidfd_create() and leave it at
that. As you mentioned, this way we'll still end up returning a pidfd
in the event and FAN_NOPIDFD will resemble other pidfd creation
errors.

> (I think if would be interesting to see perf numbers and how high you
> need to bump the number of open fds sysctl limit to make this useable on
> systems with a lot of events. I'd be interested in that just in case you
> have something there.)

Not yet, but perhaps it's something that I'll be able to provide a
little later on. Jan, Amir and I recently have been discussing how to
go about conducting performance related regressions against fanotify,
so perhaps while at it I can think about this case too. I aim to start
on this after I've wrapped up with this series.

/M

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 0/5] Add pidfd support to the fanotify API
  2021-05-26 18:05               ` Christian Brauner
  2021-05-26 22:54                 ` Matthew Bobrowski
@ 2021-06-01 11:03                 ` Matthew Bobrowski
  2021-06-01 11:46                   ` Christian Brauner
  1 sibling, 1 reply; 38+ messages in thread
From: Matthew Bobrowski @ 2021-06-01 11:03 UTC (permalink / raw)
  To: Christian Brauner; +Cc: Jan Kara, amir73il, linux-fsdevel, linux-api

On Wed, May 26, 2021 at 08:05:29PM +0200, Christian Brauner wrote:
> On Wed, May 26, 2021 at 09:20:55AM +1000, Matthew Bobrowski wrote:
> > On Tue, May 25, 2021 at 12:31:33PM +0200, Christian Brauner wrote:
> > > On Mon, May 24, 2021 at 10:47:46AM +0200, Jan Kara wrote:
> > > > On Sat 22-05-21 09:32:36, Matthew Bobrowski wrote:
> > > > > On Fri, May 21, 2021 at 12:40:56PM +0200, Jan Kara wrote:
> > > > > > On Fri 21-05-21 20:15:35, Matthew Bobrowski wrote:
> > > > > > > On Thu, May 20, 2021 at 03:55:27PM +0200, Jan Kara wrote:
> > > > > > > There's one thing that I'd like to mention, and it's something in
> > > > > > > regards to the overall approach we've taken that I'm not particularly
> > > > > > > happy about and I'd like to hear all your thoughts. Basically, with
> > > > > > > this approach the pidfd creation is done only once an event has been
> > > > > > > queued and the notification worker wakes up and picks up the event
> > > > > > > from the queue processes it. There's a subtle latency introduced when
> > > > > > > taking such an approach which at times leads to pidfd creation
> > > > > > > failures. As in, by the time pidfd_create() is called the struct pid
> > > > > > > has already been reaped, which then results in FAN_NOPIDFD being
> > > > > > > returned in the pidfd info record.
> > > > > > > 
> > > > > > > Having said that, I'm wondering what the thoughts are on doing pidfd
> > > > > > > creation earlier on i.e. in the event allocation stages? This way, the
> > > > > > > struct pid is pinned earlier on and rather than FAN_NOPIDFD being
> > > > > > > returned in the pidfd info record because the struct pid has been
> > > > > > > already reaped, userspace application will atleast receive a valid
> > > > > > > pidfd which can be used to check whether the process still exists or
> > > > > > > not. I think it'll just set the expectation better from an API
> > > > > > > perspective.
> > > > > > 
> > > > > > Yes, there's this race. OTOH if FAN_NOPIDFD is returned, the listener can
> > > > > > be sure the original process doesn't exist anymore. So is it useful to
> > > > > > still receive pidfd of the dead process?
> > > > > 
> > > > > Well, you're absolutely right. However, FWIW I was approaching this
> > > > > from two different angles:
> > > > > 
> > > > > 1) I wanted to keep the pattern in which the listener checks for the
> > > > >    existence/recycling of the process consistent. As in, the listener
> > > > >    would receive the pidfd, then send the pidfd a signal via
> > > > >    pidfd_send_signal() and check for -ESRCH which clearly indicates
> > > > >    that the target process has terminated.
> > > > > 
> > > > > 2) I didn't want to mask failed pidfd creation because of early
> > > > >    process termination and other possible failures behind a single
> > > > >    FAN_NOPIDFD. IOW, if we take the -ESRCH approach above, the
> > > > >    listener can take clear corrective branches as what's to be done
> > > > >    next if a race is to have been detected, whereas simply returning
> > > > >    FAN_NOPIDFD at this stage can mean multiple things.
> > > > > 
> > > > > Now that I've written the above and keeping in mind that we'd like to
> > > > > refrain from doing anything in the event allocation stages, perhaps we
> > > > > could introduce a different error code for detecting early process
> > > > > termination while attempting to construct the info record. WDYT?
> > > > 
> > > > Sure, I wouldn't like to overengineer it but having one special fd value for
> > > > "process doesn't exist anymore" and another for general "creating pidfd
> > > > failed" looks OK to me.
> > > 
> > > FAN_EPIDFD -> "creation failed"
> > > FAN_NOPIDFD -> "no such process"
> > 
> > Yes, I was thinking something along the lines of this...
> > 
> > With the approach that I've proposed in this series, the pidfd
> > creation failure trips up in pidfd_create() at the following
> > condition:
> > 
> > 	if (!pid || !pid_has_task(pid, PIDTYPE_TGID))
> > 	   	 return -EINVAL;
> > 
> > Specifically, the following check:
> > 	!pid_has_task(pid, PIDTYPE_TGID)
> > 
> > In order to properly report either FAN_NOPIDFD/FAN_EPIDFD to
> > userspace, AFAIK I'll have to do one of either two things to better
> > distinguish between why the pidfd creation had failed:
> 
> Ok, I see. You already do have a reference to a struct pid and in that
> case we should just always return a pidfd to the caller. For
> pidfd_open() for example we only report an error when
> find_get_pid(<pidnr>) doesn't find a struct pid to refer to. But in your
> case here you already have a struct pid so I think we should just keep
> this simple and always return a pidfd to the caller and in fact do
> burden them with figuring out that the process is gone via
> pidfd_send_signal() instead of complicating our lives here.

Ah, actually Christian... Before, I go ahead and send through the updated
series. Given what you've mentioned above I'm working with the assumption
that you're OK with dropping the pid_has_task() check from pidfd_create()
[0]. Is that right?

If so, I don't know how I feel about this given that pidfd_create() is now
to be exposed to the rest of the kernel and the pidfd API, as it stands,
doesn't support the creation of pidfds for non-thread-group leaders. I
suppose what I don't want is other kernel subsystems, if any, thinking it's
OK to call pidfd_create() with an arbitrary struct pid and setting the
expectation that a fully functional pidfd will be returned.

The way I see it, I think we've got two options here:

1) Leave the pid_has_task() check within pidfd_create() and perform another
   explicit pid_has_task() check from the fanotify code before calling
   pidfd_create(). If it returns false, we set something like FAN_NOPIDFD
   indicating to userspace that there's no such process when the event was
   created.

2) Scrap using pidfd_create() all together and implement a fanotify
   specific pidfd creation wrapper which would allow for more
   control. Something along the lines of what you've done in kernel/fork.c
   [1]. Not the biggest fan of this idea just yet given the possibility of
   it leading to an API drift over time.

WDYT?

[0] https://www.spinics.net/lists/linux-api/msg48568.html
[1] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/kernel/fork.c#n2171  

/M

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 0/5] Add pidfd support to the fanotify API
  2021-06-01 11:03                 ` Matthew Bobrowski
@ 2021-06-01 11:46                   ` Christian Brauner
  2021-06-02  6:30                     ` Matthew Bobrowski
  0 siblings, 1 reply; 38+ messages in thread
From: Christian Brauner @ 2021-06-01 11:46 UTC (permalink / raw)
  To: Matthew Bobrowski; +Cc: Jan Kara, amir73il, linux-fsdevel, linux-api

On Tue, Jun 01, 2021 at 09:03:26PM +1000, Matthew Bobrowski wrote:
> On Wed, May 26, 2021 at 08:05:29PM +0200, Christian Brauner wrote:
> > On Wed, May 26, 2021 at 09:20:55AM +1000, Matthew Bobrowski wrote:
> > > On Tue, May 25, 2021 at 12:31:33PM +0200, Christian Brauner wrote:
> > > > On Mon, May 24, 2021 at 10:47:46AM +0200, Jan Kara wrote:
> > > > > On Sat 22-05-21 09:32:36, Matthew Bobrowski wrote:
> > > > > > On Fri, May 21, 2021 at 12:40:56PM +0200, Jan Kara wrote:
> > > > > > > On Fri 21-05-21 20:15:35, Matthew Bobrowski wrote:
> > > > > > > > On Thu, May 20, 2021 at 03:55:27PM +0200, Jan Kara wrote:
> > > > > > > > There's one thing that I'd like to mention, and it's something in
> > > > > > > > regards to the overall approach we've taken that I'm not particularly
> > > > > > > > happy about and I'd like to hear all your thoughts. Basically, with
> > > > > > > > this approach the pidfd creation is done only once an event has been
> > > > > > > > queued and the notification worker wakes up and picks up the event
> > > > > > > > from the queue processes it. There's a subtle latency introduced when
> > > > > > > > taking such an approach which at times leads to pidfd creation
> > > > > > > > failures. As in, by the time pidfd_create() is called the struct pid
> > > > > > > > has already been reaped, which then results in FAN_NOPIDFD being
> > > > > > > > returned in the pidfd info record.
> > > > > > > > 
> > > > > > > > Having said that, I'm wondering what the thoughts are on doing pidfd
> > > > > > > > creation earlier on i.e. in the event allocation stages? This way, the
> > > > > > > > struct pid is pinned earlier on and rather than FAN_NOPIDFD being
> > > > > > > > returned in the pidfd info record because the struct pid has been
> > > > > > > > already reaped, userspace application will atleast receive a valid
> > > > > > > > pidfd which can be used to check whether the process still exists or
> > > > > > > > not. I think it'll just set the expectation better from an API
> > > > > > > > perspective.
> > > > > > > 
> > > > > > > Yes, there's this race. OTOH if FAN_NOPIDFD is returned, the listener can
> > > > > > > be sure the original process doesn't exist anymore. So is it useful to
> > > > > > > still receive pidfd of the dead process?
> > > > > > 
> > > > > > Well, you're absolutely right. However, FWIW I was approaching this
> > > > > > from two different angles:
> > > > > > 
> > > > > > 1) I wanted to keep the pattern in which the listener checks for the
> > > > > >    existence/recycling of the process consistent. As in, the listener
> > > > > >    would receive the pidfd, then send the pidfd a signal via
> > > > > >    pidfd_send_signal() and check for -ESRCH which clearly indicates
> > > > > >    that the target process has terminated.
> > > > > > 
> > > > > > 2) I didn't want to mask failed pidfd creation because of early
> > > > > >    process termination and other possible failures behind a single
> > > > > >    FAN_NOPIDFD. IOW, if we take the -ESRCH approach above, the
> > > > > >    listener can take clear corrective branches as what's to be done
> > > > > >    next if a race is to have been detected, whereas simply returning
> > > > > >    FAN_NOPIDFD at this stage can mean multiple things.
> > > > > > 
> > > > > > Now that I've written the above and keeping in mind that we'd like to
> > > > > > refrain from doing anything in the event allocation stages, perhaps we
> > > > > > could introduce a different error code for detecting early process
> > > > > > termination while attempting to construct the info record. WDYT?
> > > > > 
> > > > > Sure, I wouldn't like to overengineer it but having one special fd value for
> > > > > "process doesn't exist anymore" and another for general "creating pidfd
> > > > > failed" looks OK to me.
> > > > 
> > > > FAN_EPIDFD -> "creation failed"
> > > > FAN_NOPIDFD -> "no such process"
> > > 
> > > Yes, I was thinking something along the lines of this...
> > > 
> > > With the approach that I've proposed in this series, the pidfd
> > > creation failure trips up in pidfd_create() at the following
> > > condition:
> > > 
> > > 	if (!pid || !pid_has_task(pid, PIDTYPE_TGID))
> > > 	   	 return -EINVAL;
> > > 
> > > Specifically, the following check:
> > > 	!pid_has_task(pid, PIDTYPE_TGID)
> > > 
> > > In order to properly report either FAN_NOPIDFD/FAN_EPIDFD to
> > > userspace, AFAIK I'll have to do one of either two things to better
> > > distinguish between why the pidfd creation had failed:
> > 
> > Ok, I see. You already do have a reference to a struct pid and in that
> > case we should just always return a pidfd to the caller. For
> > pidfd_open() for example we only report an error when
> > find_get_pid(<pidnr>) doesn't find a struct pid to refer to. But in your
> > case here you already have a struct pid so I think we should just keep
> > this simple and always return a pidfd to the caller and in fact do
> > burden them with figuring out that the process is gone via
> > pidfd_send_signal() instead of complicating our lives here.
> 
> Ah, actually Christian... Before, I go ahead and send through the updated
> series. Given what you've mentioned above I'm working with the assumption
> that you're OK with dropping the pid_has_task() check from pidfd_create()
> [0]. Is that right?
> 
> If so, I don't know how I feel about this given that pidfd_create() is now
> to be exposed to the rest of the kernel and the pidfd API, as it stands,
> doesn't support the creation of pidfds for non-thread-group leaders. I
> suppose what I don't want is other kernel subsystems, if any, thinking it's
> OK to call pidfd_create() with an arbitrary struct pid and setting the
> expectation that a fully functional pidfd will be returned.
> 
> The way I see it, I think we've got two options here:
> 
> 1) Leave the pid_has_task() check within pidfd_create() and perform another
>    explicit pid_has_task() check from the fanotify code before calling
>    pidfd_create(). If it returns false, we set something like FAN_NOPIDFD
>    indicating to userspace that there's no such process when the event was
>    created.
> 
> 2) Scrap using pidfd_create() all together and implement a fanotify
>    specific pidfd creation wrapper which would allow for more
>    control. Something along the lines of what you've done in kernel/fork.c
>    [1]. Not the biggest fan of this idea just yet given the possibility of
>    it leading to an API drift over time.
> 
> WDYT?

Hm, why would you have to drop the pid_has_task() check again?

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 0/5] Add pidfd support to the fanotify API
  2021-06-01 11:46                   ` Christian Brauner
@ 2021-06-02  6:30                     ` Matthew Bobrowski
  2021-06-02  7:18                       ` Amir Goldstein
  0 siblings, 1 reply; 38+ messages in thread
From: Matthew Bobrowski @ 2021-06-02  6:30 UTC (permalink / raw)
  To: Christian Brauner; +Cc: Jan Kara, amir73il, linux-fsdevel, linux-api

On Tue, Jun 01, 2021 at 01:46:28PM +0200, Christian Brauner wrote:
> On Tue, Jun 01, 2021 at 09:03:26PM +1000, Matthew Bobrowski wrote:
> > On Wed, May 26, 2021 at 08:05:29PM +0200, Christian Brauner wrote:
> > > On Wed, May 26, 2021 at 09:20:55AM +1000, Matthew Bobrowski wrote:
> > > > On Tue, May 25, 2021 at 12:31:33PM +0200, Christian Brauner wrote:
> > > > > On Mon, May 24, 2021 at 10:47:46AM +0200, Jan Kara wrote:
> > > > > > On Sat 22-05-21 09:32:36, Matthew Bobrowski wrote:
> > > > > > > On Fri, May 21, 2021 at 12:40:56PM +0200, Jan Kara wrote:
> > > > > > > > On Fri 21-05-21 20:15:35, Matthew Bobrowski wrote:
> > > > > > > > > On Thu, May 20, 2021 at 03:55:27PM +0200, Jan Kara wrote:
> > > > > > > > > There's one thing that I'd like to mention, and it's something in
> > > > > > > > > regards to the overall approach we've taken that I'm not particularly
> > > > > > > > > happy about and I'd like to hear all your thoughts. Basically, with
> > > > > > > > > this approach the pidfd creation is done only once an event has been
> > > > > > > > > queued and the notification worker wakes up and picks up the event
> > > > > > > > > from the queue processes it. There's a subtle latency introduced when
> > > > > > > > > taking such an approach which at times leads to pidfd creation
> > > > > > > > > failures. As in, by the time pidfd_create() is called the struct pid
> > > > > > > > > has already been reaped, which then results in FAN_NOPIDFD being
> > > > > > > > > returned in the pidfd info record.
> > > > > > > > > 
> > > > > > > > > Having said that, I'm wondering what the thoughts are on doing pidfd
> > > > > > > > > creation earlier on i.e. in the event allocation stages? This way, the
> > > > > > > > > struct pid is pinned earlier on and rather than FAN_NOPIDFD being
> > > > > > > > > returned in the pidfd info record because the struct pid has been
> > > > > > > > > already reaped, userspace application will atleast receive a valid
> > > > > > > > > pidfd which can be used to check whether the process still exists or
> > > > > > > > > not. I think it'll just set the expectation better from an API
> > > > > > > > > perspective.
> > > > > > > > 
> > > > > > > > Yes, there's this race. OTOH if FAN_NOPIDFD is returned, the listener can
> > > > > > > > be sure the original process doesn't exist anymore. So is it useful to
> > > > > > > > still receive pidfd of the dead process?
> > > > > > > 
> > > > > > > Well, you're absolutely right. However, FWIW I was approaching this
> > > > > > > from two different angles:
> > > > > > > 
> > > > > > > 1) I wanted to keep the pattern in which the listener checks for the
> > > > > > >    existence/recycling of the process consistent. As in, the listener
> > > > > > >    would receive the pidfd, then send the pidfd a signal via
> > > > > > >    pidfd_send_signal() and check for -ESRCH which clearly indicates
> > > > > > >    that the target process has terminated.
> > > > > > > 
> > > > > > > 2) I didn't want to mask failed pidfd creation because of early
> > > > > > >    process termination and other possible failures behind a single
> > > > > > >    FAN_NOPIDFD. IOW, if we take the -ESRCH approach above, the
> > > > > > >    listener can take clear corrective branches as what's to be done
> > > > > > >    next if a race is to have been detected, whereas simply returning
> > > > > > >    FAN_NOPIDFD at this stage can mean multiple things.
> > > > > > > 
> > > > > > > Now that I've written the above and keeping in mind that we'd like to
> > > > > > > refrain from doing anything in the event allocation stages, perhaps we
> > > > > > > could introduce a different error code for detecting early process
> > > > > > > termination while attempting to construct the info record. WDYT?
> > > > > > 
> > > > > > Sure, I wouldn't like to overengineer it but having one special fd value for
> > > > > > "process doesn't exist anymore" and another for general "creating pidfd
> > > > > > failed" looks OK to me.
> > > > > 
> > > > > FAN_EPIDFD -> "creation failed"
> > > > > FAN_NOPIDFD -> "no such process"
> > > > 
> > > > Yes, I was thinking something along the lines of this...
> > > > 
> > > > With the approach that I've proposed in this series, the pidfd
> > > > creation failure trips up in pidfd_create() at the following
> > > > condition:
> > > > 
> > > > 	if (!pid || !pid_has_task(pid, PIDTYPE_TGID))
> > > > 	   	 return -EINVAL;
> > > > 
> > > > Specifically, the following check:
> > > > 	!pid_has_task(pid, PIDTYPE_TGID)
> > > > 
> > > > In order to properly report either FAN_NOPIDFD/FAN_EPIDFD to
> > > > userspace, AFAIK I'll have to do one of either two things to better
> > > > distinguish between why the pidfd creation had failed:
> > > 
> > > Ok, I see. You already do have a reference to a struct pid and in that
> > > case we should just always return a pidfd to the caller. For
> > > pidfd_open() for example we only report an error when
> > > find_get_pid(<pidnr>) doesn't find a struct pid to refer to. But in your
> > > case here you already have a struct pid so I think we should just keep
> > > this simple and always return a pidfd to the caller and in fact do
> > > burden them with figuring out that the process is gone via
> > > pidfd_send_signal() instead of complicating our lives here.
> > 
> > Ah, actually Christian... Before, I go ahead and send through the updated
> > series. Given what you've mentioned above I'm working with the assumption
> > that you're OK with dropping the pid_has_task() check from pidfd_create()
> > [0]. Is that right?
> > 
> > If so, I don't know how I feel about this given that pidfd_create() is now
> > to be exposed to the rest of the kernel and the pidfd API, as it stands,
> > doesn't support the creation of pidfds for non-thread-group leaders. I
> > suppose what I don't want is other kernel subsystems, if any, thinking it's
> > OK to call pidfd_create() with an arbitrary struct pid and setting the
> > expectation that a fully functional pidfd will be returned.
> > 
> > The way I see it, I think we've got two options here:
> > 
> > 1) Leave the pid_has_task() check within pidfd_create() and perform another
> >    explicit pid_has_task() check from the fanotify code before calling
> >    pidfd_create(). If it returns false, we set something like FAN_NOPIDFD
> >    indicating to userspace that there's no such process when the event was
> >    created.
> > 
> > 2) Scrap using pidfd_create() all together and implement a fanotify
> >    specific pidfd creation wrapper which would allow for more
> >    control. Something along the lines of what you've done in kernel/fork.c
> >    [1]. Not the biggest fan of this idea just yet given the possibility of
> >    it leading to an API drift over time.
> > 
> > WDYT?
> 
> Hm, why would you have to drop the pid_has_task() check again?

Because of the race that I brielfy decscribed here [0]. The race exists
because we perform the pidfd creation during the notification queue
processing and not in the event allocation stages (for reasons that Jan has
already covered here [1]). So, tl;dr there is the case where the fanotify
calls pidfd_create() and the check for pid_has_task() fails because the
struct pid that we're hanging onto within an event no longer contains a
task of type PIDTYPE_TGID...

[0] https://www.spinics.net/lists/linux-api/msg48630.html
[1] https://www.spinics.net/lists/linux-api/msg48632.html

/M

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 0/5] Add pidfd support to the fanotify API
  2021-06-02  6:30                     ` Matthew Bobrowski
@ 2021-06-02  7:18                       ` Amir Goldstein
  2021-06-02  8:48                         ` Christian Brauner
  2021-06-02 10:43                         ` Matthew Bobrowski
  0 siblings, 2 replies; 38+ messages in thread
From: Amir Goldstein @ 2021-06-02  7:18 UTC (permalink / raw)
  To: Matthew Bobrowski; +Cc: Christian Brauner, Jan Kara, linux-fsdevel, Linux API

On Wed, Jun 2, 2021 at 9:30 AM Matthew Bobrowski <repnop@google.com> wrote:
>
> On Tue, Jun 01, 2021 at 01:46:28PM +0200, Christian Brauner wrote:
> > On Tue, Jun 01, 2021 at 09:03:26PM +1000, Matthew Bobrowski wrote:
> > > On Wed, May 26, 2021 at 08:05:29PM +0200, Christian Brauner wrote:
> > > > On Wed, May 26, 2021 at 09:20:55AM +1000, Matthew Bobrowski wrote:
> > > > > On Tue, May 25, 2021 at 12:31:33PM +0200, Christian Brauner wrote:
> > > > > > On Mon, May 24, 2021 at 10:47:46AM +0200, Jan Kara wrote:
> > > > > > > On Sat 22-05-21 09:32:36, Matthew Bobrowski wrote:
> > > > > > > > On Fri, May 21, 2021 at 12:40:56PM +0200, Jan Kara wrote:
> > > > > > > > > On Fri 21-05-21 20:15:35, Matthew Bobrowski wrote:
> > > > > > > > > > On Thu, May 20, 2021 at 03:55:27PM +0200, Jan Kara wrote:
> > > > > > > > > > There's one thing that I'd like to mention, and it's something in
> > > > > > > > > > regards to the overall approach we've taken that I'm not particularly
> > > > > > > > > > happy about and I'd like to hear all your thoughts. Basically, with
> > > > > > > > > > this approach the pidfd creation is done only once an event has been
> > > > > > > > > > queued and the notification worker wakes up and picks up the event
> > > > > > > > > > from the queue processes it. There's a subtle latency introduced when
> > > > > > > > > > taking such an approach which at times leads to pidfd creation
> > > > > > > > > > failures. As in, by the time pidfd_create() is called the struct pid
> > > > > > > > > > has already been reaped, which then results in FAN_NOPIDFD being
> > > > > > > > > > returned in the pidfd info record.
> > > > > > > > > >
> > > > > > > > > > Having said that, I'm wondering what the thoughts are on doing pidfd
> > > > > > > > > > creation earlier on i.e. in the event allocation stages? This way, the
> > > > > > > > > > struct pid is pinned earlier on and rather than FAN_NOPIDFD being
> > > > > > > > > > returned in the pidfd info record because the struct pid has been
> > > > > > > > > > already reaped, userspace application will atleast receive a valid
> > > > > > > > > > pidfd which can be used to check whether the process still exists or
> > > > > > > > > > not. I think it'll just set the expectation better from an API
> > > > > > > > > > perspective.
> > > > > > > > >
> > > > > > > > > Yes, there's this race. OTOH if FAN_NOPIDFD is returned, the listener can
> > > > > > > > > be sure the original process doesn't exist anymore. So is it useful to
> > > > > > > > > still receive pidfd of the dead process?
> > > > > > > >
> > > > > > > > Well, you're absolutely right. However, FWIW I was approaching this
> > > > > > > > from two different angles:
> > > > > > > >
> > > > > > > > 1) I wanted to keep the pattern in which the listener checks for the
> > > > > > > >    existence/recycling of the process consistent. As in, the listener
> > > > > > > >    would receive the pidfd, then send the pidfd a signal via
> > > > > > > >    pidfd_send_signal() and check for -ESRCH which clearly indicates
> > > > > > > >    that the target process has terminated.
> > > > > > > >
> > > > > > > > 2) I didn't want to mask failed pidfd creation because of early
> > > > > > > >    process termination and other possible failures behind a single
> > > > > > > >    FAN_NOPIDFD. IOW, if we take the -ESRCH approach above, the
> > > > > > > >    listener can take clear corrective branches as what's to be done
> > > > > > > >    next if a race is to have been detected, whereas simply returning
> > > > > > > >    FAN_NOPIDFD at this stage can mean multiple things.
> > > > > > > >
> > > > > > > > Now that I've written the above and keeping in mind that we'd like to
> > > > > > > > refrain from doing anything in the event allocation stages, perhaps we
> > > > > > > > could introduce a different error code for detecting early process
> > > > > > > > termination while attempting to construct the info record. WDYT?
> > > > > > >
> > > > > > > Sure, I wouldn't like to overengineer it but having one special fd value for
> > > > > > > "process doesn't exist anymore" and another for general "creating pidfd
> > > > > > > failed" looks OK to me.
> > > > > >
> > > > > > FAN_EPIDFD -> "creation failed"
> > > > > > FAN_NOPIDFD -> "no such process"
> > > > >
> > > > > Yes, I was thinking something along the lines of this...
> > > > >
> > > > > With the approach that I've proposed in this series, the pidfd
> > > > > creation failure trips up in pidfd_create() at the following
> > > > > condition:
> > > > >
> > > > >         if (!pid || !pid_has_task(pid, PIDTYPE_TGID))
> > > > >                  return -EINVAL;
> > > > >
> > > > > Specifically, the following check:
> > > > >         !pid_has_task(pid, PIDTYPE_TGID)
> > > > >
> > > > > In order to properly report either FAN_NOPIDFD/FAN_EPIDFD to
> > > > > userspace, AFAIK I'll have to do one of either two things to better
> > > > > distinguish between why the pidfd creation had failed:
> > > >
> > > > Ok, I see. You already do have a reference to a struct pid and in that
> > > > case we should just always return a pidfd to the caller. For
> > > > pidfd_open() for example we only report an error when
> > > > find_get_pid(<pidnr>) doesn't find a struct pid to refer to. But in your
> > > > case here you already have a struct pid so I think we should just keep
> > > > this simple and always return a pidfd to the caller and in fact do
> > > > burden them with figuring out that the process is gone via
> > > > pidfd_send_signal() instead of complicating our lives here.
> > >
> > > Ah, actually Christian... Before, I go ahead and send through the updated
> > > series. Given what you've mentioned above I'm working with the assumption
> > > that you're OK with dropping the pid_has_task() check from pidfd_create()
> > > [0]. Is that right?
> > >
> > > If so, I don't know how I feel about this given that pidfd_create() is now
> > > to be exposed to the rest of the kernel and the pidfd API, as it stands,
> > > doesn't support the creation of pidfds for non-thread-group leaders. I
> > > suppose what I don't want is other kernel subsystems, if any, thinking it's
> > > OK to call pidfd_create() with an arbitrary struct pid and setting the
> > > expectation that a fully functional pidfd will be returned.
> > >
> > > The way I see it, I think we've got two options here:
> > >
> > > 1) Leave the pid_has_task() check within pidfd_create() and perform another
> > >    explicit pid_has_task() check from the fanotify code before calling
> > >    pidfd_create(). If it returns false, we set something like FAN_NOPIDFD
> > >    indicating to userspace that there's no such process when the event was
> > >    created.
> > >
> > > 2) Scrap using pidfd_create() all together and implement a fanotify
> > >    specific pidfd creation wrapper which would allow for more
> > >    control. Something along the lines of what you've done in kernel/fork.c
> > >    [1]. Not the biggest fan of this idea just yet given the possibility of
> > >    it leading to an API drift over time.
> > >
> > > WDYT?
> >
> > Hm, why would you have to drop the pid_has_task() check again?
>
> Because of the race that I brielfy decscribed here [0]. The race exists

Sorry for being thich. I still don't understand what's racy about this.
Won't the event reader get a valid pidfd?
Can't the event reader verify that the pidfd points to a dead process?
I don't mind returning FAN_NOPIDFD for convenience, but user
will have to check the pidfd that it got anyway, because process
can die at any time between reading the event and acting on the
pidfd.

> because we perform the pidfd creation during the notification queue
> processing and not in the event allocation stages (for reasons that Jan has
> already covered here [1]). So, tl;dr there is the case where the fanotify
> calls pidfd_create() and the check for pid_has_task() fails because the
> struct pid that we're hanging onto within an event no longer contains a
> task of type PIDTYPE_TGID...
>
> [0] https://www.spinics.net/lists/linux-api/msg48630.html
> [1] https://www.spinics.net/lists/linux-api/msg48632.html

I warmly recommend that you use lore.kernel.org for archive links.

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 0/5] Add pidfd support to the fanotify API
  2021-06-02  7:18                       ` Amir Goldstein
@ 2021-06-02  8:48                         ` Christian Brauner
  2021-06-02 10:56                           ` Matthew Bobrowski
  2021-06-02 10:43                         ` Matthew Bobrowski
  1 sibling, 1 reply; 38+ messages in thread
From: Christian Brauner @ 2021-06-02  8:48 UTC (permalink / raw)
  To: Amir Goldstein, Matthew Bobrowski; +Cc: Jan Kara, linux-fsdevel, Linux API

On Wed, Jun 02, 2021 at 10:18:36AM +0300, Amir Goldstein wrote:
> On Wed, Jun 2, 2021 at 9:30 AM Matthew Bobrowski <repnop@google.com> wrote:
> >
> > On Tue, Jun 01, 2021 at 01:46:28PM +0200, Christian Brauner wrote:
> > > On Tue, Jun 01, 2021 at 09:03:26PM +1000, Matthew Bobrowski wrote:
> > > > On Wed, May 26, 2021 at 08:05:29PM +0200, Christian Brauner wrote:
> > > > > On Wed, May 26, 2021 at 09:20:55AM +1000, Matthew Bobrowski wrote:
> > > > > > On Tue, May 25, 2021 at 12:31:33PM +0200, Christian Brauner wrote:
> > > > > > > On Mon, May 24, 2021 at 10:47:46AM +0200, Jan Kara wrote:
> > > > > > > > On Sat 22-05-21 09:32:36, Matthew Bobrowski wrote:
> > > > > > > > > On Fri, May 21, 2021 at 12:40:56PM +0200, Jan Kara wrote:
> > > > > > > > > > On Fri 21-05-21 20:15:35, Matthew Bobrowski wrote:
> > > > > > > > > > > On Thu, May 20, 2021 at 03:55:27PM +0200, Jan Kara wrote:
> > > > > > > > > > > There's one thing that I'd like to mention, and it's something in
> > > > > > > > > > > regards to the overall approach we've taken that I'm not particularly
> > > > > > > > > > > happy about and I'd like to hear all your thoughts. Basically, with
> > > > > > > > > > > this approach the pidfd creation is done only once an event has been
> > > > > > > > > > > queued and the notification worker wakes up and picks up the event
> > > > > > > > > > > from the queue processes it. There's a subtle latency introduced when
> > > > > > > > > > > taking such an approach which at times leads to pidfd creation
> > > > > > > > > > > failures. As in, by the time pidfd_create() is called the struct pid
> > > > > > > > > > > has already been reaped, which then results in FAN_NOPIDFD being
> > > > > > > > > > > returned in the pidfd info record.
> > > > > > > > > > >
> > > > > > > > > > > Having said that, I'm wondering what the thoughts are on doing pidfd
> > > > > > > > > > > creation earlier on i.e. in the event allocation stages? This way, the
> > > > > > > > > > > struct pid is pinned earlier on and rather than FAN_NOPIDFD being
> > > > > > > > > > > returned in the pidfd info record because the struct pid has been
> > > > > > > > > > > already reaped, userspace application will atleast receive a valid
> > > > > > > > > > > pidfd which can be used to check whether the process still exists or
> > > > > > > > > > > not. I think it'll just set the expectation better from an API
> > > > > > > > > > > perspective.
> > > > > > > > > >
> > > > > > > > > > Yes, there's this race. OTOH if FAN_NOPIDFD is returned, the listener can
> > > > > > > > > > be sure the original process doesn't exist anymore. So is it useful to
> > > > > > > > > > still receive pidfd of the dead process?
> > > > > > > > >
> > > > > > > > > Well, you're absolutely right. However, FWIW I was approaching this
> > > > > > > > > from two different angles:
> > > > > > > > >
> > > > > > > > > 1) I wanted to keep the pattern in which the listener checks for the
> > > > > > > > >    existence/recycling of the process consistent. As in, the listener
> > > > > > > > >    would receive the pidfd, then send the pidfd a signal via
> > > > > > > > >    pidfd_send_signal() and check for -ESRCH which clearly indicates
> > > > > > > > >    that the target process has terminated.
> > > > > > > > >
> > > > > > > > > 2) I didn't want to mask failed pidfd creation because of early
> > > > > > > > >    process termination and other possible failures behind a single
> > > > > > > > >    FAN_NOPIDFD. IOW, if we take the -ESRCH approach above, the
> > > > > > > > >    listener can take clear corrective branches as what's to be done
> > > > > > > > >    next if a race is to have been detected, whereas simply returning
> > > > > > > > >    FAN_NOPIDFD at this stage can mean multiple things.
> > > > > > > > >
> > > > > > > > > Now that I've written the above and keeping in mind that we'd like to
> > > > > > > > > refrain from doing anything in the event allocation stages, perhaps we
> > > > > > > > > could introduce a different error code for detecting early process
> > > > > > > > > termination while attempting to construct the info record. WDYT?
> > > > > > > >
> > > > > > > > Sure, I wouldn't like to overengineer it but having one special fd value for
> > > > > > > > "process doesn't exist anymore" and another for general "creating pidfd
> > > > > > > > failed" looks OK to me.
> > > > > > >
> > > > > > > FAN_EPIDFD -> "creation failed"
> > > > > > > FAN_NOPIDFD -> "no such process"
> > > > > >
> > > > > > Yes, I was thinking something along the lines of this...
> > > > > >
> > > > > > With the approach that I've proposed in this series, the pidfd
> > > > > > creation failure trips up in pidfd_create() at the following
> > > > > > condition:
> > > > > >
> > > > > >         if (!pid || !pid_has_task(pid, PIDTYPE_TGID))
> > > > > >                  return -EINVAL;
> > > > > >
> > > > > > Specifically, the following check:
> > > > > >         !pid_has_task(pid, PIDTYPE_TGID)
> > > > > >
> > > > > > In order to properly report either FAN_NOPIDFD/FAN_EPIDFD to
> > > > > > userspace, AFAIK I'll have to do one of either two things to better
> > > > > > distinguish between why the pidfd creation had failed:
> > > > >
> > > > > Ok, I see. You already do have a reference to a struct pid and in that
> > > > > case we should just always return a pidfd to the caller. For
> > > > > pidfd_open() for example we only report an error when
> > > > > find_get_pid(<pidnr>) doesn't find a struct pid to refer to. But in your
> > > > > case here you already have a struct pid so I think we should just keep
> > > > > this simple and always return a pidfd to the caller and in fact do
> > > > > burden them with figuring out that the process is gone via
> > > > > pidfd_send_signal() instead of complicating our lives here.
> > > >
> > > > Ah, actually Christian... Before, I go ahead and send through the updated
> > > > series. Given what you've mentioned above I'm working with the assumption
> > > > that you're OK with dropping the pid_has_task() check from pidfd_create()
> > > > [0]. Is that right?
> > > >
> > > > If so, I don't know how I feel about this given that pidfd_create() is now
> > > > to be exposed to the rest of the kernel and the pidfd API, as it stands,
> > > > doesn't support the creation of pidfds for non-thread-group leaders. I
> > > > suppose what I don't want is other kernel subsystems, if any, thinking it's
> > > > OK to call pidfd_create() with an arbitrary struct pid and setting the
> > > > expectation that a fully functional pidfd will be returned.
> > > >
> > > > The way I see it, I think we've got two options here:
> > > >
> > > > 1) Leave the pid_has_task() check within pidfd_create() and perform another
> > > >    explicit pid_has_task() check from the fanotify code before calling
> > > >    pidfd_create(). If it returns false, we set something like FAN_NOPIDFD
> > > >    indicating to userspace that there's no such process when the event was
> > > >    created.
> > > >
> > > > 2) Scrap using pidfd_create() all together and implement a fanotify
> > > >    specific pidfd creation wrapper which would allow for more
> > > >    control. Something along the lines of what you've done in kernel/fork.c
> > > >    [1]. Not the biggest fan of this idea just yet given the possibility of
> > > >    it leading to an API drift over time.
> > > >
> > > > WDYT?
> > >
> > > Hm, why would you have to drop the pid_has_task() check again?
> >
> > Because of the race that I brielfy decscribed here [0]. The race exists
> 
> Sorry for being thich. I still don't understand what's racy about this.
> Won't the event reader get a valid pidfd?
> Can't the event reader verify that the pidfd points to a dead process?
> I don't mind returning FAN_NOPIDFD for convenience, but user
> will have to check the pidfd that it got anyway, because process
> can die at any time between reading the event and acting on the
> pidfd.

(Replying to this part of the thread so we don't have to many parallel
replies.)

Hm, so quoting from link [0] Matthew posted so we all have some context
here:
"Basically, with this approach the pidfd creation is done only once an
event has been queued and the notification worker wakes up and picks up
the event from the queue processes it. There's a subtle latency
introduced when taking such an approach which at times leads to pidfd
creation failures. As in, by the time pidfd_create() is called the
struct pid has already been reaped, which then results in FAN_NOPIDFD
being returned in the pidfd info record."

I don't think that's a race and even if I don't think that it's a
meaningful one. So when the event is queued the process is still alive
but when the notification is actually delivered the process is dead.

And your point, Matthew, seems to be that the caller should always get a
pidfd back even if the process has already exited _and_ been reaped,
i.e. is dead because it was alive when the event was generated.

I think that's no how it needs to work and I have a hard time seeing
this as a good argument. What's problematic about just returning
FAN_NOPIDFD in that case? After all the process is gone. All the caller
can do with such a pidfd is to send it a signal and immediately realize
that the process is gone, i.e. -ESRCH anyway.

> 
> > because we perform the pidfd creation during the notification queue
> > processing and not in the event allocation stages (for reasons that Jan has
> > already covered here [1]). So, tl;dr there is the case where the fanotify
> > calls pidfd_create() and the check for pid_has_task() fails because the
> > struct pid that we're hanging onto within an event no longer contains a
> > task of type PIDTYPE_TGID...
> >
> > [0] https://www.spinics.net/lists/linux-api/msg48630.html
> > [1] https://www.spinics.net/lists/linux-api/msg48632.html
> 
> I warmly recommend that you use lore.kernel.org for archive links.
> 
> Thanks,
> Amir.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 0/5] Add pidfd support to the fanotify API
  2021-06-02  7:18                       ` Amir Goldstein
  2021-06-02  8:48                         ` Christian Brauner
@ 2021-06-02 10:43                         ` Matthew Bobrowski
  2021-06-02 12:18                           ` Amir Goldstein
  1 sibling, 1 reply; 38+ messages in thread
From: Matthew Bobrowski @ 2021-06-02 10:43 UTC (permalink / raw)
  To: Amir Goldstein; +Cc: Christian Brauner, Jan Kara, linux-fsdevel, Linux API

On Wed, Jun 02, 2021 at 10:18:36AM +0300, Amir Goldstein wrote:
> On Wed, Jun 2, 2021 at 9:30 AM Matthew Bobrowski <repnop@google.com> wrote:
> >
> > On Tue, Jun 01, 2021 at 01:46:28PM +0200, Christian Brauner wrote:
> > > On Tue, Jun 01, 2021 at 09:03:26PM +1000, Matthew Bobrowski wrote:
> > > > On Wed, May 26, 2021 at 08:05:29PM +0200, Christian Brauner wrote:
> > > > > On Wed, May 26, 2021 at 09:20:55AM +1000, Matthew Bobrowski wrote:
> > > > > > On Tue, May 25, 2021 at 12:31:33PM +0200, Christian Brauner wrote:
> > > > > > > On Mon, May 24, 2021 at 10:47:46AM +0200, Jan Kara wrote:
> > > > > > > > On Sat 22-05-21 09:32:36, Matthew Bobrowski wrote:
> > > > > > > > > On Fri, May 21, 2021 at 12:40:56PM +0200, Jan Kara wrote:
> > > > > > > > > > On Fri 21-05-21 20:15:35, Matthew Bobrowski wrote:
> > > > > > > > > > > On Thu, May 20, 2021 at 03:55:27PM +0200, Jan Kara wrote:
> > > > > > > > > > > There's one thing that I'd like to mention, and it's something in
> > > > > > > > > > > regards to the overall approach we've taken that I'm not particularly
> > > > > > > > > > > happy about and I'd like to hear all your thoughts. Basically, with
> > > > > > > > > > > this approach the pidfd creation is done only once an event has been
> > > > > > > > > > > queued and the notification worker wakes up and picks up the event
> > > > > > > > > > > from the queue processes it. There's a subtle latency introduced when
> > > > > > > > > > > taking such an approach which at times leads to pidfd creation
> > > > > > > > > > > failures. As in, by the time pidfd_create() is called the struct pid
> > > > > > > > > > > has already been reaped, which then results in FAN_NOPIDFD being
> > > > > > > > > > > returned in the pidfd info record.
> > > > > > > > > > >
> > > > > > > > > > > Having said that, I'm wondering what the thoughts are on doing pidfd
> > > > > > > > > > > creation earlier on i.e. in the event allocation stages? This way, the
> > > > > > > > > > > struct pid is pinned earlier on and rather than FAN_NOPIDFD being
> > > > > > > > > > > returned in the pidfd info record because the struct pid has been
> > > > > > > > > > > already reaped, userspace application will atleast receive a valid
> > > > > > > > > > > pidfd which can be used to check whether the process still exists or
> > > > > > > > > > > not. I think it'll just set the expectation better from an API
> > > > > > > > > > > perspective.
> > > > > > > > > >
> > > > > > > > > > Yes, there's this race. OTOH if FAN_NOPIDFD is returned, the listener can
> > > > > > > > > > be sure the original process doesn't exist anymore. So is it useful to
> > > > > > > > > > still receive pidfd of the dead process?
> > > > > > > > >
> > > > > > > > > Well, you're absolutely right. However, FWIW I was approaching this
> > > > > > > > > from two different angles:
> > > > > > > > >
> > > > > > > > > 1) I wanted to keep the pattern in which the listener checks for the
> > > > > > > > >    existence/recycling of the process consistent. As in, the listener
> > > > > > > > >    would receive the pidfd, then send the pidfd a signal via
> > > > > > > > >    pidfd_send_signal() and check for -ESRCH which clearly indicates
> > > > > > > > >    that the target process has terminated.
> > > > > > > > >
> > > > > > > > > 2) I didn't want to mask failed pidfd creation because of early
> > > > > > > > >    process termination and other possible failures behind a single
> > > > > > > > >    FAN_NOPIDFD. IOW, if we take the -ESRCH approach above, the
> > > > > > > > >    listener can take clear corrective branches as what's to be done
> > > > > > > > >    next if a race is to have been detected, whereas simply returning
> > > > > > > > >    FAN_NOPIDFD at this stage can mean multiple things.
> > > > > > > > >
> > > > > > > > > Now that I've written the above and keeping in mind that we'd like to
> > > > > > > > > refrain from doing anything in the event allocation stages, perhaps we
> > > > > > > > > could introduce a different error code for detecting early process
> > > > > > > > > termination while attempting to construct the info record. WDYT?
> > > > > > > >
> > > > > > > > Sure, I wouldn't like to overengineer it but having one special fd value for
> > > > > > > > "process doesn't exist anymore" and another for general "creating pidfd
> > > > > > > > failed" looks OK to me.
> > > > > > >
> > > > > > > FAN_EPIDFD -> "creation failed"
> > > > > > > FAN_NOPIDFD -> "no such process"
> > > > > >
> > > > > > Yes, I was thinking something along the lines of this...
> > > > > >
> > > > > > With the approach that I've proposed in this series, the pidfd
> > > > > > creation failure trips up in pidfd_create() at the following
> > > > > > condition:
> > > > > >
> > > > > >         if (!pid || !pid_has_task(pid, PIDTYPE_TGID))
> > > > > >                  return -EINVAL;
> > > > > >
> > > > > > Specifically, the following check:
> > > > > >         !pid_has_task(pid, PIDTYPE_TGID)
> > > > > >
> > > > > > In order to properly report either FAN_NOPIDFD/FAN_EPIDFD to
> > > > > > userspace, AFAIK I'll have to do one of either two things to better
> > > > > > distinguish between why the pidfd creation had failed:
> > > > >
> > > > > Ok, I see. You already do have a reference to a struct pid and in that
> > > > > case we should just always return a pidfd to the caller. For
> > > > > pidfd_open() for example we only report an error when
> > > > > find_get_pid(<pidnr>) doesn't find a struct pid to refer to. But in your
> > > > > case here you already have a struct pid so I think we should just keep
> > > > > this simple and always return a pidfd to the caller and in fact do
> > > > > burden them with figuring out that the process is gone via
> > > > > pidfd_send_signal() instead of complicating our lives here.
> > > >
> > > > Ah, actually Christian... Before, I go ahead and send through the updated
> > > > series. Given what you've mentioned above I'm working with the assumption
> > > > that you're OK with dropping the pid_has_task() check from pidfd_create()
> > > > [0]. Is that right?
> > > >
> > > > If so, I don't know how I feel about this given that pidfd_create() is now
> > > > to be exposed to the rest of the kernel and the pidfd API, as it stands,
> > > > doesn't support the creation of pidfds for non-thread-group leaders. I
> > > > suppose what I don't want is other kernel subsystems, if any, thinking it's
> > > > OK to call pidfd_create() with an arbitrary struct pid and setting the
> > > > expectation that a fully functional pidfd will be returned.
> > > >
> > > > The way I see it, I think we've got two options here:
> > > >
> > > > 1) Leave the pid_has_task() check within pidfd_create() and perform another
> > > >    explicit pid_has_task() check from the fanotify code before calling
> > > >    pidfd_create(). If it returns false, we set something like FAN_NOPIDFD
> > > >    indicating to userspace that there's no such process when the event was
> > > >    created.
> > > >
> > > > 2) Scrap using pidfd_create() all together and implement a fanotify
> > > >    specific pidfd creation wrapper which would allow for more
> > > >    control. Something along the lines of what you've done in kernel/fork.c
> > > >    [1]. Not the biggest fan of this idea just yet given the possibility of
> > > >    it leading to an API drift over time.
> > > >
> > > > WDYT?
> > >
> > > Hm, why would you have to drop the pid_has_task() check again?
> >
> > Because of the race that I brielfy decscribed here [0]. The race exists
> 
> Sorry for being thich.

You're not being thick at all Amir, and perhaps this is my fault for not
articulating the problem at hand correctly.

> I still don't understand what's racy about this. Won't the event reader
> get a valid pidfd?

I guess this depends, right?

As the logic/implementation currently stands in this specific patch series,
pidfd_create() will _NOT_ return a valid pidfd unless the struct pid still
holds reference to a task of type PIDTYPE_TGID. This is thanks to the extra
pid_hash_task() check that I thought was appropriate to incorporate into
pidfd_create() seeing as though:

 - With the pidfd_create() declaration now being added to linux/pid.h, we
   effectively are giving the implicit OK for it to be called from other
   kernel subsystems, and hence why the caller should be subject to the
   same restrictions/verifications imposed by the API specification
   i.e. "Currently, the process identified by @pid must be a thread-group
   leader...". Not enforcing the pid_has_task() check in pidfd_create()
   effectively says that the pidfd creation can be done for any struct pid
   types i.e. PIDTYPE_PID, PIDTYPE_PGID, etc. This leads to assumptions
   being made by the callers, which effectively then could lead to
   undefined/unexpected behavior.

There definitely can be cases whereby the underlying task(s) associated
with a struct pid have been freed as a result of process being killed
early. As in, the process is killed before the pid_has_task() check is
performed from within pidfd_create() when called from fanotify. This is
precisely the race that I'm referring to here, and in such cases as the
code currently stands, the event listener will _NOT_ receive a valid pidfd.

> Can't the event reader verify that the pidfd points to a dead process?

This was the idea, as in, the burden of checking whether a process has been
killed before the event listener receives the event should be on the event
listener. However, we're trying to come up with a good way to effectively
communicate that the above race I've attempted to articulate has actually
occurred. As in, do we:

a) Drop the pid_has_task() check in pidfd_create() so that a pidfd can be
   returned for all passed struct pids? That way, even if the above race is
   experienced the caller will still receive a pidfd and the event listener
   can do whatever it needs to with it.

b) Before calling into pidfd_create(), perform an explicit pid_has_task()
   check for PIDTYPE_TGID and if that returns false, then set FAN_NOPIDFD
   and save ourselves from calling into pidfd_create() all together. The
   event listener in this case doesn't have to perform the signal check to
   determine whether the process has already been killed.

c) Scrap calling into pidfd_create() all together and write a simple
   fanotify wrapper that contains the pidfd creation logic we need.

> I don't mind returning FAN_NOPIDFD for convenience, but user
> will have to check the pidfd that it got anyway, because process
> can die at any time between reading the event and acting on the
> pidfd.

Well sort of, as it depends on the approach that we decide to go ahead with
to report such early process termination cases. If we return FAN_NOPIDFD,
which signifies that the process died before a pidfd could be created, then
there's no point for the listener to step into checking the pidfd because
the pidfd already == FAN_NOPIDFD. If we simply return a pidfd regardless of
the early termination of the process, then sure the event listener will
need to check each pidfd that is supplied.

> > because we perform the pidfd creation during the notification queue
> > processing and not in the event allocation stages (for reasons that Jan has
> > already covered here [1]). So, tl;dr there is the case where the fanotify
> > calls pidfd_create() and the check for pid_has_task() fails because the
> > struct pid that we're hanging onto within an event no longer contains a
> > task of type PIDTYPE_TGID...
> >
> > [0] https://www.spinics.net/lists/linux-api/msg48630.html
> > [1] https://www.spinics.net/lists/linux-api/msg48632.html

Maybe I'm going down a rabbit hole and overthinking this whole thing,
IDK... :(

/M

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 0/5] Add pidfd support to the fanotify API
  2021-06-02  8:48                         ` Christian Brauner
@ 2021-06-02 10:56                           ` Matthew Bobrowski
  2021-06-02 12:46                             ` Christian Brauner
  0 siblings, 1 reply; 38+ messages in thread
From: Matthew Bobrowski @ 2021-06-02 10:56 UTC (permalink / raw)
  To: Christian Brauner; +Cc: Amir Goldstein, Jan Kara, linux-fsdevel, Linux API

On Wed, Jun 02, 2021 at 10:48:54AM +0200, Christian Brauner wrote:
> On Wed, Jun 02, 2021 at 10:18:36AM +0300, Amir Goldstein wrote:
> > On Wed, Jun 2, 2021 at 9:30 AM Matthew Bobrowski <repnop@google.com> wrote:
> > >
> > > On Tue, Jun 01, 2021 at 01:46:28PM +0200, Christian Brauner wrote:
> > > > On Tue, Jun 01, 2021 at 09:03:26PM +1000, Matthew Bobrowski wrote:
> > > > > On Wed, May 26, 2021 at 08:05:29PM +0200, Christian Brauner wrote:
> > > > > > On Wed, May 26, 2021 at 09:20:55AM +1000, Matthew Bobrowski wrote:
> > > > > > > On Tue, May 25, 2021 at 12:31:33PM +0200, Christian Brauner wrote:
> > > > > > > > On Mon, May 24, 2021 at 10:47:46AM +0200, Jan Kara wrote:
> > > > > > > > > On Sat 22-05-21 09:32:36, Matthew Bobrowski wrote:
> > > > > > > > > > On Fri, May 21, 2021 at 12:40:56PM +0200, Jan Kara wrote:
> > > > > > > > > > > On Fri 21-05-21 20:15:35, Matthew Bobrowski wrote:
> > > > > > > > > > > > On Thu, May 20, 2021 at 03:55:27PM +0200, Jan Kara wrote:
> > > > > > > > > > > > There's one thing that I'd like to mention, and it's something in
> > > > > > > > > > > > regards to the overall approach we've taken that I'm not particularly
> > > > > > > > > > > > happy about and I'd like to hear all your thoughts. Basically, with
> > > > > > > > > > > > this approach the pidfd creation is done only once an event has been
> > > > > > > > > > > > queued and the notification worker wakes up and picks up the event
> > > > > > > > > > > > from the queue processes it. There's a subtle latency introduced when
> > > > > > > > > > > > taking such an approach which at times leads to pidfd creation
> > > > > > > > > > > > failures. As in, by the time pidfd_create() is called the struct pid
> > > > > > > > > > > > has already been reaped, which then results in FAN_NOPIDFD being
> > > > > > > > > > > > returned in the pidfd info record.
> > > > > > > > > > > >
> > > > > > > > > > > > Having said that, I'm wondering what the thoughts are on doing pidfd
> > > > > > > > > > > > creation earlier on i.e. in the event allocation stages? This way, the
> > > > > > > > > > > > struct pid is pinned earlier on and rather than FAN_NOPIDFD being
> > > > > > > > > > > > returned in the pidfd info record because the struct pid has been
> > > > > > > > > > > > already reaped, userspace application will atleast receive a valid
> > > > > > > > > > > > pidfd which can be used to check whether the process still exists or
> > > > > > > > > > > > not. I think it'll just set the expectation better from an API
> > > > > > > > > > > > perspective.
> > > > > > > > > > >
> > > > > > > > > > > Yes, there's this race. OTOH if FAN_NOPIDFD is returned, the listener can
> > > > > > > > > > > be sure the original process doesn't exist anymore. So is it useful to
> > > > > > > > > > > still receive pidfd of the dead process?
> > > > > > > > > >
> > > > > > > > > > Well, you're absolutely right. However, FWIW I was approaching this
> > > > > > > > > > from two different angles:
> > > > > > > > > >
> > > > > > > > > > 1) I wanted to keep the pattern in which the listener checks for the
> > > > > > > > > >    existence/recycling of the process consistent. As in, the listener
> > > > > > > > > >    would receive the pidfd, then send the pidfd a signal via
> > > > > > > > > >    pidfd_send_signal() and check for -ESRCH which clearly indicates
> > > > > > > > > >    that the target process has terminated.
> > > > > > > > > >
> > > > > > > > > > 2) I didn't want to mask failed pidfd creation because of early
> > > > > > > > > >    process termination and other possible failures behind a single
> > > > > > > > > >    FAN_NOPIDFD. IOW, if we take the -ESRCH approach above, the
> > > > > > > > > >    listener can take clear corrective branches as what's to be done
> > > > > > > > > >    next if a race is to have been detected, whereas simply returning
> > > > > > > > > >    FAN_NOPIDFD at this stage can mean multiple things.
> > > > > > > > > >
> > > > > > > > > > Now that I've written the above and keeping in mind that we'd like to
> > > > > > > > > > refrain from doing anything in the event allocation stages, perhaps we
> > > > > > > > > > could introduce a different error code for detecting early process
> > > > > > > > > > termination while attempting to construct the info record. WDYT?
> > > > > > > > >
> > > > > > > > > Sure, I wouldn't like to overengineer it but having one special fd value for
> > > > > > > > > "process doesn't exist anymore" and another for general "creating pidfd
> > > > > > > > > failed" looks OK to me.
> > > > > > > >
> > > > > > > > FAN_EPIDFD -> "creation failed"
> > > > > > > > FAN_NOPIDFD -> "no such process"
> > > > > > >
> > > > > > > Yes, I was thinking something along the lines of this...
> > > > > > >
> > > > > > > With the approach that I've proposed in this series, the pidfd
> > > > > > > creation failure trips up in pidfd_create() at the following
> > > > > > > condition:
> > > > > > >
> > > > > > >         if (!pid || !pid_has_task(pid, PIDTYPE_TGID))
> > > > > > >                  return -EINVAL;
> > > > > > >
> > > > > > > Specifically, the following check:
> > > > > > >         !pid_has_task(pid, PIDTYPE_TGID)
> > > > > > >
> > > > > > > In order to properly report either FAN_NOPIDFD/FAN_EPIDFD to
> > > > > > > userspace, AFAIK I'll have to do one of either two things to better
> > > > > > > distinguish between why the pidfd creation had failed:
> > > > > >
> > > > > > Ok, I see. You already do have a reference to a struct pid and in that
> > > > > > case we should just always return a pidfd to the caller. For
> > > > > > pidfd_open() for example we only report an error when
> > > > > > find_get_pid(<pidnr>) doesn't find a struct pid to refer to. But in your
> > > > > > case here you already have a struct pid so I think we should just keep
> > > > > > this simple and always return a pidfd to the caller and in fact do
> > > > > > burden them with figuring out that the process is gone via
> > > > > > pidfd_send_signal() instead of complicating our lives here.
> > > > >
> > > > > Ah, actually Christian... Before, I go ahead and send through the updated
> > > > > series. Given what you've mentioned above I'm working with the assumption
> > > > > that you're OK with dropping the pid_has_task() check from pidfd_create()
> > > > > [0]. Is that right?
> > > > >
> > > > > If so, I don't know how I feel about this given that pidfd_create() is now
> > > > > to be exposed to the rest of the kernel and the pidfd API, as it stands,
> > > > > doesn't support the creation of pidfds for non-thread-group leaders. I
> > > > > suppose what I don't want is other kernel subsystems, if any, thinking it's
> > > > > OK to call pidfd_create() with an arbitrary struct pid and setting the
> > > > > expectation that a fully functional pidfd will be returned.
> > > > >
> > > > > The way I see it, I think we've got two options here:
> > > > >
> > > > > 1) Leave the pid_has_task() check within pidfd_create() and perform another
> > > > >    explicit pid_has_task() check from the fanotify code before calling
> > > > >    pidfd_create(). If it returns false, we set something like FAN_NOPIDFD
> > > > >    indicating to userspace that there's no such process when the event was
> > > > >    created.
> > > > >
> > > > > 2) Scrap using pidfd_create() all together and implement a fanotify
> > > > >    specific pidfd creation wrapper which would allow for more
> > > > >    control. Something along the lines of what you've done in kernel/fork.c
> > > > >    [1]. Not the biggest fan of this idea just yet given the possibility of
> > > > >    it leading to an API drift over time.
> > > > >
> > > > > WDYT?
> > > >
> > > > Hm, why would you have to drop the pid_has_task() check again?
> > >
> > > Because of the race that I brielfy decscribed here [0]. The race exists
> > 
> > Sorry for being thich. I still don't understand what's racy about this.
> > Won't the event reader get a valid pidfd?
> > Can't the event reader verify that the pidfd points to a dead process?
> > I don't mind returning FAN_NOPIDFD for convenience, but user
> > will have to check the pidfd that it got anyway, because process
> > can die at any time between reading the event and acting on the
> > pidfd.
> 
> (Replying to this part of the thread so we don't have to many parallel
> replies.)
> 
> Hm, so quoting from link [0] Matthew posted so we all have some context
> here:
> "Basically, with this approach the pidfd creation is done only once an
> event has been queued and the notification worker wakes up and picks up
> the event from the queue processes it. There's a subtle latency
> introduced when taking such an approach which at times leads to pidfd
> creation failures. As in, by the time pidfd_create() is called the
> struct pid has already been reaped, which then results in FAN_NOPIDFD
> being returned in the pidfd info record."
> 
> I don't think that's a race and even if I don't think that it's a
> meaningful one. So when the event is queued the process is still alive
> but when the notification is actually delivered the process is dead.
> 
> And your point, Matthew, seems to be that the caller should always get a
> pidfd back even if the process has already exited _and_ been reaped,
> i.e. is dead because it was alive when the event was generated.
> 
> I think that's no how it needs to work and I have a hard time seeing
> this as a good argument. What's problematic about just returning
> FAN_NOPIDFD in that case? After all the process is gone. All the caller
> can do with such a pidfd is to send it a signal and immediately realize
> that the process is gone, i.e. -ESRCH anyway.

To get things straight, there's no argument here. There's a discussion
about what the best approach is for communicating to the event listener
that a process has been killed prior to a pidfd being created by/from
fanotify. I have no issues with communicating FAN_NOPIDFD to the event
listener in such cases. I just want to make sure everyone else is OK with
it.

> > > because we perform the pidfd creation during the notification queue
> > > processing and not in the event allocation stages (for reasons that Jan has
> > > already covered here [1]). So, tl;dr there is the case where the fanotify
> > > calls pidfd_create() and the check for pid_has_task() fails because the
> > > struct pid that we're hanging onto within an event no longer contains a
> > > task of type PIDTYPE_TGID...
> > >
> > > [0] https://www.spinics.net/lists/linux-api/msg48630.html
> > > [1] https://www.spinics.net/lists/linux-api/msg48632.html

/M

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 0/5] Add pidfd support to the fanotify API
  2021-06-02 10:43                         ` Matthew Bobrowski
@ 2021-06-02 12:18                           ` Amir Goldstein
  2021-06-03  1:24                             ` Matthew Bobrowski
  0 siblings, 1 reply; 38+ messages in thread
From: Amir Goldstein @ 2021-06-02 12:18 UTC (permalink / raw)
  To: Matthew Bobrowski; +Cc: Christian Brauner, Jan Kara, linux-fsdevel, Linux API

> > I still don't understand what's racy about this. Won't the event reader
> > get a valid pidfd?
>
> I guess this depends, right?
>
> As the logic/implementation currently stands in this specific patch series,
> pidfd_create() will _NOT_ return a valid pidfd unless the struct pid still
> holds reference to a task of type PIDTYPE_TGID. This is thanks to the extra
> pid_hash_task() check that I thought was appropriate to incorporate into
> pidfd_create() seeing as though:
>
>  - With the pidfd_create() declaration now being added to linux/pid.h, we
>    effectively are giving the implicit OK for it to be called from other
>    kernel subsystems, and hence why the caller should be subject to the
>    same restrictions/verifications imposed by the API specification
>    i.e. "Currently, the process identified by @pid must be a thread-group
>    leader...". Not enforcing the pid_has_task() check in pidfd_create()
>    effectively says that the pidfd creation can be done for any struct pid
>    types i.e. PIDTYPE_PID, PIDTYPE_PGID, etc. This leads to assumptions
>    being made by the callers, which effectively then could lead to
>    undefined/unexpected behavior.
>
> There definitely can be cases whereby the underlying task(s) associated
> with a struct pid have been freed as a result of process being killed
> early. As in, the process is killed before the pid_has_task() check is
> performed from within pidfd_create() when called from fanotify. This is
> precisely the race that I'm referring to here, and in such cases as the
> code currently stands, the event listener will _NOT_ receive a valid pidfd.
>
> > Can't the event reader verify that the pidfd points to a dead process?
>
> This was the idea, as in, the burden of checking whether a process has been
> killed before the event listener receives the event should be on the event
> listener. However, we're trying to come up with a good way to effectively
> communicate that the above race I've attempted to articulate has actually
> occurred. As in, do we:
>
> a) Drop the pid_has_task() check in pidfd_create() so that a pidfd can be
>    returned for all passed struct pids? That way, even if the above race is
>    experienced the caller will still receive a pidfd and the event listener
>    can do whatever it needs to with it.
>
> b) Before calling into pidfd_create(), perform an explicit pid_has_task()
>    check for PIDTYPE_TGID and if that returns false, then set FAN_NOPIDFD
>    and save ourselves from calling into pidfd_create() all together. The
>    event listener in this case doesn't have to perform the signal check to
>    determine whether the process has already been killed.
>
> c) Scrap calling into pidfd_create() all together and write a simple
>    fanotify wrapper that contains the pidfd creation logic we need.
>
> > I don't mind returning FAN_NOPIDFD for convenience, but user
> > will have to check the pidfd that it got anyway, because process
> > can die at any time between reading the event and acting on the
> > pidfd.
>
> Well sort of, as it depends on the approach that we decide to go ahead with
> to report such early process termination cases. If we return FAN_NOPIDFD,
> which signifies that the process died before a pidfd could be created, then
> there's no point for the listener to step into checking the pidfd because
> the pidfd already == FAN_NOPIDFD. If we simply return a pidfd regardless of
> the early termination of the process, then sure the event listener will
> need to check each pidfd that is supplied.
>

I don't see any problem with the fact that the listener would have to check the
reported pidfd. I don't see how or why that should be avoided.
If we already know there is no process, we can be nice and return NOPIDFD,
because that doesn't add any complexity.

Not to mention that if pid_vnr() returns 0 (process is outside of
pidns of caller)
then I think you MUST either report FAN_NOPIDFD or no pid info record,
because getting 0 event->pid  with a valid pidfd is weird IMO and could be
considered as a security breach.

> > > because we perform the pidfd creation during the notification queue
> > > processing and not in the event allocation stages (for reasons that Jan has
> > > already covered here [1]). So, tl;dr there is the case where the fanotify
> > > calls pidfd_create() and the check for pid_has_task() fails because the
> > > struct pid that we're hanging onto within an event no longer contains a
> > > task of type PIDTYPE_TGID...
> > >
> > > [0] https://www.spinics.net/lists/linux-api/msg48630.html
> > > [1] https://www.spinics.net/lists/linux-api/msg48632.html
>
> Maybe I'm going down a rabbit hole and overthinking this whole thing,
> IDK... :(
>

That is the feeling I get as well.
Suggestion: write the man page - that will make it clear to yourself
and to code reviewers if the API is sane and if it is going to end up
being confusing to end users.

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 0/5] Add pidfd support to the fanotify API
  2021-06-02 10:56                           ` Matthew Bobrowski
@ 2021-06-02 12:46                             ` Christian Brauner
  0 siblings, 0 replies; 38+ messages in thread
From: Christian Brauner @ 2021-06-02 12:46 UTC (permalink / raw)
  To: Matthew Bobrowski; +Cc: Amir Goldstein, Jan Kara, linux-fsdevel, Linux API

On Wed, Jun 02, 2021 at 08:56:54PM +1000, Matthew Bobrowski wrote:
> On Wed, Jun 02, 2021 at 10:48:54AM +0200, Christian Brauner wrote:
> > On Wed, Jun 02, 2021 at 10:18:36AM +0300, Amir Goldstein wrote:
> > > On Wed, Jun 2, 2021 at 9:30 AM Matthew Bobrowski <repnop@google.com> wrote:
> > > >
> > > > On Tue, Jun 01, 2021 at 01:46:28PM +0200, Christian Brauner wrote:
> > > > > On Tue, Jun 01, 2021 at 09:03:26PM +1000, Matthew Bobrowski wrote:
> > > > > > On Wed, May 26, 2021 at 08:05:29PM +0200, Christian Brauner wrote:
> > > > > > > On Wed, May 26, 2021 at 09:20:55AM +1000, Matthew Bobrowski wrote:
> > > > > > > > On Tue, May 25, 2021 at 12:31:33PM +0200, Christian Brauner wrote:
> > > > > > > > > On Mon, May 24, 2021 at 10:47:46AM +0200, Jan Kara wrote:
> > > > > > > > > > On Sat 22-05-21 09:32:36, Matthew Bobrowski wrote:
> > > > > > > > > > > On Fri, May 21, 2021 at 12:40:56PM +0200, Jan Kara wrote:
> > > > > > > > > > > > On Fri 21-05-21 20:15:35, Matthew Bobrowski wrote:
> > > > > > > > > > > > > On Thu, May 20, 2021 at 03:55:27PM +0200, Jan Kara wrote:
> > > > > > > > > > > > > There's one thing that I'd like to mention, and it's something in
> > > > > > > > > > > > > regards to the overall approach we've taken that I'm not particularly
> > > > > > > > > > > > > happy about and I'd like to hear all your thoughts. Basically, with
> > > > > > > > > > > > > this approach the pidfd creation is done only once an event has been
> > > > > > > > > > > > > queued and the notification worker wakes up and picks up the event
> > > > > > > > > > > > > from the queue processes it. There's a subtle latency introduced when
> > > > > > > > > > > > > taking such an approach which at times leads to pidfd creation
> > > > > > > > > > > > > failures. As in, by the time pidfd_create() is called the struct pid
> > > > > > > > > > > > > has already been reaped, which then results in FAN_NOPIDFD being
> > > > > > > > > > > > > returned in the pidfd info record.
> > > > > > > > > > > > >
> > > > > > > > > > > > > Having said that, I'm wondering what the thoughts are on doing pidfd
> > > > > > > > > > > > > creation earlier on i.e. in the event allocation stages? This way, the
> > > > > > > > > > > > > struct pid is pinned earlier on and rather than FAN_NOPIDFD being
> > > > > > > > > > > > > returned in the pidfd info record because the struct pid has been
> > > > > > > > > > > > > already reaped, userspace application will atleast receive a valid
> > > > > > > > > > > > > pidfd which can be used to check whether the process still exists or
> > > > > > > > > > > > > not. I think it'll just set the expectation better from an API
> > > > > > > > > > > > > perspective.
> > > > > > > > > > > >
> > > > > > > > > > > > Yes, there's this race. OTOH if FAN_NOPIDFD is returned, the listener can
> > > > > > > > > > > > be sure the original process doesn't exist anymore. So is it useful to
> > > > > > > > > > > > still receive pidfd of the dead process?
> > > > > > > > > > >
> > > > > > > > > > > Well, you're absolutely right. However, FWIW I was approaching this
> > > > > > > > > > > from two different angles:
> > > > > > > > > > >
> > > > > > > > > > > 1) I wanted to keep the pattern in which the listener checks for the
> > > > > > > > > > >    existence/recycling of the process consistent. As in, the listener
> > > > > > > > > > >    would receive the pidfd, then send the pidfd a signal via
> > > > > > > > > > >    pidfd_send_signal() and check for -ESRCH which clearly indicates
> > > > > > > > > > >    that the target process has terminated.
> > > > > > > > > > >
> > > > > > > > > > > 2) I didn't want to mask failed pidfd creation because of early
> > > > > > > > > > >    process termination and other possible failures behind a single
> > > > > > > > > > >    FAN_NOPIDFD. IOW, if we take the -ESRCH approach above, the
> > > > > > > > > > >    listener can take clear corrective branches as what's to be done
> > > > > > > > > > >    next if a race is to have been detected, whereas simply returning
> > > > > > > > > > >    FAN_NOPIDFD at this stage can mean multiple things.
> > > > > > > > > > >
> > > > > > > > > > > Now that I've written the above and keeping in mind that we'd like to
> > > > > > > > > > > refrain from doing anything in the event allocation stages, perhaps we
> > > > > > > > > > > could introduce a different error code for detecting early process
> > > > > > > > > > > termination while attempting to construct the info record. WDYT?
> > > > > > > > > >
> > > > > > > > > > Sure, I wouldn't like to overengineer it but having one special fd value for
> > > > > > > > > > "process doesn't exist anymore" and another for general "creating pidfd
> > > > > > > > > > failed" looks OK to me.
> > > > > > > > >
> > > > > > > > > FAN_EPIDFD -> "creation failed"
> > > > > > > > > FAN_NOPIDFD -> "no such process"
> > > > > > > >
> > > > > > > > Yes, I was thinking something along the lines of this...
> > > > > > > >
> > > > > > > > With the approach that I've proposed in this series, the pidfd
> > > > > > > > creation failure trips up in pidfd_create() at the following
> > > > > > > > condition:
> > > > > > > >
> > > > > > > >         if (!pid || !pid_has_task(pid, PIDTYPE_TGID))
> > > > > > > >                  return -EINVAL;
> > > > > > > >
> > > > > > > > Specifically, the following check:
> > > > > > > >         !pid_has_task(pid, PIDTYPE_TGID)
> > > > > > > >
> > > > > > > > In order to properly report either FAN_NOPIDFD/FAN_EPIDFD to
> > > > > > > > userspace, AFAIK I'll have to do one of either two things to better
> > > > > > > > distinguish between why the pidfd creation had failed:
> > > > > > >
> > > > > > > Ok, I see. You already do have a reference to a struct pid and in that
> > > > > > > case we should just always return a pidfd to the caller. For
> > > > > > > pidfd_open() for example we only report an error when
> > > > > > > find_get_pid(<pidnr>) doesn't find a struct pid to refer to. But in your
> > > > > > > case here you already have a struct pid so I think we should just keep
> > > > > > > this simple and always return a pidfd to the caller and in fact do
> > > > > > > burden them with figuring out that the process is gone via
> > > > > > > pidfd_send_signal() instead of complicating our lives here.
> > > > > >
> > > > > > Ah, actually Christian... Before, I go ahead and send through the updated
> > > > > > series. Given what you've mentioned above I'm working with the assumption
> > > > > > that you're OK with dropping the pid_has_task() check from pidfd_create()
> > > > > > [0]. Is that right?
> > > > > >
> > > > > > If so, I don't know how I feel about this given that pidfd_create() is now
> > > > > > to be exposed to the rest of the kernel and the pidfd API, as it stands,
> > > > > > doesn't support the creation of pidfds for non-thread-group leaders. I
> > > > > > suppose what I don't want is other kernel subsystems, if any, thinking it's
> > > > > > OK to call pidfd_create() with an arbitrary struct pid and setting the
> > > > > > expectation that a fully functional pidfd will be returned.
> > > > > >
> > > > > > The way I see it, I think we've got two options here:
> > > > > >
> > > > > > 1) Leave the pid_has_task() check within pidfd_create() and perform another
> > > > > >    explicit pid_has_task() check from the fanotify code before calling
> > > > > >    pidfd_create(). If it returns false, we set something like FAN_NOPIDFD
> > > > > >    indicating to userspace that there's no such process when the event was
> > > > > >    created.
> > > > > >
> > > > > > 2) Scrap using pidfd_create() all together and implement a fanotify
> > > > > >    specific pidfd creation wrapper which would allow for more
> > > > > >    control. Something along the lines of what you've done in kernel/fork.c
> > > > > >    [1]. Not the biggest fan of this idea just yet given the possibility of
> > > > > >    it leading to an API drift over time.
> > > > > >
> > > > > > WDYT?
> > > > >
> > > > > Hm, why would you have to drop the pid_has_task() check again?
> > > >
> > > > Because of the race that I brielfy decscribed here [0]. The race exists
> > > 
> > > Sorry for being thich. I still don't understand what's racy about this.
> > > Won't the event reader get a valid pidfd?
> > > Can't the event reader verify that the pidfd points to a dead process?
> > > I don't mind returning FAN_NOPIDFD for convenience, but user
> > > will have to check the pidfd that it got anyway, because process
> > > can die at any time between reading the event and acting on the
> > > pidfd.
> > 
> > (Replying to this part of the thread so we don't have to many parallel
> > replies.)
> > 
> > Hm, so quoting from link [0] Matthew posted so we all have some context
> > here:
> > "Basically, with this approach the pidfd creation is done only once an
> > event has been queued and the notification worker wakes up and picks up
> > the event from the queue processes it. There's a subtle latency
> > introduced when taking such an approach which at times leads to pidfd
> > creation failures. As in, by the time pidfd_create() is called the
> > struct pid has already been reaped, which then results in FAN_NOPIDFD
> > being returned in the pidfd info record."
> > 
> > I don't think that's a race and even if I don't think that it's a
> > meaningful one. So when the event is queued the process is still alive
> > but when the notification is actually delivered the process is dead.
> > 
> > And your point, Matthew, seems to be that the caller should always get a
> > pidfd back even if the process has already exited _and_ been reaped,
> > i.e. is dead because it was alive when the event was generated.
> > 
> > I think that's no how it needs to work and I have a hard time seeing
> > this as a good argument. What's problematic about just returning
> > FAN_NOPIDFD in that case? After all the process is gone. All the caller
> > can do with such a pidfd is to send it a signal and immediately realize
> > that the process is gone, i.e. -ESRCH anyway.
> 
> To get things straight, there's no argument here. There's a discussion

Ok, I read it as an argument for dropping that check. :)

> about what the best approach is for communicating to the event listener
> that a process has been killed prior to a pidfd being created by/from
> fanotify. I have no issues with communicating FAN_NOPIDFD to the event
> listener in such cases. I just want to make sure everyone else is OK with
> it.

I'm ok with it.

Christian

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 0/5] Add pidfd support to the fanotify API
  2021-06-02 12:18                           ` Amir Goldstein
@ 2021-06-03  1:24                             ` Matthew Bobrowski
  0 siblings, 0 replies; 38+ messages in thread
From: Matthew Bobrowski @ 2021-06-03  1:24 UTC (permalink / raw)
  To: Amir Goldstein; +Cc: Christian Brauner, Jan Kara, linux-fsdevel, Linux API

On Wed, Jun 02, 2021 at 03:18:03PM +0300, Amir Goldstein wrote:
> > > I still don't understand what's racy about this. Won't the event reader
> > > get a valid pidfd?
> >
> > I guess this depends, right?
> >
> > As the logic/implementation currently stands in this specific patch series,
> > pidfd_create() will _NOT_ return a valid pidfd unless the struct pid still
> > holds reference to a task of type PIDTYPE_TGID. This is thanks to the extra
> > pid_hash_task() check that I thought was appropriate to incorporate into
> > pidfd_create() seeing as though:
> >
> >  - With the pidfd_create() declaration now being added to linux/pid.h, we
> >    effectively are giving the implicit OK for it to be called from other
> >    kernel subsystems, and hence why the caller should be subject to the
> >    same restrictions/verifications imposed by the API specification
> >    i.e. "Currently, the process identified by @pid must be a thread-group
> >    leader...". Not enforcing the pid_has_task() check in pidfd_create()
> >    effectively says that the pidfd creation can be done for any struct pid
> >    types i.e. PIDTYPE_PID, PIDTYPE_PGID, etc. This leads to assumptions
> >    being made by the callers, which effectively then could lead to
> >    undefined/unexpected behavior.
> >
> > There definitely can be cases whereby the underlying task(s) associated
> > with a struct pid have been freed as a result of process being killed
> > early. As in, the process is killed before the pid_has_task() check is
> > performed from within pidfd_create() when called from fanotify. This is
> > precisely the race that I'm referring to here, and in such cases as the
> > code currently stands, the event listener will _NOT_ receive a valid pidfd.
> >
> > > Can't the event reader verify that the pidfd points to a dead process?
> >
> > This was the idea, as in, the burden of checking whether a process has been
> > killed before the event listener receives the event should be on the event
> > listener. However, we're trying to come up with a good way to effectively
> > communicate that the above race I've attempted to articulate has actually
> > occurred. As in, do we:
> >
> > a) Drop the pid_has_task() check in pidfd_create() so that a pidfd can be
> >    returned for all passed struct pids? That way, even if the above race is
> >    experienced the caller will still receive a pidfd and the event listener
> >    can do whatever it needs to with it.
> >
> > b) Before calling into pidfd_create(), perform an explicit pid_has_task()
> >    check for PIDTYPE_TGID and if that returns false, then set FAN_NOPIDFD
> >    and save ourselves from calling into pidfd_create() all together. The
> >    event listener in this case doesn't have to perform the signal check to
> >    determine whether the process has already been killed.
> >
> > c) Scrap calling into pidfd_create() all together and write a simple
> >    fanotify wrapper that contains the pidfd creation logic we need.
> >
> > > I don't mind returning FAN_NOPIDFD for convenience, but user
> > > will have to check the pidfd that it got anyway, because process
> > > can die at any time between reading the event and acting on the
> > > pidfd.
> >
> > Well sort of, as it depends on the approach that we decide to go ahead with
> > to report such early process termination cases. If we return FAN_NOPIDFD,
> > which signifies that the process died before a pidfd could be created, then
> > there's no point for the listener to step into checking the pidfd because
> > the pidfd already == FAN_NOPIDFD. If we simply return a pidfd regardless of
> > the early termination of the process, then sure the event listener will
> > need to check each pidfd that is supplied.
> >
> 
> I don't see any problem with the fact that the listener would have to check the
> reported pidfd. I don't see how or why that should be avoided.
> If we already know there is no process, we can be nice and return NOPIDFD,
> because that doesn't add any complexity.

Yes, agree. Going ahead with returning FAN_NOPIDFD for early process
termination i.e. before fanotify calls into pidfd creation code. All other
pidfd creation errors will be reported as FAN_EPIDFD.

> Not to mention that if pid_vnr() returns 0 (process is outside of
> pidns of caller)
> then I think you MUST either report FAN_NOPIDFD or no pid info record,
> because getting 0 event->pid  with a valid pidfd is weird IMO and could be
> considered as a security breach.

Good point. I feel as though reporting FAN_NOPIDFD in such cases would be
more appropriate, rather than leaving off a pidfd info record completely.

> > > > because we perform the pidfd creation during the notification queue
> > > > processing and not in the event allocation stages (for reasons that Jan has
> > > > already covered here [1]). So, tl;dr there is the case where the fanotify
> > > > calls pidfd_create() and the check for pid_has_task() fails because the
> > > > struct pid that we're hanging onto within an event no longer contains a
> > > > task of type PIDTYPE_TGID...
> > > >
> > > > [0] https://www.spinics.net/lists/linux-api/msg48630.html
> > > > [1] https://www.spinics.net/lists/linux-api/msg48632.html
> >
> > Maybe I'm going down a rabbit hole and overthinking this whole thing,
> > IDK... :(
> >
> 
> That is the feeling I get as well.
> Suggestion: write the man page - that will make it clear to yourself
> and to code reviewers if the API is sane and if it is going to end up
> being confusing to end users.

Yeah, that's a good approach. I'll consider doing this for future API
changes.

/M

^ permalink raw reply	[flat|nested] 38+ messages in thread

end of thread, other threads:[~2021-06-03  1:25 UTC | newest]

Thread overview: 38+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-05-20  2:09 [PATCH 0/5] Add pidfd support to the fanotify API Matthew Bobrowski
2021-05-20  2:10 ` [PATCH 1/5] kernel/pid.c: remove static qualifier from pidfd_create() Matthew Bobrowski
2021-05-20  2:10 ` [PATCH 2/5] kernel/pid.c: implement checks on parameters passed to pidfd_create() Matthew Bobrowski
2021-05-20  2:11 ` [PATCH 3/5] fanotify_user.c: minor cosmetic adjustments to fid labels Matthew Bobrowski
2021-05-20  2:11 ` [PATCH 4/5] fanotify/fanotify_user.c: introduce a generic info record copying function Matthew Bobrowski
2021-05-20 13:59   ` Amir Goldstein
2021-05-21  9:26     ` Matthew Bobrowski
2021-05-20  2:11 ` [PATCH 5/5] fanotify: Add pidfd info record support to the fanotify API Matthew Bobrowski
2021-05-20  8:17   ` Christian Brauner
2021-05-20 13:43     ` Amir Goldstein
2021-05-21  9:21       ` Matthew Bobrowski
2021-05-21  9:41         ` Amir Goldstein
2021-05-21 10:24           ` Jan Kara
2021-05-21 11:10             ` Amir Goldstein
2021-05-21 13:19               ` Jan Kara
2021-05-21 13:52                 ` Amir Goldstein
2021-05-21 15:14                   ` Jan Kara
2021-05-22  0:41                     ` Matthew Bobrowski
2021-05-22  9:01                       ` Amir Goldstein
2021-05-20 13:55 ` [PATCH 0/5] Add pidfd " Jan Kara
2021-05-21 10:15   ` Matthew Bobrowski
2021-05-21 10:40     ` Jan Kara
2021-05-21 23:32       ` Matthew Bobrowski
2021-05-24  8:47         ` Jan Kara
2021-05-25 10:31           ` Christian Brauner
2021-05-25 23:20             ` Matthew Bobrowski
2021-05-26 18:05               ` Christian Brauner
2021-05-26 22:54                 ` Matthew Bobrowski
2021-06-01 11:03                 ` Matthew Bobrowski
2021-06-01 11:46                   ` Christian Brauner
2021-06-02  6:30                     ` Matthew Bobrowski
2021-06-02  7:18                       ` Amir Goldstein
2021-06-02  8:48                         ` Christian Brauner
2021-06-02 10:56                           ` Matthew Bobrowski
2021-06-02 12:46                             ` Christian Brauner
2021-06-02 10:43                         ` Matthew Bobrowski
2021-06-02 12:18                           ` Amir Goldstein
2021-06-03  1:24                             ` Matthew Bobrowski

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.