All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v3 0/5] Add pidfd support to the fanotify API
@ 2021-07-21  6:17 Matthew Bobrowski
  2021-07-21  6:17 ` [PATCH v3 1/5] kernel/pid.c: remove static qualifier from pidfd_create() Matthew Bobrowski
                   ` (5 more replies)
  0 siblings, 6 replies; 41+ messages in thread
From: Matthew Bobrowski @ 2021-07-21  6:17 UTC (permalink / raw)
  To: jack, amir73il, christian.brauner; +Cc: linux-fsdevel, linux-api

Hey Jan/Amir/Christian,

This is an updated version of the FAN_REPORT_PIDFD series which contains
the addressed nits from the previous review [0]. As per request, you can
also find the draft LTP tests here [1] and man-pages update for this new
API change here [2].

[0] https://lore.kernel.org/linux-fsdevel/cover.1623282854.git.repnop@google.com/
[1] https://github.com/matthewbobrowski/ltp/commits/fanotify_pidfd_v2
[2] https://github.com/matthewbobrowski/man-pages/commits/fanotify_pidfd_v1

Matthew Bobrowski (5):
  kernel/pid.c: remove static qualifier from pidfd_create()
  kernel/pid.c: implement additional checks upon pidfd_create()
    parameters
  fanotify/fanotify_user.c: minor cosmetic adjustments to fid labels
  fanotify/fanotify_user.c: introduce a generic info record copying
    helper
  fanotify: add pidfd support to the fanotify API

 fs/notify/fanotify/fanotify_user.c | 252 ++++++++++++++++++++---------
 include/linux/fanotify.h           |   3 +
 include/linux/pid.h                |   1 +
 include/uapi/linux/fanotify.h      |  13 ++
 kernel/pid.c                       |  15 +-
 5 files changed, 205 insertions(+), 79 deletions(-)

-- 
2.32.0.432.gabb21c7263-goog

/M

^ permalink raw reply	[flat|nested] 41+ messages in thread

* [PATCH v3 1/5] kernel/pid.c: remove static qualifier from pidfd_create()
  2021-07-21  6:17 [PATCH v3 0/5] Add pidfd support to the fanotify API Matthew Bobrowski
@ 2021-07-21  6:17 ` Matthew Bobrowski
  2021-07-21  6:17 ` [PATCH v3 2/5] kernel/pid.c: implement additional checks upon pidfd_create() parameters Matthew Bobrowski
                   ` (4 subsequent siblings)
  5 siblings, 0 replies; 41+ messages in thread
From: Matthew Bobrowski @ 2021-07-21  6:17 UTC (permalink / raw)
  To: jack, amir73il, christian.brauner; +Cc: linux-fsdevel, linux-api

With the idea of returning pidfds from the fanotify API, we need to
expose a mechanism for creating pidfds. We drop the static qualifier
from pidfd_create() and add its declaration to linux/pid.h so that the
pidfd_create() helper can be called from other kernel subsystems
i.e. fanotify.

Signed-off-by: Matthew Bobrowski <repnop@google.com>
Acked-by: Christian Brauner <christian.brauner@ubuntu.com>
---
 include/linux/pid.h | 1 +
 kernel/pid.c        | 4 +++-
 2 files changed, 4 insertions(+), 1 deletion(-)

diff --git a/include/linux/pid.h b/include/linux/pid.h
index fa10acb8d6a4..af308e15f174 100644
--- a/include/linux/pid.h
+++ b/include/linux/pid.h
@@ -78,6 +78,7 @@ struct file;
 
 extern struct pid *pidfd_pid(const struct file *file);
 struct pid *pidfd_get_pid(unsigned int fd, unsigned int *flags);
+int pidfd_create(struct pid *pid, unsigned int flags);
 
 static inline struct pid *get_pid(struct pid *pid)
 {
diff --git a/kernel/pid.c b/kernel/pid.c
index ebdf9c60cd0b..d3cd95b8b080 100644
--- a/kernel/pid.c
+++ b/kernel/pid.c
@@ -550,10 +550,12 @@ struct pid *pidfd_get_pid(unsigned int fd, unsigned int *flags)
  * Note, that this function can only be called after the fd table has
  * been unshared to avoid leaking the pidfd to the new process.
  *
+ * This symbol should not be explicitly exported to loadable modules.
+ *
  * Return: On success, a cloexec pidfd is returned.
  *         On error, a negative errno number will be returned.
  */
-static int pidfd_create(struct pid *pid, unsigned int flags)
+int pidfd_create(struct pid *pid, unsigned int flags)
 {
 	int fd;
 
-- 
2.32.0.432.gabb21c7263-goog

/M

^ permalink raw reply	[flat|nested] 41+ messages in thread

* [PATCH v3 2/5] kernel/pid.c: implement additional checks upon pidfd_create() parameters
  2021-07-21  6:17 [PATCH v3 0/5] Add pidfd support to the fanotify API Matthew Bobrowski
  2021-07-21  6:17 ` [PATCH v3 1/5] kernel/pid.c: remove static qualifier from pidfd_create() Matthew Bobrowski
@ 2021-07-21  6:17 ` Matthew Bobrowski
  2021-07-21  6:18 ` [PATCH v3 3/5] fanotify/fanotify_user.c: minor cosmetic adjustments to fid labels Matthew Bobrowski
                   ` (3 subsequent siblings)
  5 siblings, 0 replies; 41+ messages in thread
From: Matthew Bobrowski @ 2021-07-21  6:17 UTC (permalink / raw)
  To: jack, amir73il, christian.brauner; +Cc: linux-fsdevel, linux-api

By adding the pidfd_create() declaration to linux/pid.h, we
effectively expose this function to the rest of the kernel. In order
to avoid any unintended behaviour, or set false expectations upon this
function, ensure that constraints are forced upon each of the passed
parameters. This includes the checking of whether the passed struct
pid is a thread-group leader as pidfd creation is currently limited to
such pid types.

Signed-off-by: Matthew Bobrowski <repnop@google.com>
Acked-by: Christian Brauner <christian.brauner@ubuntu.com>
---
 kernel/pid.c | 11 +++++++----
 1 file changed, 7 insertions(+), 4 deletions(-)

diff --git a/kernel/pid.c b/kernel/pid.c
index d3cd95b8b080..efe87db44683 100644
--- a/kernel/pid.c
+++ b/kernel/pid.c
@@ -559,6 +559,12 @@ int pidfd_create(struct pid *pid, unsigned int flags)
 {
 	int fd;
 
+	if (!pid || !pid_has_task(pid, PIDTYPE_TGID))
+		return -EINVAL;
+
+	if (flags & ~(O_NONBLOCK | O_RDWR | O_CLOEXEC))
+		return -EINVAL;
+
 	fd = anon_inode_getfd("[pidfd]", &pidfd_fops, get_pid(pid),
 			      flags | O_RDWR | O_CLOEXEC);
 	if (fd < 0)
@@ -598,10 +604,7 @@ SYSCALL_DEFINE2(pidfd_open, pid_t, pid, unsigned int, flags)
 	if (!p)
 		return -ESRCH;
 
-	if (pid_has_task(p, PIDTYPE_TGID))
-		fd = pidfd_create(p, flags);
-	else
-		fd = -EINVAL;
+	fd = pidfd_create(p, flags);
 
 	put_pid(p);
 	return fd;
-- 
2.32.0.432.gabb21c7263-goog

/M

^ permalink raw reply	[flat|nested] 41+ messages in thread

* [PATCH v3 3/5] fanotify/fanotify_user.c: minor cosmetic adjustments to fid labels
  2021-07-21  6:17 [PATCH v3 0/5] Add pidfd support to the fanotify API Matthew Bobrowski
  2021-07-21  6:17 ` [PATCH v3 1/5] kernel/pid.c: remove static qualifier from pidfd_create() Matthew Bobrowski
  2021-07-21  6:17 ` [PATCH v3 2/5] kernel/pid.c: implement additional checks upon pidfd_create() parameters Matthew Bobrowski
@ 2021-07-21  6:18 ` Matthew Bobrowski
  2021-07-21  6:34   ` Amir Goldstein
  2021-07-21  6:18 ` [PATCH v3 4/5] fanotify/fanotify_user.c: introduce a generic info record copying helper Matthew Bobrowski
                   ` (2 subsequent siblings)
  5 siblings, 1 reply; 41+ messages in thread
From: Matthew Bobrowski @ 2021-07-21  6:18 UTC (permalink / raw)
  To: jack, amir73il, christian.brauner; +Cc: linux-fsdevel, linux-api

With the idea to support additional info record types in the future
i.e. fanotify_event_info_pidfd, it's a good idea to rename some of the
labels assigned to some of the existing fid related functions,
parameters, etc which more accurately represent the intent behind
their usage.

For example, copy_info_to_user() was defined with a generic function
label, which arguably reads as being supportive of different info
record types, however the parameter list for this function is
explicitly tailored towards the creation and copying of the
fanotify_event_info_fid records. This same point applies to the macro
defined as FANOTIFY_INFO_HDR_LEN.

With fanotify_event_info_len(), we change the parameter label so that
the function implies that it can be extended to calculate the length
for additional info record types.

Signed-off-by: Matthew Bobrowski <repnop@google.com>
---
 fs/notify/fanotify/fanotify_user.c | 33 +++++++++++++++++-------------
 1 file changed, 19 insertions(+), 14 deletions(-)

diff --git a/fs/notify/fanotify/fanotify_user.c b/fs/notify/fanotify/fanotify_user.c
index 64864fb40b40..182fea255376 100644
--- a/fs/notify/fanotify/fanotify_user.c
+++ b/fs/notify/fanotify/fanotify_user.c
@@ -104,7 +104,7 @@ struct kmem_cache *fanotify_path_event_cachep __read_mostly;
 struct kmem_cache *fanotify_perm_event_cachep __read_mostly;
 
 #define FANOTIFY_EVENT_ALIGN 4
-#define FANOTIFY_INFO_HDR_LEN \
+#define FANOTIFY_FID_INFO_HDR_LEN \
 	(sizeof(struct fanotify_event_info_fid) + sizeof(struct file_handle))
 
 static int fanotify_fid_info_len(int fh_len, int name_len)
@@ -114,10 +114,11 @@ static int fanotify_fid_info_len(int fh_len, int name_len)
 	if (name_len)
 		info_len += name_len + 1;
 
-	return roundup(FANOTIFY_INFO_HDR_LEN + info_len, FANOTIFY_EVENT_ALIGN);
+	return roundup(FANOTIFY_FID_INFO_HDR_LEN + info_len,
+		       FANOTIFY_EVENT_ALIGN);
 }
 
-static int fanotify_event_info_len(unsigned int fid_mode,
+static int fanotify_event_info_len(unsigned int info_mode,
 				   struct fanotify_event *event)
 {
 	struct fanotify_info *info = fanotify_event_info(event);
@@ -128,7 +129,8 @@ static int fanotify_event_info_len(unsigned int fid_mode,
 
 	if (dir_fh_len) {
 		info_len += fanotify_fid_info_len(dir_fh_len, info->name_len);
-	} else if ((fid_mode & FAN_REPORT_NAME) && (event->mask & FAN_ONDIR)) {
+	} else if ((info_mode & FAN_REPORT_NAME) &&
+		   (event->mask & FAN_ONDIR)) {
 		/*
 		 * With group flag FAN_REPORT_NAME, if name was not recorded in
 		 * event on a directory, we will report the name ".".
@@ -303,9 +305,10 @@ static int process_access_response(struct fsnotify_group *group,
 	return -ENOENT;
 }
 
-static int copy_info_to_user(__kernel_fsid_t *fsid, struct fanotify_fh *fh,
-			     int info_type, const char *name, size_t name_len,
-			     char __user *buf, size_t count)
+static int copy_fid_info_to_user(__kernel_fsid_t *fsid, struct fanotify_fh *fh,
+				 int info_type, const char *name,
+				 size_t name_len,
+				 char __user *buf, size_t count)
 {
 	struct fanotify_event_info_fid info = { };
 	struct file_handle handle = { };
@@ -466,10 +469,11 @@ static ssize_t copy_event_to_user(struct fsnotify_group *group,
 	if (fanotify_event_dir_fh_len(event)) {
 		info_type = info->name_len ? FAN_EVENT_INFO_TYPE_DFID_NAME :
 					     FAN_EVENT_INFO_TYPE_DFID;
-		ret = copy_info_to_user(fanotify_event_fsid(event),
-					fanotify_info_dir_fh(info),
-					info_type, fanotify_info_name(info),
-					info->name_len, buf, count);
+		ret = copy_fid_info_to_user(fanotify_event_fsid(event),
+					    fanotify_info_dir_fh(info),
+					    info_type,
+					    fanotify_info_name(info),
+					    info->name_len, buf, count);
 		if (ret < 0)
 			goto out_close_fd;
 
@@ -515,9 +519,10 @@ static ssize_t copy_event_to_user(struct fsnotify_group *group,
 			info_type = FAN_EVENT_INFO_TYPE_FID;
 		}
 
-		ret = copy_info_to_user(fanotify_event_fsid(event),
-					fanotify_event_object_fh(event),
-					info_type, dot, dot_len, buf, count);
+		ret = copy_fid_info_to_user(fanotify_event_fsid(event),
+					    fanotify_event_object_fh(event),
+					    info_type, dot, dot_len,
+					    buf, count);
 		if (ret < 0)
 			goto out_close_fd;
 
-- 
2.32.0.432.gabb21c7263-goog

/M

^ permalink raw reply	[flat|nested] 41+ messages in thread

* [PATCH v3 4/5] fanotify/fanotify_user.c: introduce a generic info record copying helper
  2021-07-21  6:17 [PATCH v3 0/5] Add pidfd support to the fanotify API Matthew Bobrowski
                   ` (2 preceding siblings ...)
  2021-07-21  6:18 ` [PATCH v3 3/5] fanotify/fanotify_user.c: minor cosmetic adjustments to fid labels Matthew Bobrowski
@ 2021-07-21  6:18 ` Matthew Bobrowski
  2021-07-21  6:35   ` Amir Goldstein
  2021-07-21  6:19 ` [PATCH v3 5/5] fanotify: add pidfd support to the fanotify API Matthew Bobrowski
  2021-07-21  7:06 ` [PATCH v3 0/5] Add " Amir Goldstein
  5 siblings, 1 reply; 41+ messages in thread
From: Matthew Bobrowski @ 2021-07-21  6:18 UTC (permalink / raw)
  To: jack, amir73il, christian.brauner; +Cc: linux-fsdevel, linux-api

The copy_info_records_to_user() helper allows for the separation of
info record copying routines/conditionals from copy_event_to_user(),
which reduces the overall clutter within this function. This becomes
especially true as we start introducing additional info records in the
future i.e. struct fanotify_event_info_pidfd. On success, this helper
returns the total amount of bytes that have been copied into the user
supplied buffer and on error, a negative value is returned to the
caller.

The newly defined macro FANOTIFY_INFO_MODES can be used to obtain info
record types that have been enabled for a specific notification
group. This macro becomes useful in the subsequent patch when the
FAN_REPORT_PIDFD initialization flag is introduced.

Signed-off-by: Matthew Bobrowski <repnop@google.com>
---
 fs/notify/fanotify/fanotify_user.c | 155 ++++++++++++++++-------------
 include/linux/fanotify.h           |   2 +
 2 files changed, 90 insertions(+), 67 deletions(-)

diff --git a/fs/notify/fanotify/fanotify_user.c b/fs/notify/fanotify/fanotify_user.c
index 182fea255376..d19f70b2c24c 100644
--- a/fs/notify/fanotify/fanotify_user.c
+++ b/fs/notify/fanotify/fanotify_user.c
@@ -173,7 +173,7 @@ static struct fanotify_event *get_one_event(struct fsnotify_group *group,
 	size_t event_size = FAN_EVENT_METADATA_LEN;
 	struct fanotify_event *event = NULL;
 	struct fsnotify_event *fsn_event;
-	unsigned int fid_mode = FAN_GROUP_FLAG(group, FANOTIFY_FID_BITS);
+	unsigned int info_mode = FAN_GROUP_FLAG(group, FANOTIFY_INFO_MODES);
 
 	pr_debug("%s: group=%p count=%zd\n", __func__, group, count);
 
@@ -183,8 +183,8 @@ static struct fanotify_event *get_one_event(struct fsnotify_group *group,
 		goto out;
 
 	event = FANOTIFY_E(fsn_event);
-	if (fid_mode)
-		event_size += fanotify_event_info_len(fid_mode, event);
+	if (info_mode)
+		event_size += fanotify_event_info_len(info_mode, event);
 
 	if (event_size > count) {
 		event = ERR_PTR(-EINVAL);
@@ -401,6 +401,86 @@ static int copy_fid_info_to_user(__kernel_fsid_t *fsid, struct fanotify_fh *fh,
 	return info_len;
 }
 
+static int copy_info_records_to_user(struct fanotify_event *event,
+				     struct fanotify_info *info,
+				     unsigned int info_mode,
+				     char __user *buf, size_t count)
+{
+	int ret, total_bytes = 0, info_type = 0;
+	unsigned int fid_mode = info_mode & FANOTIFY_FID_BITS;
+
+	/*
+	 * Event info records order is as follows: dir fid + name, child fid.
+	 */
+	if (fanotify_event_dir_fh_len(event)) {
+		info_type = info->name_len ? FAN_EVENT_INFO_TYPE_DFID_NAME :
+					     FAN_EVENT_INFO_TYPE_DFID;
+		ret = copy_fid_info_to_user(fanotify_event_fsid(event),
+					    fanotify_info_dir_fh(info),
+					    info_type,
+					    fanotify_info_name(info),
+					    info->name_len, buf, count);
+		if (ret < 0)
+			return ret;
+
+		buf += ret;
+		count -= ret;
+		total_bytes += ret;
+	}
+
+	if (fanotify_event_object_fh_len(event)) {
+		const char *dot = NULL;
+		int dot_len = 0;
+
+		if (fid_mode == FAN_REPORT_FID || info_type) {
+			/*
+			 * With only group flag FAN_REPORT_FID only type FID is
+			 * reported. Second info record type is always FID.
+			 */
+			info_type = FAN_EVENT_INFO_TYPE_FID;
+		} else if ((fid_mode & FAN_REPORT_NAME) &&
+			   (event->mask & FAN_ONDIR)) {
+			/*
+			 * With group flag FAN_REPORT_NAME, if name was not
+			 * recorded in an event on a directory, report the name
+			 * "." with info type DFID_NAME.
+			 */
+			info_type = FAN_EVENT_INFO_TYPE_DFID_NAME;
+			dot = ".";
+			dot_len = 1;
+		} else if ((event->mask & ALL_FSNOTIFY_DIRENT_EVENTS) ||
+			   (event->mask & FAN_ONDIR)) {
+			/*
+			 * With group flag FAN_REPORT_DIR_FID, a single info
+			 * record has type DFID for directory entry
+			 * modificatio\ n event and for event on a directory.
+			 */
+			info_type = FAN_EVENT_INFO_TYPE_DFID;
+		} else {
+			/*
+			 * With group flags FAN_REPORT_DIR_FID|FAN_REPORT_FID,
+			 * a single info record has type FID for event on a
+			 * non-directory, when there is no directory to report.
+			 * For example, on FAN_DELETE_SELF event.
+			 */
+			info_type = FAN_EVENT_INFO_TYPE_FID;
+		}
+
+		ret = copy_fid_info_to_user(fanotify_event_fsid(event),
+					    fanotify_event_object_fh(event),
+					    info_type, dot, dot_len,
+					    buf, count);
+		if (ret < 0)
+			return ret;
+
+		buf += ret;
+		count -= ret;
+		total_bytes += ret;
+	}
+
+	return total_bytes;
+}
+
 static ssize_t copy_event_to_user(struct fsnotify_group *group,
 				  struct fanotify_event *event,
 				  char __user *buf, size_t count)
@@ -408,15 +488,14 @@ static ssize_t copy_event_to_user(struct fsnotify_group *group,
 	struct fanotify_event_metadata metadata;
 	struct path *path = fanotify_event_path(event);
 	struct fanotify_info *info = fanotify_event_info(event);
-	unsigned int fid_mode = FAN_GROUP_FLAG(group, FANOTIFY_FID_BITS);
+	unsigned int info_mode = FAN_GROUP_FLAG(group, FANOTIFY_INFO_MODES);
 	struct file *f = NULL;
 	int ret, fd = FAN_NOFD;
-	int info_type = 0;
 
 	pr_debug("%s: group=%p event=%p\n", __func__, group, event);
 
 	metadata.event_len = FAN_EVENT_METADATA_LEN +
-				fanotify_event_info_len(fid_mode, event);
+				fanotify_event_info_len(info_mode, event);
 	metadata.metadata_len = FAN_EVENT_METADATA_LEN;
 	metadata.vers = FANOTIFY_METADATA_VERSION;
 	metadata.reserved = 0;
@@ -465,69 +544,11 @@ static ssize_t copy_event_to_user(struct fsnotify_group *group,
 	if (f)
 		fd_install(fd, f);
 
-	/* Event info records order is: dir fid + name, child fid */
-	if (fanotify_event_dir_fh_len(event)) {
-		info_type = info->name_len ? FAN_EVENT_INFO_TYPE_DFID_NAME :
-					     FAN_EVENT_INFO_TYPE_DFID;
-		ret = copy_fid_info_to_user(fanotify_event_fsid(event),
-					    fanotify_info_dir_fh(info),
-					    info_type,
-					    fanotify_info_name(info),
-					    info->name_len, buf, count);
-		if (ret < 0)
-			goto out_close_fd;
-
-		buf += ret;
-		count -= ret;
-	}
-
-	if (fanotify_event_object_fh_len(event)) {
-		const char *dot = NULL;
-		int dot_len = 0;
-
-		if (fid_mode == FAN_REPORT_FID || info_type) {
-			/*
-			 * With only group flag FAN_REPORT_FID only type FID is
-			 * reported. Second info record type is always FID.
-			 */
-			info_type = FAN_EVENT_INFO_TYPE_FID;
-		} else if ((fid_mode & FAN_REPORT_NAME) &&
-			   (event->mask & FAN_ONDIR)) {
-			/*
-			 * With group flag FAN_REPORT_NAME, if name was not
-			 * recorded in an event on a directory, report the
-			 * name "." with info type DFID_NAME.
-			 */
-			info_type = FAN_EVENT_INFO_TYPE_DFID_NAME;
-			dot = ".";
-			dot_len = 1;
-		} else if ((event->mask & ALL_FSNOTIFY_DIRENT_EVENTS) ||
-			   (event->mask & FAN_ONDIR)) {
-			/*
-			 * With group flag FAN_REPORT_DIR_FID, a single info
-			 * record has type DFID for directory entry modification
-			 * event and for event on a directory.
-			 */
-			info_type = FAN_EVENT_INFO_TYPE_DFID;
-		} else {
-			/*
-			 * With group flags FAN_REPORT_DIR_FID|FAN_REPORT_FID,
-			 * a single info record has type FID for event on a
-			 * non-directory, when there is no directory to report.
-			 * For example, on FAN_DELETE_SELF event.
-			 */
-			info_type = FAN_EVENT_INFO_TYPE_FID;
-		}
-
-		ret = copy_fid_info_to_user(fanotify_event_fsid(event),
-					    fanotify_event_object_fh(event),
-					    info_type, dot, dot_len,
-					    buf, count);
+	if (info_mode) {
+		ret = copy_info_records_to_user(event, info, info_mode,
+						buf, count);
 		if (ret < 0)
 			goto out_close_fd;
-
-		buf += ret;
-		count -= ret;
 	}
 
 	return metadata.event_len;
diff --git a/include/linux/fanotify.h b/include/linux/fanotify.h
index a16dbeced152..10a7e26ddba6 100644
--- a/include/linux/fanotify.h
+++ b/include/linux/fanotify.h
@@ -27,6 +27,8 @@ extern struct ctl_table fanotify_table[]; /* for sysctl */
 
 #define FANOTIFY_FID_BITS	(FAN_REPORT_FID | FAN_REPORT_DFID_NAME)
 
+#define FANOTIFY_INFO_MODES	(FANOTIFY_FID_BITS)
+
 /*
  * fanotify_init() flags that require CAP_SYS_ADMIN.
  * We do not allow unprivileged groups to request permission events.
-- 
2.32.0.432.gabb21c7263-goog

/M

^ permalink raw reply	[flat|nested] 41+ messages in thread

* [PATCH v3 5/5] fanotify: add pidfd support to the fanotify API
  2021-07-21  6:17 [PATCH v3 0/5] Add pidfd support to the fanotify API Matthew Bobrowski
                   ` (3 preceding siblings ...)
  2021-07-21  6:18 ` [PATCH v3 4/5] fanotify/fanotify_user.c: introduce a generic info record copying helper Matthew Bobrowski
@ 2021-07-21  6:19 ` Matthew Bobrowski
  2021-07-21  7:05   ` Amir Goldstein
  2021-07-27  0:23   ` Jann Horn
  2021-07-21  7:06 ` [PATCH v3 0/5] Add " Amir Goldstein
  5 siblings, 2 replies; 41+ messages in thread
From: Matthew Bobrowski @ 2021-07-21  6:19 UTC (permalink / raw)
  To: jack, amir73il, christian.brauner; +Cc: linux-fsdevel, linux-api

Introduce a new flag FAN_REPORT_PIDFD for fanotify_init(2) which
allows userspace applications to control whether a pidfd info record
containing a pidfd is to be returned with each event.

If FAN_REPORT_PIDFD is enabled for a notification group, an additional
struct fanotify_event_info_pidfd object will be supplied alongside the
generic struct fanotify_event_metadata within a single event. This
functionality is analogous to that of FAN_REPORT_FID in terms of how
the event structure is supplied to the userspace application. Usage of
FAN_REPORT_PIDFD with FAN_REPORT_FID/FAN_REPORT_DFID_NAME is
permitted, and in this case a struct fanotify_event_info_pidfd object
will follow any struct fanotify_event_info_fid object.

Currently, the usage of FAN_REPORT_TID is not permitted along with
FAN_REPORT_PIDFD as the pidfd API only supports the creation of pidfds
for thread-group leaders. Additionally, the FAN_REPORT_PIDFD is
limited to privileged processes only i.e. listeners that are running
with the CAP_SYS_ADMIN capability. Attempting to supply either of
these initialization flags with FAN_REPORT_PIDFD will result with
EINVAL being returned to the caller.

In the event of a pidfd creation error, there are two types of error
values that can be reported back to the listener. There is
FAN_NOPIDFD, which will be reported in cases where the process
responsible for generating the event has terminated prior to fanotify
being able to create pidfd for event->pid via pidfd_create(). The
there is FAN_EPIDFD, which will be reported if a more generic pidfd
creation error occurred when calling pidfd_create().

Signed-off-by: Matthew Bobrowski <repnop@google.com>
---

Changes since v2:

 * The FAN_REPORT_PIDFD flag value has been changed from 0x00001000 to
   0x00000080. This was so that future FID related initialization flags
   could be grouped nicely.

* Fixed pidfd clean up at out_close_fd label in
  copy_event_to_user(). Reversed the conditional and it now uses the
  close_fd() helper instead of put_unused_fd() as we also need to close the
  backing file, not just just mark the pidfd free in the fdtable.

* Shuffled around the WARN_ON_ONCE(FAN_REPORT_TID) within
  copy_event_to_user() so that it's inside the if (pidfd_mode) branch. It
  makes more sense to be as close to pidfd creation as possible.

* Fixed up the comment block within the if (pidfd_mode) branch.

 fs/notify/fanotify/fanotify_user.c | 88 ++++++++++++++++++++++++++++--
 include/linux/fanotify.h           |  3 +-
 include/uapi/linux/fanotify.h      | 13 +++++
 3 files changed, 98 insertions(+), 6 deletions(-)

diff --git a/fs/notify/fanotify/fanotify_user.c b/fs/notify/fanotify/fanotify_user.c
index d19f70b2c24c..bcc375c258ce 100644
--- a/fs/notify/fanotify/fanotify_user.c
+++ b/fs/notify/fanotify/fanotify_user.c
@@ -1,6 +1,7 @@
 // SPDX-License-Identifier: GPL-2.0
 #include <linux/fanotify.h>
 #include <linux/fcntl.h>
+#include <linux/fdtable.h>
 #include <linux/file.h>
 #include <linux/fs.h>
 #include <linux/anon_inodes.h>
@@ -106,6 +107,8 @@ struct kmem_cache *fanotify_perm_event_cachep __read_mostly;
 #define FANOTIFY_EVENT_ALIGN 4
 #define FANOTIFY_FID_INFO_HDR_LEN \
 	(sizeof(struct fanotify_event_info_fid) + sizeof(struct file_handle))
+#define FANOTIFY_PIDFD_INFO_HDR_LEN \
+	sizeof(struct fanotify_event_info_pidfd)
 
 static int fanotify_fid_info_len(int fh_len, int name_len)
 {
@@ -138,6 +141,9 @@ static int fanotify_event_info_len(unsigned int info_mode,
 		dot_len = 1;
 	}
 
+	if (info_mode & FAN_REPORT_PIDFD)
+		info_len += FANOTIFY_PIDFD_INFO_HDR_LEN;
+
 	if (fh_len)
 		info_len += fanotify_fid_info_len(fh_len, dot_len);
 
@@ -401,13 +407,34 @@ static int copy_fid_info_to_user(__kernel_fsid_t *fsid, struct fanotify_fh *fh,
 	return info_len;
 }
 
+static int copy_pidfd_info_to_user(int pidfd,
+				   char __user *buf,
+				   size_t count)
+{
+	struct fanotify_event_info_pidfd info = { };
+	size_t info_len = FANOTIFY_PIDFD_INFO_HDR_LEN;
+
+	if (WARN_ON_ONCE(info_len > count))
+		return -EFAULT;
+
+	info.hdr.info_type = FAN_EVENT_INFO_TYPE_PIDFD;
+	info.hdr.len = info_len;
+	info.pidfd = pidfd;
+
+	if (copy_to_user(buf, &info, info_len))
+		return -EFAULT;
+
+	return info_len;
+}
+
 static int copy_info_records_to_user(struct fanotify_event *event,
 				     struct fanotify_info *info,
-				     unsigned int info_mode,
+				     unsigned int info_mode, int pidfd,
 				     char __user *buf, size_t count)
 {
 	int ret, total_bytes = 0, info_type = 0;
 	unsigned int fid_mode = info_mode & FANOTIFY_FID_BITS;
+	unsigned int pidfd_mode = info_mode & FAN_REPORT_PIDFD;
 
 	/*
 	 * Event info records order is as follows: dir fid + name, child fid.
@@ -478,6 +505,16 @@ static int copy_info_records_to_user(struct fanotify_event *event,
 		total_bytes += ret;
 	}
 
+	if (pidfd_mode) {
+		ret = copy_pidfd_info_to_user(pidfd, buf, count);
+		if (ret < 0)
+			return ret;
+
+		buf += ret;
+		count -= ret;
+		total_bytes += ret;
+	}
+
 	return total_bytes;
 }
 
@@ -489,8 +526,9 @@ static ssize_t copy_event_to_user(struct fsnotify_group *group,
 	struct path *path = fanotify_event_path(event);
 	struct fanotify_info *info = fanotify_event_info(event);
 	unsigned int info_mode = FAN_GROUP_FLAG(group, FANOTIFY_INFO_MODES);
+	unsigned int pidfd_mode = info_mode & FAN_REPORT_PIDFD;
 	struct file *f = NULL;
-	int ret, fd = FAN_NOFD;
+	int ret, pidfd = FAN_NOPIDFD, fd = FAN_NOFD;
 
 	pr_debug("%s: group=%p event=%p\n", __func__, group, event);
 
@@ -524,6 +562,34 @@ static ssize_t copy_event_to_user(struct fsnotify_group *group,
 	}
 	metadata.fd = fd;
 
+	if (pidfd_mode) {
+		/*
+		 * Complain if the FAN_REPORT_PIDFD and FAN_REPORT_TID mutual
+		 * exclusion is ever lifted. At the time of incoporating pidfd
+		 * support within fanotify, the pidfd API only supported the
+		 * creation of pidfds for thread-group leaders.
+		 */
+		WARN_ON_ONCE(FAN_GROUP_FLAG(group, FAN_REPORT_TID));
+
+		/*
+		 * The PIDTYPE_TGID check for an event->pid is performed
+		 * preemptively in attempt to catch those rare instances where
+		 * the process responsible for generating the event has
+		 * terminated prior to calling into pidfd_create() and acquiring
+		 * a valid pidfd. Report FAN_NOPIDFD to the listener in those
+		 * cases. All other pidfd creation errors are represented as
+		 * FAN_EPIDFD.
+		 */
+		if (metadata.pid == 0 ||
+		    !pid_has_task(event->pid, PIDTYPE_TGID)) {
+			pidfd = FAN_NOPIDFD;
+		} else {
+			pidfd = pidfd_create(event->pid, 0);
+			if (pidfd < 0)
+				pidfd = FAN_EPIDFD;
+		}
+	}
+
 	ret = -EFAULT;
 	/*
 	 * Sanity check copy size in case get_one_event() and
@@ -545,8 +611,8 @@ static ssize_t copy_event_to_user(struct fsnotify_group *group,
 		fd_install(fd, f);
 
 	if (info_mode) {
-		ret = copy_info_records_to_user(event, info, info_mode,
-						buf, count);
+		ret = copy_info_records_to_user(event, info, info_mode, pidfd,
+						buf, count);
 		if (ret < 0)
 			goto out_close_fd;
 	}
@@ -558,6 +624,10 @@ static ssize_t copy_event_to_user(struct fsnotify_group *group,
 		put_unused_fd(fd);
 		fput(f);
 	}
+
+	if (pidfd >= 0)
+		close_fd(pidfd);
+
 	return ret;
 }
 
@@ -1103,6 +1173,14 @@ SYSCALL_DEFINE2(fanotify_init, unsigned int, flags, unsigned int, event_f_flags)
 #endif
 		return -EINVAL;
 
+	/*
+	 * A pidfd can only be returned for a thread-group leader; thus
+	 * FAN_REPORT_PIDFD and FAN_REPORT_TID need to remain mutually
+	 * exclusive.
+	 */
+	if ((flags & FAN_REPORT_PIDFD) && (flags & FAN_REPORT_TID))
+		return -EINVAL;
+
 	if (event_f_flags & ~FANOTIFY_INIT_ALL_EVENT_F_BITS)
 		return -EINVAL;
 
@@ -1504,7 +1582,7 @@ static int __init fanotify_user_setup(void)
 				     FANOTIFY_DEFAULT_MAX_USER_MARKS);
 
 	BUILD_BUG_ON(FANOTIFY_INIT_FLAGS & FANOTIFY_INTERNAL_GROUP_FLAGS);
-	BUILD_BUG_ON(HWEIGHT32(FANOTIFY_INIT_FLAGS) != 10);
+	BUILD_BUG_ON(HWEIGHT32(FANOTIFY_INIT_FLAGS) != 11);
 	BUILD_BUG_ON(HWEIGHT32(FANOTIFY_MARK_FLAGS) != 9);
 
 	fanotify_mark_cache = KMEM_CACHE(fsnotify_mark,
diff --git a/include/linux/fanotify.h b/include/linux/fanotify.h
index 10a7e26ddba6..eec3b7c40811 100644
--- a/include/linux/fanotify.h
+++ b/include/linux/fanotify.h
@@ -27,7 +27,7 @@ extern struct ctl_table fanotify_table[]; /* for sysctl */
 
 #define FANOTIFY_FID_BITS	(FAN_REPORT_FID | FAN_REPORT_DFID_NAME)
 
-#define FANOTIFY_INFO_MODES	(FANOTIFY_FID_BITS)
+#define FANOTIFY_INFO_MODES	(FANOTIFY_FID_BITS | FAN_REPORT_PIDFD)
 
 /*
  * fanotify_init() flags that require CAP_SYS_ADMIN.
@@ -37,6 +37,7 @@ extern struct ctl_table fanotify_table[]; /* for sysctl */
  */
 #define FANOTIFY_ADMIN_INIT_FLAGS	(FANOTIFY_PERM_CLASSES | \
 					 FAN_REPORT_TID | \
+					 FAN_REPORT_PIDFD | \
 					 FAN_UNLIMITED_QUEUE | \
 					 FAN_UNLIMITED_MARKS)
 
diff --git a/include/uapi/linux/fanotify.h b/include/uapi/linux/fanotify.h
index fbf9c5c7dd59..64553df9d735 100644
--- a/include/uapi/linux/fanotify.h
+++ b/include/uapi/linux/fanotify.h
@@ -51,6 +51,7 @@
 #define FAN_ENABLE_AUDIT	0x00000040
 
 /* Flags to determine fanotify event format */
+#define FAN_REPORT_PIDFD	0x00000080	/* Report pidfd for event->pid */
 #define FAN_REPORT_TID		0x00000100	/* event->pid is thread id */
 #define FAN_REPORT_FID		0x00000200	/* Report unique file id */
 #define FAN_REPORT_DIR_FID	0x00000400	/* Report unique directory id */
@@ -123,6 +124,7 @@ struct fanotify_event_metadata {
 #define FAN_EVENT_INFO_TYPE_FID		1
 #define FAN_EVENT_INFO_TYPE_DFID_NAME	2
 #define FAN_EVENT_INFO_TYPE_DFID	3
+#define FAN_EVENT_INFO_TYPE_PIDFD	4
 
 /* Variable length info record following event metadata */
 struct fanotify_event_info_header {
@@ -148,6 +150,15 @@ struct fanotify_event_info_fid {
 	unsigned char handle[0];
 };
 
+/*
+ * This structure is used for info records of type FAN_EVENT_INFO_TYPE_PIDFD.
+ * It holds a pidfd for the pid that was responsible for generating an event.
+ */
+struct fanotify_event_info_pidfd {
+	struct fanotify_event_info_header hdr;
+	__s32 pidfd;
+};
+
 struct fanotify_response {
 	__s32 fd;
 	__u32 response;
@@ -160,6 +171,8 @@ struct fanotify_response {
 
 /* No fd set in event */
 #define FAN_NOFD	-1
+#define FAN_NOPIDFD	FAN_NOFD
+#define FAN_EPIDFD	-2
 
 /* Helper functions to deal with fanotify_event_metadata buffers */
 #define FAN_EVENT_METADATA_LEN (sizeof(struct fanotify_event_metadata))
-- 
2.32.0.432.gabb21c7263-goog

/M

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v3 3/5] fanotify/fanotify_user.c: minor cosmetic adjustments to fid labels
  2021-07-21  6:18 ` [PATCH v3 3/5] fanotify/fanotify_user.c: minor cosmetic adjustments to fid labels Matthew Bobrowski
@ 2021-07-21  6:34   ` Amir Goldstein
  0 siblings, 0 replies; 41+ messages in thread
From: Amir Goldstein @ 2021-07-21  6:34 UTC (permalink / raw)
  To: Matthew Bobrowski; +Cc: Jan Kara, Christian Brauner, linux-fsdevel, Linux API

On Wed, Jul 21, 2021 at 9:18 AM Matthew Bobrowski <repnop@google.com> wrote:
>
> With the idea to support additional info record types in the future
> i.e. fanotify_event_info_pidfd, it's a good idea to rename some of the
> labels assigned to some of the existing fid related functions,
> parameters, etc which more accurately represent the intent behind
> their usage.
>
> For example, copy_info_to_user() was defined with a generic function
> label, which arguably reads as being supportive of different info
> record types, however the parameter list for this function is
> explicitly tailored towards the creation and copying of the
> fanotify_event_info_fid records. This same point applies to the macro
> defined as FANOTIFY_INFO_HDR_LEN.
>
> With fanotify_event_info_len(), we change the parameter label so that
> the function implies that it can be extended to calculate the length
> for additional info record types.
>
> Signed-off-by: Matthew Bobrowski <repnop@google.com>

Reviewed-by: Amir Goldstein <amir73il@gmail.com>

> ---
>  fs/notify/fanotify/fanotify_user.c | 33 +++++++++++++++++-------------
>  1 file changed, 19 insertions(+), 14 deletions(-)
>
> diff --git a/fs/notify/fanotify/fanotify_user.c b/fs/notify/fanotify/fanotify_user.c
> index 64864fb40b40..182fea255376 100644
> --- a/fs/notify/fanotify/fanotify_user.c
> +++ b/fs/notify/fanotify/fanotify_user.c
> @@ -104,7 +104,7 @@ struct kmem_cache *fanotify_path_event_cachep __read_mostly;
>  struct kmem_cache *fanotify_perm_event_cachep __read_mostly;
>
>  #define FANOTIFY_EVENT_ALIGN 4
> -#define FANOTIFY_INFO_HDR_LEN \
> +#define FANOTIFY_FID_INFO_HDR_LEN \
>         (sizeof(struct fanotify_event_info_fid) + sizeof(struct file_handle))
>
>  static int fanotify_fid_info_len(int fh_len, int name_len)
> @@ -114,10 +114,11 @@ static int fanotify_fid_info_len(int fh_len, int name_len)
>         if (name_len)
>                 info_len += name_len + 1;
>
> -       return roundup(FANOTIFY_INFO_HDR_LEN + info_len, FANOTIFY_EVENT_ALIGN);
> +       return roundup(FANOTIFY_FID_INFO_HDR_LEN + info_len,
> +                      FANOTIFY_EVENT_ALIGN);
>  }
>
> -static int fanotify_event_info_len(unsigned int fid_mode,
> +static int fanotify_event_info_len(unsigned int info_mode,
>                                    struct fanotify_event *event)
>  {
>         struct fanotify_info *info = fanotify_event_info(event);
> @@ -128,7 +129,8 @@ static int fanotify_event_info_len(unsigned int fid_mode,
>
>         if (dir_fh_len) {
>                 info_len += fanotify_fid_info_len(dir_fh_len, info->name_len);
> -       } else if ((fid_mode & FAN_REPORT_NAME) && (event->mask & FAN_ONDIR)) {
> +       } else if ((info_mode & FAN_REPORT_NAME) &&
> +                  (event->mask & FAN_ONDIR)) {
>                 /*
>                  * With group flag FAN_REPORT_NAME, if name was not recorded in
>                  * event on a directory, we will report the name ".".
> @@ -303,9 +305,10 @@ static int process_access_response(struct fsnotify_group *group,
>         return -ENOENT;
>  }
>
> -static int copy_info_to_user(__kernel_fsid_t *fsid, struct fanotify_fh *fh,
> -                            int info_type, const char *name, size_t name_len,
> -                            char __user *buf, size_t count)
> +static int copy_fid_info_to_user(__kernel_fsid_t *fsid, struct fanotify_fh *fh,
> +                                int info_type, const char *name,
> +                                size_t name_len,
> +                                char __user *buf, size_t count)
>  {
>         struct fanotify_event_info_fid info = { };
>         struct file_handle handle = { };
> @@ -466,10 +469,11 @@ static ssize_t copy_event_to_user(struct fsnotify_group *group,
>         if (fanotify_event_dir_fh_len(event)) {
>                 info_type = info->name_len ? FAN_EVENT_INFO_TYPE_DFID_NAME :
>                                              FAN_EVENT_INFO_TYPE_DFID;
> -               ret = copy_info_to_user(fanotify_event_fsid(event),
> -                                       fanotify_info_dir_fh(info),
> -                                       info_type, fanotify_info_name(info),
> -                                       info->name_len, buf, count);
> +               ret = copy_fid_info_to_user(fanotify_event_fsid(event),
> +                                           fanotify_info_dir_fh(info),
> +                                           info_type,
> +                                           fanotify_info_name(info),
> +                                           info->name_len, buf, count);
>                 if (ret < 0)
>                         goto out_close_fd;
>
> @@ -515,9 +519,10 @@ static ssize_t copy_event_to_user(struct fsnotify_group *group,
>                         info_type = FAN_EVENT_INFO_TYPE_FID;
>                 }
>
> -               ret = copy_info_to_user(fanotify_event_fsid(event),
> -                                       fanotify_event_object_fh(event),
> -                                       info_type, dot, dot_len, buf, count);
> +               ret = copy_fid_info_to_user(fanotify_event_fsid(event),
> +                                           fanotify_event_object_fh(event),
> +                                           info_type, dot, dot_len,
> +                                           buf, count);
>                 if (ret < 0)
>                         goto out_close_fd;
>
> --
> 2.32.0.432.gabb21c7263-goog
>
> /M

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v3 4/5] fanotify/fanotify_user.c: introduce a generic info record copying helper
  2021-07-21  6:18 ` [PATCH v3 4/5] fanotify/fanotify_user.c: introduce a generic info record copying helper Matthew Bobrowski
@ 2021-07-21  6:35   ` Amir Goldstein
  2021-07-27  8:16     ` Amir Goldstein
  0 siblings, 1 reply; 41+ messages in thread
From: Amir Goldstein @ 2021-07-21  6:35 UTC (permalink / raw)
  To: Matthew Bobrowski; +Cc: Jan Kara, Christian Brauner, linux-fsdevel, Linux API

On Wed, Jul 21, 2021 at 9:18 AM Matthew Bobrowski <repnop@google.com> wrote:
>
> The copy_info_records_to_user() helper allows for the separation of
> info record copying routines/conditionals from copy_event_to_user(),
> which reduces the overall clutter within this function. This becomes
> especially true as we start introducing additional info records in the
> future i.e. struct fanotify_event_info_pidfd. On success, this helper
> returns the total amount of bytes that have been copied into the user
> supplied buffer and on error, a negative value is returned to the
> caller.
>
> The newly defined macro FANOTIFY_INFO_MODES can be used to obtain info
> record types that have been enabled for a specific notification
> group. This macro becomes useful in the subsequent patch when the
> FAN_REPORT_PIDFD initialization flag is introduced.
>
> Signed-off-by: Matthew Bobrowski <repnop@google.com>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>

> ---
>  fs/notify/fanotify/fanotify_user.c | 155 ++++++++++++++++-------------
>  include/linux/fanotify.h           |   2 +
>  2 files changed, 90 insertions(+), 67 deletions(-)
>
> diff --git a/fs/notify/fanotify/fanotify_user.c b/fs/notify/fanotify/fanotify_user.c
> index 182fea255376..d19f70b2c24c 100644
> --- a/fs/notify/fanotify/fanotify_user.c
> +++ b/fs/notify/fanotify/fanotify_user.c
> @@ -173,7 +173,7 @@ static struct fanotify_event *get_one_event(struct fsnotify_group *group,
>         size_t event_size = FAN_EVENT_METADATA_LEN;
>         struct fanotify_event *event = NULL;
>         struct fsnotify_event *fsn_event;
> -       unsigned int fid_mode = FAN_GROUP_FLAG(group, FANOTIFY_FID_BITS);
> +       unsigned int info_mode = FAN_GROUP_FLAG(group, FANOTIFY_INFO_MODES);
>
>         pr_debug("%s: group=%p count=%zd\n", __func__, group, count);
>
> @@ -183,8 +183,8 @@ static struct fanotify_event *get_one_event(struct fsnotify_group *group,
>                 goto out;
>
>         event = FANOTIFY_E(fsn_event);
> -       if (fid_mode)
> -               event_size += fanotify_event_info_len(fid_mode, event);
> +       if (info_mode)
> +               event_size += fanotify_event_info_len(info_mode, event);
>
>         if (event_size > count) {
>                 event = ERR_PTR(-EINVAL);
> @@ -401,6 +401,86 @@ static int copy_fid_info_to_user(__kernel_fsid_t *fsid, struct fanotify_fh *fh,
>         return info_len;
>  }
>
> +static int copy_info_records_to_user(struct fanotify_event *event,
> +                                    struct fanotify_info *info,
> +                                    unsigned int info_mode,
> +                                    char __user *buf, size_t count)
> +{
> +       int ret, total_bytes = 0, info_type = 0;
> +       unsigned int fid_mode = info_mode & FANOTIFY_FID_BITS;
> +
> +       /*
> +        * Event info records order is as follows: dir fid + name, child fid.
> +        */
> +       if (fanotify_event_dir_fh_len(event)) {
> +               info_type = info->name_len ? FAN_EVENT_INFO_TYPE_DFID_NAME :
> +                                            FAN_EVENT_INFO_TYPE_DFID;
> +               ret = copy_fid_info_to_user(fanotify_event_fsid(event),
> +                                           fanotify_info_dir_fh(info),
> +                                           info_type,
> +                                           fanotify_info_name(info),
> +                                           info->name_len, buf, count);
> +               if (ret < 0)
> +                       return ret;
> +
> +               buf += ret;
> +               count -= ret;
> +               total_bytes += ret;
> +       }
> +
> +       if (fanotify_event_object_fh_len(event)) {
> +               const char *dot = NULL;
> +               int dot_len = 0;
> +
> +               if (fid_mode == FAN_REPORT_FID || info_type) {
> +                       /*
> +                        * With only group flag FAN_REPORT_FID only type FID is
> +                        * reported. Second info record type is always FID.
> +                        */
> +                       info_type = FAN_EVENT_INFO_TYPE_FID;
> +               } else if ((fid_mode & FAN_REPORT_NAME) &&
> +                          (event->mask & FAN_ONDIR)) {
> +                       /*
> +                        * With group flag FAN_REPORT_NAME, if name was not
> +                        * recorded in an event on a directory, report the name
> +                        * "." with info type DFID_NAME.
> +                        */
> +                       info_type = FAN_EVENT_INFO_TYPE_DFID_NAME;
> +                       dot = ".";
> +                       dot_len = 1;
> +               } else if ((event->mask & ALL_FSNOTIFY_DIRENT_EVENTS) ||
> +                          (event->mask & FAN_ONDIR)) {
> +                       /*
> +                        * With group flag FAN_REPORT_DIR_FID, a single info
> +                        * record has type DFID for directory entry
> +                        * modificatio\ n event and for event on a directory.
> +                        */
> +                       info_type = FAN_EVENT_INFO_TYPE_DFID;
> +               } else {
> +                       /*
> +                        * With group flags FAN_REPORT_DIR_FID|FAN_REPORT_FID,
> +                        * a single info record has type FID for event on a
> +                        * non-directory, when there is no directory to report.
> +                        * For example, on FAN_DELETE_SELF event.
> +                        */
> +                       info_type = FAN_EVENT_INFO_TYPE_FID;
> +               }
> +
> +               ret = copy_fid_info_to_user(fanotify_event_fsid(event),
> +                                           fanotify_event_object_fh(event),
> +                                           info_type, dot, dot_len,
> +                                           buf, count);
> +               if (ret < 0)
> +                       return ret;
> +
> +               buf += ret;
> +               count -= ret;
> +               total_bytes += ret;
> +       }
> +
> +       return total_bytes;
> +}
> +
>  static ssize_t copy_event_to_user(struct fsnotify_group *group,
>                                   struct fanotify_event *event,
>                                   char __user *buf, size_t count)
> @@ -408,15 +488,14 @@ static ssize_t copy_event_to_user(struct fsnotify_group *group,
>         struct fanotify_event_metadata metadata;
>         struct path *path = fanotify_event_path(event);
>         struct fanotify_info *info = fanotify_event_info(event);
> -       unsigned int fid_mode = FAN_GROUP_FLAG(group, FANOTIFY_FID_BITS);
> +       unsigned int info_mode = FAN_GROUP_FLAG(group, FANOTIFY_INFO_MODES);
>         struct file *f = NULL;
>         int ret, fd = FAN_NOFD;
> -       int info_type = 0;
>
>         pr_debug("%s: group=%p event=%p\n", __func__, group, event);
>
>         metadata.event_len = FAN_EVENT_METADATA_LEN +
> -                               fanotify_event_info_len(fid_mode, event);
> +                               fanotify_event_info_len(info_mode, event);
>         metadata.metadata_len = FAN_EVENT_METADATA_LEN;
>         metadata.vers = FANOTIFY_METADATA_VERSION;
>         metadata.reserved = 0;
> @@ -465,69 +544,11 @@ static ssize_t copy_event_to_user(struct fsnotify_group *group,
>         if (f)
>                 fd_install(fd, f);
>
> -       /* Event info records order is: dir fid + name, child fid */
> -       if (fanotify_event_dir_fh_len(event)) {
> -               info_type = info->name_len ? FAN_EVENT_INFO_TYPE_DFID_NAME :
> -                                            FAN_EVENT_INFO_TYPE_DFID;
> -               ret = copy_fid_info_to_user(fanotify_event_fsid(event),
> -                                           fanotify_info_dir_fh(info),
> -                                           info_type,
> -                                           fanotify_info_name(info),
> -                                           info->name_len, buf, count);
> -               if (ret < 0)
> -                       goto out_close_fd;
> -
> -               buf += ret;
> -               count -= ret;
> -       }
> -
> -       if (fanotify_event_object_fh_len(event)) {
> -               const char *dot = NULL;
> -               int dot_len = 0;
> -
> -               if (fid_mode == FAN_REPORT_FID || info_type) {
> -                       /*
> -                        * With only group flag FAN_REPORT_FID only type FID is
> -                        * reported. Second info record type is always FID.
> -                        */
> -                       info_type = FAN_EVENT_INFO_TYPE_FID;
> -               } else if ((fid_mode & FAN_REPORT_NAME) &&
> -                          (event->mask & FAN_ONDIR)) {
> -                       /*
> -                        * With group flag FAN_REPORT_NAME, if name was not
> -                        * recorded in an event on a directory, report the
> -                        * name "." with info type DFID_NAME.
> -                        */
> -                       info_type = FAN_EVENT_INFO_TYPE_DFID_NAME;
> -                       dot = ".";
> -                       dot_len = 1;
> -               } else if ((event->mask & ALL_FSNOTIFY_DIRENT_EVENTS) ||
> -                          (event->mask & FAN_ONDIR)) {
> -                       /*
> -                        * With group flag FAN_REPORT_DIR_FID, a single info
> -                        * record has type DFID for directory entry modification
> -                        * event and for event on a directory.
> -                        */
> -                       info_type = FAN_EVENT_INFO_TYPE_DFID;
> -               } else {
> -                       /*
> -                        * With group flags FAN_REPORT_DIR_FID|FAN_REPORT_FID,
> -                        * a single info record has type FID for event on a
> -                        * non-directory, when there is no directory to report.
> -                        * For example, on FAN_DELETE_SELF event.
> -                        */
> -                       info_type = FAN_EVENT_INFO_TYPE_FID;
> -               }
> -
> -               ret = copy_fid_info_to_user(fanotify_event_fsid(event),
> -                                           fanotify_event_object_fh(event),
> -                                           info_type, dot, dot_len,
> -                                           buf, count);
> +       if (info_mode) {
> +               ret = copy_info_records_to_user(event, info, info_mode,
> +                                               buf, count);
>                 if (ret < 0)
>                         goto out_close_fd;
> -
> -               buf += ret;
> -               count -= ret;
>         }
>
>         return metadata.event_len;
> diff --git a/include/linux/fanotify.h b/include/linux/fanotify.h
> index a16dbeced152..10a7e26ddba6 100644
> --- a/include/linux/fanotify.h
> +++ b/include/linux/fanotify.h
> @@ -27,6 +27,8 @@ extern struct ctl_table fanotify_table[]; /* for sysctl */
>
>  #define FANOTIFY_FID_BITS      (FAN_REPORT_FID | FAN_REPORT_DFID_NAME)
>
> +#define FANOTIFY_INFO_MODES    (FANOTIFY_FID_BITS)
> +
>  /*
>   * fanotify_init() flags that require CAP_SYS_ADMIN.
>   * We do not allow unprivileged groups to request permission events.
> --
> 2.32.0.432.gabb21c7263-goog
>
> /M

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v3 5/5] fanotify: add pidfd support to the fanotify API
  2021-07-21  6:19 ` [PATCH v3 5/5] fanotify: add pidfd support to the fanotify API Matthew Bobrowski
@ 2021-07-21  7:05   ` Amir Goldstein
  2021-07-26 23:04     ` Matthew Bobrowski
  2021-07-27  0:23   ` Jann Horn
  1 sibling, 1 reply; 41+ messages in thread
From: Amir Goldstein @ 2021-07-21  7:05 UTC (permalink / raw)
  To: Matthew Bobrowski; +Cc: Jan Kara, Christian Brauner, linux-fsdevel, Linux API

On Wed, Jul 21, 2021 at 9:19 AM Matthew Bobrowski <repnop@google.com> wrote:
>
> Introduce a new flag FAN_REPORT_PIDFD for fanotify_init(2) which
> allows userspace applications to control whether a pidfd info record
> containing a pidfd is to be returned with each event.
>
> If FAN_REPORT_PIDFD is enabled for a notification group, an additional
> struct fanotify_event_info_pidfd object will be supplied alongside the
> generic struct fanotify_event_metadata within a single event. This
> functionality is analogous to that of FAN_REPORT_FID in terms of how
> the event structure is supplied to the userspace application. Usage of
> FAN_REPORT_PIDFD with FAN_REPORT_FID/FAN_REPORT_DFID_NAME is
> permitted, and in this case a struct fanotify_event_info_pidfd object
> will follow any struct fanotify_event_info_fid object.
>
> Currently, the usage of FAN_REPORT_TID is not permitted along with
> FAN_REPORT_PIDFD as the pidfd API only supports the creation of pidfds
> for thread-group leaders. Additionally, the FAN_REPORT_PIDFD is
> limited to privileged processes only i.e. listeners that are running
> with the CAP_SYS_ADMIN capability. Attempting to supply either of
> these initialization flags with FAN_REPORT_PIDFD will result with
> EINVAL being returned to the caller.
>
> In the event of a pidfd creation error, there are two types of error
> values that can be reported back to the listener. There is
> FAN_NOPIDFD, which will be reported in cases where the process
> responsible for generating the event has terminated prior to fanotify
> being able to create pidfd for event->pid via pidfd_create(). The

I think that "...prior to event listener reading the event..." is a more
accurate description of the situation.

> there is FAN_EPIDFD, which will be reported if a more generic pidfd
> creation error occurred when calling pidfd_create().
>
> Signed-off-by: Matthew Bobrowski <repnop@google.com>
> ---
>
> Changes since v2:
>
>  * The FAN_REPORT_PIDFD flag value has been changed from 0x00001000 to
>    0x00000080. This was so that future FID related initialization flags
>    could be grouped nicely.
>
> * Fixed pidfd clean up at out_close_fd label in
>   copy_event_to_user(). Reversed the conditional and it now uses the
>   close_fd() helper instead of put_unused_fd() as we also need to close the
>   backing file, not just just mark the pidfd free in the fdtable.
>
> * Shuffled around the WARN_ON_ONCE(FAN_REPORT_TID) within
>   copy_event_to_user() so that it's inside the if (pidfd_mode) branch. It
>   makes more sense to be as close to pidfd creation as possible.
>
> * Fixed up the comment block within the if (pidfd_mode) branch.
>
>  fs/notify/fanotify/fanotify_user.c | 88 ++++++++++++++++++++++++++++--
>  include/linux/fanotify.h           |  3 +-
>  include/uapi/linux/fanotify.h      | 13 +++++
>  3 files changed, 98 insertions(+), 6 deletions(-)
>

[...]

>
> @@ -489,8 +526,9 @@ static ssize_t copy_event_to_user(struct fsnotify_group *group,
>         struct path *path = fanotify_event_path(event);
>         struct fanotify_info *info = fanotify_event_info(event);
>         unsigned int info_mode = FAN_GROUP_FLAG(group, FANOTIFY_INFO_MODES);
> +       unsigned int pidfd_mode = info_mode & FAN_REPORT_PIDFD;
>         struct file *f = NULL;
> -       int ret, fd = FAN_NOFD;
> +       int ret, pidfd = FAN_NOPIDFD, fd = FAN_NOFD;
>
>         pr_debug("%s: group=%p event=%p\n", __func__, group, event);
>
> @@ -524,6 +562,34 @@ static ssize_t copy_event_to_user(struct fsnotify_group *group,
>         }
>         metadata.fd = fd;
>
> +       if (pidfd_mode) {
> +               /*
> +                * Complain if the FAN_REPORT_PIDFD and FAN_REPORT_TID mutual
> +                * exclusion is ever lifted. At the time of incoporating pidfd
> +                * support within fanotify, the pidfd API only supported the
> +                * creation of pidfds for thread-group leaders.
> +                */
> +               WARN_ON_ONCE(FAN_GROUP_FLAG(group, FAN_REPORT_TID));
> +
> +               /*
> +                * The PIDTYPE_TGID check for an event->pid is performed
> +                * preemptively in attempt to catch those rare instances where
> +                * the process responsible for generating the event has
> +                * terminated prior to calling into pidfd_create() and acquiring

I find the description above to be "over dramatic".
An event listener reading events after generating process has terminated
could be quite common in case of one shot tools like mv,touch,etc.

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v3 0/5] Add pidfd support to the fanotify API
  2021-07-21  6:17 [PATCH v3 0/5] Add pidfd support to the fanotify API Matthew Bobrowski
                   ` (4 preceding siblings ...)
  2021-07-21  6:19 ` [PATCH v3 5/5] fanotify: add pidfd support to the fanotify API Matthew Bobrowski
@ 2021-07-21  7:06 ` Amir Goldstein
  2021-07-26 23:07   ` Matthew Bobrowski
  5 siblings, 1 reply; 41+ messages in thread
From: Amir Goldstein @ 2021-07-21  7:06 UTC (permalink / raw)
  To: Matthew Bobrowski; +Cc: Jan Kara, Christian Brauner, linux-fsdevel, Linux API

On Wed, Jul 21, 2021 at 9:17 AM Matthew Bobrowski <repnop@google.com> wrote:
>
> Hey Jan/Amir/Christian,
>
> This is an updated version of the FAN_REPORT_PIDFD series which contains
> the addressed nits from the previous review [0]. As per request, you can
> also find the draft LTP tests here [1] and man-pages update for this new
> API change here [2].
>
> [0] https://lore.kernel.org/linux-fsdevel/cover.1623282854.git.repnop@google.com/
> [1] https://github.com/matthewbobrowski/ltp/commits/fanotify_pidfd_v2
> [2] https://github.com/matthewbobrowski/man-pages/commits/fanotify_pidfd_v1

FWIW, those test and man page drafts look good to me :)

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v3 5/5] fanotify: add pidfd support to the fanotify API
  2021-07-21  7:05   ` Amir Goldstein
@ 2021-07-26 23:04     ` Matthew Bobrowski
  0 siblings, 0 replies; 41+ messages in thread
From: Matthew Bobrowski @ 2021-07-26 23:04 UTC (permalink / raw)
  To: Amir Goldstein; +Cc: Jan Kara, Christian Brauner, linux-fsdevel, Linux API

On Wed, Jul 21, 2021 at 10:05:17AM +0300, Amir Goldstein wrote:
> On Wed, Jul 21, 2021 at 9:19 AM Matthew Bobrowski <repnop@google.com> wrote:
> >
> > Introduce a new flag FAN_REPORT_PIDFD for fanotify_init(2) which
> > allows userspace applications to control whether a pidfd info record
> > containing a pidfd is to be returned with each event.
> >
> > If FAN_REPORT_PIDFD is enabled for a notification group, an additional
> > struct fanotify_event_info_pidfd object will be supplied alongside the
> > generic struct fanotify_event_metadata within a single event. This
> > functionality is analogous to that of FAN_REPORT_FID in terms of how
> > the event structure is supplied to the userspace application. Usage of
> > FAN_REPORT_PIDFD with FAN_REPORT_FID/FAN_REPORT_DFID_NAME is
> > permitted, and in this case a struct fanotify_event_info_pidfd object
> > will follow any struct fanotify_event_info_fid object.
> >
> > Currently, the usage of FAN_REPORT_TID is not permitted along with
> > FAN_REPORT_PIDFD as the pidfd API only supports the creation of pidfds
> > for thread-group leaders. Additionally, the FAN_REPORT_PIDFD is
> > limited to privileged processes only i.e. listeners that are running
> > with the CAP_SYS_ADMIN capability. Attempting to supply either of
> > these initialization flags with FAN_REPORT_PIDFD will result with
> > EINVAL being returned to the caller.
> >
> > In the event of a pidfd creation error, there are two types of error
> > values that can be reported back to the listener. There is
> > FAN_NOPIDFD, which will be reported in cases where the process
> > responsible for generating the event has terminated prior to fanotify
> > being able to create pidfd for event->pid via pidfd_create(). The
> 
> I think that "...prior to event listener reading the event..." is a more
> accurate description of the situation.

Yes, and to be fair, I actually forgot to update this specific commit
message and comments within the commit after making these exact adjustments
to the man-pages.

> > there is FAN_EPIDFD, which will be reported if a more generic pidfd
> > creation error occurred when calling pidfd_create().
> >
> > Signed-off-by: Matthew Bobrowski <repnop@google.com>
> > ---
> >
> > Changes since v2:
> >
> >  * The FAN_REPORT_PIDFD flag value has been changed from 0x00001000 to
> >    0x00000080. This was so that future FID related initialization flags
> >    could be grouped nicely.
> >
> > * Fixed pidfd clean up at out_close_fd label in
> >   copy_event_to_user(). Reversed the conditional and it now uses the
> >   close_fd() helper instead of put_unused_fd() as we also need to close the
> >   backing file, not just just mark the pidfd free in the fdtable.
> >
> > * Shuffled around the WARN_ON_ONCE(FAN_REPORT_TID) within
> >   copy_event_to_user() so that it's inside the if (pidfd_mode) branch. It
> >   makes more sense to be as close to pidfd creation as possible.
> >
> > * Fixed up the comment block within the if (pidfd_mode) branch.
> >
> >  fs/notify/fanotify/fanotify_user.c | 88 ++++++++++++++++++++++++++++--
> >  include/linux/fanotify.h           |  3 +-
> >  include/uapi/linux/fanotify.h      | 13 +++++
> >  3 files changed, 98 insertions(+), 6 deletions(-)
> >
> 
> [...]
> 
> >
> > @@ -489,8 +526,9 @@ static ssize_t copy_event_to_user(struct fsnotify_group *group,
> >         struct path *path = fanotify_event_path(event);
> >         struct fanotify_info *info = fanotify_event_info(event);
> >         unsigned int info_mode = FAN_GROUP_FLAG(group, FANOTIFY_INFO_MODES);
> > +       unsigned int pidfd_mode = info_mode & FAN_REPORT_PIDFD;
> >         struct file *f = NULL;
> > -       int ret, fd = FAN_NOFD;
> > +       int ret, pidfd = FAN_NOPIDFD, fd = FAN_NOFD;
> >
> >         pr_debug("%s: group=%p event=%p\n", __func__, group, event);
> >
> > @@ -524,6 +562,34 @@ static ssize_t copy_event_to_user(struct fsnotify_group *group,
> >         }
> >         metadata.fd = fd;
> >
> > +       if (pidfd_mode) {
> > +               /*
> > +                * Complain if the FAN_REPORT_PIDFD and FAN_REPORT_TID mutual
> > +                * exclusion is ever lifted. At the time of incoporating pidfd
> > +                * support within fanotify, the pidfd API only supported the
> > +                * creation of pidfds for thread-group leaders.
> > +                */
> > +               WARN_ON_ONCE(FAN_GROUP_FLAG(group, FAN_REPORT_TID));
> > +
> > +               /*
> > +                * The PIDTYPE_TGID check for an event->pid is performed
> > +                * preemptively in attempt to catch those rare instances where
> > +                * the process responsible for generating the event has
> > +                * terminated prior to calling into pidfd_create() and acquiring
> 
> I find the description above to be "over dramatic".
> An event listener reading events after generating process has terminated
> could be quite common in case of one shot tools like mv,touch,etc.

Agree, will adjust.

/M

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v3 0/5] Add pidfd support to the fanotify API
  2021-07-21  7:06 ` [PATCH v3 0/5] Add " Amir Goldstein
@ 2021-07-26 23:07   ` Matthew Bobrowski
  2021-07-27  0:16     ` Matthew Bobrowski
  0 siblings, 1 reply; 41+ messages in thread
From: Matthew Bobrowski @ 2021-07-26 23:07 UTC (permalink / raw)
  To: Amir Goldstein; +Cc: Jan Kara, Christian Brauner, linux-fsdevel, Linux API

On Wed, Jul 21, 2021 at 10:06:56AM +0300, Amir Goldstein wrote:
> On Wed, Jul 21, 2021 at 9:17 AM Matthew Bobrowski <repnop@google.com> wrote:
> >
> > Hey Jan/Amir/Christian,
> >
> > This is an updated version of the FAN_REPORT_PIDFD series which contains
> > the addressed nits from the previous review [0]. As per request, you can
> > also find the draft LTP tests here [1] and man-pages update for this new
> > API change here [2].
> >
> > [0] https://lore.kernel.org/linux-fsdevel/cover.1623282854.git.repnop@google.com/
> > [1] https://github.com/matthewbobrowski/ltp/commits/fanotify_pidfd_v2
> > [2] https://github.com/matthewbobrowski/man-pages/commits/fanotify_pidfd_v1
> 
> FWIW, those test and man page drafts look good to me :)

Fantastic, thanks for the review!

I will adjust the minor comments/documentation on patch 5/5 and send
through an updated series.
/M

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v3 0/5] Add pidfd support to the fanotify API
  2021-07-26 23:07   ` Matthew Bobrowski
@ 2021-07-27  0:16     ` Matthew Bobrowski
  2021-07-29 13:40       ` Jan Kara
  0 siblings, 1 reply; 41+ messages in thread
From: Matthew Bobrowski @ 2021-07-27  0:16 UTC (permalink / raw)
  To: Amir Goldstein; +Cc: Jan Kara, Christian Brauner, linux-fsdevel, Linux API

On Tue, Jul 27, 2021 at 09:07:28AM +1000, Matthew Bobrowski wrote:
> On Wed, Jul 21, 2021 at 10:06:56AM +0300, Amir Goldstein wrote:
> > On Wed, Jul 21, 2021 at 9:17 AM Matthew Bobrowski <repnop@google.com> wrote:
> > >
> > > Hey Jan/Amir/Christian,
> > >
> > > This is an updated version of the FAN_REPORT_PIDFD series which contains
> > > the addressed nits from the previous review [0]. As per request, you can
> > > also find the draft LTP tests here [1] and man-pages update for this new
> > > API change here [2].
> > >
> > > [0] https://lore.kernel.org/linux-fsdevel/cover.1623282854.git.repnop@google.com/
> > > [1] https://github.com/matthewbobrowski/ltp/commits/fanotify_pidfd_v2
> > > [2] https://github.com/matthewbobrowski/man-pages/commits/fanotify_pidfd_v1
> > 
> > FWIW, those test and man page drafts look good to me :)
> 
> Fantastic, thanks for the review!
> 
> I will adjust the minor comments/documentation on patch 5/5 and send
> through an updated series.

Alright, so I've fixed up the git commit message and comment in the source
code so that it's more accurate in terms of when/why FAN_NOPIDFD is
reported.

I'm going to hold with sending through v4 until I have Jan also look peek
and poke at v3 as I want to avoid doing any unnecessary round trips.

/M

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v3 5/5] fanotify: add pidfd support to the fanotify API
  2021-07-21  6:19 ` [PATCH v3 5/5] fanotify: add pidfd support to the fanotify API Matthew Bobrowski
  2021-07-21  7:05   ` Amir Goldstein
@ 2021-07-27  0:23   ` Jann Horn
  2021-07-27  4:19     ` Amir Goldstein
  2021-07-27 12:54     ` Matthew Bobrowski
  1 sibling, 2 replies; 41+ messages in thread
From: Jann Horn @ 2021-07-27  0:23 UTC (permalink / raw)
  To: Matthew Bobrowski
  Cc: jack, amir73il, christian.brauner, linux-fsdevel, linux-api,
	Andy Lutomirski

On Wed, Jul 21, 2021 at 8:21 AM Matthew Bobrowski <repnop@google.com> wrote:
> Introduce a new flag FAN_REPORT_PIDFD for fanotify_init(2) which
> allows userspace applications to control whether a pidfd info record
> containing a pidfd is to be returned with each event.
>
> If FAN_REPORT_PIDFD is enabled for a notification group, an additional
> struct fanotify_event_info_pidfd object will be supplied alongside the
> generic struct fanotify_event_metadata within a single event. This
> functionality is analogous to that of FAN_REPORT_FID in terms of how
> the event structure is supplied to the userspace application. Usage of
> FAN_REPORT_PIDFD with FAN_REPORT_FID/FAN_REPORT_DFID_NAME is
> permitted, and in this case a struct fanotify_event_info_pidfd object
> will follow any struct fanotify_event_info_fid object.
>
> Currently, the usage of FAN_REPORT_TID is not permitted along with
> FAN_REPORT_PIDFD as the pidfd API only supports the creation of pidfds
> for thread-group leaders. Additionally, the FAN_REPORT_PIDFD is
> limited to privileged processes only i.e. listeners that are running
> with the CAP_SYS_ADMIN capability. Attempting to supply either of
> these initialization flags with FAN_REPORT_PIDFD will result with
> EINVAL being returned to the caller.
>
> In the event of a pidfd creation error, there are two types of error
> values that can be reported back to the listener. There is
> FAN_NOPIDFD, which will be reported in cases where the process
> responsible for generating the event has terminated prior to fanotify
> being able to create pidfd for event->pid via pidfd_create(). The
> there is FAN_EPIDFD, which will be reported if a more generic pidfd
> creation error occurred when calling pidfd_create().
[...]
> @@ -524,6 +562,34 @@ static ssize_t copy_event_to_user(struct fsnotify_group *group,
>         }
>         metadata.fd = fd;
>
> +       if (pidfd_mode) {
> +               /*
> +                * Complain if the FAN_REPORT_PIDFD and FAN_REPORT_TID mutual
> +                * exclusion is ever lifted. At the time of incoporating pidfd
> +                * support within fanotify, the pidfd API only supported the
> +                * creation of pidfds for thread-group leaders.
> +                */
> +               WARN_ON_ONCE(FAN_GROUP_FLAG(group, FAN_REPORT_TID));
> +
> +               /*
> +                * The PIDTYPE_TGID check for an event->pid is performed
> +                * preemptively in attempt to catch those rare instances where
> +                * the process responsible for generating the event has
> +                * terminated prior to calling into pidfd_create() and acquiring
> +                * a valid pidfd. Report FAN_NOPIDFD to the listener in those
> +                * cases. All other pidfd creation errors are represented as
> +                * FAN_EPIDFD.
> +                */
> +               if (metadata.pid == 0 ||
> +                   !pid_has_task(event->pid, PIDTYPE_TGID)) {
> +                       pidfd = FAN_NOPIDFD;
> +               } else {
> +                       pidfd = pidfd_create(event->pid, 0);
> +                       if (pidfd < 0)
> +                               pidfd = FAN_EPIDFD;
> +               }
> +       }
> +

As a general rule, f_op->read callbacks aren't allowed to mess with
the file descriptor table of the calling process. A process should be
able to receive a file descriptor from an untrusted source and call
functions like read() on it without worrying about affecting its own
file descriptor table state with that.

I realize that existing fanotify code appears to be violating that
rule already, and that you're limiting creation of fanotify file
descriptors that can hit this codepath to CAP_SYS_ADMIN, but still, I
think fanotify_read() probably ought to be an ioctl, or something
along those lines, instead of an f_op->read handler if it messes with
the caller's fd table?

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v3 5/5] fanotify: add pidfd support to the fanotify API
  2021-07-27  0:23   ` Jann Horn
@ 2021-07-27  4:19     ` Amir Goldstein
  2021-07-27  5:10       ` Matthew Bobrowski
  2021-07-29 13:39       ` Jan Kara
  2021-07-27 12:54     ` Matthew Bobrowski
  1 sibling, 2 replies; 41+ messages in thread
From: Amir Goldstein @ 2021-07-27  4:19 UTC (permalink / raw)
  To: Jann Horn
  Cc: Matthew Bobrowski, Jan Kara, Christian Brauner, linux-fsdevel,
	Linux API, Andy Lutomirski

On Tue, Jul 27, 2021 at 3:24 AM Jann Horn <jannh@google.com> wrote:
>
> On Wed, Jul 21, 2021 at 8:21 AM Matthew Bobrowski <repnop@google.com> wrote:
> > Introduce a new flag FAN_REPORT_PIDFD for fanotify_init(2) which
> > allows userspace applications to control whether a pidfd info record
> > containing a pidfd is to be returned with each event.
> >
> > If FAN_REPORT_PIDFD is enabled for a notification group, an additional
> > struct fanotify_event_info_pidfd object will be supplied alongside the
> > generic struct fanotify_event_metadata within a single event. This
> > functionality is analogous to that of FAN_REPORT_FID in terms of how
> > the event structure is supplied to the userspace application. Usage of
> > FAN_REPORT_PIDFD with FAN_REPORT_FID/FAN_REPORT_DFID_NAME is
> > permitted, and in this case a struct fanotify_event_info_pidfd object
> > will follow any struct fanotify_event_info_fid object.
> >
> > Currently, the usage of FAN_REPORT_TID is not permitted along with
> > FAN_REPORT_PIDFD as the pidfd API only supports the creation of pidfds
> > for thread-group leaders. Additionally, the FAN_REPORT_PIDFD is
> > limited to privileged processes only i.e. listeners that are running
> > with the CAP_SYS_ADMIN capability. Attempting to supply either of
> > these initialization flags with FAN_REPORT_PIDFD will result with
> > EINVAL being returned to the caller.
> >
> > In the event of a pidfd creation error, there are two types of error
> > values that can be reported back to the listener. There is
> > FAN_NOPIDFD, which will be reported in cases where the process
> > responsible for generating the event has terminated prior to fanotify
> > being able to create pidfd for event->pid via pidfd_create(). The
> > there is FAN_EPIDFD, which will be reported if a more generic pidfd
> > creation error occurred when calling pidfd_create().
> [...]
> > @@ -524,6 +562,34 @@ static ssize_t copy_event_to_user(struct fsnotify_group *group,
> >         }
> >         metadata.fd = fd;
> >
> > +       if (pidfd_mode) {
> > +               /*
> > +                * Complain if the FAN_REPORT_PIDFD and FAN_REPORT_TID mutual
> > +                * exclusion is ever lifted. At the time of incoporating pidfd
> > +                * support within fanotify, the pidfd API only supported the
> > +                * creation of pidfds for thread-group leaders.
> > +                */
> > +               WARN_ON_ONCE(FAN_GROUP_FLAG(group, FAN_REPORT_TID));
> > +
> > +               /*
> > +                * The PIDTYPE_TGID check for an event->pid is performed
> > +                * preemptively in attempt to catch those rare instances where
> > +                * the process responsible for generating the event has
> > +                * terminated prior to calling into pidfd_create() and acquiring
> > +                * a valid pidfd. Report FAN_NOPIDFD to the listener in those
> > +                * cases. All other pidfd creation errors are represented as
> > +                * FAN_EPIDFD.
> > +                */
> > +               if (metadata.pid == 0 ||
> > +                   !pid_has_task(event->pid, PIDTYPE_TGID)) {
> > +                       pidfd = FAN_NOPIDFD;
> > +               } else {
> > +                       pidfd = pidfd_create(event->pid, 0);
> > +                       if (pidfd < 0)
> > +                               pidfd = FAN_EPIDFD;
> > +               }
> > +       }
> > +
>
> As a general rule, f_op->read callbacks aren't allowed to mess with
> the file descriptor table of the calling process. A process should be
> able to receive a file descriptor from an untrusted source and call
> functions like read() on it without worrying about affecting its own
> file descriptor table state with that.
>

Interesting. I've never considered this interface flaw.
Thanks for bringing this up!

> I realize that existing fanotify code appears to be violating that
> rule already, and that you're limiting creation of fanotify file
> descriptors that can hit this codepath to CAP_SYS_ADMIN, but still, I
> think fanotify_read() probably ought to be an ioctl, or something
> along those lines, instead of an f_op->read handler if it messes with
> the caller's fd table?

Naturally, we cannot change the legacy interface.
However, since fanotify has a modern FAN_REPORT_FID interface
which does not mess with fd table maybe this is an opportunity not
to repeat the same mistake for the FAN_REPORT_FID interface.

Matthew, can you explain what is the use case of the consumer
application of pidfd. I am guessing this is for an audit user case?
because if it were for permission events, event->pid would have been
sufficient.

If that is the case, then I presume that the application does not really
need to operate on the pidfd, it only need to avoid reporting wrong
process details after pid wraparound?

If that is the case, then maybe a model similar to inode generation
can be used to report a "pid generation" in addition to event->pid
and export pid generation in /proc/<pid>/status?

Or am I completely misunderstanding the use case?

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v3 5/5] fanotify: add pidfd support to the fanotify API
  2021-07-27  4:19     ` Amir Goldstein
@ 2021-07-27  5:10       ` Matthew Bobrowski
  2021-07-27  7:03         ` Amir Goldstein
  2021-07-29 13:39       ` Jan Kara
  1 sibling, 1 reply; 41+ messages in thread
From: Matthew Bobrowski @ 2021-07-27  5:10 UTC (permalink / raw)
  To: Amir Goldstein
  Cc: Jann Horn, Jan Kara, Christian Brauner, linux-fsdevel, Linux API,
	Andy Lutomirski

On Tue, Jul 27, 2021 at 07:19:43AM +0300, Amir Goldstein wrote:
> On Tue, Jul 27, 2021 at 3:24 AM Jann Horn <jannh@google.com> wrote:
> >
> > On Wed, Jul 21, 2021 at 8:21 AM Matthew Bobrowski <repnop@google.com> wrote:
> > > Introduce a new flag FAN_REPORT_PIDFD for fanotify_init(2) which
> > > allows userspace applications to control whether a pidfd info record
> > > containing a pidfd is to be returned with each event.
> > >
> > > If FAN_REPORT_PIDFD is enabled for a notification group, an additional
> > > struct fanotify_event_info_pidfd object will be supplied alongside the
> > > generic struct fanotify_event_metadata within a single event. This
> > > functionality is analogous to that of FAN_REPORT_FID in terms of how
> > > the event structure is supplied to the userspace application. Usage of
> > > FAN_REPORT_PIDFD with FAN_REPORT_FID/FAN_REPORT_DFID_NAME is
> > > permitted, and in this case a struct fanotify_event_info_pidfd object
> > > will follow any struct fanotify_event_info_fid object.
> > >
> > > Currently, the usage of FAN_REPORT_TID is not permitted along with
> > > FAN_REPORT_PIDFD as the pidfd API only supports the creation of pidfds
> > > for thread-group leaders. Additionally, the FAN_REPORT_PIDFD is
> > > limited to privileged processes only i.e. listeners that are running
> > > with the CAP_SYS_ADMIN capability. Attempting to supply either of
> > > these initialization flags with FAN_REPORT_PIDFD will result with
> > > EINVAL being returned to the caller.
> > >
> > > In the event of a pidfd creation error, there are two types of error
> > > values that can be reported back to the listener. There is
> > > FAN_NOPIDFD, which will be reported in cases where the process
> > > responsible for generating the event has terminated prior to fanotify
> > > being able to create pidfd for event->pid via pidfd_create(). The
> > > there is FAN_EPIDFD, which will be reported if a more generic pidfd
> > > creation error occurred when calling pidfd_create().
> > [...]
> > > @@ -524,6 +562,34 @@ static ssize_t copy_event_to_user(struct fsnotify_group *group,
> > >         }
> > >         metadata.fd = fd;
> > >
> > > +       if (pidfd_mode) {
> > > +               /*
> > > +                * Complain if the FAN_REPORT_PIDFD and FAN_REPORT_TID mutual
> > > +                * exclusion is ever lifted. At the time of incoporating pidfd
> > > +                * support within fanotify, the pidfd API only supported the
> > > +                * creation of pidfds for thread-group leaders.
> > > +                */
> > > +               WARN_ON_ONCE(FAN_GROUP_FLAG(group, FAN_REPORT_TID));
> > > +
> > > +               /*
> > > +                * The PIDTYPE_TGID check for an event->pid is performed
> > > +                * preemptively in attempt to catch those rare instances where
> > > +                * the process responsible for generating the event has
> > > +                * terminated prior to calling into pidfd_create() and acquiring
> > > +                * a valid pidfd. Report FAN_NOPIDFD to the listener in those
> > > +                * cases. All other pidfd creation errors are represented as
> > > +                * FAN_EPIDFD.
> > > +                */
> > > +               if (metadata.pid == 0 ||
> > > +                   !pid_has_task(event->pid, PIDTYPE_TGID)) {
> > > +                       pidfd = FAN_NOPIDFD;
> > > +               } else {
> > > +                       pidfd = pidfd_create(event->pid, 0);
> > > +                       if (pidfd < 0)
> > > +                               pidfd = FAN_EPIDFD;
> > > +               }
> > > +       }
> > > +
> >
> > As a general rule, f_op->read callbacks aren't allowed to mess with
> > the file descriptor table of the calling process. A process should be
> > able to receive a file descriptor from an untrusted source and call
> > functions like read() on it without worrying about affecting its own
> > file descriptor table state with that.
> >
> 
> Interesting. I've never considered this interface flaw.
> Thanks for bringing this up!
> 
> > I realize that existing fanotify code appears to be violating that
> > rule already, and that you're limiting creation of fanotify file
> > descriptors that can hit this codepath to CAP_SYS_ADMIN, but still, I
> > think fanotify_read() probably ought to be an ioctl, or something
> > along those lines, instead of an f_op->read handler if it messes with
> > the caller's fd table?
> 
> Naturally, we cannot change the legacy interface.
> However, since fanotify has a modern FAN_REPORT_FID interface
> which does not mess with fd table maybe this is an opportunity not
> to repeat the same mistake for the FAN_REPORT_FID interface.

You mean the FAN_REPORT_PIDFD interface, right?

> Matthew, can you explain what is the use case of the consumer
> application of pidfd. I am guessing this is for an audit user case?
> because if it were for permission events, event->pid would have been
> sufficient.

Yes, the primary use case would be for reliable auditing i.e. what actual
process had accessed what filesystem object of interest. Generally, finding
what process is a little unreliable at the moment given that the reporting
event->pid and crawling through /proc based on that has been observed to
lead to certain inaccuracy in the past i.e. reporting an access that was in
fact not performed by event->pid.

The permission model doesn't work in this case given that it takes the
"blocking" approach and not it's not something that can always be
afforded...

> If that is the case, then I presume that the application does not really
> need to operate on the pidfd, it only need to avoid reporting wrong
> process details after pid wraparound?

The idea is that the event listener, or receiver will use the
pidfd_send_signal(2) and specify event->info->pidfd as one of its arguments
in order to _reliably_ determine whether the process that generated the
event is still around. If so, it can freely ascertain further contextual
information from /proc reliably.

> If that is the case, then maybe a model similar to inode generation
> can be used to report a "pid generation" in addition to event->pid
> and export pid generation in /proc/<pid>/status?

TBH, I don't fully understand what you mean by this model...

/M

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v3 5/5] fanotify: add pidfd support to the fanotify API
  2021-07-27  5:10       ` Matthew Bobrowski
@ 2021-07-27  7:03         ` Amir Goldstein
  2021-07-27  8:22           ` Christian Brauner
  0 siblings, 1 reply; 41+ messages in thread
From: Amir Goldstein @ 2021-07-27  7:03 UTC (permalink / raw)
  To: Matthew Bobrowski
  Cc: Jann Horn, Jan Kara, Christian Brauner, linux-fsdevel, Linux API,
	Andy Lutomirski

On Tue, Jul 27, 2021 at 8:10 AM Matthew Bobrowski <repnop@google.com> wrote:
>
> On Tue, Jul 27, 2021 at 07:19:43AM +0300, Amir Goldstein wrote:
> > On Tue, Jul 27, 2021 at 3:24 AM Jann Horn <jannh@google.com> wrote:
> > >
> > > On Wed, Jul 21, 2021 at 8:21 AM Matthew Bobrowski <repnop@google.com> wrote:
> > > > Introduce a new flag FAN_REPORT_PIDFD for fanotify_init(2) which
> > > > allows userspace applications to control whether a pidfd info record
> > > > containing a pidfd is to be returned with each event.
> > > >
> > > > If FAN_REPORT_PIDFD is enabled for a notification group, an additional
> > > > struct fanotify_event_info_pidfd object will be supplied alongside the
> > > > generic struct fanotify_event_metadata within a single event. This
> > > > functionality is analogous to that of FAN_REPORT_FID in terms of how
> > > > the event structure is supplied to the userspace application. Usage of
> > > > FAN_REPORT_PIDFD with FAN_REPORT_FID/FAN_REPORT_DFID_NAME is
> > > > permitted, and in this case a struct fanotify_event_info_pidfd object
> > > > will follow any struct fanotify_event_info_fid object.
> > > >
> > > > Currently, the usage of FAN_REPORT_TID is not permitted along with
> > > > FAN_REPORT_PIDFD as the pidfd API only supports the creation of pidfds
> > > > for thread-group leaders. Additionally, the FAN_REPORT_PIDFD is
> > > > limited to privileged processes only i.e. listeners that are running
> > > > with the CAP_SYS_ADMIN capability. Attempting to supply either of
> > > > these initialization flags with FAN_REPORT_PIDFD will result with
> > > > EINVAL being returned to the caller.
> > > >
> > > > In the event of a pidfd creation error, there are two types of error
> > > > values that can be reported back to the listener. There is
> > > > FAN_NOPIDFD, which will be reported in cases where the process
> > > > responsible for generating the event has terminated prior to fanotify
> > > > being able to create pidfd for event->pid via pidfd_create(). The
> > > > there is FAN_EPIDFD, which will be reported if a more generic pidfd
> > > > creation error occurred when calling pidfd_create().
> > > [...]
> > > > @@ -524,6 +562,34 @@ static ssize_t copy_event_to_user(struct fsnotify_group *group,
> > > >         }
> > > >         metadata.fd = fd;
> > > >
> > > > +       if (pidfd_mode) {
> > > > +               /*
> > > > +                * Complain if the FAN_REPORT_PIDFD and FAN_REPORT_TID mutual
> > > > +                * exclusion is ever lifted. At the time of incoporating pidfd
> > > > +                * support within fanotify, the pidfd API only supported the
> > > > +                * creation of pidfds for thread-group leaders.
> > > > +                */
> > > > +               WARN_ON_ONCE(FAN_GROUP_FLAG(group, FAN_REPORT_TID));
> > > > +
> > > > +               /*
> > > > +                * The PIDTYPE_TGID check for an event->pid is performed
> > > > +                * preemptively in attempt to catch those rare instances where
> > > > +                * the process responsible for generating the event has
> > > > +                * terminated prior to calling into pidfd_create() and acquiring
> > > > +                * a valid pidfd. Report FAN_NOPIDFD to the listener in those
> > > > +                * cases. All other pidfd creation errors are represented as
> > > > +                * FAN_EPIDFD.
> > > > +                */
> > > > +               if (metadata.pid == 0 ||
> > > > +                   !pid_has_task(event->pid, PIDTYPE_TGID)) {
> > > > +                       pidfd = FAN_NOPIDFD;
> > > > +               } else {
> > > > +                       pidfd = pidfd_create(event->pid, 0);
> > > > +                       if (pidfd < 0)
> > > > +                               pidfd = FAN_EPIDFD;
> > > > +               }
> > > > +       }
> > > > +
> > >
> > > As a general rule, f_op->read callbacks aren't allowed to mess with
> > > the file descriptor table of the calling process. A process should be
> > > able to receive a file descriptor from an untrusted source and call
> > > functions like read() on it without worrying about affecting its own
> > > file descriptor table state with that.
> > >
> >
> > Interesting. I've never considered this interface flaw.
> > Thanks for bringing this up!
> >
> > > I realize that existing fanotify code appears to be violating that
> > > rule already, and that you're limiting creation of fanotify file
> > > descriptors that can hit this codepath to CAP_SYS_ADMIN, but still, I
> > > think fanotify_read() probably ought to be an ioctl, or something
> > > along those lines, instead of an f_op->read handler if it messes with
> > > the caller's fd table?
> >
> > Naturally, we cannot change the legacy interface.
> > However, since fanotify has a modern FAN_REPORT_FID interface
> > which does not mess with fd table maybe this is an opportunity not
> > to repeat the same mistake for the FAN_REPORT_FID interface.
>
> You mean the FAN_REPORT_PIDFD interface, right?

No, I mean FAN_REPORT_FID.
We have a new interface that does not pollute reader process fd table
with fds of event->fd, so maybe let's try to avoiding regressing this
use case by polluting the reader process fd table with pidfds.

>
> > Matthew, can you explain what is the use case of the consumer
> > application of pidfd. I am guessing this is for an audit user case?
> > because if it were for permission events, event->pid would have been
> > sufficient.
>
> Yes, the primary use case would be for reliable auditing i.e. what actual
> process had accessed what filesystem object of interest. Generally, finding
> what process is a little unreliable at the moment given that the reporting
> event->pid and crawling through /proc based on that has been observed to
> lead to certain inaccuracy in the past i.e. reporting an access that was in
> fact not performed by event->pid.
>
> The permission model doesn't work in this case given that it takes the
> "blocking" approach and not it's not something that can always be
> afforded...
>
> > If that is the case, then I presume that the application does not really
> > need to operate on the pidfd, it only need to avoid reporting wrong
> > process details after pid wraparound?
>
> The idea is that the event listener, or receiver will use the
> pidfd_send_signal(2) and specify event->info->pidfd as one of its arguments
> in order to _reliably_ determine whether the process that generated the
> event is still around. If so, it can freely ascertain further contextual
> information from /proc reliably.
>
> > If that is the case, then maybe a model similar to inode generation
> > can be used to report a "pid generation" in addition to event->pid
> > and export pid generation in /proc/<pid>/stat
>
> TBH, I don't fully understand what you mean by this model...
>

The model is this:

FAN_REPORT_UPID (or something) will report an info record
with a unique identifier of the generating process or thread, because
there is no restriction imposed by pidfd to support only group leaders.

That unique identifier may be obtained from /proc, e.g.:
$ cat /proc/self/upid
633733.0

In this case .0 represents generation 0.
If pid numbers would wrap around in that pid namespace
generation would be bumped and next process to get pid
633733 would have a unique id 633733.1.

There are probably more pid namespace considerations of how
that /proc API will be designed exactly.

IIUC the process generation needs to be stored in struct upid,
because generation is per pid namespace.

Essentially, that is the same concept embodied by FAN_REPORT_FID -
fanotify makes a record of the inode unique identifier and does not keep
any live references on the inode.

The event reader is able to perform a check to determine if the event
happened on the inode in question or not by comparing the FID reported
in the event with the FID that the listener obtains from the suscept target
inode.

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v3 4/5] fanotify/fanotify_user.c: introduce a generic info record copying helper
  2021-07-21  6:35   ` Amir Goldstein
@ 2021-07-27  8:16     ` Amir Goldstein
  2021-07-27 12:57       ` Matthew Bobrowski
  0 siblings, 1 reply; 41+ messages in thread
From: Amir Goldstein @ 2021-07-27  8:16 UTC (permalink / raw)
  To: Matthew Bobrowski; +Cc: Jan Kara, Christian Brauner, linux-fsdevel, Linux API

On Wed, Jul 21, 2021 at 9:35 AM Amir Goldstein <amir73il@gmail.com> wrote:
>
> On Wed, Jul 21, 2021 at 9:18 AM Matthew Bobrowski <repnop@google.com> wrote:
> >
> > The copy_info_records_to_user() helper allows for the separation of
> > info record copying routines/conditionals from copy_event_to_user(),
> > which reduces the overall clutter within this function. This becomes
> > especially true as we start introducing additional info records in the
> > future i.e. struct fanotify_event_info_pidfd. On success, this helper
> > returns the total amount of bytes that have been copied into the user
> > supplied buffer and on error, a negative value is returned to the
> > caller.
> >
> > The newly defined macro FANOTIFY_INFO_MODES can be used to obtain info
> > record types that have been enabled for a specific notification
> > group. This macro becomes useful in the subsequent patch when the
> > FAN_REPORT_PIDFD initialization flag is introduced.
> >
> > Signed-off-by: Matthew Bobrowski <repnop@google.com>
> Reviewed-by: Amir Goldstein <amir73il@gmail.com>
>
> > ---
> >  fs/notify/fanotify/fanotify_user.c | 155 ++++++++++++++++-------------
> >  include/linux/fanotify.h           |   2 +
> >  2 files changed, 90 insertions(+), 67 deletions(-)
> >
> > diff --git a/fs/notify/fanotify/fanotify_user.c b/fs/notify/fanotify/fanotify_user.c
> > index 182fea255376..d19f70b2c24c 100644
> > --- a/fs/notify/fanotify/fanotify_user.c
> > +++ b/fs/notify/fanotify/fanotify_user.c
> > @@ -173,7 +173,7 @@ static struct fanotify_event *get_one_event(struct fsnotify_group *group,
> >         size_t event_size = FAN_EVENT_METADATA_LEN;
> >         struct fanotify_event *event = NULL;
> >         struct fsnotify_event *fsn_event;
> > -       unsigned int fid_mode = FAN_GROUP_FLAG(group, FANOTIFY_FID_BITS);
> > +       unsigned int info_mode = FAN_GROUP_FLAG(group, FANOTIFY_INFO_MODES);
> >
> >         pr_debug("%s: group=%p count=%zd\n", __func__, group, count);
> >
> > @@ -183,8 +183,8 @@ static struct fanotify_event *get_one_event(struct fsnotify_group *group,
> >                 goto out;
> >
> >         event = FANOTIFY_E(fsn_event);
> > -       if (fid_mode)
> > -               event_size += fanotify_event_info_len(fid_mode, event);
> > +       if (info_mode)
> > +               event_size += fanotify_event_info_len(info_mode, event);
> >
> >         if (event_size > count) {
> >                 event = ERR_PTR(-EINVAL);
> > @@ -401,6 +401,86 @@ static int copy_fid_info_to_user(__kernel_fsid_t *fsid, struct fanotify_fh *fh,
> >         return info_len;
> >  }
> >
> > +static int copy_info_records_to_user(struct fanotify_event *event,
> > +                                    struct fanotify_info *info,
> > +                                    unsigned int info_mode,
> > +                                    char __user *buf, size_t count)
> > +{
> > +       int ret, total_bytes = 0, info_type = 0;
> > +       unsigned int fid_mode = info_mode & FANOTIFY_FID_BITS;
> > +
> > +       /*
> > +        * Event info records order is as follows: dir fid + name, child fid.
> > +        */
> > +       if (fanotify_event_dir_fh_len(event)) {
> > +               info_type = info->name_len ? FAN_EVENT_INFO_TYPE_DFID_NAME :
> > +                                            FAN_EVENT_INFO_TYPE_DFID;
> > +               ret = copy_fid_info_to_user(fanotify_event_fsid(event),
> > +                                           fanotify_info_dir_fh(info),
> > +                                           info_type,
> > +                                           fanotify_info_name(info),
> > +                                           info->name_len, buf, count);
> > +               if (ret < 0)
> > +                       return ret;
> > +
> > +               buf += ret;
> > +               count -= ret;
> > +               total_bytes += ret;
> > +       }
> > +
> > +       if (fanotify_event_object_fh_len(event)) {
> > +               const char *dot = NULL;
> > +               int dot_len = 0;
> > +
> > +               if (fid_mode == FAN_REPORT_FID || info_type) {
> > +                       /*
> > +                        * With only group flag FAN_REPORT_FID only type FID is
> > +                        * reported. Second info record type is always FID.
> > +                        */
> > +                       info_type = FAN_EVENT_INFO_TYPE_FID;
> > +               } else if ((fid_mode & FAN_REPORT_NAME) &&
> > +                          (event->mask & FAN_ONDIR)) {
> > +                       /*
> > +                        * With group flag FAN_REPORT_NAME, if name was not
> > +                        * recorded in an event on a directory, report the name
> > +                        * "." with info type DFID_NAME.
> > +                        */
> > +                       info_type = FAN_EVENT_INFO_TYPE_DFID_NAME;
> > +                       dot = ".";
> > +                       dot_len = 1;
> > +               } else if ((event->mask & ALL_FSNOTIFY_DIRENT_EVENTS) ||
> > +                          (event->mask & FAN_ONDIR)) {
> > +                       /*
> > +                        * With group flag FAN_REPORT_DIR_FID, a single info
> > +                        * record has type DFID for directory entry
> > +                        * modificatio\ n event and for event on a directory.

Just notices this typo in the copied comment:
modificatio\ n

And for the next posting, please remove the mention of fanotify_user.c from
the commit title - it adds no information and clutters the git oneline log.
This concerns patch 3/5 as well. It does NOT concern the kernel/pid.c patches.

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v3 5/5] fanotify: add pidfd support to the fanotify API
  2021-07-27  7:03         ` Amir Goldstein
@ 2021-07-27  8:22           ` Christian Brauner
  2021-07-27  8:29             ` Christian Brauner
  0 siblings, 1 reply; 41+ messages in thread
From: Christian Brauner @ 2021-07-27  8:22 UTC (permalink / raw)
  To: Amir Goldstein
  Cc: Matthew Bobrowski, Jann Horn, Jan Kara, linux-fsdevel, Linux API,
	Andy Lutomirski

On Tue, Jul 27, 2021 at 10:03:20AM +0300, Amir Goldstein wrote:
> On Tue, Jul 27, 2021 at 8:10 AM Matthew Bobrowski <repnop@google.com> wrote:
> >
> > On Tue, Jul 27, 2021 at 07:19:43AM +0300, Amir Goldstein wrote:
> > > On Tue, Jul 27, 2021 at 3:24 AM Jann Horn <jannh@google.com> wrote:
> > > >
> > > > On Wed, Jul 21, 2021 at 8:21 AM Matthew Bobrowski <repnop@google.com> wrote:
> > > > > Introduce a new flag FAN_REPORT_PIDFD for fanotify_init(2) which
> > > > > allows userspace applications to control whether a pidfd info record
> > > > > containing a pidfd is to be returned with each event.
> > > > >
> > > > > If FAN_REPORT_PIDFD is enabled for a notification group, an additional
> > > > > struct fanotify_event_info_pidfd object will be supplied alongside the
> > > > > generic struct fanotify_event_metadata within a single event. This
> > > > > functionality is analogous to that of FAN_REPORT_FID in terms of how
> > > > > the event structure is supplied to the userspace application. Usage of
> > > > > FAN_REPORT_PIDFD with FAN_REPORT_FID/FAN_REPORT_DFID_NAME is
> > > > > permitted, and in this case a struct fanotify_event_info_pidfd object
> > > > > will follow any struct fanotify_event_info_fid object.
> > > > >
> > > > > Currently, the usage of FAN_REPORT_TID is not permitted along with
> > > > > FAN_REPORT_PIDFD as the pidfd API only supports the creation of pidfds
> > > > > for thread-group leaders. Additionally, the FAN_REPORT_PIDFD is
> > > > > limited to privileged processes only i.e. listeners that are running
> > > > > with the CAP_SYS_ADMIN capability. Attempting to supply either of
> > > > > these initialization flags with FAN_REPORT_PIDFD will result with
> > > > > EINVAL being returned to the caller.
> > > > >
> > > > > In the event of a pidfd creation error, there are two types of error
> > > > > values that can be reported back to the listener. There is
> > > > > FAN_NOPIDFD, which will be reported in cases where the process
> > > > > responsible for generating the event has terminated prior to fanotify
> > > > > being able to create pidfd for event->pid via pidfd_create(). The
> > > > > there is FAN_EPIDFD, which will be reported if a more generic pidfd
> > > > > creation error occurred when calling pidfd_create().
> > > > [...]
> > > > > @@ -524,6 +562,34 @@ static ssize_t copy_event_to_user(struct fsnotify_group *group,
> > > > >         }
> > > > >         metadata.fd = fd;
> > > > >
> > > > > +       if (pidfd_mode) {
> > > > > +               /*
> > > > > +                * Complain if the FAN_REPORT_PIDFD and FAN_REPORT_TID mutual
> > > > > +                * exclusion is ever lifted. At the time of incoporating pidfd
> > > > > +                * support within fanotify, the pidfd API only supported the
> > > > > +                * creation of pidfds for thread-group leaders.
> > > > > +                */
> > > > > +               WARN_ON_ONCE(FAN_GROUP_FLAG(group, FAN_REPORT_TID));
> > > > > +
> > > > > +               /*
> > > > > +                * The PIDTYPE_TGID check for an event->pid is performed
> > > > > +                * preemptively in attempt to catch those rare instances where
> > > > > +                * the process responsible for generating the event has
> > > > > +                * terminated prior to calling into pidfd_create() and acquiring
> > > > > +                * a valid pidfd. Report FAN_NOPIDFD to the listener in those
> > > > > +                * cases. All other pidfd creation errors are represented as
> > > > > +                * FAN_EPIDFD.
> > > > > +                */
> > > > > +               if (metadata.pid == 0 ||
> > > > > +                   !pid_has_task(event->pid, PIDTYPE_TGID)) {
> > > > > +                       pidfd = FAN_NOPIDFD;
> > > > > +               } else {
> > > > > +                       pidfd = pidfd_create(event->pid, 0);
> > > > > +                       if (pidfd < 0)
> > > > > +                               pidfd = FAN_EPIDFD;
> > > > > +               }
> > > > > +       }
> > > > > +
> > > >
> > > > As a general rule, f_op->read callbacks aren't allowed to mess with
> > > > the file descriptor table of the calling process. A process should be
> > > > able to receive a file descriptor from an untrusted source and call
> > > > functions like read() on it without worrying about affecting its own
> > > > file descriptor table state with that.
> > > >
> > >
> > > Interesting. I've never considered this interface flaw.
> > > Thanks for bringing this up!
> > >
> > > > I realize that existing fanotify code appears to be violating that
> > > > rule already, and that you're limiting creation of fanotify file
> > > > descriptors that can hit this codepath to CAP_SYS_ADMIN, but still, I
> > > > think fanotify_read() probably ought to be an ioctl, or something
> > > > along those lines, instead of an f_op->read handler if it messes with
> > > > the caller's fd table?
> > >
> > > Naturally, we cannot change the legacy interface.
> > > However, since fanotify has a modern FAN_REPORT_FID interface
> > > which does not mess with fd table maybe this is an opportunity not
> > > to repeat the same mistake for the FAN_REPORT_FID interface.
> >
> > You mean the FAN_REPORT_PIDFD interface, right?
> 
> No, I mean FAN_REPORT_FID.
> We have a new interface that does not pollute reader process fd table
> with fds of event->fd, so maybe let's try to avoiding regressing this
> use case by polluting the reader process fd table with pidfds.
> 
> >
> > > Matthew, can you explain what is the use case of the consumer
> > > application of pidfd. I am guessing this is for an audit user case?
> > > because if it were for permission events, event->pid would have been
> > > sufficient.
> >
> > Yes, the primary use case would be for reliable auditing i.e. what actual
> > process had accessed what filesystem object of interest. Generally, finding
> > what process is a little unreliable at the moment given that the reporting
> > event->pid and crawling through /proc based on that has been observed to
> > lead to certain inaccuracy in the past i.e. reporting an access that was in
> > fact not performed by event->pid.
> >
> > The permission model doesn't work in this case given that it takes the
> > "blocking" approach and not it's not something that can always be
> > afforded...
> >
> > > If that is the case, then I presume that the application does not really
> > > need to operate on the pidfd, it only need to avoid reporting wrong
> > > process details after pid wraparound?
> >
> > The idea is that the event listener, or receiver will use the
> > pidfd_send_signal(2) and specify event->info->pidfd as one of its arguments
> > in order to _reliably_ determine whether the process that generated the
> > event is still around. If so, it can freely ascertain further contextual
> > information from /proc reliably.
> >
> > > If that is the case, then maybe a model similar to inode generation
> > > can be used to report a "pid generation" in addition to event->pid
> > > and export pid generation in /proc/<pid>/stat
> >
> > TBH, I don't fully understand what you mean by this model...
> >
> 
> The model is this:
> 
> FAN_REPORT_UPID (or something) will report an info record
> with a unique identifier of the generating process or thread, because
> there is no restriction imposed by pidfd to support only group leaders.
> 
> That unique identifier may be obtained from /proc, e.g.:
> $ cat /proc/self/upid
> 633733.0
> 
> In this case .0 represents generation 0.
> If pid numbers would wrap around in that pid namespace
> generation would be bumped and next process to get pid
> 633733 would have a unique id 633733.1.
> 
> There are probably more pid namespace considerations of how
> that /proc API will be designed exactly.

I'm not a fan of this at all to be honest. This very much reminds me of
(a weak version of) pid uuids which has been very controversial. This
sounds all kinds of messy. If the pid gets recycled then you bump the
generation number in all pid namespace where that pid has been recycled
and not in the others and then you expose it through /proc. Then if a
process from one pid namespaces looks at a process from another pid
namespace through the proc file what would it see as the generation
number? That can probably all be solved but the API sounds justy very
unpleasant and hacky.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v3 5/5] fanotify: add pidfd support to the fanotify API
  2021-07-27  8:22           ` Christian Brauner
@ 2021-07-27  8:29             ` Christian Brauner
  0 siblings, 0 replies; 41+ messages in thread
From: Christian Brauner @ 2021-07-27  8:29 UTC (permalink / raw)
  To: Amir Goldstein
  Cc: Matthew Bobrowski, Jann Horn, Jan Kara, linux-fsdevel, Linux API,
	Andy Lutomirski

On Tue, Jul 27, 2021 at 10:22:30AM +0200, Christian Brauner wrote:
> On Tue, Jul 27, 2021 at 10:03:20AM +0300, Amir Goldstein wrote:
> > On Tue, Jul 27, 2021 at 8:10 AM Matthew Bobrowski <repnop@google.com> wrote:
> > >
> > > On Tue, Jul 27, 2021 at 07:19:43AM +0300, Amir Goldstein wrote:
> > > > On Tue, Jul 27, 2021 at 3:24 AM Jann Horn <jannh@google.com> wrote:
> > > > >
> > > > > On Wed, Jul 21, 2021 at 8:21 AM Matthew Bobrowski <repnop@google.com> wrote:
> > > > > > Introduce a new flag FAN_REPORT_PIDFD for fanotify_init(2) which
> > > > > > allows userspace applications to control whether a pidfd info record
> > > > > > containing a pidfd is to be returned with each event.
> > > > > >
> > > > > > If FAN_REPORT_PIDFD is enabled for a notification group, an additional
> > > > > > struct fanotify_event_info_pidfd object will be supplied alongside the
> > > > > > generic struct fanotify_event_metadata within a single event. This
> > > > > > functionality is analogous to that of FAN_REPORT_FID in terms of how
> > > > > > the event structure is supplied to the userspace application. Usage of
> > > > > > FAN_REPORT_PIDFD with FAN_REPORT_FID/FAN_REPORT_DFID_NAME is
> > > > > > permitted, and in this case a struct fanotify_event_info_pidfd object
> > > > > > will follow any struct fanotify_event_info_fid object.
> > > > > >
> > > > > > Currently, the usage of FAN_REPORT_TID is not permitted along with
> > > > > > FAN_REPORT_PIDFD as the pidfd API only supports the creation of pidfds
> > > > > > for thread-group leaders. Additionally, the FAN_REPORT_PIDFD is
> > > > > > limited to privileged processes only i.e. listeners that are running
> > > > > > with the CAP_SYS_ADMIN capability. Attempting to supply either of
> > > > > > these initialization flags with FAN_REPORT_PIDFD will result with
> > > > > > EINVAL being returned to the caller.
> > > > > >
> > > > > > In the event of a pidfd creation error, there are two types of error
> > > > > > values that can be reported back to the listener. There is
> > > > > > FAN_NOPIDFD, which will be reported in cases where the process
> > > > > > responsible for generating the event has terminated prior to fanotify
> > > > > > being able to create pidfd for event->pid via pidfd_create(). The
> > > > > > there is FAN_EPIDFD, which will be reported if a more generic pidfd
> > > > > > creation error occurred when calling pidfd_create().
> > > > > [...]
> > > > > > @@ -524,6 +562,34 @@ static ssize_t copy_event_to_user(struct fsnotify_group *group,
> > > > > >         }
> > > > > >         metadata.fd = fd;
> > > > > >
> > > > > > +       if (pidfd_mode) {
> > > > > > +               /*
> > > > > > +                * Complain if the FAN_REPORT_PIDFD and FAN_REPORT_TID mutual
> > > > > > +                * exclusion is ever lifted. At the time of incoporating pidfd
> > > > > > +                * support within fanotify, the pidfd API only supported the
> > > > > > +                * creation of pidfds for thread-group leaders.
> > > > > > +                */
> > > > > > +               WARN_ON_ONCE(FAN_GROUP_FLAG(group, FAN_REPORT_TID));
> > > > > > +
> > > > > > +               /*
> > > > > > +                * The PIDTYPE_TGID check for an event->pid is performed
> > > > > > +                * preemptively in attempt to catch those rare instances where
> > > > > > +                * the process responsible for generating the event has
> > > > > > +                * terminated prior to calling into pidfd_create() and acquiring
> > > > > > +                * a valid pidfd. Report FAN_NOPIDFD to the listener in those
> > > > > > +                * cases. All other pidfd creation errors are represented as
> > > > > > +                * FAN_EPIDFD.
> > > > > > +                */
> > > > > > +               if (metadata.pid == 0 ||
> > > > > > +                   !pid_has_task(event->pid, PIDTYPE_TGID)) {
> > > > > > +                       pidfd = FAN_NOPIDFD;
> > > > > > +               } else {
> > > > > > +                       pidfd = pidfd_create(event->pid, 0);
> > > > > > +                       if (pidfd < 0)
> > > > > > +                               pidfd = FAN_EPIDFD;
> > > > > > +               }
> > > > > > +       }
> > > > > > +
> > > > >
> > > > > As a general rule, f_op->read callbacks aren't allowed to mess with
> > > > > the file descriptor table of the calling process. A process should be
> > > > > able to receive a file descriptor from an untrusted source and call
> > > > > functions like read() on it without worrying about affecting its own
> > > > > file descriptor table state with that.
> > > > >
> > > >
> > > > Interesting. I've never considered this interface flaw.
> > > > Thanks for bringing this up!
> > > >
> > > > > I realize that existing fanotify code appears to be violating that
> > > > > rule already, and that you're limiting creation of fanotify file
> > > > > descriptors that can hit this codepath to CAP_SYS_ADMIN, but still, I
> > > > > think fanotify_read() probably ought to be an ioctl, or something
> > > > > along those lines, instead of an f_op->read handler if it messes with
> > > > > the caller's fd table?
> > > >
> > > > Naturally, we cannot change the legacy interface.
> > > > However, since fanotify has a modern FAN_REPORT_FID interface
> > > > which does not mess with fd table maybe this is an opportunity not
> > > > to repeat the same mistake for the FAN_REPORT_FID interface.
> > >
> > > You mean the FAN_REPORT_PIDFD interface, right?
> > 
> > No, I mean FAN_REPORT_FID.
> > We have a new interface that does not pollute reader process fd table
> > with fds of event->fd, so maybe let's try to avoiding regressing this
> > use case by polluting the reader process fd table with pidfds.
> > 
> > >
> > > > Matthew, can you explain what is the use case of the consumer
> > > > application of pidfd. I am guessing this is for an audit user case?
> > > > because if it were for permission events, event->pid would have been
> > > > sufficient.
> > >
> > > Yes, the primary use case would be for reliable auditing i.e. what actual
> > > process had accessed what filesystem object of interest. Generally, finding
> > > what process is a little unreliable at the moment given that the reporting
> > > event->pid and crawling through /proc based on that has been observed to
> > > lead to certain inaccuracy in the past i.e. reporting an access that was in
> > > fact not performed by event->pid.
> > >
> > > The permission model doesn't work in this case given that it takes the
> > > "blocking" approach and not it's not something that can always be
> > > afforded...
> > >
> > > > If that is the case, then I presume that the application does not really
> > > > need to operate on the pidfd, it only need to avoid reporting wrong
> > > > process details after pid wraparound?
> > >
> > > The idea is that the event listener, or receiver will use the
> > > pidfd_send_signal(2) and specify event->info->pidfd as one of its arguments
> > > in order to _reliably_ determine whether the process that generated the
> > > event is still around. If so, it can freely ascertain further contextual
> > > information from /proc reliably.
> > >
> > > > If that is the case, then maybe a model similar to inode generation
> > > > can be used to report a "pid generation" in addition to event->pid
> > > > and export pid generation in /proc/<pid>/stat
> > >
> > > TBH, I don't fully understand what you mean by this model...
> > >
> > 
> > The model is this:
> > 
> > FAN_REPORT_UPID (or something) will report an info record
> > with a unique identifier of the generating process or thread, because
> > there is no restriction imposed by pidfd to support only group leaders.
> > 
> > That unique identifier may be obtained from /proc, e.g.:
> > $ cat /proc/self/upid
> > 633733.0
> > 
> > In this case .0 represents generation 0.
> > If pid numbers would wrap around in that pid namespace
> > generation would be bumped and next process to get pid
> > 633733 would have a unique id 633733.1.
> > 
> > There are probably more pid namespace considerations of how
> > that /proc API will be designed exactly.
> 
> I'm not a fan of this at all to be honest. This very much reminds me of
> (a weak version of) pid uuids which has been very controversial. This
> sounds all kinds of messy. If the pid gets recycled then you bump the
> generation number in all pid namespace where that pid has been recycled
> and not in the others and then you expose it through /proc. Then if a
> process from one pid namespaces looks at a process from another pid
> namespace through the proc file what would it see as the generation
> number? That can probably all be solved but the API sounds justy very
> unpleasant and hacky.

Instead of making this a kernel-wide infrastructure problem you could
consider adding a new ioctl().

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v3 5/5] fanotify: add pidfd support to the fanotify API
  2021-07-27  0:23   ` Jann Horn
  2021-07-27  4:19     ` Amir Goldstein
@ 2021-07-27 12:54     ` Matthew Bobrowski
  2021-07-29 22:48       ` Matthew Bobrowski
  1 sibling, 1 reply; 41+ messages in thread
From: Matthew Bobrowski @ 2021-07-27 12:54 UTC (permalink / raw)
  To: Jann Horn
  Cc: jack, amir73il, christian.brauner, linux-fsdevel, linux-api,
	Andy Lutomirski

Hey Jann,

On Tue, Jul 27, 2021 at 02:23:38AM +0200, Jann Horn wrote:
> On Wed, Jul 21, 2021 at 8:21 AM Matthew Bobrowski <repnop@google.com> wrote:
> > Introduce a new flag FAN_REPORT_PIDFD for fanotify_init(2) which
> > allows userspace applications to control whether a pidfd info record
> > containing a pidfd is to be returned with each event.
> >
> > If FAN_REPORT_PIDFD is enabled for a notification group, an additional
> > struct fanotify_event_info_pidfd object will be supplied alongside the
> > generic struct fanotify_event_metadata within a single event. This
> > functionality is analogous to that of FAN_REPORT_FID in terms of how
> > the event structure is supplied to the userspace application. Usage of
> > FAN_REPORT_PIDFD with FAN_REPORT_FID/FAN_REPORT_DFID_NAME is
> > permitted, and in this case a struct fanotify_event_info_pidfd object
> > will follow any struct fanotify_event_info_fid object.
> >
> > Currently, the usage of FAN_REPORT_TID is not permitted along with
> > FAN_REPORT_PIDFD as the pidfd API only supports the creation of pidfds
> > for thread-group leaders. Additionally, the FAN_REPORT_PIDFD is
> > limited to privileged processes only i.e. listeners that are running
> > with the CAP_SYS_ADMIN capability. Attempting to supply either of
> > these initialization flags with FAN_REPORT_PIDFD will result with
> > EINVAL being returned to the caller.
> >
> > In the event of a pidfd creation error, there are two types of error
> > values that can be reported back to the listener. There is
> > FAN_NOPIDFD, which will be reported in cases where the process
> > responsible for generating the event has terminated prior to fanotify
> > being able to create pidfd for event->pid via pidfd_create(). The
> > there is FAN_EPIDFD, which will be reported if a more generic pidfd
> > creation error occurred when calling pidfd_create().
> [...]
> > @@ -524,6 +562,34 @@ static ssize_t copy_event_to_user(struct fsnotify_group *group,
> >         }
> >         metadata.fd = fd;
> >
> > +       if (pidfd_mode) {
> > +               /*
> > +                * Complain if the FAN_REPORT_PIDFD and FAN_REPORT_TID mutual
> > +                * exclusion is ever lifted. At the time of incoporating pidfd
> > +                * support within fanotify, the pidfd API only supported the
> > +                * creation of pidfds for thread-group leaders.
> > +                */
> > +               WARN_ON_ONCE(FAN_GROUP_FLAG(group, FAN_REPORT_TID));
> > +
> > +               /*
> > +                * The PIDTYPE_TGID check for an event->pid is performed
> > +                * preemptively in attempt to catch those rare instances where
> > +                * the process responsible for generating the event has
> > +                * terminated prior to calling into pidfd_create() and acquiring
> > +                * a valid pidfd. Report FAN_NOPIDFD to the listener in those
> > +                * cases. All other pidfd creation errors are represented as
> > +                * FAN_EPIDFD.
> > +                */
> > +               if (metadata.pid == 0 ||
> > +                   !pid_has_task(event->pid, PIDTYPE_TGID)) {
> > +                       pidfd = FAN_NOPIDFD;
> > +               } else {
> > +                       pidfd = pidfd_create(event->pid, 0);
> > +                       if (pidfd < 0)
> > +                               pidfd = FAN_EPIDFD;
> > +               }
> > +       }
> > +
> 
> As a general rule, f_op->read callbacks aren't allowed to mess with
> the file descriptor table of the calling process. A process should be
> able to receive a file descriptor from an untrusted source and call
> functions like read() on it without worrying about affecting its own
> file descriptor table state with that.

Interesting, thanks for bringing this up. I never knew about this general
rule. Do you mind elaborating a little on why f_op->read() callbacks aren't
allowed to mess with the fdtable of the calling process? I don't quite
exactly understand why this is considered to be suboptimal.

/M

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v3 4/5] fanotify/fanotify_user.c: introduce a generic info record copying helper
  2021-07-27  8:16     ` Amir Goldstein
@ 2021-07-27 12:57       ` Matthew Bobrowski
  0 siblings, 0 replies; 41+ messages in thread
From: Matthew Bobrowski @ 2021-07-27 12:57 UTC (permalink / raw)
  To: Amir Goldstein; +Cc: Jan Kara, Christian Brauner, linux-fsdevel, Linux API

On Tue, Jul 27, 2021 at 11:16:33AM +0300, Amir Goldstein wrote:
> On Wed, Jul 21, 2021 at 9:35 AM Amir Goldstein <amir73il@gmail.com> wrote:
> >
> > On Wed, Jul 21, 2021 at 9:18 AM Matthew Bobrowski <repnop@google.com> wrote:
> > >
> > > The copy_info_records_to_user() helper allows for the separation of
> > > info record copying routines/conditionals from copy_event_to_user(),
> > > which reduces the overall clutter within this function. This becomes
> > > especially true as we start introducing additional info records in the
> > > future i.e. struct fanotify_event_info_pidfd. On success, this helper
> > > returns the total amount of bytes that have been copied into the user
> > > supplied buffer and on error, a negative value is returned to the
> > > caller.
> > >
> > > The newly defined macro FANOTIFY_INFO_MODES can be used to obtain info
> > > record types that have been enabled for a specific notification
> > > group. This macro becomes useful in the subsequent patch when the
> > > FAN_REPORT_PIDFD initialization flag is introduced.
> > >
> > > Signed-off-by: Matthew Bobrowski <repnop@google.com>
> > Reviewed-by: Amir Goldstein <amir73il@gmail.com>
> >
> > > ---
> > >  fs/notify/fanotify/fanotify_user.c | 155 ++++++++++++++++-------------
> > >  include/linux/fanotify.h           |   2 +
> > >  2 files changed, 90 insertions(+), 67 deletions(-)
> > >
> > > diff --git a/fs/notify/fanotify/fanotify_user.c b/fs/notify/fanotify/fanotify_user.c
> > > index 182fea255376..d19f70b2c24c 100644
> > > --- a/fs/notify/fanotify/fanotify_user.c
> > > +++ b/fs/notify/fanotify/fanotify_user.c
> > > @@ -173,7 +173,7 @@ static struct fanotify_event *get_one_event(struct fsnotify_group *group,
> > >         size_t event_size = FAN_EVENT_METADATA_LEN;
> > >         struct fanotify_event *event = NULL;
> > >         struct fsnotify_event *fsn_event;
> > > -       unsigned int fid_mode = FAN_GROUP_FLAG(group, FANOTIFY_FID_BITS);
> > > +       unsigned int info_mode = FAN_GROUP_FLAG(group, FANOTIFY_INFO_MODES);
> > >
> > >         pr_debug("%s: group=%p count=%zd\n", __func__, group, count);
> > >
> > > @@ -183,8 +183,8 @@ static struct fanotify_event *get_one_event(struct fsnotify_group *group,
> > >                 goto out;
> > >
> > >         event = FANOTIFY_E(fsn_event);
> > > -       if (fid_mode)
> > > -               event_size += fanotify_event_info_len(fid_mode, event);
> > > +       if (info_mode)
> > > +               event_size += fanotify_event_info_len(info_mode, event);
> > >
> > >         if (event_size > count) {
> > >                 event = ERR_PTR(-EINVAL);
> > > @@ -401,6 +401,86 @@ static int copy_fid_info_to_user(__kernel_fsid_t *fsid, struct fanotify_fh *fh,
> > >         return info_len;
> > >  }
> > >
> > > +static int copy_info_records_to_user(struct fanotify_event *event,
> > > +                                    struct fanotify_info *info,
> > > +                                    unsigned int info_mode,
> > > +                                    char __user *buf, size_t count)
> > > +{
> > > +       int ret, total_bytes = 0, info_type = 0;
> > > +       unsigned int fid_mode = info_mode & FANOTIFY_FID_BITS;
> > > +
> > > +       /*
> > > +        * Event info records order is as follows: dir fid + name, child fid.
> > > +        */
> > > +       if (fanotify_event_dir_fh_len(event)) {
> > > +               info_type = info->name_len ? FAN_EVENT_INFO_TYPE_DFID_NAME :
> > > +                                            FAN_EVENT_INFO_TYPE_DFID;
> > > +               ret = copy_fid_info_to_user(fanotify_event_fsid(event),
> > > +                                           fanotify_info_dir_fh(info),
> > > +                                           info_type,
> > > +                                           fanotify_info_name(info),
> > > +                                           info->name_len, buf, count);
> > > +               if (ret < 0)
> > > +                       return ret;
> > > +
> > > +               buf += ret;
> > > +               count -= ret;
> > > +               total_bytes += ret;
> > > +       }
> > > +
> > > +       if (fanotify_event_object_fh_len(event)) {
> > > +               const char *dot = NULL;
> > > +               int dot_len = 0;
> > > +
> > > +               if (fid_mode == FAN_REPORT_FID || info_type) {
> > > +                       /*
> > > +                        * With only group flag FAN_REPORT_FID only type FID is
> > > +                        * reported. Second info record type is always FID.
> > > +                        */
> > > +                       info_type = FAN_EVENT_INFO_TYPE_FID;
> > > +               } else if ((fid_mode & FAN_REPORT_NAME) &&
> > > +                          (event->mask & FAN_ONDIR)) {
> > > +                       /*
> > > +                        * With group flag FAN_REPORT_NAME, if name was not
> > > +                        * recorded in an event on a directory, report the name
> > > +                        * "." with info type DFID_NAME.
> > > +                        */
> > > +                       info_type = FAN_EVENT_INFO_TYPE_DFID_NAME;
> > > +                       dot = ".";
> > > +                       dot_len = 1;
> > > +               } else if ((event->mask & ALL_FSNOTIFY_DIRENT_EVENTS) ||
> > > +                          (event->mask & FAN_ONDIR)) {
> > > +                       /*
> > > +                        * With group flag FAN_REPORT_DIR_FID, a single info
> > > +                        * record has type DFID for directory entry
> > > +                        * modificatio\ n event and for event on a directory.
> 
> Just notices this typo in the copied comment:
> modificatio\ n
> 
> And for the next posting, please remove the mention of fanotify_user.c from
> the commit title - it adds no information and clutters the git oneline log.
> This concerns patch 3/5 as well. It does NOT concern the kernel/pid.c patches.

Noted.

/M

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v3 5/5] fanotify: add pidfd support to the fanotify API
  2021-07-27  4:19     ` Amir Goldstein
  2021-07-27  5:10       ` Matthew Bobrowski
@ 2021-07-29 13:39       ` Jan Kara
  2021-07-29 15:13         ` Amir Goldstein
  1 sibling, 1 reply; 41+ messages in thread
From: Jan Kara @ 2021-07-29 13:39 UTC (permalink / raw)
  To: Amir Goldstein
  Cc: Jann Horn, Matthew Bobrowski, Jan Kara, Christian Brauner,
	linux-fsdevel, Linux API, Andy Lutomirski

On Tue 27-07-21 07:19:43, Amir Goldstein wrote:
> On Tue, Jul 27, 2021 at 3:24 AM Jann Horn <jannh@google.com> wrote:
> >
> > On Wed, Jul 21, 2021 at 8:21 AM Matthew Bobrowski <repnop@google.com> wrote:
> > > Introduce a new flag FAN_REPORT_PIDFD for fanotify_init(2) which
> > > allows userspace applications to control whether a pidfd info record
> > > containing a pidfd is to be returned with each event.
> > >
> > > If FAN_REPORT_PIDFD is enabled for a notification group, an additional
> > > struct fanotify_event_info_pidfd object will be supplied alongside the
> > > generic struct fanotify_event_metadata within a single event. This
> > > functionality is analogous to that of FAN_REPORT_FID in terms of how
> > > the event structure is supplied to the userspace application. Usage of
> > > FAN_REPORT_PIDFD with FAN_REPORT_FID/FAN_REPORT_DFID_NAME is
> > > permitted, and in this case a struct fanotify_event_info_pidfd object
> > > will follow any struct fanotify_event_info_fid object.
> > >
> > > Currently, the usage of FAN_REPORT_TID is not permitted along with
> > > FAN_REPORT_PIDFD as the pidfd API only supports the creation of pidfds
> > > for thread-group leaders. Additionally, the FAN_REPORT_PIDFD is
> > > limited to privileged processes only i.e. listeners that are running
> > > with the CAP_SYS_ADMIN capability. Attempting to supply either of
> > > these initialization flags with FAN_REPORT_PIDFD will result with
> > > EINVAL being returned to the caller.
> > >
> > > In the event of a pidfd creation error, there are two types of error
> > > values that can be reported back to the listener. There is
> > > FAN_NOPIDFD, which will be reported in cases where the process
> > > responsible for generating the event has terminated prior to fanotify
> > > being able to create pidfd for event->pid via pidfd_create(). The
> > > there is FAN_EPIDFD, which will be reported if a more generic pidfd
> > > creation error occurred when calling pidfd_create().
> > [...]
> > > @@ -524,6 +562,34 @@ static ssize_t copy_event_to_user(struct fsnotify_group *group,
> > >         }
> > >         metadata.fd = fd;
> > >
> > > +       if (pidfd_mode) {
> > > +               /*
> > > +                * Complain if the FAN_REPORT_PIDFD and FAN_REPORT_TID mutual
> > > +                * exclusion is ever lifted. At the time of incoporating pidfd
> > > +                * support within fanotify, the pidfd API only supported the
> > > +                * creation of pidfds for thread-group leaders.
> > > +                */
> > > +               WARN_ON_ONCE(FAN_GROUP_FLAG(group, FAN_REPORT_TID));
> > > +
> > > +               /*
> > > +                * The PIDTYPE_TGID check for an event->pid is performed
> > > +                * preemptively in attempt to catch those rare instances where
> > > +                * the process responsible for generating the event has
> > > +                * terminated prior to calling into pidfd_create() and acquiring
> > > +                * a valid pidfd. Report FAN_NOPIDFD to the listener in those
> > > +                * cases. All other pidfd creation errors are represented as
> > > +                * FAN_EPIDFD.
> > > +                */
> > > +               if (metadata.pid == 0 ||
> > > +                   !pid_has_task(event->pid, PIDTYPE_TGID)) {
> > > +                       pidfd = FAN_NOPIDFD;
> > > +               } else {
> > > +                       pidfd = pidfd_create(event->pid, 0);
> > > +                       if (pidfd < 0)
> > > +                               pidfd = FAN_EPIDFD;
> > > +               }
> > > +       }
> > > +
> >
> > As a general rule, f_op->read callbacks aren't allowed to mess with
> > the file descriptor table of the calling process. A process should be
> > able to receive a file descriptor from an untrusted source and call
> > functions like read() on it without worrying about affecting its own
> > file descriptor table state with that.
> >
> 
> Interesting. I've never considered this interface flaw.
> Thanks for bringing this up!

Me neither. But I guess it's one more reason why any fd-generating variant
of fanotify should stay priviledged.

> > I realize that existing fanotify code appears to be violating that
> > rule already, and that you're limiting creation of fanotify file
> > descriptors that can hit this codepath to CAP_SYS_ADMIN, but still, I
> > think fanotify_read() probably ought to be an ioctl, or something
> > along those lines, instead of an f_op->read handler if it messes with
> > the caller's fd table?
> 
> Naturally, we cannot change the legacy interface.
> However, since fanotify has a modern FAN_REPORT_FID interface
> which does not mess with fd table maybe this is an opportunity not
> to repeat the same mistake for the FAN_REPORT_FID interface.
> 
> Matthew, can you explain what is the use case of the consumer
> application of pidfd. I am guessing this is for an audit user case?
> because if it were for permission events, event->pid would have been
> sufficient.
> 
> If that is the case, then I presume that the application does not really
> need to operate on the pidfd, it only need to avoid reporting wrong
> process details after pid wraparound?
> 
> If that is the case, then maybe a model similar to inode generation
> can be used to report a "pid generation" in addition to event->pid
> and export pid generation in /proc/<pid>/status?
> 
> Or am I completely misunderstanding the use case?

Well, but pidfd also makes sure that /proc/<pid>/ keeps belonging to the
same process while you read various data from it. And you cannot achieve
that with pid+generation thing you've suggested. Plus the additional
concept and its complexity is non-trivial So I tend to agree with
Christian that we really want to return pidfd.

Given returning pidfd is CAP_SYS_ADMIN priviledged operation I'm undecided
whether it is worth the trouble to come up with some other mechanism how to
return pidfd with the event. We could return some cookie which could be
then (by some ioctl or so) either transformed into real pidfd or released
(so that we can release pid handle in the kernel) but it looks ugly and
complicates things for everybody without bringing significant security
improvement (we already can pass fd with the event). So I'm pondering
whether there's some other way how we could make the interface safer - e.g.
so that the process receiving the event (not the one creating the group)
would also need to opt in for getting fds created in its file table.

But so far nothing bright has come to my mind. :-|

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v3 0/5] Add pidfd support to the fanotify API
  2021-07-27  0:16     ` Matthew Bobrowski
@ 2021-07-29 13:40       ` Jan Kara
  0 siblings, 0 replies; 41+ messages in thread
From: Jan Kara @ 2021-07-29 13:40 UTC (permalink / raw)
  To: Matthew Bobrowski
  Cc: Amir Goldstein, Jan Kara, Christian Brauner, linux-fsdevel, Linux API

On Tue 27-07-21 10:16:40, Matthew Bobrowski wrote:
> On Tue, Jul 27, 2021 at 09:07:28AM +1000, Matthew Bobrowski wrote:
> > On Wed, Jul 21, 2021 at 10:06:56AM +0300, Amir Goldstein wrote:
> > > On Wed, Jul 21, 2021 at 9:17 AM Matthew Bobrowski <repnop@google.com> wrote:
> > > >
> > > > Hey Jan/Amir/Christian,
> > > >
> > > > This is an updated version of the FAN_REPORT_PIDFD series which contains
> > > > the addressed nits from the previous review [0]. As per request, you can
> > > > also find the draft LTP tests here [1] and man-pages update for this new
> > > > API change here [2].
> > > >
> > > > [0] https://lore.kernel.org/linux-fsdevel/cover.1623282854.git.repnop@google.com/
> > > > [1] https://github.com/matthewbobrowski/ltp/commits/fanotify_pidfd_v2
> > > > [2] https://github.com/matthewbobrowski/man-pages/commits/fanotify_pidfd_v1
> > > 
> > > FWIW, those test and man page drafts look good to me :)
> > 
> > Fantastic, thanks for the review!
> > 
> > I will adjust the minor comments/documentation on patch 5/5 and send
> > through an updated series.
> 
> Alright, so I've fixed up the git commit message and comment in the source
> code so that it's more accurate in terms of when/why FAN_NOPIDFD is
> reported.
> 
> I'm going to hold with sending through v4 until I have Jan also look peek
> and poke at v3 as I want to avoid doing any unnecessary round trips.

I've read through the series and I don't have any more comments besides the
objection raised by Jann. I'm still considering that one...

								Honza

-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v3 5/5] fanotify: add pidfd support to the fanotify API
  2021-07-29 13:39       ` Jan Kara
@ 2021-07-29 15:13         ` Amir Goldstein
  2021-07-30  5:03           ` Amir Goldstein
  0 siblings, 1 reply; 41+ messages in thread
From: Amir Goldstein @ 2021-07-29 15:13 UTC (permalink / raw)
  To: Jan Kara
  Cc: Jann Horn, Matthew Bobrowski, Christian Brauner, linux-fsdevel,
	Linux API, Andy Lutomirski

On Thu, Jul 29, 2021 at 4:39 PM Jan Kara <jack@suse.cz> wrote:
>
> On Tue 27-07-21 07:19:43, Amir Goldstein wrote:
> > On Tue, Jul 27, 2021 at 3:24 AM Jann Horn <jannh@google.com> wrote:
> > >
> > > On Wed, Jul 21, 2021 at 8:21 AM Matthew Bobrowski <repnop@google.com> wrote:
> > > > Introduce a new flag FAN_REPORT_PIDFD for fanotify_init(2) which
> > > > allows userspace applications to control whether a pidfd info record
> > > > containing a pidfd is to be returned with each event.
> > > >
> > > > If FAN_REPORT_PIDFD is enabled for a notification group, an additional
> > > > struct fanotify_event_info_pidfd object will be supplied alongside the
> > > > generic struct fanotify_event_metadata within a single event. This
> > > > functionality is analogous to that of FAN_REPORT_FID in terms of how
> > > > the event structure is supplied to the userspace application. Usage of
> > > > FAN_REPORT_PIDFD with FAN_REPORT_FID/FAN_REPORT_DFID_NAME is
> > > > permitted, and in this case a struct fanotify_event_info_pidfd object
> > > > will follow any struct fanotify_event_info_fid object.
> > > >
> > > > Currently, the usage of FAN_REPORT_TID is not permitted along with
> > > > FAN_REPORT_PIDFD as the pidfd API only supports the creation of pidfds
> > > > for thread-group leaders. Additionally, the FAN_REPORT_PIDFD is
> > > > limited to privileged processes only i.e. listeners that are running
> > > > with the CAP_SYS_ADMIN capability. Attempting to supply either of
> > > > these initialization flags with FAN_REPORT_PIDFD will result with
> > > > EINVAL being returned to the caller.
> > > >
> > > > In the event of a pidfd creation error, there are two types of error
> > > > values that can be reported back to the listener. There is
> > > > FAN_NOPIDFD, which will be reported in cases where the process
> > > > responsible for generating the event has terminated prior to fanotify
> > > > being able to create pidfd for event->pid via pidfd_create(). The
> > > > there is FAN_EPIDFD, which will be reported if a more generic pidfd
> > > > creation error occurred when calling pidfd_create().
> > > [...]
> > > > @@ -524,6 +562,34 @@ static ssize_t copy_event_to_user(struct fsnotify_group *group,
> > > >         }
> > > >         metadata.fd = fd;
> > > >
> > > > +       if (pidfd_mode) {
> > > > +               /*
> > > > +                * Complain if the FAN_REPORT_PIDFD and FAN_REPORT_TID mutual
> > > > +                * exclusion is ever lifted. At the time of incoporating pidfd
> > > > +                * support within fanotify, the pidfd API only supported the
> > > > +                * creation of pidfds for thread-group leaders.
> > > > +                */
> > > > +               WARN_ON_ONCE(FAN_GROUP_FLAG(group, FAN_REPORT_TID));
> > > > +
> > > > +               /*
> > > > +                * The PIDTYPE_TGID check for an event->pid is performed
> > > > +                * preemptively in attempt to catch those rare instances where
> > > > +                * the process responsible for generating the event has
> > > > +                * terminated prior to calling into pidfd_create() and acquiring
> > > > +                * a valid pidfd. Report FAN_NOPIDFD to the listener in those
> > > > +                * cases. All other pidfd creation errors are represented as
> > > > +                * FAN_EPIDFD.
> > > > +                */
> > > > +               if (metadata.pid == 0 ||
> > > > +                   !pid_has_task(event->pid, PIDTYPE_TGID)) {
> > > > +                       pidfd = FAN_NOPIDFD;
> > > > +               } else {
> > > > +                       pidfd = pidfd_create(event->pid, 0);
> > > > +                       if (pidfd < 0)
> > > > +                               pidfd = FAN_EPIDFD;
> > > > +               }
> > > > +       }
> > > > +
> > >
> > > As a general rule, f_op->read callbacks aren't allowed to mess with
> > > the file descriptor table of the calling process. A process should be
> > > able to receive a file descriptor from an untrusted source and call
> > > functions like read() on it without worrying about affecting its own
> > > file descriptor table state with that.
> > >
> >
> > Interesting. I've never considered this interface flaw.
> > Thanks for bringing this up!
>
> Me neither. But I guess it's one more reason why any fd-generating variant
> of fanotify should stay priviledged.
>
> > > I realize that existing fanotify code appears to be violating that
> > > rule already, and that you're limiting creation of fanotify file
> > > descriptors that can hit this codepath to CAP_SYS_ADMIN, but still, I
> > > think fanotify_read() probably ought to be an ioctl, or something
> > > along those lines, instead of an f_op->read handler if it messes with
> > > the caller's fd table?
> >
> > Naturally, we cannot change the legacy interface.
> > However, since fanotify has a modern FAN_REPORT_FID interface
> > which does not mess with fd table maybe this is an opportunity not
> > to repeat the same mistake for the FAN_REPORT_FID interface.
> >
> > Matthew, can you explain what is the use case of the consumer
> > application of pidfd. I am guessing this is for an audit user case?
> > because if it were for permission events, event->pid would have been
> > sufficient.
> >
> > If that is the case, then I presume that the application does not really
> > need to operate on the pidfd, it only need to avoid reporting wrong
> > process details after pid wraparound?
> >
> > If that is the case, then maybe a model similar to inode generation
> > can be used to report a "pid generation" in addition to event->pid
> > and export pid generation in /proc/<pid>/status?
> >
> > Or am I completely misunderstanding the use case?
>
> Well, but pidfd also makes sure that /proc/<pid>/ keeps belonging to the
> same process while you read various data from it. And you cannot achieve
> that with pid+generation thing you've suggested. Plus the additional
> concept and its complexity is non-trivial So I tend to agree with
> Christian that we really want to return pidfd.
>
> Given returning pidfd is CAP_SYS_ADMIN priviledged operation I'm undecided
> whether it is worth the trouble to come up with some other mechanism how to
> return pidfd with the event. We could return some cookie which could be
> then (by some ioctl or so) either transformed into real pidfd or released
> (so that we can release pid handle in the kernel) but it looks ugly and
> complicates things for everybody without bringing significant security
> improvement (we already can pass fd with the event). So I'm pondering
> whether there's some other way how we could make the interface safer - e.g.
> so that the process receiving the event (not the one creating the group)
> would also need to opt in for getting fds created in its file table.
>
> But so far nothing bright has come to my mind. :-|
>

There is a way, it is not bright, but it is pretty simple -
store an optional pid in group->fanotify_data.fd_reader.

With flag FAN_REPORT_PIDFD, both pidfd and event->fd reporting
will be disabled to any process other than fd_reader.
Without FAN_REPORT_PIDFD, event->fd reporting will be disabled
if fd_reaader is set to a process other than the reader.

A process can call ioctl START_FD_READER to set fd_reader to itself.
With FAN_REPORT_PIDFD, if reaader_fd is NULL and the reader
process has CAP_SYS_ADMIN, read() sets fd_reader to itself.

Permission wise, START_FD_READER is allowed with
CAP_SYS_ADMIN or if fd_reader is not owned by another process.
We may consider YIELD_FD_READER ioctl if needed.

I think that this is a pretty cheap price for implementation
and maybe acceptable overhead for complicating the API?
Note that without passing fd, there is no need for any ioctl.

An added security benefit is that the ioctl adds is a way for the
caller of fanotify_init() to make sure that even if the fanotify_fd is
leaked, that event->fd will not be leaked, regardless of flag
FAN_REPORT_PIDFD.

So the START_FD_READER ioctl feature could be implemented
and documented first.
And then FAN_REPORT_PIDFD could use the feature with a
very minor API difference:
- Without the flag, other processes can read fds by default and
  group initiator can opt-out
- With the flag, other processes cannot read fds by default and
  need to opt-in

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v3 5/5] fanotify: add pidfd support to the fanotify API
  2021-07-27 12:54     ` Matthew Bobrowski
@ 2021-07-29 22:48       ` Matthew Bobrowski
  0 siblings, 0 replies; 41+ messages in thread
From: Matthew Bobrowski @ 2021-07-29 22:48 UTC (permalink / raw)
  To: Jann Horn
  Cc: jack, amir73il, christian.brauner, linux-fsdevel, linux-api,
	Andy Lutomirski

On Tue, Jul 27, 2021 at 10:54:18PM +1000, Matthew Bobrowski wrote:
> Hey Jann,
> 
> On Tue, Jul 27, 2021 at 02:23:38AM +0200, Jann Horn wrote:
> > On Wed, Jul 21, 2021 at 8:21 AM Matthew Bobrowski <repnop@google.com> wrote:
> > > Introduce a new flag FAN_REPORT_PIDFD for fanotify_init(2) which
> > > allows userspace applications to control whether a pidfd info record
> > > containing a pidfd is to be returned with each event.
> > >
> > > If FAN_REPORT_PIDFD is enabled for a notification group, an additional
> > > struct fanotify_event_info_pidfd object will be supplied alongside the
> > > generic struct fanotify_event_metadata within a single event. This
> > > functionality is analogous to that of FAN_REPORT_FID in terms of how
> > > the event structure is supplied to the userspace application. Usage of
> > > FAN_REPORT_PIDFD with FAN_REPORT_FID/FAN_REPORT_DFID_NAME is
> > > permitted, and in this case a struct fanotify_event_info_pidfd object
> > > will follow any struct fanotify_event_info_fid object.
> > >
> > > Currently, the usage of FAN_REPORT_TID is not permitted along with
> > > FAN_REPORT_PIDFD as the pidfd API only supports the creation of pidfds
> > > for thread-group leaders. Additionally, the FAN_REPORT_PIDFD is
> > > limited to privileged processes only i.e. listeners that are running
> > > with the CAP_SYS_ADMIN capability. Attempting to supply either of
> > > these initialization flags with FAN_REPORT_PIDFD will result with
> > > EINVAL being returned to the caller.
> > >
> > > In the event of a pidfd creation error, there are two types of error
> > > values that can be reported back to the listener. There is
> > > FAN_NOPIDFD, which will be reported in cases where the process
> > > responsible for generating the event has terminated prior to fanotify
> > > being able to create pidfd for event->pid via pidfd_create(). The
> > > there is FAN_EPIDFD, which will be reported if a more generic pidfd
> > > creation error occurred when calling pidfd_create().
> > [...]
> > > @@ -524,6 +562,34 @@ static ssize_t copy_event_to_user(struct fsnotify_group *group,
> > >         }
> > >         metadata.fd = fd;
> > >
> > > +       if (pidfd_mode) {
> > > +               /*
> > > +                * Complain if the FAN_REPORT_PIDFD and FAN_REPORT_TID mutual
> > > +                * exclusion is ever lifted. At the time of incoporating pidfd
> > > +                * support within fanotify, the pidfd API only supported the
> > > +                * creation of pidfds for thread-group leaders.
> > > +                */
> > > +               WARN_ON_ONCE(FAN_GROUP_FLAG(group, FAN_REPORT_TID));
> > > +
> > > +               /*
> > > +                * The PIDTYPE_TGID check for an event->pid is performed
> > > +                * preemptively in attempt to catch those rare instances where
> > > +                * the process responsible for generating the event has
> > > +                * terminated prior to calling into pidfd_create() and acquiring
> > > +                * a valid pidfd. Report FAN_NOPIDFD to the listener in those
> > > +                * cases. All other pidfd creation errors are represented as
> > > +                * FAN_EPIDFD.
> > > +                */
> > > +               if (metadata.pid == 0 ||
> > > +                   !pid_has_task(event->pid, PIDTYPE_TGID)) {
> > > +                       pidfd = FAN_NOPIDFD;
> > > +               } else {
> > > +                       pidfd = pidfd_create(event->pid, 0);
> > > +                       if (pidfd < 0)
> > > +                               pidfd = FAN_EPIDFD;
> > > +               }
> > > +       }
> > > +
> > 
> > As a general rule, f_op->read callbacks aren't allowed to mess with
> > the file descriptor table of the calling process. A process should be
> > able to receive a file descriptor from an untrusted source and call
> > functions like read() on it without worrying about affecting its own
> > file descriptor table state with that.
> 
> Interesting, thanks for bringing this up. I never knew about this general
> rule. Do you mind elaborating a little on why f_op->read() callbacks aren't
> allowed to mess with the fdtable of the calling process? I don't quite
> exactly understand why this is considered to be suboptimal.

Nevermind. I done a little extra thinking about this and I can see exactly why
this could be problematic.

/M

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v3 5/5] fanotify: add pidfd support to the fanotify API
  2021-07-29 15:13         ` Amir Goldstein
@ 2021-07-30  5:03           ` Amir Goldstein
  2021-08-02 12:34             ` Jan Kara
  0 siblings, 1 reply; 41+ messages in thread
From: Amir Goldstein @ 2021-07-30  5:03 UTC (permalink / raw)
  To: Jan Kara
  Cc: Jann Horn, Matthew Bobrowski, Christian Brauner, linux-fsdevel,
	Linux API, Andy Lutomirski

On Thu, Jul 29, 2021 at 6:13 PM Amir Goldstein <amir73il@gmail.com> wrote:
>
> On Thu, Jul 29, 2021 at 4:39 PM Jan Kara <jack@suse.cz> wrote:
> >
> > On Tue 27-07-21 07:19:43, Amir Goldstein wrote:
> > > On Tue, Jul 27, 2021 at 3:24 AM Jann Horn <jannh@google.com> wrote:
> > > >
> > > > On Wed, Jul 21, 2021 at 8:21 AM Matthew Bobrowski <repnop@google.com> wrote:
> > > > > Introduce a new flag FAN_REPORT_PIDFD for fanotify_init(2) which
> > > > > allows userspace applications to control whether a pidfd info record
> > > > > containing a pidfd is to be returned with each event.
> > > > >
> > > > > If FAN_REPORT_PIDFD is enabled for a notification group, an additional
> > > > > struct fanotify_event_info_pidfd object will be supplied alongside the
> > > > > generic struct fanotify_event_metadata within a single event. This
> > > > > functionality is analogous to that of FAN_REPORT_FID in terms of how
> > > > > the event structure is supplied to the userspace application. Usage of
> > > > > FAN_REPORT_PIDFD with FAN_REPORT_FID/FAN_REPORT_DFID_NAME is
> > > > > permitted, and in this case a struct fanotify_event_info_pidfd object
> > > > > will follow any struct fanotify_event_info_fid object.
> > > > >
> > > > > Currently, the usage of FAN_REPORT_TID is not permitted along with
> > > > > FAN_REPORT_PIDFD as the pidfd API only supports the creation of pidfds
> > > > > for thread-group leaders. Additionally, the FAN_REPORT_PIDFD is
> > > > > limited to privileged processes only i.e. listeners that are running
> > > > > with the CAP_SYS_ADMIN capability. Attempting to supply either of
> > > > > these initialization flags with FAN_REPORT_PIDFD will result with
> > > > > EINVAL being returned to the caller.
> > > > >
> > > > > In the event of a pidfd creation error, there are two types of error
> > > > > values that can be reported back to the listener. There is
> > > > > FAN_NOPIDFD, which will be reported in cases where the process
> > > > > responsible for generating the event has terminated prior to fanotify
> > > > > being able to create pidfd for event->pid via pidfd_create(). The
> > > > > there is FAN_EPIDFD, which will be reported if a more generic pidfd
> > > > > creation error occurred when calling pidfd_create().
> > > > [...]
> > > > > @@ -524,6 +562,34 @@ static ssize_t copy_event_to_user(struct fsnotify_group *group,
> > > > >         }
> > > > >         metadata.fd = fd;
> > > > >
> > > > > +       if (pidfd_mode) {
> > > > > +               /*
> > > > > +                * Complain if the FAN_REPORT_PIDFD and FAN_REPORT_TID mutual
> > > > > +                * exclusion is ever lifted. At the time of incoporating pidfd
> > > > > +                * support within fanotify, the pidfd API only supported the
> > > > > +                * creation of pidfds for thread-group leaders.
> > > > > +                */
> > > > > +               WARN_ON_ONCE(FAN_GROUP_FLAG(group, FAN_REPORT_TID));
> > > > > +
> > > > > +               /*
> > > > > +                * The PIDTYPE_TGID check for an event->pid is performed
> > > > > +                * preemptively in attempt to catch those rare instances where
> > > > > +                * the process responsible for generating the event has
> > > > > +                * terminated prior to calling into pidfd_create() and acquiring
> > > > > +                * a valid pidfd. Report FAN_NOPIDFD to the listener in those
> > > > > +                * cases. All other pidfd creation errors are represented as
> > > > > +                * FAN_EPIDFD.
> > > > > +                */
> > > > > +               if (metadata.pid == 0 ||
> > > > > +                   !pid_has_task(event->pid, PIDTYPE_TGID)) {
> > > > > +                       pidfd = FAN_NOPIDFD;
> > > > > +               } else {
> > > > > +                       pidfd = pidfd_create(event->pid, 0);
> > > > > +                       if (pidfd < 0)
> > > > > +                               pidfd = FAN_EPIDFD;
> > > > > +               }
> > > > > +       }
> > > > > +
> > > >
> > > > As a general rule, f_op->read callbacks aren't allowed to mess with
> > > > the file descriptor table of the calling process. A process should be
> > > > able to receive a file descriptor from an untrusted source and call
> > > > functions like read() on it without worrying about affecting its own
> > > > file descriptor table state with that.
> > > >
> > >
> > > Interesting. I've never considered this interface flaw.
> > > Thanks for bringing this up!
> >
> > Me neither. But I guess it's one more reason why any fd-generating variant
> > of fanotify should stay priviledged.
> >
> > > > I realize that existing fanotify code appears to be violating that
> > > > rule already, and that you're limiting creation of fanotify file
> > > > descriptors that can hit this codepath to CAP_SYS_ADMIN, but still, I
> > > > think fanotify_read() probably ought to be an ioctl, or something
> > > > along those lines, instead of an f_op->read handler if it messes with
> > > > the caller's fd table?
> > >
> > > Naturally, we cannot change the legacy interface.
> > > However, since fanotify has a modern FAN_REPORT_FID interface
> > > which does not mess with fd table maybe this is an opportunity not
> > > to repeat the same mistake for the FAN_REPORT_FID interface.
> > >
> > > Matthew, can you explain what is the use case of the consumer
> > > application of pidfd. I am guessing this is for an audit user case?
> > > because if it were for permission events, event->pid would have been
> > > sufficient.
> > >
> > > If that is the case, then I presume that the application does not really
> > > need to operate on the pidfd, it only need to avoid reporting wrong
> > > process details after pid wraparound?
> > >
> > > If that is the case, then maybe a model similar to inode generation
> > > can be used to report a "pid generation" in addition to event->pid
> > > and export pid generation in /proc/<pid>/status?
> > >
> > > Or am I completely misunderstanding the use case?
> >
> > Well, but pidfd also makes sure that /proc/<pid>/ keeps belonging to the
> > same process while you read various data from it. And you cannot achieve
> > that with pid+generation thing you've suggested. Plus the additional
> > concept and its complexity is non-trivial So I tend to agree with
> > Christian that we really want to return pidfd.
> >
> > Given returning pidfd is CAP_SYS_ADMIN priviledged operation I'm undecided
> > whether it is worth the trouble to come up with some other mechanism how to
> > return pidfd with the event. We could return some cookie which could be
> > then (by some ioctl or so) either transformed into real pidfd or released
> > (so that we can release pid handle in the kernel) but it looks ugly and
> > complicates things for everybody without bringing significant security
> > improvement (we already can pass fd with the event). So I'm pondering
> > whether there's some other way how we could make the interface safer - e.g.
> > so that the process receiving the event (not the one creating the group)
> > would also need to opt in for getting fds created in its file table.
> >
> > But so far nothing bright has come to my mind. :-|
> >
>
> There is a way, it is not bright, but it is pretty simple -
> store an optional pid in group->fanotify_data.fd_reader.
>
> With flag FAN_REPORT_PIDFD, both pidfd and event->fd reporting
> will be disabled to any process other than fd_reader.
> Without FAN_REPORT_PIDFD, event->fd reporting will be disabled
> if fd_reaader is set to a process other than the reader.
>
> A process can call ioctl START_FD_READER to set fd_reader to itself.
> With FAN_REPORT_PIDFD, if reaader_fd is NULL and the reader
> process has CAP_SYS_ADMIN, read() sets fd_reader to itself.
>
> Permission wise, START_FD_READER is allowed with
> CAP_SYS_ADMIN or if fd_reader is not owned by another process.
> We may consider YIELD_FD_READER ioctl if needed.
>
> I think that this is a pretty cheap price for implementation
> and maybe acceptable overhead for complicating the API?
> Note that without passing fd, there is no need for any ioctl.
>
> An added security benefit is that the ioctl adds is a way for the
> caller of fanotify_init() to make sure that even if the fanotify_fd is
> leaked, that event->fd will not be leaked, regardless of flag
> FAN_REPORT_PIDFD.
>
> So the START_FD_READER ioctl feature could be implemented
> and documented first.
> And then FAN_REPORT_PIDFD could use the feature with a
> very minor API difference:
> - Without the flag, other processes can read fds by default and
>   group initiator can opt-out
> - With the flag, other processes cannot read fds by default and
>   need to opt-in
>

Or maybe something even simpler... fanotify_init() flag
FAN_PRIVATE (or FAN_PROTECTED) that limits event reading
to the initiator process (not only fd reading).

FAN_REPORT_PIDFD requires FAN_PRIVATE.
If we do not know there is a use case for passing fanotify_fd
that reports pidfds to another process why implement the ioctl.
We can always implement it later if the need arises.
If we contemplate this future change, though, maybe the name
FAN_PROTECTED is better to start with.

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v3 5/5] fanotify: add pidfd support to the fanotify API
  2021-07-30  5:03           ` Amir Goldstein
@ 2021-08-02 12:34             ` Jan Kara
  2021-08-02 14:38               ` Amir Goldstein
  0 siblings, 1 reply; 41+ messages in thread
From: Jan Kara @ 2021-08-02 12:34 UTC (permalink / raw)
  To: Amir Goldstein
  Cc: Jan Kara, Jann Horn, Matthew Bobrowski, Christian Brauner,
	linux-fsdevel, Linux API, Andy Lutomirski

On Fri 30-07-21 08:03:01, Amir Goldstein wrote:
> On Thu, Jul 29, 2021 at 6:13 PM Amir Goldstein <amir73il@gmail.com> wrote:
> > On Thu, Jul 29, 2021 at 4:39 PM Jan Kara <jack@suse.cz> wrote:
> > > Well, but pidfd also makes sure that /proc/<pid>/ keeps belonging to the
> > > same process while you read various data from it. And you cannot achieve
> > > that with pid+generation thing you've suggested. Plus the additional
> > > concept and its complexity is non-trivial So I tend to agree with
> > > Christian that we really want to return pidfd.
> > >
> > > Given returning pidfd is CAP_SYS_ADMIN priviledged operation I'm undecided
> > > whether it is worth the trouble to come up with some other mechanism how to
> > > return pidfd with the event. We could return some cookie which could be
> > > then (by some ioctl or so) either transformed into real pidfd or released
> > > (so that we can release pid handle in the kernel) but it looks ugly and
> > > complicates things for everybody without bringing significant security
> > > improvement (we already can pass fd with the event). So I'm pondering
> > > whether there's some other way how we could make the interface safer - e.g.
> > > so that the process receiving the event (not the one creating the group)
> > > would also need to opt in for getting fds created in its file table.
> > >
> > > But so far nothing bright has come to my mind. :-|
> > >
> >
> > There is a way, it is not bright, but it is pretty simple -
> > store an optional pid in group->fanotify_data.fd_reader.
> >
> > With flag FAN_REPORT_PIDFD, both pidfd and event->fd reporting
> > will be disabled to any process other than fd_reader.
> > Without FAN_REPORT_PIDFD, event->fd reporting will be disabled
> > if fd_reaader is set to a process other than the reader.
> >
> > A process can call ioctl START_FD_READER to set fd_reader to itself.
> > With FAN_REPORT_PIDFD, if reaader_fd is NULL and the reader
> > process has CAP_SYS_ADMIN, read() sets fd_reader to itself.
> >
> > Permission wise, START_FD_READER is allowed with
> > CAP_SYS_ADMIN or if fd_reader is not owned by another process.
> > We may consider YIELD_FD_READER ioctl if needed.
> >
> > I think that this is a pretty cheap price for implementation
> > and maybe acceptable overhead for complicating the API?
> > Note that without passing fd, there is no need for any ioctl.
> >
> > An added security benefit is that the ioctl adds is a way for the
> > caller of fanotify_init() to make sure that even if the fanotify_fd is
> > leaked, that event->fd will not be leaked, regardless of flag
> > FAN_REPORT_PIDFD.
> >
> > So the START_FD_READER ioctl feature could be implemented
> > and documented first.
> > And then FAN_REPORT_PIDFD could use the feature with a
> > very minor API difference:
> > - Without the flag, other processes can read fds by default and
> >   group initiator can opt-out
> > - With the flag, other processes cannot read fds by default and
> >   need to opt-in
> 
> Or maybe something even simpler... fanotify_init() flag
> FAN_PRIVATE (or FAN_PROTECTED) that limits event reading
> to the initiator process (not only fd reading).
> 
> FAN_REPORT_PIDFD requires FAN_PRIVATE.
> If we do not know there is a use case for passing fanotify_fd
> that reports pidfds to another process why implement the ioctl.
> We can always implement it later if the need arises.
> If we contemplate this future change, though, maybe the name
> FAN_PROTECTED is better to start with.

Good ideas. I think we are fine with returning pidfd only to the process
creating the fanotify group. Later we can add an ioctl which would indicate
that the process is also prepared to have fds created in its file table.
But I have still some open questions:
Do we want threads of the same process to still be able to receive fds?
Also pids can be recycled so they are probably not completely reliable
identifiers? What if someone wants to process events from fanotify group by
multiple processes / threads (fd can be inherited also through fork(2)...)?

I'm currently undecided whether explicit FAN_PROTECTED flag (and impact on
receiving / not receiving whole event) makes this better.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v3 5/5] fanotify: add pidfd support to the fanotify API
  2021-08-02 12:34             ` Jan Kara
@ 2021-08-02 14:38               ` Amir Goldstein
  2021-08-02 20:10                 ` Jan Kara
  2021-08-03  9:37                 ` Christian Brauner
  0 siblings, 2 replies; 41+ messages in thread
From: Amir Goldstein @ 2021-08-02 14:38 UTC (permalink / raw)
  To: Jan Kara
  Cc: Jann Horn, Matthew Bobrowski, Christian Brauner, linux-fsdevel,
	Linux API, Andy Lutomirski

On Mon, Aug 2, 2021 at 3:34 PM Jan Kara <jack@suse.cz> wrote:
>
> On Fri 30-07-21 08:03:01, Amir Goldstein wrote:
> > On Thu, Jul 29, 2021 at 6:13 PM Amir Goldstein <amir73il@gmail.com> wrote:
> > > On Thu, Jul 29, 2021 at 4:39 PM Jan Kara <jack@suse.cz> wrote:
> > > > Well, but pidfd also makes sure that /proc/<pid>/ keeps belonging to the
> > > > same process while you read various data from it. And you cannot achieve
> > > > that with pid+generation thing you've suggested. Plus the additional
> > > > concept and its complexity is non-trivial So I tend to agree with
> > > > Christian that we really want to return pidfd.
> > > >
> > > > Given returning pidfd is CAP_SYS_ADMIN priviledged operation I'm undecided
> > > > whether it is worth the trouble to come up with some other mechanism how to
> > > > return pidfd with the event. We could return some cookie which could be
> > > > then (by some ioctl or so) either transformed into real pidfd or released
> > > > (so that we can release pid handle in the kernel) but it looks ugly and
> > > > complicates things for everybody without bringing significant security
> > > > improvement (we already can pass fd with the event). So I'm pondering
> > > > whether there's some other way how we could make the interface safer - e.g.
> > > > so that the process receiving the event (not the one creating the group)
> > > > would also need to opt in for getting fds created in its file table.
> > > >
> > > > But so far nothing bright has come to my mind. :-|
> > > >
> > >
> > > There is a way, it is not bright, but it is pretty simple -
> > > store an optional pid in group->fanotify_data.fd_reader.
> > >
> > > With flag FAN_REPORT_PIDFD, both pidfd and event->fd reporting
> > > will be disabled to any process other than fd_reader.
> > > Without FAN_REPORT_PIDFD, event->fd reporting will be disabled
> > > if fd_reaader is set to a process other than the reader.
> > >
> > > A process can call ioctl START_FD_READER to set fd_reader to itself.
> > > With FAN_REPORT_PIDFD, if reaader_fd is NULL and the reader
> > > process has CAP_SYS_ADMIN, read() sets fd_reader to itself.
> > >
> > > Permission wise, START_FD_READER is allowed with
> > > CAP_SYS_ADMIN or if fd_reader is not owned by another process.
> > > We may consider YIELD_FD_READER ioctl if needed.
> > >
> > > I think that this is a pretty cheap price for implementation
> > > and maybe acceptable overhead for complicating the API?
> > > Note that without passing fd, there is no need for any ioctl.
> > >
> > > An added security benefit is that the ioctl adds is a way for the
> > > caller of fanotify_init() to make sure that even if the fanotify_fd is
> > > leaked, that event->fd will not be leaked, regardless of flag
> > > FAN_REPORT_PIDFD.
> > >
> > > So the START_FD_READER ioctl feature could be implemented
> > > and documented first.
> > > And then FAN_REPORT_PIDFD could use the feature with a
> > > very minor API difference:
> > > - Without the flag, other processes can read fds by default and
> > >   group initiator can opt-out
> > > - With the flag, other processes cannot read fds by default and
> > >   need to opt-in
> >
> > Or maybe something even simpler... fanotify_init() flag
> > FAN_PRIVATE (or FAN_PROTECTED) that limits event reading
> > to the initiator process (not only fd reading).
> >
> > FAN_REPORT_PIDFD requires FAN_PRIVATE.
> > If we do not know there is a use case for passing fanotify_fd
> > that reports pidfds to another process why implement the ioctl.
> > We can always implement it later if the need arises.
> > If we contemplate this future change, though, maybe the name
> > FAN_PROTECTED is better to start with.
>
> Good ideas. I think we are fine with returning pidfd only to the process
> creating the fanotify group. Later we can add an ioctl which would indicate
> that the process is also prepared to have fds created in its file table.
> But I have still some open questions:
> Do we want threads of the same process to still be able to receive fds?

I don't see why not.
They will be bloating the same fd table as the thread that called
fanotify_init().

> Also pids can be recycled so they are probably not completely reliable
> identifiers?

Not sure I follow. The group hold a refcount on struct pid of the process that
called fanotify_init() - I think that can used to check if reader process is
the same process, but not sure. Maybe there is another way (Christian?).

> What if someone wants to process events from fanotify group by
> multiple processes / threads (fd can be inherited also through fork(2)...)?
>

That's the same as passing fd between processes, no?
If users want to do that, we will need to implement the ioctl or
fanotify_init() flag FAN_SHARED.

> I'm currently undecided whether explicit FAN_PROTECTED flag (and impact on
> receiving / not receiving whole event) makes this better.
>

Yeh, I'm not sure either. You usually tell me not to overload different
meanings on one flag, which I always found to be good advice :-)

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v3 5/5] fanotify: add pidfd support to the fanotify API
  2021-08-02 14:38               ` Amir Goldstein
@ 2021-08-02 20:10                 ` Jan Kara
  2021-08-03  1:29                   ` Matthew Bobrowski
  2021-08-03  9:46                   ` Christian Brauner
  2021-08-03  9:37                 ` Christian Brauner
  1 sibling, 2 replies; 41+ messages in thread
From: Jan Kara @ 2021-08-02 20:10 UTC (permalink / raw)
  To: Amir Goldstein
  Cc: Jan Kara, Jann Horn, Matthew Bobrowski, Christian Brauner,
	linux-fsdevel, Linux API, Andy Lutomirski

On Mon 02-08-21 17:38:20, Amir Goldstein wrote:
> On Mon, Aug 2, 2021 at 3:34 PM Jan Kara <jack@suse.cz> wrote:
> >
> > On Fri 30-07-21 08:03:01, Amir Goldstein wrote:
> > > On Thu, Jul 29, 2021 at 6:13 PM Amir Goldstein <amir73il@gmail.com> wrote:
> > > > On Thu, Jul 29, 2021 at 4:39 PM Jan Kara <jack@suse.cz> wrote:
> > > > > Well, but pidfd also makes sure that /proc/<pid>/ keeps belonging to the
> > > > > same process while you read various data from it. And you cannot achieve
> > > > > that with pid+generation thing you've suggested. Plus the additional
> > > > > concept and its complexity is non-trivial So I tend to agree with
> > > > > Christian that we really want to return pidfd.
> > > > >
> > > > > Given returning pidfd is CAP_SYS_ADMIN priviledged operation I'm undecided
> > > > > whether it is worth the trouble to come up with some other mechanism how to
> > > > > return pidfd with the event. We could return some cookie which could be
> > > > > then (by some ioctl or so) either transformed into real pidfd or released
> > > > > (so that we can release pid handle in the kernel) but it looks ugly and
> > > > > complicates things for everybody without bringing significant security
> > > > > improvement (we already can pass fd with the event). So I'm pondering
> > > > > whether there's some other way how we could make the interface safer - e.g.
> > > > > so that the process receiving the event (not the one creating the group)
> > > > > would also need to opt in for getting fds created in its file table.
> > > > >
> > > > > But so far nothing bright has come to my mind. :-|
> > > > >
> > > >
> > > > There is a way, it is not bright, but it is pretty simple -
> > > > store an optional pid in group->fanotify_data.fd_reader.
> > > >
> > > > With flag FAN_REPORT_PIDFD, both pidfd and event->fd reporting
> > > > will be disabled to any process other than fd_reader.
> > > > Without FAN_REPORT_PIDFD, event->fd reporting will be disabled
> > > > if fd_reaader is set to a process other than the reader.
> > > >
> > > > A process can call ioctl START_FD_READER to set fd_reader to itself.
> > > > With FAN_REPORT_PIDFD, if reaader_fd is NULL and the reader
> > > > process has CAP_SYS_ADMIN, read() sets fd_reader to itself.
> > > >
> > > > Permission wise, START_FD_READER is allowed with
> > > > CAP_SYS_ADMIN or if fd_reader is not owned by another process.
> > > > We may consider YIELD_FD_READER ioctl if needed.
> > > >
> > > > I think that this is a pretty cheap price for implementation
> > > > and maybe acceptable overhead for complicating the API?
> > > > Note that without passing fd, there is no need for any ioctl.
> > > >
> > > > An added security benefit is that the ioctl adds is a way for the
> > > > caller of fanotify_init() to make sure that even if the fanotify_fd is
> > > > leaked, that event->fd will not be leaked, regardless of flag
> > > > FAN_REPORT_PIDFD.
> > > >
> > > > So the START_FD_READER ioctl feature could be implemented
> > > > and documented first.
> > > > And then FAN_REPORT_PIDFD could use the feature with a
> > > > very minor API difference:
> > > > - Without the flag, other processes can read fds by default and
> > > >   group initiator can opt-out
> > > > - With the flag, other processes cannot read fds by default and
> > > >   need to opt-in
> > >
> > > Or maybe something even simpler... fanotify_init() flag
> > > FAN_PRIVATE (or FAN_PROTECTED) that limits event reading
> > > to the initiator process (not only fd reading).
> > >
> > > FAN_REPORT_PIDFD requires FAN_PRIVATE.
> > > If we do not know there is a use case for passing fanotify_fd
> > > that reports pidfds to another process why implement the ioctl.
> > > We can always implement it later if the need arises.
> > > If we contemplate this future change, though, maybe the name
> > > FAN_PROTECTED is better to start with.
> >
> > Good ideas. I think we are fine with returning pidfd only to the process
> > creating the fanotify group. Later we can add an ioctl which would indicate
> > that the process is also prepared to have fds created in its file table.
> > But I have still some open questions:
> > Do we want threads of the same process to still be able to receive fds?
> 
> I don't see why not.
> They will be bloating the same fd table as the thread that called
> fanotify_init().

I agree. So do we store thread group leader PID in fanotify group? What if
thread group leader changes? I guess I have to do some reading as I don't
know how all these details work internally.

> > Also pids can be recycled so they are probably not completely reliable
> > identifiers?
> 
> Not sure I follow. The group hold a refcount on struct pid of the process that
> called fanotify_init() - I think that can used to check if reader process is
> the same process, but not sure. Maybe there is another way (Christian?).

Yes, if we hold refcount on struct pid, it should be safe against recycling.
But cannot someone (even unpriviledged process in this case) mount some
attack by creating a process which creates fanotify group, passes fanotify fd,
and dies but pid would be still blocked because fanotify holds reference to
it? I guess this is not practical as the number of fanotify groups is limited
as well as number of fds.

> > What if someone wants to process events from fanotify group by
> > multiple processes / threads (fd can be inherited also through fork(2)...)?
> 
> That's the same as passing fd between processes, no?
> If users want to do that, we will need to implement the ioctl or
> fanotify_init() flag FAN_SHARED.

Well, FAN_SHARED would be the current behavior so I don't think there's any
point in that (we'd loose much of the security benefit gained by this
excercise). I agree we'd need to implement the ioctl for such usecase
but my point was that we could have a relatively sensible setup in which
multiple pids may need to read events from fanotify queue and so fanotify
group would need to track multiple pids allowed to read from it.

I'm sorry if I sound negative at times. I'm not settled on any particular
solution.  I'm just trying to brainstorm various pros and cons of possible
solutions to settle on what's going to be the best :).

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v3 5/5] fanotify: add pidfd support to the fanotify API
  2021-08-02 20:10                 ` Jan Kara
@ 2021-08-03  1:29                   ` Matthew Bobrowski
  2021-08-03  5:51                     ` Amir Goldstein
  2021-08-03  9:46                   ` Christian Brauner
  1 sibling, 1 reply; 41+ messages in thread
From: Matthew Bobrowski @ 2021-08-03  1:29 UTC (permalink / raw)
  To: Jan Kara
  Cc: Amir Goldstein, Jann Horn, Christian Brauner, linux-fsdevel,
	Linux API, Andy Lutomirski

On Mon, Aug 02, 2021 at 10:10:02PM +0200, Jan Kara wrote:
> On Mon 02-08-21 17:38:20, Amir Goldstein wrote:
> > On Mon, Aug 2, 2021 at 3:34 PM Jan Kara <jack@suse.cz> wrote:
> > > On Fri 30-07-21 08:03:01, Amir Goldstein wrote:
> > > > On Thu, Jul 29, 2021 at 6:13 PM Amir Goldstein <amir73il@gmail.com> wrote:
> > > > > On Thu, Jul 29, 2021 at 4:39 PM Jan Kara <jack@suse.cz> wrote:
> > > > > > Well, but pidfd also makes sure that /proc/<pid>/ keeps belonging to the
> > > > > > same process while you read various data from it. And you cannot achieve
> > > > > > that with pid+generation thing you've suggested. Plus the additional
> > > > > > concept and its complexity is non-trivial So I tend to agree with
> > > > > > Christian that we really want to return pidfd.
> > > > > >
> > > > > > Given returning pidfd is CAP_SYS_ADMIN priviledged operation I'm undecided
> > > > > > whether it is worth the trouble to come up with some other mechanism how to
> > > > > > return pidfd with the event. We could return some cookie which could be
> > > > > > then (by some ioctl or so) either transformed into real pidfd or released
> > > > > > (so that we can release pid handle in the kernel) but it looks ugly and
> > > > > > complicates things for everybody without bringing significant security
> > > > > > improvement (we already can pass fd with the event). So I'm pondering
> > > > > > whether there's some other way how we could make the interface safer - e.g.
> > > > > > so that the process receiving the event (not the one creating the group)
> > > > > > would also need to opt in for getting fds created in its file table.
> > > > > >
> > > > > > But so far nothing bright has come to my mind. :-|
> > > > > >
> > > > >
> > > > > There is a way, it is not bright, but it is pretty simple -
> > > > > store an optional pid in group->fanotify_data.fd_reader.
> > > > >
> > > > > With flag FAN_REPORT_PIDFD, both pidfd and event->fd reporting
> > > > > will be disabled to any process other than fd_reader.
> > > > > Without FAN_REPORT_PIDFD, event->fd reporting will be disabled
> > > > > if fd_reaader is set to a process other than the reader.
> > > > >
> > > > > A process can call ioctl START_FD_READER to set fd_reader to itself.
> > > > > With FAN_REPORT_PIDFD, if reaader_fd is NULL and the reader
> > > > > process has CAP_SYS_ADMIN, read() sets fd_reader to itself.
> > > > >
> > > > > Permission wise, START_FD_READER is allowed with
> > > > > CAP_SYS_ADMIN or if fd_reader is not owned by another process.
> > > > > We may consider YIELD_FD_READER ioctl if needed.
> > > > >
> > > > > I think that this is a pretty cheap price for implementation
> > > > > and maybe acceptable overhead for complicating the API?
> > > > > Note that without passing fd, there is no need for any ioctl.
> > > > >
> > > > > An added security benefit is that the ioctl adds is a way for the
> > > > > caller of fanotify_init() to make sure that even if the fanotify_fd is
> > > > > leaked, that event->fd will not be leaked, regardless of flag
> > > > > FAN_REPORT_PIDFD.
> > > > >
> > > > > So the START_FD_READER ioctl feature could be implemented
> > > > > and documented first.
> > > > > And then FAN_REPORT_PIDFD could use the feature with a
> > > > > very minor API difference:
> > > > > - Without the flag, other processes can read fds by default and
> > > > >   group initiator can opt-out
> > > > > - With the flag, other processes cannot read fds by default and
> > > > >   need to opt-in
> > > >
> > > > Or maybe something even simpler... fanotify_init() flag
> > > > FAN_PRIVATE (or FAN_PROTECTED) that limits event reading
> > > > to the initiator process (not only fd reading).
> > > >
> > > > FAN_REPORT_PIDFD requires FAN_PRIVATE.
> > > > If we do not know there is a use case for passing fanotify_fd
> > > > that reports pidfds to another process why implement the ioctl.
> > > > We can always implement it later if the need arises.
> > > > If we contemplate this future change, though, maybe the name
> > > > FAN_PROTECTED is better to start with.
> > >
> > > Good ideas. I think we are fine with returning pidfd only to the process
> > > creating the fanotify group. Later we can add an ioctl which would indicate
> > > that the process is also prepared to have fds created in its file table.
> > > But I have still some open questions:
> > > Do we want threads of the same process to still be able to receive fds?
> > 
> > I don't see why not.
> > They will be bloating the same fd table as the thread that called
> > fanotify_init().
> 
> I agree. So do we store thread group leader PID in fanotify group? What if
> thread group leader changes? I guess I have to do some reading as I don't
> know how all these details work internally.
> 
> > > Also pids can be recycled so they are probably not completely reliable
> > > identifiers?
> > 
> > Not sure I follow. The group hold a refcount on struct pid of the process that
> > called fanotify_init() - I think that can used to check if reader process is
> > the same process, but not sure. Maybe there is another way (Christian?).
> 
> Yes, if we hold refcount on struct pid, it should be safe against recycling.
> But cannot someone (even unpriviledged process in this case) mount some
> attack by creating a process which creates fanotify group, passes fanotify fd,
> and dies but pid would be still blocked because fanotify holds reference to
> it? I guess this is not practical as the number of fanotify groups is limited
> as well as number of fds.
> 
> > > What if someone wants to process events from fanotify group by
> > > multiple processes / threads (fd can be inherited also through fork(2)...)?
> > 
> > That's the same as passing fd between processes, no?
> > If users want to do that, we will need to implement the ioctl or
> > fanotify_init() flag FAN_SHARED.
> 
> Well, FAN_SHARED would be the current behavior so I don't think there's any
> point in that (we'd loose much of the security benefit gained by this
> excercise). I agree we'd need to implement the ioctl for such usecase
> but my point was that we could have a relatively sensible setup in which
> multiple pids may need to read events from fanotify queue and so fanotify
> group would need to track multiple pids allowed to read from it.
> 
> I'm sorry if I sound negative at times. I'm not settled on any particular
> solution.  I'm just trying to brainstorm various pros and cons of possible
> solutions to settle on what's going to be the best :).

Quite honestly, I'm struggling to understand what problem we're trying to
solve here... That is, whether this is an actual "security" problem, or
whether it's just attempting to come up with a solution which conforms to
the "general" rule of not modifying a callers fdtable upon calling
functions like read().

Can someone please elaborate a little on the exact "security" implications
or "threats" we're attempting to alleviate through the implementation of
the additional aforementioned group initialization flags and ioctls? What
is the exact scenario we're attempting to avoid which could lead to a
compromise in the systems overall integrity?

Also, I can understand this as a "general" rule:

  "A process should be able to receive a file descriptor from an untrusted
   source and call functions like read() on it without worrying about
   affecting its own file descriptor table state with that."

But, in instances where event processing is offloaded to a separate
dedicated "reader" process, does that actually fall under receiving a file
descriptor from an "untrusted" source? I don't think so. Modification of a
callers fdtable upon calling functions like read() may be considered
"undesired" in general, but this is just how fanotify has always worked, is
it not?

/M

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v3 5/5] fanotify: add pidfd support to the fanotify API
  2021-08-03  1:29                   ` Matthew Bobrowski
@ 2021-08-03  5:51                     ` Amir Goldstein
  0 siblings, 0 replies; 41+ messages in thread
From: Amir Goldstein @ 2021-08-03  5:51 UTC (permalink / raw)
  To: Matthew Bobrowski
  Cc: Jan Kara, Jann Horn, Christian Brauner, linux-fsdevel, Linux API,
	Andy Lutomirski

On Tue, Aug 3, 2021 at 4:30 AM Matthew Bobrowski <repnop@google.com> wrote:
>
> On Mon, Aug 02, 2021 at 10:10:02PM +0200, Jan Kara wrote:
> > On Mon 02-08-21 17:38:20, Amir Goldstein wrote:
> > > On Mon, Aug 2, 2021 at 3:34 PM Jan Kara <jack@suse.cz> wrote:
> > > > On Fri 30-07-21 08:03:01, Amir Goldstein wrote:
> > > > > On Thu, Jul 29, 2021 at 6:13 PM Amir Goldstein <amir73il@gmail.com> wrote:
> > > > > > On Thu, Jul 29, 2021 at 4:39 PM Jan Kara <jack@suse.cz> wrote:
> > > > > > > Well, but pidfd also makes sure that /proc/<pid>/ keeps belonging to the
> > > > > > > same process while you read various data from it. And you cannot achieve
> > > > > > > that with pid+generation thing you've suggested. Plus the additional
> > > > > > > concept and its complexity is non-trivial So I tend to agree with
> > > > > > > Christian that we really want to return pidfd.
> > > > > > >
> > > > > > > Given returning pidfd is CAP_SYS_ADMIN priviledged operation I'm undecided
> > > > > > > whether it is worth the trouble to come up with some other mechanism how to
> > > > > > > return pidfd with the event. We could return some cookie which could be
> > > > > > > then (by some ioctl or so) either transformed into real pidfd or released
> > > > > > > (so that we can release pid handle in the kernel) but it looks ugly and
> > > > > > > complicates things for everybody without bringing significant security
> > > > > > > improvement (we already can pass fd with the event). So I'm pondering
> > > > > > > whether there's some other way how we could make the interface safer - e.g.
> > > > > > > so that the process receiving the event (not the one creating the group)
> > > > > > > would also need to opt in for getting fds created in its file table.
> > > > > > >
> > > > > > > But so far nothing bright has come to my mind. :-|
> > > > > > >
> > > > > >
> > > > > > There is a way, it is not bright, but it is pretty simple -
> > > > > > store an optional pid in group->fanotify_data.fd_reader.
> > > > > >
> > > > > > With flag FAN_REPORT_PIDFD, both pidfd and event->fd reporting
> > > > > > will be disabled to any process other than fd_reader.
> > > > > > Without FAN_REPORT_PIDFD, event->fd reporting will be disabled
> > > > > > if fd_reaader is set to a process other than the reader.
> > > > > >
> > > > > > A process can call ioctl START_FD_READER to set fd_reader to itself.
> > > > > > With FAN_REPORT_PIDFD, if reaader_fd is NULL and the reader
> > > > > > process has CAP_SYS_ADMIN, read() sets fd_reader to itself.
> > > > > >
> > > > > > Permission wise, START_FD_READER is allowed with
> > > > > > CAP_SYS_ADMIN or if fd_reader is not owned by another process.
> > > > > > We may consider YIELD_FD_READER ioctl if needed.
> > > > > >
> > > > > > I think that this is a pretty cheap price for implementation
> > > > > > and maybe acceptable overhead for complicating the API?
> > > > > > Note that without passing fd, there is no need for any ioctl.
> > > > > >
> > > > > > An added security benefit is that the ioctl adds is a way for the
> > > > > > caller of fanotify_init() to make sure that even if the fanotify_fd is
> > > > > > leaked, that event->fd will not be leaked, regardless of flag
> > > > > > FAN_REPORT_PIDFD.
> > > > > >
> > > > > > So the START_FD_READER ioctl feature could be implemented
> > > > > > and documented first.
> > > > > > And then FAN_REPORT_PIDFD could use the feature with a
> > > > > > very minor API difference:
> > > > > > - Without the flag, other processes can read fds by default and
> > > > > >   group initiator can opt-out
> > > > > > - With the flag, other processes cannot read fds by default and
> > > > > >   need to opt-in
> > > > >
> > > > > Or maybe something even simpler... fanotify_init() flag
> > > > > FAN_PRIVATE (or FAN_PROTECTED) that limits event reading
> > > > > to the initiator process (not only fd reading).
> > > > >
> > > > > FAN_REPORT_PIDFD requires FAN_PRIVATE.
> > > > > If we do not know there is a use case for passing fanotify_fd
> > > > > that reports pidfds to another process why implement the ioctl.
> > > > > We can always implement it later if the need arises.
> > > > > If we contemplate this future change, though, maybe the name
> > > > > FAN_PROTECTED is better to start with.
> > > >
> > > > Good ideas. I think we are fine with returning pidfd only to the process
> > > > creating the fanotify group. Later we can add an ioctl which would indicate
> > > > that the process is also prepared to have fds created in its file table.
> > > > But I have still some open questions:
> > > > Do we want threads of the same process to still be able to receive fds?
> > >
> > > I don't see why not.
> > > They will be bloating the same fd table as the thread that called
> > > fanotify_init().
> >
> > I agree. So do we store thread group leader PID in fanotify group? What if
> > thread group leader changes? I guess I have to do some reading as I don't
> > know how all these details work internally.
> >
> > > > Also pids can be recycled so they are probably not completely reliable
> > > > identifiers?
> > >
> > > Not sure I follow. The group hold a refcount on struct pid of the process that
> > > called fanotify_init() - I think that can used to check if reader process is
> > > the same process, but not sure. Maybe there is another way (Christian?).
> >
> > Yes, if we hold refcount on struct pid, it should be safe against recycling.
> > But cannot someone (even unpriviledged process in this case) mount some
> > attack by creating a process which creates fanotify group, passes fanotify fd,
> > and dies but pid would be still blocked because fanotify holds reference to
> > it? I guess this is not practical as the number of fanotify groups is limited
> > as well as number of fds.
> >
> > > > What if someone wants to process events from fanotify group by
> > > > multiple processes / threads (fd can be inherited also through fork(2)...)?
> > >
> > > That's the same as passing fd between processes, no?
> > > If users want to do that, we will need to implement the ioctl or
> > > fanotify_init() flag FAN_SHARED.
> >
> > Well, FAN_SHARED would be the current behavior so I don't think there's any
> > point in that (we'd loose much of the security benefit gained by this
> > excercise). I agree we'd need to implement the ioctl for such usecase
> > but my point was that we could have a relatively sensible setup in which
> > multiple pids may need to read events from fanotify queue and so fanotify
> > group would need to track multiple pids allowed to read from it.
> >
> > I'm sorry if I sound negative at times. I'm not settled on any particular
> > solution.  I'm just trying to brainstorm various pros and cons of possible
> > solutions to settle on what's going to be the best :).
>
> Quite honestly, I'm struggling to understand what problem we're trying to
> solve here... That is, whether this is an actual "security" problem, or
> whether it's just attempting to come up with a solution which conforms to
> the "general" rule of not modifying a callers fdtable upon calling
> functions like read().
>
> Can someone please elaborate a little on the exact "security" implications
> or "threats" we're attempting to alleviate through the implementation of
> the additional aforementioned group initialization flags and ioctls? What
> is the exact scenario we're attempting to avoid which could lead to a
> compromise in the systems overall integrity?
>
> Also, I can understand this as a "general" rule:
>
>   "A process should be able to receive a file descriptor from an untrusted
>    source and call functions like read() on it without worrying about
>    affecting its own file descriptor table state with that."
>
> But, in instances where event processing is offloaded to a separate
> dedicated "reader" process, does that actually fall under receiving a file
> descriptor from an "untrusted" source? I don't think so. Modification of a
> callers fdtable upon calling functions like read() may be considered
> "undesired" in general, but this is just how fanotify has always worked, is
> it not?
>

Yes, but IMO "it has always worked this way" cannot be used as a valid
argument to re-introduce the same bad pattern to a new interface.
If you want to report open pidfd to group without fid_mode go ahead -
it will not be a regression of any form, but I don't think this is what you
are aiming for.

Re-introducing open fds to reader is a regression, because the alleged
attacker does not have control over fanotify group flags, the alleged
attacker has no CAP_SYS_ADMIN, it gets an fanotify_fd somehow
and passes it on to an innocent reader, so the less programs out there
creating "unsafe" fanotify fd's the better.

If there is already a program out there handing out legacy fanotify fd's
there is nothing we can do about it, but we have a way to shape the
future:
Step #1: Provide new safer interface (i.e. FAN_REPORT_FID)
Step #2: Add shiny new features to new interface to "lure" application
              to switch to new interface (i.e. FAN_REPORT_PIDFD)

Don't get me wrong. I think that your doubts about severity of that
fd table bloat are perfectly valid - I am only opposing to using the
excuse that it has always been that way.

But is another argument that may be used to reject the claims:

Passing a fanotify fd irresponsibly to an untrusted process can have
far greater implications than bloating the victim process fd table.
fanotify events expose information about actions and identities of
other processes in the system and in the worst case (permission
events) enables severe DoS attack on the system.

So perhaps we need to address the specific problem of PIDFD
at all and only the more severe potential security aspects.

We could address them using fanotify_init flags FAN_PRIVATE
and FAN_SHARED and a Kconfig+sysfs knob to control the default
for fanotify groups.

A distro could audit all packages that use fanotify and once every
program that needs to use FAN_SHARED uses it explicitly, change
the system default and make the system more prone to future attacks.

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v3 5/5] fanotify: add pidfd support to the fanotify API
  2021-08-02 14:38               ` Amir Goldstein
  2021-08-02 20:10                 ` Jan Kara
@ 2021-08-03  9:37                 ` Christian Brauner
  2021-08-03 10:07                   ` Amir Goldstein
  2021-08-03 13:39                   ` Jan Kara
  1 sibling, 2 replies; 41+ messages in thread
From: Christian Brauner @ 2021-08-03  9:37 UTC (permalink / raw)
  To: Amir Goldstein
  Cc: Jan Kara, Jann Horn, Matthew Bobrowski, linux-fsdevel, Linux API,
	Andy Lutomirski

On Mon, Aug 02, 2021 at 05:38:20PM +0300, Amir Goldstein wrote:
> On Mon, Aug 2, 2021 at 3:34 PM Jan Kara <jack@suse.cz> wrote:
> >
> > On Fri 30-07-21 08:03:01, Amir Goldstein wrote:
> > > On Thu, Jul 29, 2021 at 6:13 PM Amir Goldstein <amir73il@gmail.com> wrote:
> > > > On Thu, Jul 29, 2021 at 4:39 PM Jan Kara <jack@suse.cz> wrote:
> > > > > Well, but pidfd also makes sure that /proc/<pid>/ keeps belonging to the
> > > > > same process while you read various data from it. And you cannot achieve
> > > > > that with pid+generation thing you've suggested. Plus the additional
> > > > > concept and its complexity is non-trivial So I tend to agree with
> > > > > Christian that we really want to return pidfd.
> > > > >
> > > > > Given returning pidfd is CAP_SYS_ADMIN priviledged operation I'm undecided
> > > > > whether it is worth the trouble to come up with some other mechanism how to
> > > > > return pidfd with the event. We could return some cookie which could be
> > > > > then (by some ioctl or so) either transformed into real pidfd or released
> > > > > (so that we can release pid handle in the kernel) but it looks ugly and
> > > > > complicates things for everybody without bringing significant security
> > > > > improvement (we already can pass fd with the event). So I'm pondering
> > > > > whether there's some other way how we could make the interface safer - e.g.
> > > > > so that the process receiving the event (not the one creating the group)
> > > > > would also need to opt in for getting fds created in its file table.
> > > > >
> > > > > But so far nothing bright has come to my mind. :-|
> > > > >
> > > >
> > > > There is a way, it is not bright, but it is pretty simple -
> > > > store an optional pid in group->fanotify_data.fd_reader.
> > > >
> > > > With flag FAN_REPORT_PIDFD, both pidfd and event->fd reporting
> > > > will be disabled to any process other than fd_reader.
> > > > Without FAN_REPORT_PIDFD, event->fd reporting will be disabled
> > > > if fd_reaader is set to a process other than the reader.
> > > >
> > > > A process can call ioctl START_FD_READER to set fd_reader to itself.
> > > > With FAN_REPORT_PIDFD, if reaader_fd is NULL and the reader
> > > > process has CAP_SYS_ADMIN, read() sets fd_reader to itself.
> > > >
> > > > Permission wise, START_FD_READER is allowed with
> > > > CAP_SYS_ADMIN or if fd_reader is not owned by another process.
> > > > We may consider YIELD_FD_READER ioctl if needed.
> > > >
> > > > I think that this is a pretty cheap price for implementation
> > > > and maybe acceptable overhead for complicating the API?
> > > > Note that without passing fd, there is no need for any ioctl.
> > > >
> > > > An added security benefit is that the ioctl adds is a way for the
> > > > caller of fanotify_init() to make sure that even if the fanotify_fd is
> > > > leaked, that event->fd will not be leaked, regardless of flag
> > > > FAN_REPORT_PIDFD.
> > > >
> > > > So the START_FD_READER ioctl feature could be implemented
> > > > and documented first.
> > > > And then FAN_REPORT_PIDFD could use the feature with a
> > > > very minor API difference:
> > > > - Without the flag, other processes can read fds by default and
> > > >   group initiator can opt-out
> > > > - With the flag, other processes cannot read fds by default and
> > > >   need to opt-in
> > >
> > > Or maybe something even simpler... fanotify_init() flag
> > > FAN_PRIVATE (or FAN_PROTECTED) that limits event reading
> > > to the initiator process (not only fd reading).
> > >
> > > FAN_REPORT_PIDFD requires FAN_PRIVATE.
> > > If we do not know there is a use case for passing fanotify_fd
> > > that reports pidfds to another process why implement the ioctl.
> > > We can always implement it later if the need arises.
> > > If we contemplate this future change, though, maybe the name
> > > FAN_PROTECTED is better to start with.
> >
> > Good ideas. I think we are fine with returning pidfd only to the process
> > creating the fanotify group. Later we can add an ioctl which would indicate
> > that the process is also prepared to have fds created in its file table.
> > But I have still some open questions:
> > Do we want threads of the same process to still be able to receive fds?
> 
> I don't see why not.
> They will be bloating the same fd table as the thread that called
> fanotify_init().
> 
> > Also pids can be recycled so they are probably not completely reliable
> > identifiers?
> 
> Not sure I follow. The group hold a refcount on struct pid of the process that
> called fanotify_init() - I think that can used to check if reader process is
> the same process, but not sure. Maybe there is another way (Christian?).

If the fanotify group hold's a reference to struct pid it won't get
recycled. And it can be used to check if the reader thread is the same
thread with some care. You also have to be specific what exactly you
want to know.  If you're asking if the reading process is the same as
the fanotify_init() process you can be asking one of two things.

You can be asking if the reader is a thread in the same thread-group as
the thread that called fanotify_init(). In that case you might need to
do something like

rcu_read_lock();
struct task_struct *fanotify_init_task_struct = pid_task(stashed_struct_pid, PIDTYPE_PID);
if (!fanotify_init_task_struct) {
	/* The thread which called fanotify_init() has died already. */ 
	return -ESRCH;
}
if (same_thread_group(fanotify_init_task_struct, current))
rcu_read_unlock();

though thinking about it makes me realise that there's a corner case. If
the thread that called fanotify_init() is a thread in a non-empty
thread-group it can already have died and been reaped. This would mean,
pid_task(..., PIDTYPE_PID) will return NULL but there are still other
threads alive in the thread-group. Handling that case might be a bit
complicated.

If you're asking whether the reading thread is really the same as the
thread that created the fanotify instance then you might need to do sm
like

rcu_read_lock();
if (pid_task(stashed_struct_pid, PIDTYPE_PID) == current)
rcu_read_unlock();

Just for completeness if I remember all of this right: there's a corner
case because of how de_thread() works.
During exec the thread that is execing will assume the struct pid of the
old thread-group leader. (All other threads in the same thread-group
will get killed.)
Assume the thread that created the fanotify instance is not the
thread-group leader in its non-empty thread-group. And further assume it
exec's. Then it will assume the struct pid of the old thread-group
leader during de_thread().
Assume the thread inherits the fanotify fd across the exec. Now, when it
tries to read a new event after the exec then pid_task() will return
NULL.
However, if the thread was already the thread-group leader before the
exec then pid_task() will return the same task struct as before after
the exec (because no struct pid swapping needed to take place).

I hope this causes more clarity then confusion. :)
Christian

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v3 5/5] fanotify: add pidfd support to the fanotify API
  2021-08-02 20:10                 ` Jan Kara
  2021-08-03  1:29                   ` Matthew Bobrowski
@ 2021-08-03  9:46                   ` Christian Brauner
  1 sibling, 0 replies; 41+ messages in thread
From: Christian Brauner @ 2021-08-03  9:46 UTC (permalink / raw)
  To: Jan Kara
  Cc: Amir Goldstein, Jann Horn, Matthew Bobrowski, linux-fsdevel,
	Linux API, Andy Lutomirski

On Mon, Aug 02, 2021 at 10:10:02PM +0200, Jan Kara wrote:
> On Mon 02-08-21 17:38:20, Amir Goldstein wrote:
> > On Mon, Aug 2, 2021 at 3:34 PM Jan Kara <jack@suse.cz> wrote:
> > >
> > > On Fri 30-07-21 08:03:01, Amir Goldstein wrote:
> > > > On Thu, Jul 29, 2021 at 6:13 PM Amir Goldstein <amir73il@gmail.com> wrote:
> > > > > On Thu, Jul 29, 2021 at 4:39 PM Jan Kara <jack@suse.cz> wrote:
> > > > > > Well, but pidfd also makes sure that /proc/<pid>/ keeps belonging to the
> > > > > > same process while you read various data from it. And you cannot achieve
> > > > > > that with pid+generation thing you've suggested. Plus the additional
> > > > > > concept and its complexity is non-trivial So I tend to agree with
> > > > > > Christian that we really want to return pidfd.
> > > > > >
> > > > > > Given returning pidfd is CAP_SYS_ADMIN priviledged operation I'm undecided
> > > > > > whether it is worth the trouble to come up with some other mechanism how to
> > > > > > return pidfd with the event. We could return some cookie which could be
> > > > > > then (by some ioctl or so) either transformed into real pidfd or released
> > > > > > (so that we can release pid handle in the kernel) but it looks ugly and
> > > > > > complicates things for everybody without bringing significant security
> > > > > > improvement (we already can pass fd with the event). So I'm pondering
> > > > > > whether there's some other way how we could make the interface safer - e.g.
> > > > > > so that the process receiving the event (not the one creating the group)
> > > > > > would also need to opt in for getting fds created in its file table.
> > > > > >
> > > > > > But so far nothing bright has come to my mind. :-|
> > > > > >
> > > > >
> > > > > There is a way, it is not bright, but it is pretty simple -
> > > > > store an optional pid in group->fanotify_data.fd_reader.
> > > > >
> > > > > With flag FAN_REPORT_PIDFD, both pidfd and event->fd reporting
> > > > > will be disabled to any process other than fd_reader.
> > > > > Without FAN_REPORT_PIDFD, event->fd reporting will be disabled
> > > > > if fd_reaader is set to a process other than the reader.
> > > > >
> > > > > A process can call ioctl START_FD_READER to set fd_reader to itself.
> > > > > With FAN_REPORT_PIDFD, if reaader_fd is NULL and the reader
> > > > > process has CAP_SYS_ADMIN, read() sets fd_reader to itself.
> > > > >
> > > > > Permission wise, START_FD_READER is allowed with
> > > > > CAP_SYS_ADMIN or if fd_reader is not owned by another process.
> > > > > We may consider YIELD_FD_READER ioctl if needed.
> > > > >
> > > > > I think that this is a pretty cheap price for implementation
> > > > > and maybe acceptable overhead for complicating the API?
> > > > > Note that without passing fd, there is no need for any ioctl.
> > > > >
> > > > > An added security benefit is that the ioctl adds is a way for the
> > > > > caller of fanotify_init() to make sure that even if the fanotify_fd is
> > > > > leaked, that event->fd will not be leaked, regardless of flag
> > > > > FAN_REPORT_PIDFD.
> > > > >
> > > > > So the START_FD_READER ioctl feature could be implemented
> > > > > and documented first.
> > > > > And then FAN_REPORT_PIDFD could use the feature with a
> > > > > very minor API difference:
> > > > > - Without the flag, other processes can read fds by default and
> > > > >   group initiator can opt-out
> > > > > - With the flag, other processes cannot read fds by default and
> > > > >   need to opt-in
> > > >
> > > > Or maybe something even simpler... fanotify_init() flag
> > > > FAN_PRIVATE (or FAN_PROTECTED) that limits event reading
> > > > to the initiator process (not only fd reading).
> > > >
> > > > FAN_REPORT_PIDFD requires FAN_PRIVATE.
> > > > If we do not know there is a use case for passing fanotify_fd
> > > > that reports pidfds to another process why implement the ioctl.
> > > > We can always implement it later if the need arises.
> > > > If we contemplate this future change, though, maybe the name
> > > > FAN_PROTECTED is better to start with.
> > >
> > > Good ideas. I think we are fine with returning pidfd only to the process
> > > creating the fanotify group. Later we can add an ioctl which would indicate
> > > that the process is also prepared to have fds created in its file table.
> > > But I have still some open questions:
> > > Do we want threads of the same process to still be able to receive fds?
> > 
> > I don't see why not.
> > They will be bloating the same fd table as the thread that called
> > fanotify_init().
> 
> I agree. So do we store thread group leader PID in fanotify group? What if
> thread group leader changes? I guess I have to do some reading as I don't

I've touched on this in my answer to Amir.
In my other answer I pointed out that storing a
non-thread-group leader struct pid has consequences because of how
de_thread() works. This might lead to a slight inconsistency in api
behavior if a non-thread group leader execs in a non-empty thread-group
that created the fanotify instance. But if you store a thread-group
leader struct pid the api would be consistent even with de_thread() in
the picture unless I forgot some other nasty detail in my other mail.

> know how all these details work internally.
> 
> > > Also pids can be recycled so they are probably not completely reliable
> > > identifiers?
> > 
> > Not sure I follow. The group hold a refcount on struct pid of the process that
> > called fanotify_init() - I think that can used to check if reader process is
> > the same process, but not sure. Maybe there is another way (Christian?).
> 
> Yes, if we hold refcount on struct pid, it should be safe against recycling.
> But cannot someone (even unpriviledged process in this case) mount some
> attack by creating a process which creates fanotify group, passes fanotify fd,
> and dies but pid would be still blocked because fanotify holds reference to
> it? I guess this is not practical as the number of fanotify groups is limited
> as well as number of fds.

If this were a serious problem it would already be a problem today. But
yes, it is limited by the number of fds. Note that struct pid was (among
other reasons) created in order to prevent having to hold references to
struct task_struct because of the amount of memory that would be wasted.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v3 5/5] fanotify: add pidfd support to the fanotify API
  2021-08-03  9:37                 ` Christian Brauner
@ 2021-08-03 10:07                   ` Amir Goldstein
  2021-08-03 14:04                     ` Jan Kara
  2021-08-03 13:39                   ` Jan Kara
  1 sibling, 1 reply; 41+ messages in thread
From: Amir Goldstein @ 2021-08-03 10:07 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Jan Kara, Jann Horn, Matthew Bobrowski, linux-fsdevel, Linux API,
	Andy Lutomirski

On Tue, Aug 3, 2021 at 12:37 PM Christian Brauner
<christian.brauner@ubuntu.com> wrote:
>
> On Mon, Aug 02, 2021 at 05:38:20PM +0300, Amir Goldstein wrote:
> > On Mon, Aug 2, 2021 at 3:34 PM Jan Kara <jack@suse.cz> wrote:
> > >
> > > On Fri 30-07-21 08:03:01, Amir Goldstein wrote:
> > > > On Thu, Jul 29, 2021 at 6:13 PM Amir Goldstein <amir73il@gmail.com> wrote:
> > > > > On Thu, Jul 29, 2021 at 4:39 PM Jan Kara <jack@suse.cz> wrote:
> > > > > > Well, but pidfd also makes sure that /proc/<pid>/ keeps belonging to the
> > > > > > same process while you read various data from it. And you cannot achieve
> > > > > > that with pid+generation thing you've suggested. Plus the additional
> > > > > > concept and its complexity is non-trivial So I tend to agree with
> > > > > > Christian that we really want to return pidfd.
> > > > > >
> > > > > > Given returning pidfd is CAP_SYS_ADMIN priviledged operation I'm undecided
> > > > > > whether it is worth the trouble to come up with some other mechanism how to
> > > > > > return pidfd with the event. We could return some cookie which could be
> > > > > > then (by some ioctl or so) either transformed into real pidfd or released
> > > > > > (so that we can release pid handle in the kernel) but it looks ugly and
> > > > > > complicates things for everybody without bringing significant security
> > > > > > improvement (we already can pass fd with the event). So I'm pondering
> > > > > > whether there's some other way how we could make the interface safer - e.g.
> > > > > > so that the process receiving the event (not the one creating the group)
> > > > > > would also need to opt in for getting fds created in its file table.
> > > > > >
> > > > > > But so far nothing bright has come to my mind. :-|
> > > > > >
> > > > >
> > > > > There is a way, it is not bright, but it is pretty simple -
> > > > > store an optional pid in group->fanotify_data.fd_reader.
> > > > >
> > > > > With flag FAN_REPORT_PIDFD, both pidfd and event->fd reporting
> > > > > will be disabled to any process other than fd_reader.
> > > > > Without FAN_REPORT_PIDFD, event->fd reporting will be disabled
> > > > > if fd_reaader is set to a process other than the reader.
> > > > >
> > > > > A process can call ioctl START_FD_READER to set fd_reader to itself.
> > > > > With FAN_REPORT_PIDFD, if reaader_fd is NULL and the reader
> > > > > process has CAP_SYS_ADMIN, read() sets fd_reader to itself.
> > > > >
> > > > > Permission wise, START_FD_READER is allowed with
> > > > > CAP_SYS_ADMIN or if fd_reader is not owned by another process.
> > > > > We may consider YIELD_FD_READER ioctl if needed.
> > > > >
> > > > > I think that this is a pretty cheap price for implementation
> > > > > and maybe acceptable overhead for complicating the API?
> > > > > Note that without passing fd, there is no need for any ioctl.
> > > > >
> > > > > An added security benefit is that the ioctl adds is a way for the
> > > > > caller of fanotify_init() to make sure that even if the fanotify_fd is
> > > > > leaked, that event->fd will not be leaked, regardless of flag
> > > > > FAN_REPORT_PIDFD.
> > > > >
> > > > > So the START_FD_READER ioctl feature could be implemented
> > > > > and documented first.
> > > > > And then FAN_REPORT_PIDFD could use the feature with a
> > > > > very minor API difference:
> > > > > - Without the flag, other processes can read fds by default and
> > > > >   group initiator can opt-out
> > > > > - With the flag, other processes cannot read fds by default and
> > > > >   need to opt-in
> > > >
> > > > Or maybe something even simpler... fanotify_init() flag
> > > > FAN_PRIVATE (or FAN_PROTECTED) that limits event reading
> > > > to the initiator process (not only fd reading).
> > > >
> > > > FAN_REPORT_PIDFD requires FAN_PRIVATE.
> > > > If we do not know there is a use case for passing fanotify_fd
> > > > that reports pidfds to another process why implement the ioctl.
> > > > We can always implement it later if the need arises.
> > > > If we contemplate this future change, though, maybe the name
> > > > FAN_PROTECTED is better to start with.
> > >
> > > Good ideas. I think we are fine with returning pidfd only to the process
> > > creating the fanotify group. Later we can add an ioctl which would indicate
> > > that the process is also prepared to have fds created in its file table.
> > > But I have still some open questions:
> > > Do we want threads of the same process to still be able to receive fds?
> >
> > I don't see why not.
> > They will be bloating the same fd table as the thread that called
> > fanotify_init().
> >
> > > Also pids can be recycled so they are probably not completely reliable
> > > identifiers?
> >
> > Not sure I follow. The group hold a refcount on struct pid of the process that
> > called fanotify_init() - I think that can used to check if reader process is
> > the same process, but not sure. Maybe there is another way (Christian?).
>
> If the fanotify group hold's a reference to struct pid it won't get
> recycled. And it can be used to check if the reader thread is the same
> thread with some care. You also have to be specific what exactly you
> want to know.  If you're asking if the reading process is the same as
> the fanotify_init() process you can be asking one of two things.
>
> You can be asking if the reader is a thread in the same thread-group as
> the thread that called fanotify_init(). In that case you might need to
> do something like
>
> rcu_read_lock();
> struct task_struct *fanotify_init_task_struct = pid_task(stashed_struct_pid, PIDTYPE_PID);
> if (!fanotify_init_task_struct) {
>         /* The thread which called fanotify_init() has died already. */
>         return -ESRCH;
> }
> if (same_thread_group(fanotify_init_task_struct, current))
> rcu_read_unlock();
>
> though thinking about it makes me realise that there's a corner case. If
> the thread that called fanotify_init() is a thread in a non-empty
> thread-group it can already have died and been reaped. This would mean,
> pid_task(..., PIDTYPE_PID) will return NULL but there are still other
> threads alive in the thread-group. Handling that case might be a bit
> complicated.
>
> If you're asking whether the reading thread is really the same as the
> thread that created the fanotify instance then you might need to do sm
> like
>
> rcu_read_lock();
> if (pid_task(stashed_struct_pid, PIDTYPE_PID) == current)
> rcu_read_unlock();
>
> Just for completeness if I remember all of this right: there's a corner
> case because of how de_thread() works.
> During exec the thread that is execing will assume the struct pid of the
> old thread-group leader. (All other threads in the same thread-group
> will get killed.)
> Assume the thread that created the fanotify instance is not the
> thread-group leader in its non-empty thread-group. And further assume it
> exec's. Then it will assume the struct pid of the old thread-group
> leader during de_thread().
> Assume the thread inherits the fanotify fd across the exec. Now, when it
> tries to read a new event after the exec then pid_task() will return
> NULL.
> However, if the thread was already the thread-group leader before the
> exec then pid_task() will return the same task struct as before after
> the exec (because no struct pid swapping needed to take place).
>
> I hope this causes more clarity ?then confusion. :)

I'm afraid it's the latter :D

Sigh! We must simplify.

Thinking out loud, instead of sealing the possibility of another
process reading pidfd, maybe just avoid the most obvious unintentional
leak of fanotify_fd to another process by mandating  FAN_CLOEXEC?

If users want to fork and use fanotify fd with pidfd reporting they can
opt-in back in with fcntl().

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v3 5/5] fanotify: add pidfd support to the fanotify API
  2021-08-03  9:37                 ` Christian Brauner
  2021-08-03 10:07                   ` Amir Goldstein
@ 2021-08-03 13:39                   ` Jan Kara
  1 sibling, 0 replies; 41+ messages in thread
From: Jan Kara @ 2021-08-03 13:39 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Amir Goldstein, Jan Kara, Jann Horn, Matthew Bobrowski,
	linux-fsdevel, Linux API, Andy Lutomirski

On Tue 03-08-21 11:37:53, Christian Brauner wrote:
> On Mon, Aug 02, 2021 at 05:38:20PM +0300, Amir Goldstein wrote:
> > On Mon, Aug 2, 2021 at 3:34 PM Jan Kara <jack@suse.cz> wrote:
> > >
> > > On Fri 30-07-21 08:03:01, Amir Goldstein wrote:
> > > > On Thu, Jul 29, 2021 at 6:13 PM Amir Goldstein <amir73il@gmail.com> wrote:
> > > > > On Thu, Jul 29, 2021 at 4:39 PM Jan Kara <jack@suse.cz> wrote:
> > > > > > Well, but pidfd also makes sure that /proc/<pid>/ keeps belonging to the
> > > > > > same process while you read various data from it. And you cannot achieve
> > > > > > that with pid+generation thing you've suggested. Plus the additional
> > > > > > concept and its complexity is non-trivial So I tend to agree with
> > > > > > Christian that we really want to return pidfd.
> > > > > >
> > > > > > Given returning pidfd is CAP_SYS_ADMIN priviledged operation I'm undecided
> > > > > > whether it is worth the trouble to come up with some other mechanism how to
> > > > > > return pidfd with the event. We could return some cookie which could be
> > > > > > then (by some ioctl or so) either transformed into real pidfd or released
> > > > > > (so that we can release pid handle in the kernel) but it looks ugly and
> > > > > > complicates things for everybody without bringing significant security
> > > > > > improvement (we already can pass fd with the event). So I'm pondering
> > > > > > whether there's some other way how we could make the interface safer - e.g.
> > > > > > so that the process receiving the event (not the one creating the group)
> > > > > > would also need to opt in for getting fds created in its file table.
> > > > > >
> > > > > > But so far nothing bright has come to my mind. :-|
> > > > > >
> > > > >
> > > > > There is a way, it is not bright, but it is pretty simple -
> > > > > store an optional pid in group->fanotify_data.fd_reader.
> > > > >
> > > > > With flag FAN_REPORT_PIDFD, both pidfd and event->fd reporting
> > > > > will be disabled to any process other than fd_reader.
> > > > > Without FAN_REPORT_PIDFD, event->fd reporting will be disabled
> > > > > if fd_reaader is set to a process other than the reader.
> > > > >
> > > > > A process can call ioctl START_FD_READER to set fd_reader to itself.
> > > > > With FAN_REPORT_PIDFD, if reaader_fd is NULL and the reader
> > > > > process has CAP_SYS_ADMIN, read() sets fd_reader to itself.
> > > > >
> > > > > Permission wise, START_FD_READER is allowed with
> > > > > CAP_SYS_ADMIN or if fd_reader is not owned by another process.
> > > > > We may consider YIELD_FD_READER ioctl if needed.
> > > > >
> > > > > I think that this is a pretty cheap price for implementation
> > > > > and maybe acceptable overhead for complicating the API?
> > > > > Note that without passing fd, there is no need for any ioctl.
> > > > >
> > > > > An added security benefit is that the ioctl adds is a way for the
> > > > > caller of fanotify_init() to make sure that even if the fanotify_fd is
> > > > > leaked, that event->fd will not be leaked, regardless of flag
> > > > > FAN_REPORT_PIDFD.
> > > > >
> > > > > So the START_FD_READER ioctl feature could be implemented
> > > > > and documented first.
> > > > > And then FAN_REPORT_PIDFD could use the feature with a
> > > > > very minor API difference:
> > > > > - Without the flag, other processes can read fds by default and
> > > > >   group initiator can opt-out
> > > > > - With the flag, other processes cannot read fds by default and
> > > > >   need to opt-in
> > > >
> > > > Or maybe something even simpler... fanotify_init() flag
> > > > FAN_PRIVATE (or FAN_PROTECTED) that limits event reading
> > > > to the initiator process (not only fd reading).
> > > >
> > > > FAN_REPORT_PIDFD requires FAN_PRIVATE.
> > > > If we do not know there is a use case for passing fanotify_fd
> > > > that reports pidfds to another process why implement the ioctl.
> > > > We can always implement it later if the need arises.
> > > > If we contemplate this future change, though, maybe the name
> > > > FAN_PROTECTED is better to start with.
> > >
> > > Good ideas. I think we are fine with returning pidfd only to the process
> > > creating the fanotify group. Later we can add an ioctl which would indicate
> > > that the process is also prepared to have fds created in its file table.
> > > But I have still some open questions:
> > > Do we want threads of the same process to still be able to receive fds?
> > 
> > I don't see why not.
> > They will be bloating the same fd table as the thread that called
> > fanotify_init().
> > 
> > > Also pids can be recycled so they are probably not completely reliable
> > > identifiers?
> > 
> > Not sure I follow. The group hold a refcount on struct pid of the process that
> > called fanotify_init() - I think that can used to check if reader process is
> > the same process, but not sure. Maybe there is another way (Christian?).
> 
> If the fanotify group hold's a reference to struct pid it won't get
> recycled. And it can be used to check if the reader thread is the same
> thread with some care. You also have to be specific what exactly you
> want to know.  If you're asking if the reading process is the same as
> the fanotify_init() process you can be asking one of two things.
> 
> You can be asking if the reader is a thread in the same thread-group as
> the thread that called fanotify_init(). In that case you might need to
> do something like
> 
> rcu_read_lock();
> struct task_struct *fanotify_init_task_struct = pid_task(stashed_struct_pid, PIDTYPE_PID);
> if (!fanotify_init_task_struct) {
> 	/* The thread which called fanotify_init() has died already. */ 
> 	return -ESRCH;
> }
> if (same_thread_group(fanotify_init_task_struct, current))
> rcu_read_unlock();
> 
> though thinking about it makes me realise that there's a corner case. If
> the thread that called fanotify_init() is a thread in a non-empty
> thread-group it can already have died and been reaped. This would mean,
> pid_task(..., PIDTYPE_PID) will return NULL but there are still other
> threads alive in the thread-group. Handling that case might be a bit
> complicated.
> 
> If you're asking whether the reading thread is really the same as the
> thread that created the fanotify instance then you might need to do sm
> like
> 
> rcu_read_lock();
> if (pid_task(stashed_struct_pid, PIDTYPE_PID) == current)
> rcu_read_unlock();
> 
> Just for completeness if I remember all of this right: there's a corner
> case because of how de_thread() works.
> During exec the thread that is execing will assume the struct pid of the
> old thread-group leader. (All other threads in the same thread-group
> will get killed.)
> Assume the thread that created the fanotify instance is not the
> thread-group leader in its non-empty thread-group. And further assume it
> exec's. Then it will assume the struct pid of the old thread-group
> leader during de_thread().
> Assume the thread inherits the fanotify fd across the exec. Now, when it
> tries to read a new event after the exec then pid_task() will return
> NULL.
> However, if the thread was already the thread-group leader before the
> exec then pid_task() will return the same task struct as before after
> the exec (because no struct pid swapping needed to take place).
> 
> I hope this causes more clarity then confusion. :)

Thanks for the details Christian! After some more reading and your comments
I think the situation is relatively clear. What we could do is:

On fanotify_init(2) do get_task_pid(current, PIDTYPE_TGID) and stash it. On
read from fanotify fd do
	rcu_read_lock();
	if (rcu_dereference(*task_pid_ptr(current, PIDTYPE_TGID)) !=
								stashed_pid) {
		rcu_read_unlock();
		return -EPERM;
	}
	rcu_read_unlock();

This should avoid all the nasty cornercases. So it can be done but I have
more and more doubts whether it is all worth it.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v3 5/5] fanotify: add pidfd support to the fanotify API
  2021-08-03 10:07                   ` Amir Goldstein
@ 2021-08-03 14:04                     ` Jan Kara
  2021-08-04  3:46                       ` Matthew Bobrowski
  0 siblings, 1 reply; 41+ messages in thread
From: Jan Kara @ 2021-08-03 14:04 UTC (permalink / raw)
  To: Amir Goldstein
  Cc: Christian Brauner, Jan Kara, Jann Horn, Matthew Bobrowski,
	linux-fsdevel, Linux API, Andy Lutomirski

On Tue 03-08-21 13:07:57, Amir Goldstein wrote:
> On Tue, Aug 3, 2021 at 12:37 PM Christian Brauner
> <christian.brauner@ubuntu.com> wrote:
> >
> > On Mon, Aug 02, 2021 at 05:38:20PM +0300, Amir Goldstein wrote:
> > > On Mon, Aug 2, 2021 at 3:34 PM Jan Kara <jack@suse.cz> wrote:
> > > >
> > > > On Fri 30-07-21 08:03:01, Amir Goldstein wrote:
> > > > > On Thu, Jul 29, 2021 at 6:13 PM Amir Goldstein <amir73il@gmail.com> wrote:
> > > > > > On Thu, Jul 29, 2021 at 4:39 PM Jan Kara <jack@suse.cz> wrote:
> > > > > > > Well, but pidfd also makes sure that /proc/<pid>/ keeps belonging to the
> > > > > > > same process while you read various data from it. And you cannot achieve
> > > > > > > that with pid+generation thing you've suggested. Plus the additional
> > > > > > > concept and its complexity is non-trivial So I tend to agree with
> > > > > > > Christian that we really want to return pidfd.
> > > > > > >
> > > > > > > Given returning pidfd is CAP_SYS_ADMIN priviledged operation I'm undecided
> > > > > > > whether it is worth the trouble to come up with some other mechanism how to
> > > > > > > return pidfd with the event. We could return some cookie which could be
> > > > > > > then (by some ioctl or so) either transformed into real pidfd or released
> > > > > > > (so that we can release pid handle in the kernel) but it looks ugly and
> > > > > > > complicates things for everybody without bringing significant security
> > > > > > > improvement (we already can pass fd with the event). So I'm pondering
> > > > > > > whether there's some other way how we could make the interface safer - e.g.
> > > > > > > so that the process receiving the event (not the one creating the group)
> > > > > > > would also need to opt in for getting fds created in its file table.
> > > > > > >
> > > > > > > But so far nothing bright has come to my mind. :-|
> > > > > > >
> > > > > >
> > > > > > There is a way, it is not bright, but it is pretty simple -
> > > > > > store an optional pid in group->fanotify_data.fd_reader.
> > > > > >
> > > > > > With flag FAN_REPORT_PIDFD, both pidfd and event->fd reporting
> > > > > > will be disabled to any process other than fd_reader.
> > > > > > Without FAN_REPORT_PIDFD, event->fd reporting will be disabled
> > > > > > if fd_reaader is set to a process other than the reader.
> > > > > >
> > > > > > A process can call ioctl START_FD_READER to set fd_reader to itself.
> > > > > > With FAN_REPORT_PIDFD, if reaader_fd is NULL and the reader
> > > > > > process has CAP_SYS_ADMIN, read() sets fd_reader to itself.
> > > > > >
> > > > > > Permission wise, START_FD_READER is allowed with
> > > > > > CAP_SYS_ADMIN or if fd_reader is not owned by another process.
> > > > > > We may consider YIELD_FD_READER ioctl if needed.
> > > > > >
> > > > > > I think that this is a pretty cheap price for implementation
> > > > > > and maybe acceptable overhead for complicating the API?
> > > > > > Note that without passing fd, there is no need for any ioctl.
> > > > > >
> > > > > > An added security benefit is that the ioctl adds is a way for the
> > > > > > caller of fanotify_init() to make sure that even if the fanotify_fd is
> > > > > > leaked, that event->fd will not be leaked, regardless of flag
> > > > > > FAN_REPORT_PIDFD.
> > > > > >
> > > > > > So the START_FD_READER ioctl feature could be implemented
> > > > > > and documented first.
> > > > > > And then FAN_REPORT_PIDFD could use the feature with a
> > > > > > very minor API difference:
> > > > > > - Without the flag, other processes can read fds by default and
> > > > > >   group initiator can opt-out
> > > > > > - With the flag, other processes cannot read fds by default and
> > > > > >   need to opt-in
> > > > >
> > > > > Or maybe something even simpler... fanotify_init() flag
> > > > > FAN_PRIVATE (or FAN_PROTECTED) that limits event reading
> > > > > to the initiator process (not only fd reading).
> > > > >
> > > > > FAN_REPORT_PIDFD requires FAN_PRIVATE.
> > > > > If we do not know there is a use case for passing fanotify_fd
> > > > > that reports pidfds to another process why implement the ioctl.
> > > > > We can always implement it later if the need arises.
> > > > > If we contemplate this future change, though, maybe the name
> > > > > FAN_PROTECTED is better to start with.
> > > >
> > > > Good ideas. I think we are fine with returning pidfd only to the process
> > > > creating the fanotify group. Later we can add an ioctl which would indicate
> > > > that the process is also prepared to have fds created in its file table.
> > > > But I have still some open questions:
> > > > Do we want threads of the same process to still be able to receive fds?
> > >
> > > I don't see why not.
> > > They will be bloating the same fd table as the thread that called
> > > fanotify_init().
> > >
> > > > Also pids can be recycled so they are probably not completely reliable
> > > > identifiers?
> > >
> > > Not sure I follow. The group hold a refcount on struct pid of the process that
> > > called fanotify_init() - I think that can used to check if reader process is
> > > the same process, but not sure. Maybe there is another way (Christian?).
> >
> > If the fanotify group hold's a reference to struct pid it won't get
> > recycled. And it can be used to check if the reader thread is the same
> > thread with some care. You also have to be specific what exactly you
> > want to know.  If you're asking if the reading process is the same as
> > the fanotify_init() process you can be asking one of two things.
> >
> > You can be asking if the reader is a thread in the same thread-group as
> > the thread that called fanotify_init(). In that case you might need to
> > do something like
> >
> > rcu_read_lock();
> > struct task_struct *fanotify_init_task_struct = pid_task(stashed_struct_pid, PIDTYPE_PID);
> > if (!fanotify_init_task_struct) {
> >         /* The thread which called fanotify_init() has died already. */
> >         return -ESRCH;
> > }
> > if (same_thread_group(fanotify_init_task_struct, current))
> > rcu_read_unlock();
> >
> > though thinking about it makes me realise that there's a corner case. If
> > the thread that called fanotify_init() is a thread in a non-empty
> > thread-group it can already have died and been reaped. This would mean,
> > pid_task(..., PIDTYPE_PID) will return NULL but there are still other
> > threads alive in the thread-group. Handling that case might be a bit
> > complicated.
> >
> > If you're asking whether the reading thread is really the same as the
> > thread that created the fanotify instance then you might need to do sm
> > like
> >
> > rcu_read_lock();
> > if (pid_task(stashed_struct_pid, PIDTYPE_PID) == current)
> > rcu_read_unlock();
> >
> > Just for completeness if I remember all of this right: there's a corner
> > case because of how de_thread() works.
> > During exec the thread that is execing will assume the struct pid of the
> > old thread-group leader. (All other threads in the same thread-group
> > will get killed.)
> > Assume the thread that created the fanotify instance is not the
> > thread-group leader in its non-empty thread-group. And further assume it
> > exec's. Then it will assume the struct pid of the old thread-group
> > leader during de_thread().
> > Assume the thread inherits the fanotify fd across the exec. Now, when it
> > tries to read a new event after the exec then pid_task() will return
> > NULL.
> > However, if the thread was already the thread-group leader before the
> > exec then pid_task() will return the same task struct as before after
> > the exec (because no struct pid swapping needed to take place).
> >
> > I hope this causes more clarity ?then confusion. :)
> 
> I'm afraid it's the latter :D
> 
> Sigh! We must simplify.
> 
> Thinking out loud, instead of sealing the possibility of another
> process reading pidfd, maybe just avoid the most obvious unintentional
> leak of fanotify_fd to another process by mandating  FAN_CLOEXEC?

Well, I don't think we need any protection from leaking fanotify_fd. It is
special fd with special priviledges as any other. If you leak it, well, bad
luck but that's how Unix priviledge model works.

The threat IMO is that you have a process X, that process expects to
receive fd to work with from process Y. Now process Y is malicious (or
taken over by an attacker) and passes to X fanotify_fd. X reads from
fanotify_fd to get data to process, it performs all kinds of validity
checks on untrusted input but it does not expect that the read has side
effects on X's file_table and in the worst case can lead to some compromise
of X or easily to DoS on X by exhausting its file_table space.

Currently this attack vector is moot because you have to have CAP_SYS_ADMIN
to get to fanotify_fd and then you can certainly do worse things. But OTOH
I can see why Jann was uneasy about this.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v3 5/5] fanotify: add pidfd support to the fanotify API
  2021-08-03 14:04                     ` Jan Kara
@ 2021-08-04  3:46                       ` Matthew Bobrowski
  2021-08-04 12:39                         ` Jan Kara
  0 siblings, 1 reply; 41+ messages in thread
From: Matthew Bobrowski @ 2021-08-04  3:46 UTC (permalink / raw)
  To: Jan Kara
  Cc: Amir Goldstein, Christian Brauner, Jann Horn, linux-fsdevel,
	Linux API, Andy Lutomirski

On Tue, Aug 03, 2021 at 04:04:21PM +0200, Jan Kara wrote:
> On Tue 03-08-21 13:07:57, Amir Goldstein wrote:
> > On Tue, Aug 3, 2021 at 12:37 PM Christian Brauner
> > <christian.brauner@ubuntu.com> wrote:
> > >
> > > On Mon, Aug 02, 2021 at 05:38:20PM +0300, Amir Goldstein wrote:
> > > > On Mon, Aug 2, 2021 at 3:34 PM Jan Kara <jack@suse.cz> wrote:
> > > > >
> > > > > On Fri 30-07-21 08:03:01, Amir Goldstein wrote:
> > > > > > On Thu, Jul 29, 2021 at 6:13 PM Amir Goldstein <amir73il@gmail.com> wrote:
> > > > > > > On Thu, Jul 29, 2021 at 4:39 PM Jan Kara <jack@suse.cz> wrote:
> > > > > > > > Well, but pidfd also makes sure that /proc/<pid>/ keeps belonging to the
> > > > > > > > same process while you read various data from it. And you cannot achieve
> > > > > > > > that with pid+generation thing you've suggested. Plus the additional
> > > > > > > > concept and its complexity is non-trivial So I tend to agree with
> > > > > > > > Christian that we really want to return pidfd.
> > > > > > > >
> > > > > > > > Given returning pidfd is CAP_SYS_ADMIN priviledged operation I'm undecided
> > > > > > > > whether it is worth the trouble to come up with some other mechanism how to
> > > > > > > > return pidfd with the event. We could return some cookie which could be
> > > > > > > > then (by some ioctl or so) either transformed into real pidfd or released
> > > > > > > > (so that we can release pid handle in the kernel) but it looks ugly and
> > > > > > > > complicates things for everybody without bringing significant security
> > > > > > > > improvement (we already can pass fd with the event). So I'm pondering
> > > > > > > > whether there's some other way how we could make the interface safer - e.g.
> > > > > > > > so that the process receiving the event (not the one creating the group)
> > > > > > > > would also need to opt in for getting fds created in its file table.
> > > > > > > >
> > > > > > > > But so far nothing bright has come to my mind. :-|
> > > > > > > >
> > > > > > >
> > > > > > > There is a way, it is not bright, but it is pretty simple -
> > > > > > > store an optional pid in group->fanotify_data.fd_reader.
> > > > > > >
> > > > > > > With flag FAN_REPORT_PIDFD, both pidfd and event->fd reporting
> > > > > > > will be disabled to any process other than fd_reader.
> > > > > > > Without FAN_REPORT_PIDFD, event->fd reporting will be disabled
> > > > > > > if fd_reaader is set to a process other than the reader.
> > > > > > >
> > > > > > > A process can call ioctl START_FD_READER to set fd_reader to itself.
> > > > > > > With FAN_REPORT_PIDFD, if reaader_fd is NULL and the reader
> > > > > > > process has CAP_SYS_ADMIN, read() sets fd_reader to itself.
> > > > > > >
> > > > > > > Permission wise, START_FD_READER is allowed with
> > > > > > > CAP_SYS_ADMIN or if fd_reader is not owned by another process.
> > > > > > > We may consider YIELD_FD_READER ioctl if needed.
> > > > > > >
> > > > > > > I think that this is a pretty cheap price for implementation
> > > > > > > and maybe acceptable overhead for complicating the API?
> > > > > > > Note that without passing fd, there is no need for any ioctl.
> > > > > > >
> > > > > > > An added security benefit is that the ioctl adds is a way for the
> > > > > > > caller of fanotify_init() to make sure that even if the fanotify_fd is
> > > > > > > leaked, that event->fd will not be leaked, regardless of flag
> > > > > > > FAN_REPORT_PIDFD.
> > > > > > >
> > > > > > > So the START_FD_READER ioctl feature could be implemented
> > > > > > > and documented first.
> > > > > > > And then FAN_REPORT_PIDFD could use the feature with a
> > > > > > > very minor API difference:
> > > > > > > - Without the flag, other processes can read fds by default and
> > > > > > >   group initiator can opt-out
> > > > > > > - With the flag, other processes cannot read fds by default and
> > > > > > >   need to opt-in
> > > > > >
> > > > > > Or maybe something even simpler... fanotify_init() flag
> > > > > > FAN_PRIVATE (or FAN_PROTECTED) that limits event reading
> > > > > > to the initiator process (not only fd reading).
> > > > > >
> > > > > > FAN_REPORT_PIDFD requires FAN_PRIVATE.
> > > > > > If we do not know there is a use case for passing fanotify_fd
> > > > > > that reports pidfds to another process why implement the ioctl.
> > > > > > We can always implement it later if the need arises.
> > > > > > If we contemplate this future change, though, maybe the name
> > > > > > FAN_PROTECTED is better to start with.
> > > > >
> > > > > Good ideas. I think we are fine with returning pidfd only to the process
> > > > > creating the fanotify group. Later we can add an ioctl which would indicate
> > > > > that the process is also prepared to have fds created in its file table.
> > > > > But I have still some open questions:
> > > > > Do we want threads of the same process to still be able to receive fds?
> > > >
> > > > I don't see why not.
> > > > They will be bloating the same fd table as the thread that called
> > > > fanotify_init().
> > > >
> > > > > Also pids can be recycled so they are probably not completely reliable
> > > > > identifiers?
> > > >
> > > > Not sure I follow. The group hold a refcount on struct pid of the process that
> > > > called fanotify_init() - I think that can used to check if reader process is
> > > > the same process, but not sure. Maybe there is another way (Christian?).
> > >
> > > If the fanotify group hold's a reference to struct pid it won't get
> > > recycled. And it can be used to check if the reader thread is the same
> > > thread with some care. You also have to be specific what exactly you
> > > want to know.  If you're asking if the reading process is the same as
> > > the fanotify_init() process you can be asking one of two things.
> > >
> > > You can be asking if the reader is a thread in the same thread-group as
> > > the thread that called fanotify_init(). In that case you might need to
> > > do something like
> > >
> > > rcu_read_lock();
> > > struct task_struct *fanotify_init_task_struct = pid_task(stashed_struct_pid, PIDTYPE_PID);
> > > if (!fanotify_init_task_struct) {
> > >         /* The thread which called fanotify_init() has died already. */
> > >         return -ESRCH;
> > > }
> > > if (same_thread_group(fanotify_init_task_struct, current))
> > > rcu_read_unlock();
> > >
> > > though thinking about it makes me realise that there's a corner case. If
> > > the thread that called fanotify_init() is a thread in a non-empty
> > > thread-group it can already have died and been reaped. This would mean,
> > > pid_task(..., PIDTYPE_PID) will return NULL but there are still other
> > > threads alive in the thread-group. Handling that case might be a bit
> > > complicated.
> > >
> > > If you're asking whether the reading thread is really the same as the
> > > thread that created the fanotify instance then you might need to do sm
> > > like
> > >
> > > rcu_read_lock();
> > > if (pid_task(stashed_struct_pid, PIDTYPE_PID) == current)
> > > rcu_read_unlock();
> > >
> > > Just for completeness if I remember all of this right: there's a corner
> > > case because of how de_thread() works.
> > > During exec the thread that is execing will assume the struct pid of the
> > > old thread-group leader. (All other threads in the same thread-group
> > > will get killed.)
> > > Assume the thread that created the fanotify instance is not the
> > > thread-group leader in its non-empty thread-group. And further assume it
> > > exec's. Then it will assume the struct pid of the old thread-group
> > > leader during de_thread().
> > > Assume the thread inherits the fanotify fd across the exec. Now, when it
> > > tries to read a new event after the exec then pid_task() will return
> > > NULL.
> > > However, if the thread was already the thread-group leader before the
> > > exec then pid_task() will return the same task struct as before after
> > > the exec (because no struct pid swapping needed to take place).
> > >
> > > I hope this causes more clarity ?then confusion. :)
> > 
> > I'm afraid it's the latter :D
> > 
> > Sigh! We must simplify.
> > 
> > Thinking out loud, instead of sealing the possibility of another
> > process reading pidfd, maybe just avoid the most obvious unintentional
> > leak of fanotify_fd to another process by mandating  FAN_CLOEXEC?
>
> Well, I don't think we need any protection from leaking fanotify_fd. It is
> special fd with special priviledges as any other. If you leak it, well, bad
> luck but that's how Unix priviledge model works.
> 
> The threat IMO is that you have a process X, that process expects to
> receive fd to work with from process Y. Now process Y is malicious (or
> taken over by an attacker) and passes to X fanotify_fd. X reads from
> fanotify_fd to get data to process, it performs all kinds of validity
> checks on untrusted input but it does not expect that the read has side
> effects on X's file_table and in the worst case can lead to some compromise
> of X or easily to DoS on X by exhausting its file_table space.
>
> Currently this attack vector is moot because you have to have CAP_SYS_ADMIN
> to get to fanotify_fd and then you can certainly do worse things. But OTOH
> I can see why Jann was uneasy about this.

As I have breifly expressed in my previous emails, the cause for concern
here is flakey IMO. If there's sensible something that I'm clearly missing,
then please explain.

From my perspective, the only sensible attack vector that's maybe worth
worrying about here is the possibility of exhausting the fdtable of a given
process, which yes, can be considered as a form of DoS. However, in any
case, there are other defensive protections/measures that a programmer
could employ in their application code which could prevent such from ever
happening.

The whole passing of file descriptors between process Y and process X and
the leaking of a file descriptor thing simply goes back to what you've
mentioned above Jan. I consider it a very weak argument. When enabling
FAN_REPORT_PIDFD, the process requires CAP_SYS_ADMIN. If that process ever
has its execution flow hijacked by an attacker, then I'm sorry, I think
there's other larger causes for concern at that point rather then worrying
about the state of some other child processes fdtable.

In general cases, I get that passing a file descriptor between process Y
and process X and then having process X's fdtable modified as result of
calling functions like read() is considered undesired. But, for
applications that makes use of fanotify is there ever a case where we pass
the fanotify file descriptor to a random/unexpected process and have it
process events? I don't think so. So, I suppose what I'm trying to say is
that, if an application chooses to opt-in and use a flag like
FAN_REPORT_PIDFD or any other future file descriptor generating variant,
the expectation is that which ever process is created and event processing
is passed to that process, then it should always expect to have its fdtable
modified when reading events.

/M

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v3 5/5] fanotify: add pidfd support to the fanotify API
  2021-08-04  3:46                       ` Matthew Bobrowski
@ 2021-08-04 12:39                         ` Jan Kara
  2021-08-05  5:51                           ` Matthew Bobrowski
  0 siblings, 1 reply; 41+ messages in thread
From: Jan Kara @ 2021-08-04 12:39 UTC (permalink / raw)
  To: Matthew Bobrowski
  Cc: Jan Kara, Amir Goldstein, Christian Brauner, Jann Horn,
	linux-fsdevel, Linux API, Andy Lutomirski

On Wed 04-08-21 13:46:05, Matthew Bobrowski wrote:
> On Tue, Aug 03, 2021 at 04:04:21PM +0200, Jan Kara wrote:
> > On Tue 03-08-21 13:07:57, Amir Goldstein wrote:
> > > On Tue, Aug 3, 2021 at 12:37 PM Christian Brauner
> > > <christian.brauner@ubuntu.com> wrote:
> > > >
> > > > On Mon, Aug 02, 2021 at 05:38:20PM +0300, Amir Goldstein wrote:
> > > > > On Mon, Aug 2, 2021 at 3:34 PM Jan Kara <jack@suse.cz> wrote:
> > > > > >
> > > > > > On Fri 30-07-21 08:03:01, Amir Goldstein wrote:
> > > > > > > On Thu, Jul 29, 2021 at 6:13 PM Amir Goldstein <amir73il@gmail.com> wrote:
> > > > > > > > On Thu, Jul 29, 2021 at 4:39 PM Jan Kara <jack@suse.cz> wrote:
> > > > > > > > > Well, but pidfd also makes sure that /proc/<pid>/ keeps belonging to the
> > > > > > > > > same process while you read various data from it. And you cannot achieve
> > > > > > > > > that with pid+generation thing you've suggested. Plus the additional
> > > > > > > > > concept and its complexity is non-trivial So I tend to agree with
> > > > > > > > > Christian that we really want to return pidfd.
> > > > > > > > >
> > > > > > > > > Given returning pidfd is CAP_SYS_ADMIN priviledged operation I'm undecided
> > > > > > > > > whether it is worth the trouble to come up with some other mechanism how to
> > > > > > > > > return pidfd with the event. We could return some cookie which could be
> > > > > > > > > then (by some ioctl or so) either transformed into real pidfd or released
> > > > > > > > > (so that we can release pid handle in the kernel) but it looks ugly and
> > > > > > > > > complicates things for everybody without bringing significant security
> > > > > > > > > improvement (we already can pass fd with the event). So I'm pondering
> > > > > > > > > whether there's some other way how we could make the interface safer - e.g.
> > > > > > > > > so that the process receiving the event (not the one creating the group)
> > > > > > > > > would also need to opt in for getting fds created in its file table.
> > > > > > > > >
> > > > > > > > > But so far nothing bright has come to my mind. :-|
> > > > > > > > >
> > > > > > > >
> > > > > > > > There is a way, it is not bright, but it is pretty simple -
> > > > > > > > store an optional pid in group->fanotify_data.fd_reader.
> > > > > > > >
> > > > > > > > With flag FAN_REPORT_PIDFD, both pidfd and event->fd reporting
> > > > > > > > will be disabled to any process other than fd_reader.
> > > > > > > > Without FAN_REPORT_PIDFD, event->fd reporting will be disabled
> > > > > > > > if fd_reaader is set to a process other than the reader.
> > > > > > > >
> > > > > > > > A process can call ioctl START_FD_READER to set fd_reader to itself.
> > > > > > > > With FAN_REPORT_PIDFD, if reaader_fd is NULL and the reader
> > > > > > > > process has CAP_SYS_ADMIN, read() sets fd_reader to itself.
> > > > > > > >
> > > > > > > > Permission wise, START_FD_READER is allowed with
> > > > > > > > CAP_SYS_ADMIN or if fd_reader is not owned by another process.
> > > > > > > > We may consider YIELD_FD_READER ioctl if needed.
> > > > > > > >
> > > > > > > > I think that this is a pretty cheap price for implementation
> > > > > > > > and maybe acceptable overhead for complicating the API?
> > > > > > > > Note that without passing fd, there is no need for any ioctl.
> > > > > > > >
> > > > > > > > An added security benefit is that the ioctl adds is a way for the
> > > > > > > > caller of fanotify_init() to make sure that even if the fanotify_fd is
> > > > > > > > leaked, that event->fd will not be leaked, regardless of flag
> > > > > > > > FAN_REPORT_PIDFD.
> > > > > > > >
> > > > > > > > So the START_FD_READER ioctl feature could be implemented
> > > > > > > > and documented first.
> > > > > > > > And then FAN_REPORT_PIDFD could use the feature with a
> > > > > > > > very minor API difference:
> > > > > > > > - Without the flag, other processes can read fds by default and
> > > > > > > >   group initiator can opt-out
> > > > > > > > - With the flag, other processes cannot read fds by default and
> > > > > > > >   need to opt-in
> > > > > > >
> > > > > > > Or maybe something even simpler... fanotify_init() flag
> > > > > > > FAN_PRIVATE (or FAN_PROTECTED) that limits event reading
> > > > > > > to the initiator process (not only fd reading).
> > > > > > >
> > > > > > > FAN_REPORT_PIDFD requires FAN_PRIVATE.
> > > > > > > If we do not know there is a use case for passing fanotify_fd
> > > > > > > that reports pidfds to another process why implement the ioctl.
> > > > > > > We can always implement it later if the need arises.
> > > > > > > If we contemplate this future change, though, maybe the name
> > > > > > > FAN_PROTECTED is better to start with.
> > > > > >
> > > > > > Good ideas. I think we are fine with returning pidfd only to the process
> > > > > > creating the fanotify group. Later we can add an ioctl which would indicate
> > > > > > that the process is also prepared to have fds created in its file table.
> > > > > > But I have still some open questions:
> > > > > > Do we want threads of the same process to still be able to receive fds?
> > > > >
> > > > > I don't see why not.
> > > > > They will be bloating the same fd table as the thread that called
> > > > > fanotify_init().
> > > > >
> > > > > > Also pids can be recycled so they are probably not completely reliable
> > > > > > identifiers?
> > > > >
> > > > > Not sure I follow. The group hold a refcount on struct pid of the process that
> > > > > called fanotify_init() - I think that can used to check if reader process is
> > > > > the same process, but not sure. Maybe there is another way (Christian?).
> > > >
> > > > If the fanotify group hold's a reference to struct pid it won't get
> > > > recycled. And it can be used to check if the reader thread is the same
> > > > thread with some care. You also have to be specific what exactly you
> > > > want to know.  If you're asking if the reading process is the same as
> > > > the fanotify_init() process you can be asking one of two things.
> > > >
> > > > You can be asking if the reader is a thread in the same thread-group as
> > > > the thread that called fanotify_init(). In that case you might need to
> > > > do something like
> > > >
> > > > rcu_read_lock();
> > > > struct task_struct *fanotify_init_task_struct = pid_task(stashed_struct_pid, PIDTYPE_PID);
> > > > if (!fanotify_init_task_struct) {
> > > >         /* The thread which called fanotify_init() has died already. */
> > > >         return -ESRCH;
> > > > }
> > > > if (same_thread_group(fanotify_init_task_struct, current))
> > > > rcu_read_unlock();
> > > >
> > > > though thinking about it makes me realise that there's a corner case. If
> > > > the thread that called fanotify_init() is a thread in a non-empty
> > > > thread-group it can already have died and been reaped. This would mean,
> > > > pid_task(..., PIDTYPE_PID) will return NULL but there are still other
> > > > threads alive in the thread-group. Handling that case might be a bit
> > > > complicated.
> > > >
> > > > If you're asking whether the reading thread is really the same as the
> > > > thread that created the fanotify instance then you might need to do sm
> > > > like
> > > >
> > > > rcu_read_lock();
> > > > if (pid_task(stashed_struct_pid, PIDTYPE_PID) == current)
> > > > rcu_read_unlock();
> > > >
> > > > Just for completeness if I remember all of this right: there's a corner
> > > > case because of how de_thread() works.
> > > > During exec the thread that is execing will assume the struct pid of the
> > > > old thread-group leader. (All other threads in the same thread-group
> > > > will get killed.)
> > > > Assume the thread that created the fanotify instance is not the
> > > > thread-group leader in its non-empty thread-group. And further assume it
> > > > exec's. Then it will assume the struct pid of the old thread-group
> > > > leader during de_thread().
> > > > Assume the thread inherits the fanotify fd across the exec. Now, when it
> > > > tries to read a new event after the exec then pid_task() will return
> > > > NULL.
> > > > However, if the thread was already the thread-group leader before the
> > > > exec then pid_task() will return the same task struct as before after
> > > > the exec (because no struct pid swapping needed to take place).
> > > >
> > > > I hope this causes more clarity ?then confusion. :)
> > > 
> > > I'm afraid it's the latter :D
> > > 
> > > Sigh! We must simplify.
> > > 
> > > Thinking out loud, instead of sealing the possibility of another
> > > process reading pidfd, maybe just avoid the most obvious unintentional
> > > leak of fanotify_fd to another process by mandating  FAN_CLOEXEC?
> >
> > Well, I don't think we need any protection from leaking fanotify_fd. It is
> > special fd with special priviledges as any other. If you leak it, well, bad
> > luck but that's how Unix priviledge model works.
> > 
> > The threat IMO is that you have a process X, that process expects to
> > receive fd to work with from process Y. Now process Y is malicious (or
> > taken over by an attacker) and passes to X fanotify_fd. X reads from
> > fanotify_fd to get data to process, it performs all kinds of validity
> > checks on untrusted input but it does not expect that the read has side
> > effects on X's file_table and in the worst case can lead to some compromise
> > of X or easily to DoS on X by exhausting its file_table space.
> >
> > Currently this attack vector is moot because you have to have CAP_SYS_ADMIN
> > to get to fanotify_fd and then you can certainly do worse things. But OTOH
> > I can see why Jann was uneasy about this.
> 
> As I have breifly expressed in my previous emails, the cause for concern
> here is flakey IMO. If there's sensible something that I'm clearly missing,
> then please explain.

No, I think your understanding is correct.

> From my perspective, the only sensible attack vector that's maybe worth
> worrying about here is the possibility of exhausting the fdtable of a given
> process, which yes, can be considered as a form of DoS. However, in any
> case, there are other defensive protections/measures that a programmer
> could employ in their application code which could prevent such from ever
> happening.
> 
> The whole passing of file descriptors between process Y and process X and
> the leaking of a file descriptor thing simply goes back to what you've
> mentioned above Jan. I consider it a very weak argument. When enabling
> FAN_REPORT_PIDFD, the process requires CAP_SYS_ADMIN. If that process ever
> has its execution flow hijacked by an attacker, then I'm sorry, I think
> there's other larger causes for concern at that point rather then worrying
> about the state of some other child processes fdtable.
> 
> In general cases, I get that passing a file descriptor between process Y
> and process X and then having process X's fdtable modified as result of
> calling functions like read() is considered undesired. But, for
> applications that makes use of fanotify is there ever a case where we pass
> the fanotify file descriptor to a random/unexpected process and have it
> process events? I don't think so. So, I suppose what I'm trying to say is
> that, if an application chooses to opt-in and use a flag like
> FAN_REPORT_PIDFD or any other future file descriptor generating variant,
> the expectation is that which ever process is created and event processing
> is passed to that process, then it should always expect to have its fdtable
> modified when reading events.

Yes, I was thinking about this some more and at this point, given the lack
of convenient options for the hardening, I think the best option is to keep
the interface as originally planned. Because I'm afraid the hardening options
we were able to come up with would only cause confusion (and from confusion
bugs easily arise) for little security gain.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v3 5/5] fanotify: add pidfd support to the fanotify API
  2021-08-04 12:39                         ` Jan Kara
@ 2021-08-05  5:51                           ` Matthew Bobrowski
  2021-08-05  8:55                             ` Jan Kara
  0 siblings, 1 reply; 41+ messages in thread
From: Matthew Bobrowski @ 2021-08-05  5:51 UTC (permalink / raw)
  To: Jan Kara
  Cc: Amir Goldstein, Christian Brauner, Jann Horn, linux-fsdevel,
	Linux API, Andy Lutomirski

On Wed, Aug 04, 2021 at 02:39:40PM +0200, Jan Kara wrote:
> On Wed 04-08-21 13:46:05, Matthew Bobrowski wrote:
> > On Tue, Aug 03, 2021 at 04:04:21PM +0200, Jan Kara wrote:
> > > On Tue 03-08-21 13:07:57, Amir Goldstein wrote:
> > > > On Tue, Aug 3, 2021 at 12:37 PM Christian Brauner
> > > > <christian.brauner@ubuntu.com> wrote:
> > > > >
> > > > > On Mon, Aug 02, 2021 at 05:38:20PM +0300, Amir Goldstein wrote:
> > > > > > On Mon, Aug 2, 2021 at 3:34 PM Jan Kara <jack@suse.cz> wrote:
> > > > > > >
> > > > > > > On Fri 30-07-21 08:03:01, Amir Goldstein wrote:
> > > > > > > > On Thu, Jul 29, 2021 at 6:13 PM Amir Goldstein <amir73il@gmail.com> wrote:
> > > > > > > > > On Thu, Jul 29, 2021 at 4:39 PM Jan Kara <jack@suse.cz> wrote:
> > > > > > > > > > Well, but pidfd also makes sure that /proc/<pid>/ keeps belonging to the
> > > > > > > > > > same process while you read various data from it. And you cannot achieve
> > > > > > > > > > that with pid+generation thing you've suggested. Plus the additional
> > > > > > > > > > concept and its complexity is non-trivial So I tend to agree with
> > > > > > > > > > Christian that we really want to return pidfd.
> > > > > > > > > >
> > > > > > > > > > Given returning pidfd is CAP_SYS_ADMIN priviledged operation I'm undecided
> > > > > > > > > > whether it is worth the trouble to come up with some other mechanism how to
> > > > > > > > > > return pidfd with the event. We could return some cookie which could be
> > > > > > > > > > then (by some ioctl or so) either transformed into real pidfd or released
> > > > > > > > > > (so that we can release pid handle in the kernel) but it looks ugly and
> > > > > > > > > > complicates things for everybody without bringing significant security
> > > > > > > > > > improvement (we already can pass fd with the event). So I'm pondering
> > > > > > > > > > whether there's some other way how we could make the interface safer - e.g.
> > > > > > > > > > so that the process receiving the event (not the one creating the group)
> > > > > > > > > > would also need to opt in for getting fds created in its file table.
> > > > > > > > > >
> > > > > > > > > > But so far nothing bright has come to my mind. :-|
> > > > > > > > > >
> > > > > > > > >
> > > > > > > > > There is a way, it is not bright, but it is pretty simple -
> > > > > > > > > store an optional pid in group->fanotify_data.fd_reader.
> > > > > > > > >
> > > > > > > > > With flag FAN_REPORT_PIDFD, both pidfd and event->fd reporting
> > > > > > > > > will be disabled to any process other than fd_reader.
> > > > > > > > > Without FAN_REPORT_PIDFD, event->fd reporting will be disabled
> > > > > > > > > if fd_reaader is set to a process other than the reader.
> > > > > > > > >
> > > > > > > > > A process can call ioctl START_FD_READER to set fd_reader to itself.
> > > > > > > > > With FAN_REPORT_PIDFD, if reaader_fd is NULL and the reader
> > > > > > > > > process has CAP_SYS_ADMIN, read() sets fd_reader to itself.
> > > > > > > > >
> > > > > > > > > Permission wise, START_FD_READER is allowed with
> > > > > > > > > CAP_SYS_ADMIN or if fd_reader is not owned by another process.
> > > > > > > > > We may consider YIELD_FD_READER ioctl if needed.
> > > > > > > > >
> > > > > > > > > I think that this is a pretty cheap price for implementation
> > > > > > > > > and maybe acceptable overhead for complicating the API?
> > > > > > > > > Note that without passing fd, there is no need for any ioctl.
> > > > > > > > >
> > > > > > > > > An added security benefit is that the ioctl adds is a way for the
> > > > > > > > > caller of fanotify_init() to make sure that even if the fanotify_fd is
> > > > > > > > > leaked, that event->fd will not be leaked, regardless of flag
> > > > > > > > > FAN_REPORT_PIDFD.
> > > > > > > > >
> > > > > > > > > So the START_FD_READER ioctl feature could be implemented
> > > > > > > > > and documented first.
> > > > > > > > > And then FAN_REPORT_PIDFD could use the feature with a
> > > > > > > > > very minor API difference:
> > > > > > > > > - Without the flag, other processes can read fds by default and
> > > > > > > > >   group initiator can opt-out
> > > > > > > > > - With the flag, other processes cannot read fds by default and
> > > > > > > > >   need to opt-in
> > > > > > > >
> > > > > > > > Or maybe something even simpler... fanotify_init() flag
> > > > > > > > FAN_PRIVATE (or FAN_PROTECTED) that limits event reading
> > > > > > > > to the initiator process (not only fd reading).
> > > > > > > >
> > > > > > > > FAN_REPORT_PIDFD requires FAN_PRIVATE.
> > > > > > > > If we do not know there is a use case for passing fanotify_fd
> > > > > > > > that reports pidfds to another process why implement the ioctl.
> > > > > > > > We can always implement it later if the need arises.
> > > > > > > > If we contemplate this future change, though, maybe the name
> > > > > > > > FAN_PROTECTED is better to start with.
> > > > > > >
> > > > > > > Good ideas. I think we are fine with returning pidfd only to the process
> > > > > > > creating the fanotify group. Later we can add an ioctl which would indicate
> > > > > > > that the process is also prepared to have fds created in its file table.
> > > > > > > But I have still some open questions:
> > > > > > > Do we want threads of the same process to still be able to receive fds?
> > > > > >
> > > > > > I don't see why not.
> > > > > > They will be bloating the same fd table as the thread that called
> > > > > > fanotify_init().
> > > > > >
> > > > > > > Also pids can be recycled so they are probably not completely reliable
> > > > > > > identifiers?
> > > > > >
> > > > > > Not sure I follow. The group hold a refcount on struct pid of the process that
> > > > > > called fanotify_init() - I think that can used to check if reader process is
> > > > > > the same process, but not sure. Maybe there is another way (Christian?).
> > > > >
> > > > > If the fanotify group hold's a reference to struct pid it won't get
> > > > > recycled. And it can be used to check if the reader thread is the same
> > > > > thread with some care. You also have to be specific what exactly you
> > > > > want to know.  If you're asking if the reading process is the same as
> > > > > the fanotify_init() process you can be asking one of two things.
> > > > >
> > > > > You can be asking if the reader is a thread in the same thread-group as
> > > > > the thread that called fanotify_init(). In that case you might need to
> > > > > do something like
> > > > >
> > > > > rcu_read_lock();
> > > > > struct task_struct *fanotify_init_task_struct = pid_task(stashed_struct_pid, PIDTYPE_PID);
> > > > > if (!fanotify_init_task_struct) {
> > > > >         /* The thread which called fanotify_init() has died already. */
> > > > >         return -ESRCH;
> > > > > }
> > > > > if (same_thread_group(fanotify_init_task_struct, current))
> > > > > rcu_read_unlock();
> > > > >
> > > > > though thinking about it makes me realise that there's a corner case. If
> > > > > the thread that called fanotify_init() is a thread in a non-empty
> > > > > thread-group it can already have died and been reaped. This would mean,
> > > > > pid_task(..., PIDTYPE_PID) will return NULL but there are still other
> > > > > threads alive in the thread-group. Handling that case might be a bit
> > > > > complicated.
> > > > >
> > > > > If you're asking whether the reading thread is really the same as the
> > > > > thread that created the fanotify instance then you might need to do sm
> > > > > like
> > > > >
> > > > > rcu_read_lock();
> > > > > if (pid_task(stashed_struct_pid, PIDTYPE_PID) == current)
> > > > > rcu_read_unlock();
> > > > >
> > > > > Just for completeness if I remember all of this right: there's a corner
> > > > > case because of how de_thread() works.
> > > > > During exec the thread that is execing will assume the struct pid of the
> > > > > old thread-group leader. (All other threads in the same thread-group
> > > > > will get killed.)
> > > > > Assume the thread that created the fanotify instance is not the
> > > > > thread-group leader in its non-empty thread-group. And further assume it
> > > > > exec's. Then it will assume the struct pid of the old thread-group
> > > > > leader during de_thread().
> > > > > Assume the thread inherits the fanotify fd across the exec. Now, when it
> > > > > tries to read a new event after the exec then pid_task() will return
> > > > > NULL.
> > > > > However, if the thread was already the thread-group leader before the
> > > > > exec then pid_task() will return the same task struct as before after
> > > > > the exec (because no struct pid swapping needed to take place).
> > > > >
> > > > > I hope this causes more clarity ?then confusion. :)
> > > > 
> > > > I'm afraid it's the latter :D
> > > > 
> > > > Sigh! We must simplify.
> > > > 
> > > > Thinking out loud, instead of sealing the possibility of another
> > > > process reading pidfd, maybe just avoid the most obvious unintentional
> > > > leak of fanotify_fd to another process by mandating  FAN_CLOEXEC?
> > >
> > > Well, I don't think we need any protection from leaking fanotify_fd. It is
> > > special fd with special priviledges as any other. If you leak it, well, bad
> > > luck but that's how Unix priviledge model works.
> > > 
> > > The threat IMO is that you have a process X, that process expects to
> > > receive fd to work with from process Y. Now process Y is malicious (or
> > > taken over by an attacker) and passes to X fanotify_fd. X reads from
> > > fanotify_fd to get data to process, it performs all kinds of validity
> > > checks on untrusted input but it does not expect that the read has side
> > > effects on X's file_table and in the worst case can lead to some compromise
> > > of X or easily to DoS on X by exhausting its file_table space.
> > >
> > > Currently this attack vector is moot because you have to have CAP_SYS_ADMIN
> > > to get to fanotify_fd and then you can certainly do worse things. But OTOH
> > > I can see why Jann was uneasy about this.
> > 
> > As I have breifly expressed in my previous emails, the cause for concern
> > here is flakey IMO. If there's sensible something that I'm clearly missing,
> > then please explain.
> 
> No, I think your understanding is correct.
> 
> > From my perspective, the only sensible attack vector that's maybe worth
> > worrying about here is the possibility of exhausting the fdtable of a given
> > process, which yes, can be considered as a form of DoS. However, in any
> > case, there are other defensive protections/measures that a programmer
> > could employ in their application code which could prevent such from ever
> > happening.
> > 
> > The whole passing of file descriptors between process Y and process X and
> > the leaking of a file descriptor thing simply goes back to what you've
> > mentioned above Jan. I consider it a very weak argument. When enabling
> > FAN_REPORT_PIDFD, the process requires CAP_SYS_ADMIN. If that process ever
> > has its execution flow hijacked by an attacker, then I'm sorry, I think
> > there's other larger causes for concern at that point rather then worrying
> > about the state of some other child processes fdtable.
> > 
> > In general cases, I get that passing a file descriptor between process Y
> > and process X and then having process X's fdtable modified as result of
> > calling functions like read() is considered undesired. But, for
> > applications that makes use of fanotify is there ever a case where we pass
> > the fanotify file descriptor to a random/unexpected process and have it
> > process events? I don't think so. So, I suppose what I'm trying to say is
> > that, if an application chooses to opt-in and use a flag like
> > FAN_REPORT_PIDFD or any other future file descriptor generating variant,
> > the expectation is that which ever process is created and event processing
> > is passed to that process, then it should always expect to have its fdtable
> > modified when reading events.
> 
> Yes, I was thinking about this some more and at this point, given the lack
> of convenient options for the hardening, I think the best option is to keep
> the interface as originally planned. Because I'm afraid the hardening options
> we were able to come up with would only cause confusion (and from confusion
> bugs easily arise) for little security gain.

OK, in that case are you happy for me to post hopefully the last iteration
of this series with the minor nits addressed?

/M

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v3 5/5] fanotify: add pidfd support to the fanotify API
  2021-08-05  5:51                           ` Matthew Bobrowski
@ 2021-08-05  8:55                             ` Jan Kara
  0 siblings, 0 replies; 41+ messages in thread
From: Jan Kara @ 2021-08-05  8:55 UTC (permalink / raw)
  To: Matthew Bobrowski
  Cc: Jan Kara, Amir Goldstein, Christian Brauner, Jann Horn,
	linux-fsdevel, Linux API, Andy Lutomirski

On Thu 05-08-21 15:51:16, Matthew Bobrowski wrote:
> On Wed, Aug 04, 2021 at 02:39:40PM +0200, Jan Kara wrote:
> > On Wed 04-08-21 13:46:05, Matthew Bobrowski wrote:
> > > On Tue, Aug 03, 2021 at 04:04:21PM +0200, Jan Kara wrote:
> > > > On Tue 03-08-21 13:07:57, Amir Goldstein wrote:
> > > > > On Tue, Aug 3, 2021 at 12:37 PM Christian Brauner
> > > > > <christian.brauner@ubuntu.com> wrote:
> > > > > >
> > > > > > On Mon, Aug 02, 2021 at 05:38:20PM +0300, Amir Goldstein wrote:
> > > > > > > On Mon, Aug 2, 2021 at 3:34 PM Jan Kara <jack@suse.cz> wrote:
> > > > > > > >
> > > > > > > > On Fri 30-07-21 08:03:01, Amir Goldstein wrote:
> > > > > > > > > On Thu, Jul 29, 2021 at 6:13 PM Amir Goldstein <amir73il@gmail.com> wrote:
> > > > > > > > > > On Thu, Jul 29, 2021 at 4:39 PM Jan Kara <jack@suse.cz> wrote:
> > > > > > > > > > > Well, but pidfd also makes sure that /proc/<pid>/ keeps belonging to the
> > > > > > > > > > > same process while you read various data from it. And you cannot achieve
> > > > > > > > > > > that with pid+generation thing you've suggested. Plus the additional
> > > > > > > > > > > concept and its complexity is non-trivial So I tend to agree with
> > > > > > > > > > > Christian that we really want to return pidfd.
> > > > > > > > > > >
> > > > > > > > > > > Given returning pidfd is CAP_SYS_ADMIN priviledged operation I'm undecided
> > > > > > > > > > > whether it is worth the trouble to come up with some other mechanism how to
> > > > > > > > > > > return pidfd with the event. We could return some cookie which could be
> > > > > > > > > > > then (by some ioctl or so) either transformed into real pidfd or released
> > > > > > > > > > > (so that we can release pid handle in the kernel) but it looks ugly and
> > > > > > > > > > > complicates things for everybody without bringing significant security
> > > > > > > > > > > improvement (we already can pass fd with the event). So I'm pondering
> > > > > > > > > > > whether there's some other way how we could make the interface safer - e.g.
> > > > > > > > > > > so that the process receiving the event (not the one creating the group)
> > > > > > > > > > > would also need to opt in for getting fds created in its file table.
> > > > > > > > > > >
> > > > > > > > > > > But so far nothing bright has come to my mind. :-|
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > There is a way, it is not bright, but it is pretty simple -
> > > > > > > > > > store an optional pid in group->fanotify_data.fd_reader.
> > > > > > > > > >
> > > > > > > > > > With flag FAN_REPORT_PIDFD, both pidfd and event->fd reporting
> > > > > > > > > > will be disabled to any process other than fd_reader.
> > > > > > > > > > Without FAN_REPORT_PIDFD, event->fd reporting will be disabled
> > > > > > > > > > if fd_reaader is set to a process other than the reader.
> > > > > > > > > >
> > > > > > > > > > A process can call ioctl START_FD_READER to set fd_reader to itself.
> > > > > > > > > > With FAN_REPORT_PIDFD, if reaader_fd is NULL and the reader
> > > > > > > > > > process has CAP_SYS_ADMIN, read() sets fd_reader to itself.
> > > > > > > > > >
> > > > > > > > > > Permission wise, START_FD_READER is allowed with
> > > > > > > > > > CAP_SYS_ADMIN or if fd_reader is not owned by another process.
> > > > > > > > > > We may consider YIELD_FD_READER ioctl if needed.
> > > > > > > > > >
> > > > > > > > > > I think that this is a pretty cheap price for implementation
> > > > > > > > > > and maybe acceptable overhead for complicating the API?
> > > > > > > > > > Note that without passing fd, there is no need for any ioctl.
> > > > > > > > > >
> > > > > > > > > > An added security benefit is that the ioctl adds is a way for the
> > > > > > > > > > caller of fanotify_init() to make sure that even if the fanotify_fd is
> > > > > > > > > > leaked, that event->fd will not be leaked, regardless of flag
> > > > > > > > > > FAN_REPORT_PIDFD.
> > > > > > > > > >
> > > > > > > > > > So the START_FD_READER ioctl feature could be implemented
> > > > > > > > > > and documented first.
> > > > > > > > > > And then FAN_REPORT_PIDFD could use the feature with a
> > > > > > > > > > very minor API difference:
> > > > > > > > > > - Without the flag, other processes can read fds by default and
> > > > > > > > > >   group initiator can opt-out
> > > > > > > > > > - With the flag, other processes cannot read fds by default and
> > > > > > > > > >   need to opt-in
> > > > > > > > >
> > > > > > > > > Or maybe something even simpler... fanotify_init() flag
> > > > > > > > > FAN_PRIVATE (or FAN_PROTECTED) that limits event reading
> > > > > > > > > to the initiator process (not only fd reading).
> > > > > > > > >
> > > > > > > > > FAN_REPORT_PIDFD requires FAN_PRIVATE.
> > > > > > > > > If we do not know there is a use case for passing fanotify_fd
> > > > > > > > > that reports pidfds to another process why implement the ioctl.
> > > > > > > > > We can always implement it later if the need arises.
> > > > > > > > > If we contemplate this future change, though, maybe the name
> > > > > > > > > FAN_PROTECTED is better to start with.
> > > > > > > >
> > > > > > > > Good ideas. I think we are fine with returning pidfd only to the process
> > > > > > > > creating the fanotify group. Later we can add an ioctl which would indicate
> > > > > > > > that the process is also prepared to have fds created in its file table.
> > > > > > > > But I have still some open questions:
> > > > > > > > Do we want threads of the same process to still be able to receive fds?
> > > > > > >
> > > > > > > I don't see why not.
> > > > > > > They will be bloating the same fd table as the thread that called
> > > > > > > fanotify_init().
> > > > > > >
> > > > > > > > Also pids can be recycled so they are probably not completely reliable
> > > > > > > > identifiers?
> > > > > > >
> > > > > > > Not sure I follow. The group hold a refcount on struct pid of the process that
> > > > > > > called fanotify_init() - I think that can used to check if reader process is
> > > > > > > the same process, but not sure. Maybe there is another way (Christian?).
> > > > > >
> > > > > > If the fanotify group hold's a reference to struct pid it won't get
> > > > > > recycled. And it can be used to check if the reader thread is the same
> > > > > > thread with some care. You also have to be specific what exactly you
> > > > > > want to know.  If you're asking if the reading process is the same as
> > > > > > the fanotify_init() process you can be asking one of two things.
> > > > > >
> > > > > > You can be asking if the reader is a thread in the same thread-group as
> > > > > > the thread that called fanotify_init(). In that case you might need to
> > > > > > do something like
> > > > > >
> > > > > > rcu_read_lock();
> > > > > > struct task_struct *fanotify_init_task_struct = pid_task(stashed_struct_pid, PIDTYPE_PID);
> > > > > > if (!fanotify_init_task_struct) {
> > > > > >         /* The thread which called fanotify_init() has died already. */
> > > > > >         return -ESRCH;
> > > > > > }
> > > > > > if (same_thread_group(fanotify_init_task_struct, current))
> > > > > > rcu_read_unlock();
> > > > > >
> > > > > > though thinking about it makes me realise that there's a corner case. If
> > > > > > the thread that called fanotify_init() is a thread in a non-empty
> > > > > > thread-group it can already have died and been reaped. This would mean,
> > > > > > pid_task(..., PIDTYPE_PID) will return NULL but there are still other
> > > > > > threads alive in the thread-group. Handling that case might be a bit
> > > > > > complicated.
> > > > > >
> > > > > > If you're asking whether the reading thread is really the same as the
> > > > > > thread that created the fanotify instance then you might need to do sm
> > > > > > like
> > > > > >
> > > > > > rcu_read_lock();
> > > > > > if (pid_task(stashed_struct_pid, PIDTYPE_PID) == current)
> > > > > > rcu_read_unlock();
> > > > > >
> > > > > > Just for completeness if I remember all of this right: there's a corner
> > > > > > case because of how de_thread() works.
> > > > > > During exec the thread that is execing will assume the struct pid of the
> > > > > > old thread-group leader. (All other threads in the same thread-group
> > > > > > will get killed.)
> > > > > > Assume the thread that created the fanotify instance is not the
> > > > > > thread-group leader in its non-empty thread-group. And further assume it
> > > > > > exec's. Then it will assume the struct pid of the old thread-group
> > > > > > leader during de_thread().
> > > > > > Assume the thread inherits the fanotify fd across the exec. Now, when it
> > > > > > tries to read a new event after the exec then pid_task() will return
> > > > > > NULL.
> > > > > > However, if the thread was already the thread-group leader before the
> > > > > > exec then pid_task() will return the same task struct as before after
> > > > > > the exec (because no struct pid swapping needed to take place).
> > > > > >
> > > > > > I hope this causes more clarity ?then confusion. :)
> > > > > 
> > > > > I'm afraid it's the latter :D
> > > > > 
> > > > > Sigh! We must simplify.
> > > > > 
> > > > > Thinking out loud, instead of sealing the possibility of another
> > > > > process reading pidfd, maybe just avoid the most obvious unintentional
> > > > > leak of fanotify_fd to another process by mandating  FAN_CLOEXEC?
> > > >
> > > > Well, I don't think we need any protection from leaking fanotify_fd. It is
> > > > special fd with special priviledges as any other. If you leak it, well, bad
> > > > luck but that's how Unix priviledge model works.
> > > > 
> > > > The threat IMO is that you have a process X, that process expects to
> > > > receive fd to work with from process Y. Now process Y is malicious (or
> > > > taken over by an attacker) and passes to X fanotify_fd. X reads from
> > > > fanotify_fd to get data to process, it performs all kinds of validity
> > > > checks on untrusted input but it does not expect that the read has side
> > > > effects on X's file_table and in the worst case can lead to some compromise
> > > > of X or easily to DoS on X by exhausting its file_table space.
> > > >
> > > > Currently this attack vector is moot because you have to have CAP_SYS_ADMIN
> > > > to get to fanotify_fd and then you can certainly do worse things. But OTOH
> > > > I can see why Jann was uneasy about this.
> > > 
> > > As I have breifly expressed in my previous emails, the cause for concern
> > > here is flakey IMO. If there's sensible something that I'm clearly missing,
> > > then please explain.
> > 
> > No, I think your understanding is correct.
> > 
> > > From my perspective, the only sensible attack vector that's maybe worth
> > > worrying about here is the possibility of exhausting the fdtable of a given
> > > process, which yes, can be considered as a form of DoS. However, in any
> > > case, there are other defensive protections/measures that a programmer
> > > could employ in their application code which could prevent such from ever
> > > happening.
> > > 
> > > The whole passing of file descriptors between process Y and process X and
> > > the leaking of a file descriptor thing simply goes back to what you've
> > > mentioned above Jan. I consider it a very weak argument. When enabling
> > > FAN_REPORT_PIDFD, the process requires CAP_SYS_ADMIN. If that process ever
> > > has its execution flow hijacked by an attacker, then I'm sorry, I think
> > > there's other larger causes for concern at that point rather then worrying
> > > about the state of some other child processes fdtable.
> > > 
> > > In general cases, I get that passing a file descriptor between process Y
> > > and process X and then having process X's fdtable modified as result of
> > > calling functions like read() is considered undesired. But, for
> > > applications that makes use of fanotify is there ever a case where we pass
> > > the fanotify file descriptor to a random/unexpected process and have it
> > > process events? I don't think so. So, I suppose what I'm trying to say is
> > > that, if an application chooses to opt-in and use a flag like
> > > FAN_REPORT_PIDFD or any other future file descriptor generating variant,
> > > the expectation is that which ever process is created and event processing
> > > is passed to that process, then it should always expect to have its fdtable
> > > modified when reading events.
> > 
> > Yes, I was thinking about this some more and at this point, given the lack
> > of convenient options for the hardening, I think the best option is to keep
> > the interface as originally planned. Because I'm afraid the hardening options
> > we were able to come up with would only cause confusion (and from confusion
> > bugs easily arise) for little security gain.
> 
> OK, in that case are you happy for me to post hopefully the last iteration
> of this series with the minor nits addressed?

Yes.
	
								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 41+ messages in thread

end of thread, other threads:[~2021-08-05  8:55 UTC | newest]

Thread overview: 41+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-07-21  6:17 [PATCH v3 0/5] Add pidfd support to the fanotify API Matthew Bobrowski
2021-07-21  6:17 ` [PATCH v3 1/5] kernel/pid.c: remove static qualifier from pidfd_create() Matthew Bobrowski
2021-07-21  6:17 ` [PATCH v3 2/5] kernel/pid.c: implement additional checks upon pidfd_create() parameters Matthew Bobrowski
2021-07-21  6:18 ` [PATCH v3 3/5] fanotify/fanotify_user.c: minor cosmetic adjustments to fid labels Matthew Bobrowski
2021-07-21  6:34   ` Amir Goldstein
2021-07-21  6:18 ` [PATCH v3 4/5] fanotify/fanotify_user.c: introduce a generic info record copying helper Matthew Bobrowski
2021-07-21  6:35   ` Amir Goldstein
2021-07-27  8:16     ` Amir Goldstein
2021-07-27 12:57       ` Matthew Bobrowski
2021-07-21  6:19 ` [PATCH v3 5/5] fanotify: add pidfd support to the fanotify API Matthew Bobrowski
2021-07-21  7:05   ` Amir Goldstein
2021-07-26 23:04     ` Matthew Bobrowski
2021-07-27  0:23   ` Jann Horn
2021-07-27  4:19     ` Amir Goldstein
2021-07-27  5:10       ` Matthew Bobrowski
2021-07-27  7:03         ` Amir Goldstein
2021-07-27  8:22           ` Christian Brauner
2021-07-27  8:29             ` Christian Brauner
2021-07-29 13:39       ` Jan Kara
2021-07-29 15:13         ` Amir Goldstein
2021-07-30  5:03           ` Amir Goldstein
2021-08-02 12:34             ` Jan Kara
2021-08-02 14:38               ` Amir Goldstein
2021-08-02 20:10                 ` Jan Kara
2021-08-03  1:29                   ` Matthew Bobrowski
2021-08-03  5:51                     ` Amir Goldstein
2021-08-03  9:46                   ` Christian Brauner
2021-08-03  9:37                 ` Christian Brauner
2021-08-03 10:07                   ` Amir Goldstein
2021-08-03 14:04                     ` Jan Kara
2021-08-04  3:46                       ` Matthew Bobrowski
2021-08-04 12:39                         ` Jan Kara
2021-08-05  5:51                           ` Matthew Bobrowski
2021-08-05  8:55                             ` Jan Kara
2021-08-03 13:39                   ` Jan Kara
2021-07-27 12:54     ` Matthew Bobrowski
2021-07-29 22:48       ` Matthew Bobrowski
2021-07-21  7:06 ` [PATCH v3 0/5] Add " Amir Goldstein
2021-07-26 23:07   ` Matthew Bobrowski
2021-07-27  0:16     ` Matthew Bobrowski
2021-07-29 13:40       ` Jan Kara

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.