All of lore.kernel.org
 help / color / mirror / Atom feed
* Implement close-on-fork
@ 2020-04-20  7:15 ` Nate Karstens
  0 siblings, 0 replies; 69+ messages in thread
From: Nate Karstens @ 2020-04-20  7:15 UTC (permalink / raw)
  To: Alexander Viro, Jeff Layton, J. Bruce Fields, Arnd Bergmann,
	Richard Henderson, Ivan Kokshaysky, Matt Turner,
	James E.J. Bottomley, Helge Deller, David S. Miller,
	Jakub Kicinski, linux-fsdevel, linux-arch, linux-alpha,
	linux-parisc, sparclinux, netdev, linux-kernel
  Cc: Changli Gao

Series of 4 patches to implement close-on-fork. Tests have been
published to https://github.com/nkarstens/ltp/tree/close-on-fork.

close-on-fork addresses race conditions in system(), which
(depending on the implementation) is non-atomic in that it
first calls a fork() and then an exec().

This functionality was approved by the Austin Common Standards
Revision Group for inclusion in the next revision of the POSIX
standard (see issue 1318 in the Austin Group Defect Tracker).

[PATCH 1/4] fs: Implement close-on-fork
[PATCH 2/4] fs: Add O_CLOFORK flag for open(2) and dup3(2)
[PATCH 3/4] fs: Add F_DUPFD_CLOFORK to fcntl(2)
[PATCH 4/4] net: Add SOCK_CLOFORK


^ permalink raw reply	[flat|nested] 69+ messages in thread

* Implement close-on-fork
@ 2020-04-20  7:15 ` Nate Karstens
  0 siblings, 0 replies; 69+ messages in thread
From: Nate Karstens @ 2020-04-20  7:15 UTC (permalink / raw)
  To: Alexander Viro, Jeff Layton, J. Bruce Fields, Arnd Bergmann,
	Richard Henderson, Ivan Kokshaysky, Matt Turner,
	James E.J. Bottomley, Helge Deller, David S. Miller,
	Jakub Kicinski, linux-fsdevel, linux-arch, linux-alpha,
	linux-parisc, sparclinux, netdev, linux-kernel
  Cc: Changli Gao

Series of 4 patches to implement close-on-fork. Tests have been
published to https://github.com/nkarstens/ltp/tree/close-on-fork.

close-on-fork addresses race conditions in system(), which
(depending on the implementation) is non-atomic in that it
first calls a fork() and then an exec().

This functionality was approved by the Austin Common Standards
Revision Group for inclusion in the next revision of the POSIX
standard (see issue 1318 in the Austin Group Defect Tracker).

[PATCH 1/4] fs: Implement close-on-fork
[PATCH 2/4] fs: Add O_CLOFORK flag for open(2) and dup3(2)
[PATCH 3/4] fs: Add F_DUPFD_CLOFORK to fcntl(2)
[PATCH 4/4] net: Add SOCK_CLOFORK

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Implement close-on-fork
@ 2020-04-20  7:15 ` Nate Karstens
  0 siblings, 0 replies; 69+ messages in thread
From: Nate Karstens @ 2020-04-20  7:15 UTC (permalink / raw)
  To: Alexander Viro, Jeff Layton, J. Bruce Fields, Arnd Bergmann,
	Richard Henderson, Ivan Kokshaysky, Matt Turner,
	James E.J. Bottomley, Helge Deller, David S. Miller,
	Jakub Kicinski, linux-fsdevel, linux-arch, linux-alpha,
	linux-parisc, sparclinux, netdev, linux-kernel
  Cc: Changli Gao

Series of 4 patches to implement close-on-fork. Tests have been
published to https://github.com/nkarstens/ltp/tree/close-on-fork.

close-on-fork addresses race conditions in system(), which
(depending on the implementation) is non-atomic in that it
first calls a fork() and then an exec().

This functionality was approved by the Austin Common Standards
Revision Group for inclusion in the next revision of the POSIX
standard (see issue 1318 in the Austin Group Defect Tracker).

[PATCH 1/4] fs: Implement close-on-fork
[PATCH 2/4] fs: Add O_CLOFORK flag for open(2) and dup3(2)
[PATCH 3/4] fs: Add F_DUPFD_CLOFORK to fcntl(2)
[PATCH 4/4] net: Add SOCK_CLOFORK

^ permalink raw reply	[flat|nested] 69+ messages in thread

* [PATCH 1/4] fs: Implement close-on-fork
  2020-04-20  7:15 ` Nate Karstens
  (?)
@ 2020-04-20  7:15   ` Nate Karstens
  -1 siblings, 0 replies; 69+ messages in thread
From: Nate Karstens @ 2020-04-20  7:15 UTC (permalink / raw)
  To: Alexander Viro, Jeff Layton, J. Bruce Fields, Arnd Bergmann,
	Richard Henderson, Ivan Kokshaysky, Matt Turner,
	James E.J. Bottomley, Helge Deller, David S. Miller,
	Jakub Kicinski, linux-fsdevel, linux-arch, linux-alpha,
	linux-parisc, sparclinux, netdev, linux-kernel
  Cc: Changli Gao, Nate Karstens

The close-on-fork flag causes the file descriptor to be closed
atomically in the child process before the child process returns
from fork(). Implement this feature and provide a method to
get/set the close-on-fork flag using fcntl(2).

This functionality was approved by the Austin Common Standards
Revision Group for inclusion in the next revision of the POSIX
standard (see issue 1318 in the Austin Group Defect Tracker).

Co-developed-by: Changli Gao <xiaosuo@gmail.com>
Signed-off-by: Changli Gao <xiaosuo@gmail.com>
Signed-off-by: Nate Karstens <nate.karstens@garmin.com>
---
 fs/fcntl.c                             |  2 ++
 fs/file.c                              | 50 +++++++++++++++++++++++++-
 include/linux/fdtable.h                |  7 ++++
 include/linux/file.h                   |  2 ++
 include/uapi/asm-generic/fcntl.h       |  5 +--
 tools/include/uapi/asm-generic/fcntl.h |  5 +--
 6 files changed, 66 insertions(+), 5 deletions(-)

diff --git a/fs/fcntl.c b/fs/fcntl.c
index 2e4c0fa2074b..23964abf4a1a 100644
--- a/fs/fcntl.c
+++ b/fs/fcntl.c
@@ -335,10 +335,12 @@ static long do_fcntl(int fd, unsigned int cmd, unsigned long arg,
 		break;
 	case F_GETFD:
 		err = get_close_on_exec(fd) ? FD_CLOEXEC : 0;
+		err |= get_close_on_fork(fd) ? FD_CLOFORK : 0;
 		break;
 	case F_SETFD:
 		err = 0;
 		set_close_on_exec(fd, arg & FD_CLOEXEC);
+		set_close_on_fork(fd, arg & FD_CLOFORK);
 		break;
 	case F_GETFL:
 		err = filp->f_flags;
diff --git a/fs/file.c b/fs/file.c
index c8a4e4c86e55..de7260ba718d 100644
--- a/fs/file.c
+++ b/fs/file.c
@@ -57,6 +57,8 @@ static void copy_fd_bitmaps(struct fdtable *nfdt, struct fdtable *ofdt,
 	memset((char *)nfdt->open_fds + cpy, 0, set);
 	memcpy(nfdt->close_on_exec, ofdt->close_on_exec, cpy);
 	memset((char *)nfdt->close_on_exec + cpy, 0, set);
+	memcpy(nfdt->close_on_fork, ofdt->close_on_fork, cpy);
+	memset((char *)nfdt->close_on_fork + cpy, 0, set);
 
 	cpy = BITBIT_SIZE(count);
 	set = BITBIT_SIZE(nfdt->max_fds) - cpy;
@@ -118,7 +120,7 @@ static struct fdtable * alloc_fdtable(unsigned int nr)
 	fdt->fd = data;
 
 	data = kvmalloc(max_t(size_t,
-				 2 * nr / BITS_PER_BYTE + BITBIT_SIZE(nr), L1_CACHE_BYTES),
+				 3 * nr / BITS_PER_BYTE + BITBIT_SIZE(nr), L1_CACHE_BYTES),
 				 GFP_KERNEL_ACCOUNT);
 	if (!data)
 		goto out_arr;
@@ -126,6 +128,8 @@ static struct fdtable * alloc_fdtable(unsigned int nr)
 	data += nr / BITS_PER_BYTE;
 	fdt->close_on_exec = data;
 	data += nr / BITS_PER_BYTE;
+	fdt->close_on_fork = data;
+	data += nr / BITS_PER_BYTE;
 	fdt->full_fds_bits = data;
 
 	return fdt;
@@ -236,6 +240,17 @@ static inline void __clear_close_on_exec(unsigned int fd, struct fdtable *fdt)
 		__clear_bit(fd, fdt->close_on_exec);
 }
 
+static inline void __set_close_on_fork(unsigned int fd, struct fdtable *fdt)
+{
+	__set_bit(fd, fdt->close_on_fork);
+}
+
+static inline void __clear_close_on_fork(unsigned int fd, struct fdtable *fdt)
+{
+	if (test_bit(fd, fdt->close_on_fork))
+		__clear_bit(fd, fdt->close_on_fork);
+}
+
 static inline void __set_open_fd(unsigned int fd, struct fdtable *fdt)
 {
 	__set_bit(fd, fdt->open_fds);
@@ -290,6 +305,7 @@ struct files_struct *dup_fd(struct files_struct *oldf, int *errorp)
 	new_fdt = &newf->fdtab;
 	new_fdt->max_fds = NR_OPEN_DEFAULT;
 	new_fdt->close_on_exec = newf->close_on_exec_init;
+	new_fdt->close_on_fork = newf->close_on_fork_init;
 	new_fdt->open_fds = newf->open_fds_init;
 	new_fdt->full_fds_bits = newf->full_fds_bits_init;
 	new_fdt->fd = &newf->fd_array[0];
@@ -337,6 +353,12 @@ struct files_struct *dup_fd(struct files_struct *oldf, int *errorp)
 
 	for (i = open_files; i != 0; i--) {
 		struct file *f = *old_fds++;
+
+		if (test_bit(open_files - i, new_fdt->close_on_fork)) {
+			__clear_bit(open_files - i, new_fdt->open_fds);
+			f = NULL;
+		}
+
 		if (f) {
 			get_file(f);
 		} else {
@@ -453,6 +475,7 @@ struct files_struct init_files = {
 		.max_fds	= NR_OPEN_DEFAULT,
 		.fd		= &init_files.fd_array[0],
 		.close_on_exec	= init_files.close_on_exec_init,
+		.close_on_fork	= init_files.close_on_fork_init,
 		.open_fds	= init_files.open_fds_init,
 		.full_fds_bits	= init_files.full_fds_bits_init,
 	},
@@ -865,6 +888,31 @@ bool get_close_on_exec(unsigned int fd)
 	return res;
 }
 
+void set_close_on_fork(unsigned int fd, int flag)
+{
+	struct files_struct *files = current->files;
+	struct fdtable *fdt;
+	spin_lock(&files->file_lock);
+	fdt = files_fdtable(files);
+	if (flag)
+		__set_close_on_fork(fd, fdt);
+	else
+		__clear_close_on_fork(fd, fdt);
+	spin_unlock(&files->file_lock);
+}
+
+bool get_close_on_fork(unsigned int fd)
+{
+	struct files_struct *files = current->files;
+	struct fdtable *fdt;
+	bool res;
+	rcu_read_lock();
+	fdt = files_fdtable(files);
+	res = close_on_fork(fd, fdt);
+	rcu_read_unlock();
+	return res;
+}
+
 static int do_dup2(struct files_struct *files,
 	struct file *file, unsigned fd, unsigned flags)
 __releases(&files->file_lock)
diff --git a/include/linux/fdtable.h b/include/linux/fdtable.h
index f07c55ea0c22..61c551947fa3 100644
--- a/include/linux/fdtable.h
+++ b/include/linux/fdtable.h
@@ -27,6 +27,7 @@ struct fdtable {
 	unsigned int max_fds;
 	struct file __rcu **fd;      /* current fd array */
 	unsigned long *close_on_exec;
+	unsigned long *close_on_fork;
 	unsigned long *open_fds;
 	unsigned long *full_fds_bits;
 	struct rcu_head rcu;
@@ -37,6 +38,11 @@ static inline bool close_on_exec(unsigned int fd, const struct fdtable *fdt)
 	return test_bit(fd, fdt->close_on_exec);
 }
 
+static inline bool close_on_fork(unsigned int fd, const struct fdtable *fdt)
+{
+	return test_bit(fd, fdt->close_on_fork);
+}
+
 static inline bool fd_is_open(unsigned int fd, const struct fdtable *fdt)
 {
 	return test_bit(fd, fdt->open_fds);
@@ -61,6 +67,7 @@ struct files_struct {
 	spinlock_t file_lock ____cacheline_aligned_in_smp;
 	unsigned int next_fd;
 	unsigned long close_on_exec_init[1];
+	unsigned long close_on_fork_init[1];
 	unsigned long open_fds_init[1];
 	unsigned long full_fds_bits_init[1];
 	struct file __rcu * fd_array[NR_OPEN_DEFAULT];
diff --git a/include/linux/file.h b/include/linux/file.h
index 142d102f285e..86fbb36b438b 100644
--- a/include/linux/file.h
+++ b/include/linux/file.h
@@ -85,6 +85,8 @@ extern int f_dupfd(unsigned int from, struct file *file, unsigned flags);
 extern int replace_fd(unsigned fd, struct file *file, unsigned flags);
 extern void set_close_on_exec(unsigned int fd, int flag);
 extern bool get_close_on_exec(unsigned int fd);
+extern void set_close_on_fork(unsigned int fd, int flag);
+extern bool get_close_on_fork(unsigned int fd);
 extern int __get_unused_fd_flags(unsigned flags, unsigned long nofile);
 extern int get_unused_fd_flags(unsigned flags);
 extern void put_unused_fd(unsigned int fd);
diff --git a/include/uapi/asm-generic/fcntl.h b/include/uapi/asm-generic/fcntl.h
index 9dc0bf0c5a6e..0cb7199a7743 100644
--- a/include/uapi/asm-generic/fcntl.h
+++ b/include/uapi/asm-generic/fcntl.h
@@ -98,8 +98,8 @@
 #endif
 
 #define F_DUPFD		0	/* dup */
-#define F_GETFD		1	/* get close_on_exec */
-#define F_SETFD		2	/* set/clear close_on_exec */
+#define F_GETFD		1	/* get close_on_exec & close_on_fork */
+#define F_SETFD		2	/* set/clear close_on_exec & close_on_fork */
 #define F_GETFL		3	/* get file->f_flags */
 #define F_SETFL		4	/* set file->f_flags */
 #ifndef F_GETLK
@@ -160,6 +160,7 @@ struct f_owner_ex {
 
 /* for F_[GET|SET]FL */
 #define FD_CLOEXEC	1	/* actually anything with low bit set goes */
+#define FD_CLOFORK	2
 
 /* for posix fcntl() and lockf() */
 #ifndef F_RDLCK
diff --git a/tools/include/uapi/asm-generic/fcntl.h b/tools/include/uapi/asm-generic/fcntl.h
index ac190958c981..e04a00fecb4a 100644
--- a/tools/include/uapi/asm-generic/fcntl.h
+++ b/tools/include/uapi/asm-generic/fcntl.h
@@ -97,8 +97,8 @@
 #endif
 
 #define F_DUPFD		0	/* dup */
-#define F_GETFD		1	/* get close_on_exec */
-#define F_SETFD		2	/* set/clear close_on_exec */
+#define F_GETFD		1	/* get close_on_exec & close_on_fork */
+#define F_SETFD		2	/* set/clear close_on_exec & close_on_fork */
 #define F_GETFL		3	/* get file->f_flags */
 #define F_SETFL		4	/* set file->f_flags */
 #ifndef F_GETLK
@@ -159,6 +159,7 @@ struct f_owner_ex {
 
 /* for F_[GET|SET]FL */
 #define FD_CLOEXEC	1	/* actually anything with low bit set goes */
+#define FD_CLOFORK	2
 
 /* for posix fcntl() and lockf() */
 #ifndef F_RDLCK
-- 
2.26.1


^ permalink raw reply related	[flat|nested] 69+ messages in thread

* [PATCH 1/4] fs: Implement close-on-fork
@ 2020-04-20  7:15   ` Nate Karstens
  0 siblings, 0 replies; 69+ messages in thread
From: Nate Karstens @ 2020-04-20  7:15 UTC (permalink / raw)
  To: Alexander Viro, Jeff Layton, J. Bruce Fields, Arnd Bergmann,
	Richard Henderson, Ivan Kokshaysky, Matt Turner,
	James E.J. Bottomley, Helge Deller, David S. Miller,
	Jakub Kicinski, linux-fsdevel, linux-arch, linux-alpha,
	linux-parisc, sparclinux, netdev, linux-kernel
  Cc: Changli Gao, Nate Karstens

The close-on-fork flag causes the file descriptor to be closed
atomically in the child process before the child process returns
from fork(). Implement this feature and provide a method to
get/set the close-on-fork flag using fcntl(2).

This functionality was approved by the Austin Common Standards
Revision Group for inclusion in the next revision of the POSIX
standard (see issue 1318 in the Austin Group Defect Tracker).

Co-developed-by: Changli Gao <xiaosuo@gmail.com>
Signed-off-by: Changli Gao <xiaosuo@gmail.com>
Signed-off-by: Nate Karstens <nate.karstens@garmin.com>
---
 fs/fcntl.c                             |  2 ++
 fs/file.c                              | 50 +++++++++++++++++++++++++-
 include/linux/fdtable.h                |  7 ++++
 include/linux/file.h                   |  2 ++
 include/uapi/asm-generic/fcntl.h       |  5 +--
 tools/include/uapi/asm-generic/fcntl.h |  5 +--
 6 files changed, 66 insertions(+), 5 deletions(-)

diff --git a/fs/fcntl.c b/fs/fcntl.c
index 2e4c0fa2074b..23964abf4a1a 100644
--- a/fs/fcntl.c
+++ b/fs/fcntl.c
@@ -335,10 +335,12 @@ static long do_fcntl(int fd, unsigned int cmd, unsigned long arg,
 		break;
 	case F_GETFD:
 		err = get_close_on_exec(fd) ? FD_CLOEXEC : 0;
+		err |= get_close_on_fork(fd) ? FD_CLOFORK : 0;
 		break;
 	case F_SETFD:
 		err = 0;
 		set_close_on_exec(fd, arg & FD_CLOEXEC);
+		set_close_on_fork(fd, arg & FD_CLOFORK);
 		break;
 	case F_GETFL:
 		err = filp->f_flags;
diff --git a/fs/file.c b/fs/file.c
index c8a4e4c86e55..de7260ba718d 100644
--- a/fs/file.c
+++ b/fs/file.c
@@ -57,6 +57,8 @@ static void copy_fd_bitmaps(struct fdtable *nfdt, struct fdtable *ofdt,
 	memset((char *)nfdt->open_fds + cpy, 0, set);
 	memcpy(nfdt->close_on_exec, ofdt->close_on_exec, cpy);
 	memset((char *)nfdt->close_on_exec + cpy, 0, set);
+	memcpy(nfdt->close_on_fork, ofdt->close_on_fork, cpy);
+	memset((char *)nfdt->close_on_fork + cpy, 0, set);
 
 	cpy = BITBIT_SIZE(count);
 	set = BITBIT_SIZE(nfdt->max_fds) - cpy;
@@ -118,7 +120,7 @@ static struct fdtable * alloc_fdtable(unsigned int nr)
 	fdt->fd = data;
 
 	data = kvmalloc(max_t(size_t,
-				 2 * nr / BITS_PER_BYTE + BITBIT_SIZE(nr), L1_CACHE_BYTES),
+				 3 * nr / BITS_PER_BYTE + BITBIT_SIZE(nr), L1_CACHE_BYTES),
 				 GFP_KERNEL_ACCOUNT);
 	if (!data)
 		goto out_arr;
@@ -126,6 +128,8 @@ static struct fdtable * alloc_fdtable(unsigned int nr)
 	data += nr / BITS_PER_BYTE;
 	fdt->close_on_exec = data;
 	data += nr / BITS_PER_BYTE;
+	fdt->close_on_fork = data;
+	data += nr / BITS_PER_BYTE;
 	fdt->full_fds_bits = data;
 
 	return fdt;
@@ -236,6 +240,17 @@ static inline void __clear_close_on_exec(unsigned int fd, struct fdtable *fdt)
 		__clear_bit(fd, fdt->close_on_exec);
 }
 
+static inline void __set_close_on_fork(unsigned int fd, struct fdtable *fdt)
+{
+	__set_bit(fd, fdt->close_on_fork);
+}
+
+static inline void __clear_close_on_fork(unsigned int fd, struct fdtable *fdt)
+{
+	if (test_bit(fd, fdt->close_on_fork))
+		__clear_bit(fd, fdt->close_on_fork);
+}
+
 static inline void __set_open_fd(unsigned int fd, struct fdtable *fdt)
 {
 	__set_bit(fd, fdt->open_fds);
@@ -290,6 +305,7 @@ struct files_struct *dup_fd(struct files_struct *oldf, int *errorp)
 	new_fdt = &newf->fdtab;
 	new_fdt->max_fds = NR_OPEN_DEFAULT;
 	new_fdt->close_on_exec = newf->close_on_exec_init;
+	new_fdt->close_on_fork = newf->close_on_fork_init;
 	new_fdt->open_fds = newf->open_fds_init;
 	new_fdt->full_fds_bits = newf->full_fds_bits_init;
 	new_fdt->fd = &newf->fd_array[0];
@@ -337,6 +353,12 @@ struct files_struct *dup_fd(struct files_struct *oldf, int *errorp)
 
 	for (i = open_files; i != 0; i--) {
 		struct file *f = *old_fds++;
+
+		if (test_bit(open_files - i, new_fdt->close_on_fork)) {
+			__clear_bit(open_files - i, new_fdt->open_fds);
+			f = NULL;
+		}
+
 		if (f) {
 			get_file(f);
 		} else {
@@ -453,6 +475,7 @@ struct files_struct init_files = {
 		.max_fds	= NR_OPEN_DEFAULT,
 		.fd		= &init_files.fd_array[0],
 		.close_on_exec	= init_files.close_on_exec_init,
+		.close_on_fork	= init_files.close_on_fork_init,
 		.open_fds	= init_files.open_fds_init,
 		.full_fds_bits	= init_files.full_fds_bits_init,
 	},
@@ -865,6 +888,31 @@ bool get_close_on_exec(unsigned int fd)
 	return res;
 }
 
+void set_close_on_fork(unsigned int fd, int flag)
+{
+	struct files_struct *files = current->files;
+	struct fdtable *fdt;
+	spin_lock(&files->file_lock);
+	fdt = files_fdtable(files);
+	if (flag)
+		__set_close_on_fork(fd, fdt);
+	else
+		__clear_close_on_fork(fd, fdt);
+	spin_unlock(&files->file_lock);
+}
+
+bool get_close_on_fork(unsigned int fd)
+{
+	struct files_struct *files = current->files;
+	struct fdtable *fdt;
+	bool res;
+	rcu_read_lock();
+	fdt = files_fdtable(files);
+	res = close_on_fork(fd, fdt);
+	rcu_read_unlock();
+	return res;
+}
+
 static int do_dup2(struct files_struct *files,
 	struct file *file, unsigned fd, unsigned flags)
 __releases(&files->file_lock)
diff --git a/include/linux/fdtable.h b/include/linux/fdtable.h
index f07c55ea0c22..61c551947fa3 100644
--- a/include/linux/fdtable.h
+++ b/include/linux/fdtable.h
@@ -27,6 +27,7 @@ struct fdtable {
 	unsigned int max_fds;
 	struct file __rcu **fd;      /* current fd array */
 	unsigned long *close_on_exec;
+	unsigned long *close_on_fork;
 	unsigned long *open_fds;
 	unsigned long *full_fds_bits;
 	struct rcu_head rcu;
@@ -37,6 +38,11 @@ static inline bool close_on_exec(unsigned int fd, const struct fdtable *fdt)
 	return test_bit(fd, fdt->close_on_exec);
 }
 
+static inline bool close_on_fork(unsigned int fd, const struct fdtable *fdt)
+{
+	return test_bit(fd, fdt->close_on_fork);
+}
+
 static inline bool fd_is_open(unsigned int fd, const struct fdtable *fdt)
 {
 	return test_bit(fd, fdt->open_fds);
@@ -61,6 +67,7 @@ struct files_struct {
 	spinlock_t file_lock ____cacheline_aligned_in_smp;
 	unsigned int next_fd;
 	unsigned long close_on_exec_init[1];
+	unsigned long close_on_fork_init[1];
 	unsigned long open_fds_init[1];
 	unsigned long full_fds_bits_init[1];
 	struct file __rcu * fd_array[NR_OPEN_DEFAULT];
diff --git a/include/linux/file.h b/include/linux/file.h
index 142d102f285e..86fbb36b438b 100644
--- a/include/linux/file.h
+++ b/include/linux/file.h
@@ -85,6 +85,8 @@ extern int f_dupfd(unsigned int from, struct file *file, unsigned flags);
 extern int replace_fd(unsigned fd, struct file *file, unsigned flags);
 extern void set_close_on_exec(unsigned int fd, int flag);
 extern bool get_close_on_exec(unsigned int fd);
+extern void set_close_on_fork(unsigned int fd, int flag);
+extern bool get_close_on_fork(unsigned int fd);
 extern int __get_unused_fd_flags(unsigned flags, unsigned long nofile);
 extern int get_unused_fd_flags(unsigned flags);
 extern void put_unused_fd(unsigned int fd);
diff --git a/include/uapi/asm-generic/fcntl.h b/include/uapi/asm-generic/fcntl.h
index 9dc0bf0c5a6e..0cb7199a7743 100644
--- a/include/uapi/asm-generic/fcntl.h
+++ b/include/uapi/asm-generic/fcntl.h
@@ -98,8 +98,8 @@
 #endif
 
 #define F_DUPFD		0	/* dup */
-#define F_GETFD		1	/* get close_on_exec */
-#define F_SETFD		2	/* set/clear close_on_exec */
+#define F_GETFD		1	/* get close_on_exec & close_on_fork */
+#define F_SETFD		2	/* set/clear close_on_exec & close_on_fork */
 #define F_GETFL		3	/* get file->f_flags */
 #define F_SETFL		4	/* set file->f_flags */
 #ifndef F_GETLK
@@ -160,6 +160,7 @@ struct f_owner_ex {
 
 /* for F_[GET|SET]FL */
 #define FD_CLOEXEC	1	/* actually anything with low bit set goes */
+#define FD_CLOFORK	2
 
 /* for posix fcntl() and lockf() */
 #ifndef F_RDLCK
diff --git a/tools/include/uapi/asm-generic/fcntl.h b/tools/include/uapi/asm-generic/fcntl.h
index ac190958c981..e04a00fecb4a 100644
--- a/tools/include/uapi/asm-generic/fcntl.h
+++ b/tools/include/uapi/asm-generic/fcntl.h
@@ -97,8 +97,8 @@
 #endif
 
 #define F_DUPFD		0	/* dup */
-#define F_GETFD		1	/* get close_on_exec */
-#define F_SETFD		2	/* set/clear close_on_exec */
+#define F_GETFD		1	/* get close_on_exec & close_on_fork */
+#define F_SETFD		2	/* set/clear close_on_exec & close_on_fork */
 #define F_GETFL		3	/* get file->f_flags */
 #define F_SETFL		4	/* set file->f_flags */
 #ifndef F_GETLK
@@ -159,6 +159,7 @@ struct f_owner_ex {
 
 /* for F_[GET|SET]FL */
 #define FD_CLOEXEC	1	/* actually anything with low bit set goes */
+#define FD_CLOFORK	2
 
 /* for posix fcntl() and lockf() */
 #ifndef F_RDLCK
-- 
2.26.1

^ permalink raw reply related	[flat|nested] 69+ messages in thread

* [PATCH 1/4] fs: Implement close-on-fork
@ 2020-04-20  7:15   ` Nate Karstens
  0 siblings, 0 replies; 69+ messages in thread
From: Nate Karstens @ 2020-04-20  7:15 UTC (permalink / raw)
  To: Alexander Viro, Jeff Layton, J. Bruce Fields, Arnd Bergmann,
	Richard Henderson, Ivan Kokshaysky, Matt Turner,
	James E.J. Bottomley, Helge Deller, David S. Miller,
	Jakub Kicinski, linux-fsdevel, linux-arch, linux-alpha,
	linux-parisc, sparclinux, netdev, linux-kernel
  Cc: Changli Gao, Nate Karstens

The close-on-fork flag causes the file descriptor to be closed
atomically in the child process before the child process returns
from fork(). Implement this feature and provide a method to
get/set the close-on-fork flag using fcntl(2).

This functionality was approved by the Austin Common Standards
Revision Group for inclusion in the next revision of the POSIX
standard (see issue 1318 in the Austin Group Defect Tracker).

Co-developed-by: Changli Gao <xiaosuo@gmail.com>
Signed-off-by: Changli Gao <xiaosuo@gmail.com>
Signed-off-by: Nate Karstens <nate.karstens@garmin.com>
---
 fs/fcntl.c                             |  2 ++
 fs/file.c                              | 50 +++++++++++++++++++++++++-
 include/linux/fdtable.h                |  7 ++++
 include/linux/file.h                   |  2 ++
 include/uapi/asm-generic/fcntl.h       |  5 +--
 tools/include/uapi/asm-generic/fcntl.h |  5 +--
 6 files changed, 66 insertions(+), 5 deletions(-)

diff --git a/fs/fcntl.c b/fs/fcntl.c
index 2e4c0fa2074b..23964abf4a1a 100644
--- a/fs/fcntl.c
+++ b/fs/fcntl.c
@@ -335,10 +335,12 @@ static long do_fcntl(int fd, unsigned int cmd, unsigned long arg,
 		break;
 	case F_GETFD:
 		err = get_close_on_exec(fd) ? FD_CLOEXEC : 0;
+		err |= get_close_on_fork(fd) ? FD_CLOFORK : 0;
 		break;
 	case F_SETFD:
 		err = 0;
 		set_close_on_exec(fd, arg & FD_CLOEXEC);
+		set_close_on_fork(fd, arg & FD_CLOFORK);
 		break;
 	case F_GETFL:
 		err = filp->f_flags;
diff --git a/fs/file.c b/fs/file.c
index c8a4e4c86e55..de7260ba718d 100644
--- a/fs/file.c
+++ b/fs/file.c
@@ -57,6 +57,8 @@ static void copy_fd_bitmaps(struct fdtable *nfdt, struct fdtable *ofdt,
 	memset((char *)nfdt->open_fds + cpy, 0, set);
 	memcpy(nfdt->close_on_exec, ofdt->close_on_exec, cpy);
 	memset((char *)nfdt->close_on_exec + cpy, 0, set);
+	memcpy(nfdt->close_on_fork, ofdt->close_on_fork, cpy);
+	memset((char *)nfdt->close_on_fork + cpy, 0, set);
 
 	cpy = BITBIT_SIZE(count);
 	set = BITBIT_SIZE(nfdt->max_fds) - cpy;
@@ -118,7 +120,7 @@ static struct fdtable * alloc_fdtable(unsigned int nr)
 	fdt->fd = data;
 
 	data = kvmalloc(max_t(size_t,
-				 2 * nr / BITS_PER_BYTE + BITBIT_SIZE(nr), L1_CACHE_BYTES),
+				 3 * nr / BITS_PER_BYTE + BITBIT_SIZE(nr), L1_CACHE_BYTES),
 				 GFP_KERNEL_ACCOUNT);
 	if (!data)
 		goto out_arr;
@@ -126,6 +128,8 @@ static struct fdtable * alloc_fdtable(unsigned int nr)
 	data += nr / BITS_PER_BYTE;
 	fdt->close_on_exec = data;
 	data += nr / BITS_PER_BYTE;
+	fdt->close_on_fork = data;
+	data += nr / BITS_PER_BYTE;
 	fdt->full_fds_bits = data;
 
 	return fdt;
@@ -236,6 +240,17 @@ static inline void __clear_close_on_exec(unsigned int fd, struct fdtable *fdt)
 		__clear_bit(fd, fdt->close_on_exec);
 }
 
+static inline void __set_close_on_fork(unsigned int fd, struct fdtable *fdt)
+{
+	__set_bit(fd, fdt->close_on_fork);
+}
+
+static inline void __clear_close_on_fork(unsigned int fd, struct fdtable *fdt)
+{
+	if (test_bit(fd, fdt->close_on_fork))
+		__clear_bit(fd, fdt->close_on_fork);
+}
+
 static inline void __set_open_fd(unsigned int fd, struct fdtable *fdt)
 {
 	__set_bit(fd, fdt->open_fds);
@@ -290,6 +305,7 @@ struct files_struct *dup_fd(struct files_struct *oldf, int *errorp)
 	new_fdt = &newf->fdtab;
 	new_fdt->max_fds = NR_OPEN_DEFAULT;
 	new_fdt->close_on_exec = newf->close_on_exec_init;
+	new_fdt->close_on_fork = newf->close_on_fork_init;
 	new_fdt->open_fds = newf->open_fds_init;
 	new_fdt->full_fds_bits = newf->full_fds_bits_init;
 	new_fdt->fd = &newf->fd_array[0];
@@ -337,6 +353,12 @@ struct files_struct *dup_fd(struct files_struct *oldf, int *errorp)
 
 	for (i = open_files; i != 0; i--) {
 		struct file *f = *old_fds++;
+
+		if (test_bit(open_files - i, new_fdt->close_on_fork)) {
+			__clear_bit(open_files - i, new_fdt->open_fds);
+			f = NULL;
+		}
+
 		if (f) {
 			get_file(f);
 		} else {
@@ -453,6 +475,7 @@ struct files_struct init_files = {
 		.max_fds	= NR_OPEN_DEFAULT,
 		.fd		= &init_files.fd_array[0],
 		.close_on_exec	= init_files.close_on_exec_init,
+		.close_on_fork	= init_files.close_on_fork_init,
 		.open_fds	= init_files.open_fds_init,
 		.full_fds_bits	= init_files.full_fds_bits_init,
 	},
@@ -865,6 +888,31 @@ bool get_close_on_exec(unsigned int fd)
 	return res;
 }
 
+void set_close_on_fork(unsigned int fd, int flag)
+{
+	struct files_struct *files = current->files;
+	struct fdtable *fdt;
+	spin_lock(&files->file_lock);
+	fdt = files_fdtable(files);
+	if (flag)
+		__set_close_on_fork(fd, fdt);
+	else
+		__clear_close_on_fork(fd, fdt);
+	spin_unlock(&files->file_lock);
+}
+
+bool get_close_on_fork(unsigned int fd)
+{
+	struct files_struct *files = current->files;
+	struct fdtable *fdt;
+	bool res;
+	rcu_read_lock();
+	fdt = files_fdtable(files);
+	res = close_on_fork(fd, fdt);
+	rcu_read_unlock();
+	return res;
+}
+
 static int do_dup2(struct files_struct *files,
 	struct file *file, unsigned fd, unsigned flags)
 __releases(&files->file_lock)
diff --git a/include/linux/fdtable.h b/include/linux/fdtable.h
index f07c55ea0c22..61c551947fa3 100644
--- a/include/linux/fdtable.h
+++ b/include/linux/fdtable.h
@@ -27,6 +27,7 @@ struct fdtable {
 	unsigned int max_fds;
 	struct file __rcu **fd;      /* current fd array */
 	unsigned long *close_on_exec;
+	unsigned long *close_on_fork;
 	unsigned long *open_fds;
 	unsigned long *full_fds_bits;
 	struct rcu_head rcu;
@@ -37,6 +38,11 @@ static inline bool close_on_exec(unsigned int fd, const struct fdtable *fdt)
 	return test_bit(fd, fdt->close_on_exec);
 }
 
+static inline bool close_on_fork(unsigned int fd, const struct fdtable *fdt)
+{
+	return test_bit(fd, fdt->close_on_fork);
+}
+
 static inline bool fd_is_open(unsigned int fd, const struct fdtable *fdt)
 {
 	return test_bit(fd, fdt->open_fds);
@@ -61,6 +67,7 @@ struct files_struct {
 	spinlock_t file_lock ____cacheline_aligned_in_smp;
 	unsigned int next_fd;
 	unsigned long close_on_exec_init[1];
+	unsigned long close_on_fork_init[1];
 	unsigned long open_fds_init[1];
 	unsigned long full_fds_bits_init[1];
 	struct file __rcu * fd_array[NR_OPEN_DEFAULT];
diff --git a/include/linux/file.h b/include/linux/file.h
index 142d102f285e..86fbb36b438b 100644
--- a/include/linux/file.h
+++ b/include/linux/file.h
@@ -85,6 +85,8 @@ extern int f_dupfd(unsigned int from, struct file *file, unsigned flags);
 extern int replace_fd(unsigned fd, struct file *file, unsigned flags);
 extern void set_close_on_exec(unsigned int fd, int flag);
 extern bool get_close_on_exec(unsigned int fd);
+extern void set_close_on_fork(unsigned int fd, int flag);
+extern bool get_close_on_fork(unsigned int fd);
 extern int __get_unused_fd_flags(unsigned flags, unsigned long nofile);
 extern int get_unused_fd_flags(unsigned flags);
 extern void put_unused_fd(unsigned int fd);
diff --git a/include/uapi/asm-generic/fcntl.h b/include/uapi/asm-generic/fcntl.h
index 9dc0bf0c5a6e..0cb7199a7743 100644
--- a/include/uapi/asm-generic/fcntl.h
+++ b/include/uapi/asm-generic/fcntl.h
@@ -98,8 +98,8 @@
 #endif
 
 #define F_DUPFD		0	/* dup */
-#define F_GETFD		1	/* get close_on_exec */
-#define F_SETFD		2	/* set/clear close_on_exec */
+#define F_GETFD		1	/* get close_on_exec & close_on_fork */
+#define F_SETFD		2	/* set/clear close_on_exec & close_on_fork */
 #define F_GETFL		3	/* get file->f_flags */
 #define F_SETFL		4	/* set file->f_flags */
 #ifndef F_GETLK
@@ -160,6 +160,7 @@ struct f_owner_ex {
 
 /* for F_[GET|SET]FL */
 #define FD_CLOEXEC	1	/* actually anything with low bit set goes */
+#define FD_CLOFORK	2
 
 /* for posix fcntl() and lockf() */
 #ifndef F_RDLCK
diff --git a/tools/include/uapi/asm-generic/fcntl.h b/tools/include/uapi/asm-generic/fcntl.h
index ac190958c981..e04a00fecb4a 100644
--- a/tools/include/uapi/asm-generic/fcntl.h
+++ b/tools/include/uapi/asm-generic/fcntl.h
@@ -97,8 +97,8 @@
 #endif
 
 #define F_DUPFD		0	/* dup */
-#define F_GETFD		1	/* get close_on_exec */
-#define F_SETFD		2	/* set/clear close_on_exec */
+#define F_GETFD		1	/* get close_on_exec & close_on_fork */
+#define F_SETFD		2	/* set/clear close_on_exec & close_on_fork */
 #define F_GETFL		3	/* get file->f_flags */
 #define F_SETFL		4	/* set file->f_flags */
 #ifndef F_GETLK
@@ -159,6 +159,7 @@ struct f_owner_ex {
 
 /* for F_[GET|SET]FL */
 #define FD_CLOEXEC	1	/* actually anything with low bit set goes */
+#define FD_CLOFORK	2
 
 /* for posix fcntl() and lockf() */
 #ifndef F_RDLCK
-- 
2.26.1

^ permalink raw reply related	[flat|nested] 69+ messages in thread

* [PATCH 2/4] fs: Add O_CLOFORK flag for open(2) and dup3(2)
  2020-04-20  7:15 ` Nate Karstens
  (?)
@ 2020-04-20  7:15   ` Nate Karstens
  -1 siblings, 0 replies; 69+ messages in thread
From: Nate Karstens @ 2020-04-20  7:15 UTC (permalink / raw)
  To: Alexander Viro, Jeff Layton, J. Bruce Fields, Arnd Bergmann,
	Richard Henderson, Ivan Kokshaysky, Matt Turner,
	James E.J. Bottomley, Helge Deller, David S. Miller,
	Jakub Kicinski, linux-fsdevel, linux-arch, linux-alpha,
	linux-parisc, sparclinux, netdev, linux-kernel
  Cc: Changli Gao, Nate Karstens

Add the O_CLOFORK flag to open(2) and dup3(2) to automatically
set the close-on-fork flag in the new file descriptor, saving
a separate call to fcntl(2).

Co-developed-by: Changli Gao <xiaosuo@gmail.com>
Signed-off-by: Changli Gao <xiaosuo@gmail.com>
Signed-off-by: Nate Karstens <nate.karstens@garmin.com>
---
 arch/alpha/include/uapi/asm/fcntl.h    |  2 ++
 arch/parisc/include/uapi/asm/fcntl.h   | 39 +++++++++++++-------------
 arch/sparc/include/uapi/asm/fcntl.h    |  1 +
 fs/fcntl.c                             |  2 +-
 fs/file.c                              | 10 ++++++-
 include/linux/fcntl.h                  |  2 +-
 include/uapi/asm-generic/fcntl.h       |  4 +++
 tools/include/uapi/asm-generic/fcntl.h |  4 +++
 8 files changed, 42 insertions(+), 22 deletions(-)

diff --git a/arch/alpha/include/uapi/asm/fcntl.h b/arch/alpha/include/uapi/asm/fcntl.h
index 50bdc8e8a271..fbab69b15f7f 100644
--- a/arch/alpha/include/uapi/asm/fcntl.h
+++ b/arch/alpha/include/uapi/asm/fcntl.h
@@ -35,6 +35,8 @@
 #define O_PATH		040000000
 #define __O_TMPFILE	0100000000
 
+#define O_CLOFORK	0200000000 /* set close_on_fork */
+
 #define F_GETLK		7
 #define F_SETLK		8
 #define F_SETLKW	9
diff --git a/arch/parisc/include/uapi/asm/fcntl.h b/arch/parisc/include/uapi/asm/fcntl.h
index 03ce20e5ad7d..8f5989e75b05 100644
--- a/arch/parisc/include/uapi/asm/fcntl.h
+++ b/arch/parisc/include/uapi/asm/fcntl.h
@@ -2,26 +2,27 @@
 #ifndef _PARISC_FCNTL_H
 #define _PARISC_FCNTL_H
 
-#define O_APPEND	000000010
-#define O_BLKSEEK	000000100 /* HPUX only */
-#define O_CREAT		000000400 /* not fcntl */
-#define O_EXCL		000002000 /* not fcntl */
-#define O_LARGEFILE	000004000
-#define __O_SYNC	000100000
+#define O_APPEND	0000000010
+#define O_BLKSEEK	0000000100 /* HPUX only */
+#define O_CREAT		0000000400 /* not fcntl */
+#define O_EXCL		0000002000 /* not fcntl */
+#define O_LARGEFILE	0000004000
+#define __O_SYNC	0000100000
 #define O_SYNC		(__O_SYNC|O_DSYNC)
-#define O_NONBLOCK	000200004 /* HPUX has separate NDELAY & NONBLOCK */
-#define O_NOCTTY	000400000 /* not fcntl */
-#define O_DSYNC		001000000 /* HPUX only */
-#define O_RSYNC		002000000 /* HPUX only */
-#define O_NOATIME	004000000
-#define O_CLOEXEC	010000000 /* set close_on_exec */
-
-#define O_DIRECTORY	000010000 /* must be a directory */
-#define O_NOFOLLOW	000000200 /* don't follow links */
-#define O_INVISIBLE	004000000 /* invisible I/O, for DMAPI/XDSM */
-
-#define O_PATH		020000000
-#define __O_TMPFILE	040000000
+#define O_NONBLOCK	0000200004 /* HPUX has separate NDELAY & NONBLOCK */
+#define O_NOCTTY	0000400000 /* not fcntl */
+#define O_DSYNC		0001000000 /* HPUX only */
+#define O_RSYNC		0002000000 /* HPUX only */
+#define O_NOATIME	0004000000
+#define O_CLOEXEC	0010000000 /* set close_on_exec */
+
+#define O_DIRECTORY	0000010000 /* must be a directory */
+#define O_NOFOLLOW	0000000200 /* don't follow links */
+#define O_INVISIBLE	0004000000 /* invisible I/O, for DMAPI/XDSM */
+
+#define O_PATH		0020000000
+#define __O_TMPFILE	0040000000
+#define O_CLOFORK	0100000000
 
 #define F_GETLK64	8
 #define F_SETLK64	9
diff --git a/arch/sparc/include/uapi/asm/fcntl.h b/arch/sparc/include/uapi/asm/fcntl.h
index 67dae75e5274..d631ea13bac3 100644
--- a/arch/sparc/include/uapi/asm/fcntl.h
+++ b/arch/sparc/include/uapi/asm/fcntl.h
@@ -37,6 +37,7 @@
 
 #define O_PATH		0x1000000
 #define __O_TMPFILE	0x2000000
+#define O_CLOFORK	0x4000000
 
 #define F_GETOWN	5	/*  for sockets. */
 #define F_SETOWN	6	/*  for sockets. */
diff --git a/fs/fcntl.c b/fs/fcntl.c
index 23964abf4a1a..b59b27c3a338 100644
--- a/fs/fcntl.c
+++ b/fs/fcntl.c
@@ -1035,7 +1035,7 @@ static int __init fcntl_init(void)
 	 * Exceptions: O_NONBLOCK is a two bit define on parisc; O_NDELAY
 	 * is defined as O_NONBLOCK on some platforms and not on others.
 	 */
-	BUILD_BUG_ON(21 - 1 /* for O_RDONLY being 0 */ !=
+	BUILD_BUG_ON(22 - 1 /* for O_RDONLY being 0 */ !=
 		HWEIGHT32(
 			(VALID_OPEN_FLAGS & ~(O_NONBLOCK | O_NDELAY)) |
 			__FMODE_EXEC | __FMODE_NONOTIFY));
diff --git a/fs/file.c b/fs/file.c
index de7260ba718d..95774b7962d1 100644
--- a/fs/file.c
+++ b/fs/file.c
@@ -544,6 +544,10 @@ int __alloc_fd(struct files_struct *files,
 		__set_close_on_exec(fd, fdt);
 	else
 		__clear_close_on_exec(fd, fdt);
+	if (flags & O_CLOFORK)
+		__set_close_on_fork(fd, fdt);
+	else
+		__clear_close_on_fork(fd, fdt);
 	error = fd;
 #if 1
 	/* Sanity check */
@@ -945,6 +949,10 @@ __releases(&files->file_lock)
 		__set_close_on_exec(fd, fdt);
 	else
 		__clear_close_on_exec(fd, fdt);
+	if (flags & O_CLOFORK)
+		__set_close_on_fork(fd, fdt);
+	else
+		__clear_close_on_fork(fd, fdt);
 	spin_unlock(&files->file_lock);
 
 	if (tofree)
@@ -985,7 +993,7 @@ static int ksys_dup3(unsigned int oldfd, unsigned int newfd, int flags)
 	struct file *file;
 	struct files_struct *files = current->files;
 
-	if ((flags & ~O_CLOEXEC) != 0)
+	if ((flags & ~(O_CLOEXEC | O_CLOFORK)) != 0)
 		return -EINVAL;
 
 	if (unlikely(oldfd == newfd))
diff --git a/include/linux/fcntl.h b/include/linux/fcntl.h
index 7bcdcf4f6ab2..cd4c625647db 100644
--- a/include/linux/fcntl.h
+++ b/include/linux/fcntl.h
@@ -10,7 +10,7 @@
 	(O_RDONLY | O_WRONLY | O_RDWR | O_CREAT | O_EXCL | O_NOCTTY | O_TRUNC | \
 	 O_APPEND | O_NDELAY | O_NONBLOCK | O_NDELAY | __O_SYNC | O_DSYNC | \
 	 FASYNC	| O_DIRECT | O_LARGEFILE | O_DIRECTORY | O_NOFOLLOW | \
-	 O_NOATIME | O_CLOEXEC | O_PATH | __O_TMPFILE)
+	 O_NOATIME | O_CLOEXEC | O_PATH | __O_TMPFILE | O_CLOFORK)
 
 /* List of all valid flags for the how->upgrade_mask argument: */
 #define VALID_UPGRADE_FLAGS \
diff --git a/include/uapi/asm-generic/fcntl.h b/include/uapi/asm-generic/fcntl.h
index 0cb7199a7743..165a0736a3aa 100644
--- a/include/uapi/asm-generic/fcntl.h
+++ b/include/uapi/asm-generic/fcntl.h
@@ -89,6 +89,10 @@
 #define __O_TMPFILE	020000000
 #endif
 
+#ifndef O_CLOFORK
+#define O_CLOFORK	040000000	/* set close_on_fork */
+#endif
+
 /* a horrid kludge trying to make sure that this will fail on old kernels */
 #define O_TMPFILE (__O_TMPFILE | O_DIRECTORY)
 #define O_TMPFILE_MASK (__O_TMPFILE | O_DIRECTORY | O_CREAT)      
diff --git a/tools/include/uapi/asm-generic/fcntl.h b/tools/include/uapi/asm-generic/fcntl.h
index e04a00fecb4a..69d8a000ec65 100644
--- a/tools/include/uapi/asm-generic/fcntl.h
+++ b/tools/include/uapi/asm-generic/fcntl.h
@@ -88,6 +88,10 @@
 #define __O_TMPFILE	020000000
 #endif
 
+#ifndef O_CLOFORK
+#define O_CLOFORK	040000000	/* set close_on_fork */
+#endif
+
 /* a horrid kludge trying to make sure that this will fail on old kernels */
 #define O_TMPFILE (__O_TMPFILE | O_DIRECTORY)
 #define O_TMPFILE_MASK (__O_TMPFILE | O_DIRECTORY | O_CREAT)
-- 
2.26.1


^ permalink raw reply related	[flat|nested] 69+ messages in thread

* [PATCH 2/4] fs: Add O_CLOFORK flag for open(2) and dup3(2)
@ 2020-04-20  7:15   ` Nate Karstens
  0 siblings, 0 replies; 69+ messages in thread
From: Nate Karstens @ 2020-04-20  7:15 UTC (permalink / raw)
  To: Alexander Viro, Jeff Layton, J. Bruce Fields, Arnd Bergmann,
	Richard Henderson, Ivan Kokshaysky, Matt Turner,
	James E.J. Bottomley, Helge Deller, David S. Miller,
	Jakub Kicinski, linux-fsdevel, linux-arch, linux-alpha,
	linux-parisc, sparclinux, netdev, linux-kernel
  Cc: Changli Gao, Nate Karstens

Add the O_CLOFORK flag to open(2) and dup3(2) to automatically
set the close-on-fork flag in the new file descriptor, saving
a separate call to fcntl(2).

Co-developed-by: Changli Gao <xiaosuo@gmail.com>
Signed-off-by: Changli Gao <xiaosuo@gmail.com>
Signed-off-by: Nate Karstens <nate.karstens@garmin.com>
---
 arch/alpha/include/uapi/asm/fcntl.h    |  2 ++
 arch/parisc/include/uapi/asm/fcntl.h   | 39 +++++++++++++-------------
 arch/sparc/include/uapi/asm/fcntl.h    |  1 +
 fs/fcntl.c                             |  2 +-
 fs/file.c                              | 10 ++++++-
 include/linux/fcntl.h                  |  2 +-
 include/uapi/asm-generic/fcntl.h       |  4 +++
 tools/include/uapi/asm-generic/fcntl.h |  4 +++
 8 files changed, 42 insertions(+), 22 deletions(-)

diff --git a/arch/alpha/include/uapi/asm/fcntl.h b/arch/alpha/include/uapi/asm/fcntl.h
index 50bdc8e8a271..fbab69b15f7f 100644
--- a/arch/alpha/include/uapi/asm/fcntl.h
+++ b/arch/alpha/include/uapi/asm/fcntl.h
@@ -35,6 +35,8 @@
 #define O_PATH		040000000
 #define __O_TMPFILE	0100000000
 
+#define O_CLOFORK	0200000000 /* set close_on_fork */
+
 #define F_GETLK		7
 #define F_SETLK		8
 #define F_SETLKW	9
diff --git a/arch/parisc/include/uapi/asm/fcntl.h b/arch/parisc/include/uapi/asm/fcntl.h
index 03ce20e5ad7d..8f5989e75b05 100644
--- a/arch/parisc/include/uapi/asm/fcntl.h
+++ b/arch/parisc/include/uapi/asm/fcntl.h
@@ -2,26 +2,27 @@
 #ifndef _PARISC_FCNTL_H
 #define _PARISC_FCNTL_H
 
-#define O_APPEND	000000010
-#define O_BLKSEEK	000000100 /* HPUX only */
-#define O_CREAT		000000400 /* not fcntl */
-#define O_EXCL		000002000 /* not fcntl */
-#define O_LARGEFILE	000004000
-#define __O_SYNC	000100000
+#define O_APPEND	0000000010
+#define O_BLKSEEK	0000000100 /* HPUX only */
+#define O_CREAT		0000000400 /* not fcntl */
+#define O_EXCL		0000002000 /* not fcntl */
+#define O_LARGEFILE	0000004000
+#define __O_SYNC	0000100000
 #define O_SYNC		(__O_SYNC|O_DSYNC)
-#define O_NONBLOCK	000200004 /* HPUX has separate NDELAY & NONBLOCK */
-#define O_NOCTTY	000400000 /* not fcntl */
-#define O_DSYNC		001000000 /* HPUX only */
-#define O_RSYNC		002000000 /* HPUX only */
-#define O_NOATIME	004000000
-#define O_CLOEXEC	010000000 /* set close_on_exec */
-
-#define O_DIRECTORY	000010000 /* must be a directory */
-#define O_NOFOLLOW	000000200 /* don't follow links */
-#define O_INVISIBLE	004000000 /* invisible I/O, for DMAPI/XDSM */
-
-#define O_PATH		020000000
-#define __O_TMPFILE	040000000
+#define O_NONBLOCK	0000200004 /* HPUX has separate NDELAY & NONBLOCK */
+#define O_NOCTTY	0000400000 /* not fcntl */
+#define O_DSYNC		0001000000 /* HPUX only */
+#define O_RSYNC		0002000000 /* HPUX only */
+#define O_NOATIME	0004000000
+#define O_CLOEXEC	0010000000 /* set close_on_exec */
+
+#define O_DIRECTORY	0000010000 /* must be a directory */
+#define O_NOFOLLOW	0000000200 /* don't follow links */
+#define O_INVISIBLE	0004000000 /* invisible I/O, for DMAPI/XDSM */
+
+#define O_PATH		0020000000
+#define __O_TMPFILE	0040000000
+#define O_CLOFORK	0100000000
 
 #define F_GETLK64	8
 #define F_SETLK64	9
diff --git a/arch/sparc/include/uapi/asm/fcntl.h b/arch/sparc/include/uapi/asm/fcntl.h
index 67dae75e5274..d631ea13bac3 100644
--- a/arch/sparc/include/uapi/asm/fcntl.h
+++ b/arch/sparc/include/uapi/asm/fcntl.h
@@ -37,6 +37,7 @@
 
 #define O_PATH		0x1000000
 #define __O_TMPFILE	0x2000000
+#define O_CLOFORK	0x4000000
 
 #define F_GETOWN	5	/*  for sockets. */
 #define F_SETOWN	6	/*  for sockets. */
diff --git a/fs/fcntl.c b/fs/fcntl.c
index 23964abf4a1a..b59b27c3a338 100644
--- a/fs/fcntl.c
+++ b/fs/fcntl.c
@@ -1035,7 +1035,7 @@ static int __init fcntl_init(void)
 	 * Exceptions: O_NONBLOCK is a two bit define on parisc; O_NDELAY
 	 * is defined as O_NONBLOCK on some platforms and not on others.
 	 */
-	BUILD_BUG_ON(21 - 1 /* for O_RDONLY being 0 */ !=
+	BUILD_BUG_ON(22 - 1 /* for O_RDONLY being 0 */ !=
 		HWEIGHT32(
 			(VALID_OPEN_FLAGS & ~(O_NONBLOCK | O_NDELAY)) |
 			__FMODE_EXEC | __FMODE_NONOTIFY));
diff --git a/fs/file.c b/fs/file.c
index de7260ba718d..95774b7962d1 100644
--- a/fs/file.c
+++ b/fs/file.c
@@ -544,6 +544,10 @@ int __alloc_fd(struct files_struct *files,
 		__set_close_on_exec(fd, fdt);
 	else
 		__clear_close_on_exec(fd, fdt);
+	if (flags & O_CLOFORK)
+		__set_close_on_fork(fd, fdt);
+	else
+		__clear_close_on_fork(fd, fdt);
 	error = fd;
 #if 1
 	/* Sanity check */
@@ -945,6 +949,10 @@ __releases(&files->file_lock)
 		__set_close_on_exec(fd, fdt);
 	else
 		__clear_close_on_exec(fd, fdt);
+	if (flags & O_CLOFORK)
+		__set_close_on_fork(fd, fdt);
+	else
+		__clear_close_on_fork(fd, fdt);
 	spin_unlock(&files->file_lock);
 
 	if (tofree)
@@ -985,7 +993,7 @@ static int ksys_dup3(unsigned int oldfd, unsigned int newfd, int flags)
 	struct file *file;
 	struct files_struct *files = current->files;
 
-	if ((flags & ~O_CLOEXEC) != 0)
+	if ((flags & ~(O_CLOEXEC | O_CLOFORK)) != 0)
 		return -EINVAL;
 
 	if (unlikely(oldfd == newfd))
diff --git a/include/linux/fcntl.h b/include/linux/fcntl.h
index 7bcdcf4f6ab2..cd4c625647db 100644
--- a/include/linux/fcntl.h
+++ b/include/linux/fcntl.h
@@ -10,7 +10,7 @@
 	(O_RDONLY | O_WRONLY | O_RDWR | O_CREAT | O_EXCL | O_NOCTTY | O_TRUNC | \
 	 O_APPEND | O_NDELAY | O_NONBLOCK | O_NDELAY | __O_SYNC | O_DSYNC | \
 	 FASYNC	| O_DIRECT | O_LARGEFILE | O_DIRECTORY | O_NOFOLLOW | \
-	 O_NOATIME | O_CLOEXEC | O_PATH | __O_TMPFILE)
+	 O_NOATIME | O_CLOEXEC | O_PATH | __O_TMPFILE | O_CLOFORK)
 
 /* List of all valid flags for the how->upgrade_mask argument: */
 #define VALID_UPGRADE_FLAGS \
diff --git a/include/uapi/asm-generic/fcntl.h b/include/uapi/asm-generic/fcntl.h
index 0cb7199a7743..165a0736a3aa 100644
--- a/include/uapi/asm-generic/fcntl.h
+++ b/include/uapi/asm-generic/fcntl.h
@@ -89,6 +89,10 @@
 #define __O_TMPFILE	020000000
 #endif
 
+#ifndef O_CLOFORK
+#define O_CLOFORK	040000000	/* set close_on_fork */
+#endif
+
 /* a horrid kludge trying to make sure that this will fail on old kernels */
 #define O_TMPFILE (__O_TMPFILE | O_DIRECTORY)
 #define O_TMPFILE_MASK (__O_TMPFILE | O_DIRECTORY | O_CREAT)      
diff --git a/tools/include/uapi/asm-generic/fcntl.h b/tools/include/uapi/asm-generic/fcntl.h
index e04a00fecb4a..69d8a000ec65 100644
--- a/tools/include/uapi/asm-generic/fcntl.h
+++ b/tools/include/uapi/asm-generic/fcntl.h
@@ -88,6 +88,10 @@
 #define __O_TMPFILE	020000000
 #endif
 
+#ifndef O_CLOFORK
+#define O_CLOFORK	040000000	/* set close_on_fork */
+#endif
+
 /* a horrid kludge trying to make sure that this will fail on old kernels */
 #define O_TMPFILE (__O_TMPFILE | O_DIRECTORY)
 #define O_TMPFILE_MASK (__O_TMPFILE | O_DIRECTORY | O_CREAT)
-- 
2.26.1

^ permalink raw reply related	[flat|nested] 69+ messages in thread

* [PATCH 2/4] fs: Add O_CLOFORK flag for open(2) and dup3(2)
@ 2020-04-20  7:15   ` Nate Karstens
  0 siblings, 0 replies; 69+ messages in thread
From: Nate Karstens @ 2020-04-20  7:15 UTC (permalink / raw)
  To: Alexander Viro, Jeff Layton, J. Bruce Fields, Arnd Bergmann,
	Richard Henderson, Ivan Kokshaysky, Matt Turner,
	James E.J. Bottomley, Helge Deller, David S. Miller,
	Jakub Kicinski, linux-fsdevel, linux-arch, linux-alpha,
	linux-parisc, sparclinux, netdev, linux-kernel
  Cc: Changli Gao, Nate Karstens

Add the O_CLOFORK flag to open(2) and dup3(2) to automatically
set the close-on-fork flag in the new file descriptor, saving
a separate call to fcntl(2).

Co-developed-by: Changli Gao <xiaosuo@gmail.com>
Signed-off-by: Changli Gao <xiaosuo@gmail.com>
Signed-off-by: Nate Karstens <nate.karstens@garmin.com>
---
 arch/alpha/include/uapi/asm/fcntl.h    |  2 ++
 arch/parisc/include/uapi/asm/fcntl.h   | 39 +++++++++++++-------------
 arch/sparc/include/uapi/asm/fcntl.h    |  1 +
 fs/fcntl.c                             |  2 +-
 fs/file.c                              | 10 ++++++-
 include/linux/fcntl.h                  |  2 +-
 include/uapi/asm-generic/fcntl.h       |  4 +++
 tools/include/uapi/asm-generic/fcntl.h |  4 +++
 8 files changed, 42 insertions(+), 22 deletions(-)

diff --git a/arch/alpha/include/uapi/asm/fcntl.h b/arch/alpha/include/uapi/asm/fcntl.h
index 50bdc8e8a271..fbab69b15f7f 100644
--- a/arch/alpha/include/uapi/asm/fcntl.h
+++ b/arch/alpha/include/uapi/asm/fcntl.h
@@ -35,6 +35,8 @@
 #define O_PATH		040000000
 #define __O_TMPFILE	0100000000
 
+#define O_CLOFORK	0200000000 /* set close_on_fork */
+
 #define F_GETLK		7
 #define F_SETLK		8
 #define F_SETLKW	9
diff --git a/arch/parisc/include/uapi/asm/fcntl.h b/arch/parisc/include/uapi/asm/fcntl.h
index 03ce20e5ad7d..8f5989e75b05 100644
--- a/arch/parisc/include/uapi/asm/fcntl.h
+++ b/arch/parisc/include/uapi/asm/fcntl.h
@@ -2,26 +2,27 @@
 #ifndef _PARISC_FCNTL_H
 #define _PARISC_FCNTL_H
 
-#define O_APPEND	000000010
-#define O_BLKSEEK	000000100 /* HPUX only */
-#define O_CREAT		000000400 /* not fcntl */
-#define O_EXCL		000002000 /* not fcntl */
-#define O_LARGEFILE	000004000
-#define __O_SYNC	000100000
+#define O_APPEND	0000000010
+#define O_BLKSEEK	0000000100 /* HPUX only */
+#define O_CREAT		0000000400 /* not fcntl */
+#define O_EXCL		0000002000 /* not fcntl */
+#define O_LARGEFILE	0000004000
+#define __O_SYNC	0000100000
 #define O_SYNC		(__O_SYNC|O_DSYNC)
-#define O_NONBLOCK	000200004 /* HPUX has separate NDELAY & NONBLOCK */
-#define O_NOCTTY	000400000 /* not fcntl */
-#define O_DSYNC		001000000 /* HPUX only */
-#define O_RSYNC		002000000 /* HPUX only */
-#define O_NOATIME	004000000
-#define O_CLOEXEC	010000000 /* set close_on_exec */
-
-#define O_DIRECTORY	000010000 /* must be a directory */
-#define O_NOFOLLOW	000000200 /* don't follow links */
-#define O_INVISIBLE	004000000 /* invisible I/O, for DMAPI/XDSM */
-
-#define O_PATH		020000000
-#define __O_TMPFILE	040000000
+#define O_NONBLOCK	0000200004 /* HPUX has separate NDELAY & NONBLOCK */
+#define O_NOCTTY	0000400000 /* not fcntl */
+#define O_DSYNC		0001000000 /* HPUX only */
+#define O_RSYNC		0002000000 /* HPUX only */
+#define O_NOATIME	0004000000
+#define O_CLOEXEC	0010000000 /* set close_on_exec */
+
+#define O_DIRECTORY	0000010000 /* must be a directory */
+#define O_NOFOLLOW	0000000200 /* don't follow links */
+#define O_INVISIBLE	0004000000 /* invisible I/O, for DMAPI/XDSM */
+
+#define O_PATH		0020000000
+#define __O_TMPFILE	0040000000
+#define O_CLOFORK	0100000000
 
 #define F_GETLK64	8
 #define F_SETLK64	9
diff --git a/arch/sparc/include/uapi/asm/fcntl.h b/arch/sparc/include/uapi/asm/fcntl.h
index 67dae75e5274..d631ea13bac3 100644
--- a/arch/sparc/include/uapi/asm/fcntl.h
+++ b/arch/sparc/include/uapi/asm/fcntl.h
@@ -37,6 +37,7 @@
 
 #define O_PATH		0x1000000
 #define __O_TMPFILE	0x2000000
+#define O_CLOFORK	0x4000000
 
 #define F_GETOWN	5	/*  for sockets. */
 #define F_SETOWN	6	/*  for sockets. */
diff --git a/fs/fcntl.c b/fs/fcntl.c
index 23964abf4a1a..b59b27c3a338 100644
--- a/fs/fcntl.c
+++ b/fs/fcntl.c
@@ -1035,7 +1035,7 @@ static int __init fcntl_init(void)
 	 * Exceptions: O_NONBLOCK is a two bit define on parisc; O_NDELAY
 	 * is defined as O_NONBLOCK on some platforms and not on others.
 	 */
-	BUILD_BUG_ON(21 - 1 /* for O_RDONLY being 0 */ !+	BUILD_BUG_ON(22 - 1 /* for O_RDONLY being 0 */ ! 		HWEIGHT32(
 			(VALID_OPEN_FLAGS & ~(O_NONBLOCK | O_NDELAY)) |
 			__FMODE_EXEC | __FMODE_NONOTIFY));
diff --git a/fs/file.c b/fs/file.c
index de7260ba718d..95774b7962d1 100644
--- a/fs/file.c
+++ b/fs/file.c
@@ -544,6 +544,10 @@ int __alloc_fd(struct files_struct *files,
 		__set_close_on_exec(fd, fdt);
 	else
 		__clear_close_on_exec(fd, fdt);
+	if (flags & O_CLOFORK)
+		__set_close_on_fork(fd, fdt);
+	else
+		__clear_close_on_fork(fd, fdt);
 	error = fd;
 #if 1
 	/* Sanity check */
@@ -945,6 +949,10 @@ __releases(&files->file_lock)
 		__set_close_on_exec(fd, fdt);
 	else
 		__clear_close_on_exec(fd, fdt);
+	if (flags & O_CLOFORK)
+		__set_close_on_fork(fd, fdt);
+	else
+		__clear_close_on_fork(fd, fdt);
 	spin_unlock(&files->file_lock);
 
 	if (tofree)
@@ -985,7 +993,7 @@ static int ksys_dup3(unsigned int oldfd, unsigned int newfd, int flags)
 	struct file *file;
 	struct files_struct *files = current->files;
 
-	if ((flags & ~O_CLOEXEC) != 0)
+	if ((flags & ~(O_CLOEXEC | O_CLOFORK)) != 0)
 		return -EINVAL;
 
 	if (unlikely(oldfd = newfd))
diff --git a/include/linux/fcntl.h b/include/linux/fcntl.h
index 7bcdcf4f6ab2..cd4c625647db 100644
--- a/include/linux/fcntl.h
+++ b/include/linux/fcntl.h
@@ -10,7 +10,7 @@
 	(O_RDONLY | O_WRONLY | O_RDWR | O_CREAT | O_EXCL | O_NOCTTY | O_TRUNC | \
 	 O_APPEND | O_NDELAY | O_NONBLOCK | O_NDELAY | __O_SYNC | O_DSYNC | \
 	 FASYNC	| O_DIRECT | O_LARGEFILE | O_DIRECTORY | O_NOFOLLOW | \
-	 O_NOATIME | O_CLOEXEC | O_PATH | __O_TMPFILE)
+	 O_NOATIME | O_CLOEXEC | O_PATH | __O_TMPFILE | O_CLOFORK)
 
 /* List of all valid flags for the how->upgrade_mask argument: */
 #define VALID_UPGRADE_FLAGS \
diff --git a/include/uapi/asm-generic/fcntl.h b/include/uapi/asm-generic/fcntl.h
index 0cb7199a7743..165a0736a3aa 100644
--- a/include/uapi/asm-generic/fcntl.h
+++ b/include/uapi/asm-generic/fcntl.h
@@ -89,6 +89,10 @@
 #define __O_TMPFILE	020000000
 #endif
 
+#ifndef O_CLOFORK
+#define O_CLOFORK	040000000	/* set close_on_fork */
+#endif
+
 /* a horrid kludge trying to make sure that this will fail on old kernels */
 #define O_TMPFILE (__O_TMPFILE | O_DIRECTORY)
 #define O_TMPFILE_MASK (__O_TMPFILE | O_DIRECTORY | O_CREAT)      
diff --git a/tools/include/uapi/asm-generic/fcntl.h b/tools/include/uapi/asm-generic/fcntl.h
index e04a00fecb4a..69d8a000ec65 100644
--- a/tools/include/uapi/asm-generic/fcntl.h
+++ b/tools/include/uapi/asm-generic/fcntl.h
@@ -88,6 +88,10 @@
 #define __O_TMPFILE	020000000
 #endif
 
+#ifndef O_CLOFORK
+#define O_CLOFORK	040000000	/* set close_on_fork */
+#endif
+
 /* a horrid kludge trying to make sure that this will fail on old kernels */
 #define O_TMPFILE (__O_TMPFILE | O_DIRECTORY)
 #define O_TMPFILE_MASK (__O_TMPFILE | O_DIRECTORY | O_CREAT)
-- 
2.26.1

^ permalink raw reply related	[flat|nested] 69+ messages in thread

* [PATCH 3/4] fs: Add F_DUPFD_CLOFORK to fcntl(2)
  2020-04-20  7:15 ` Nate Karstens
  (?)
@ 2020-04-20  7:15   ` Nate Karstens
  -1 siblings, 0 replies; 69+ messages in thread
From: Nate Karstens @ 2020-04-20  7:15 UTC (permalink / raw)
  To: Alexander Viro, Jeff Layton, J. Bruce Fields, Arnd Bergmann,
	Richard Henderson, Ivan Kokshaysky, Matt Turner,
	James E.J. Bottomley, Helge Deller, David S. Miller,
	Jakub Kicinski, linux-fsdevel, linux-arch, linux-alpha,
	linux-parisc, sparclinux, netdev, linux-kernel
  Cc: Changli Gao, Nate Karstens

Implement functionality for duplicating a file descriptor
and having the close-on-fork flag automatically set in the
new file descriptor.

Signed-off-by: Nate Karstens <nate.karstens@garmin.com>
---
 fs/fcntl.c                       | 4 ++++
 include/uapi/linux/fcntl.h       | 3 +++
 tools/include/uapi/linux/fcntl.h | 3 +++
 3 files changed, 10 insertions(+)

diff --git a/fs/fcntl.c b/fs/fcntl.c
index b59b27c3a338..43ca3e3dacc5 100644
--- a/fs/fcntl.c
+++ b/fs/fcntl.c
@@ -333,6 +333,9 @@ static long do_fcntl(int fd, unsigned int cmd, unsigned long arg,
 	case F_DUPFD_CLOEXEC:
 		err = f_dupfd(arg, filp, O_CLOEXEC);
 		break;
+	case F_DUPFD_CLOFORK:
+		err = f_dupfd(arg, filp, O_CLOFORK);
+		break;
 	case F_GETFD:
 		err = get_close_on_exec(fd) ? FD_CLOEXEC : 0;
 		err |= get_close_on_fork(fd) ? FD_CLOFORK : 0;
@@ -439,6 +442,7 @@ static int check_fcntl_cmd(unsigned cmd)
 	switch (cmd) {
 	case F_DUPFD:
 	case F_DUPFD_CLOEXEC:
+	case F_DUPFD_CLOFORK:
 	case F_GETFD:
 	case F_SETFD:
 	case F_GETFL:
diff --git a/include/uapi/linux/fcntl.h b/include/uapi/linux/fcntl.h
index ca88b7bce553..9e1069ff3a22 100644
--- a/include/uapi/linux/fcntl.h
+++ b/include/uapi/linux/fcntl.h
@@ -55,6 +55,9 @@
 #define F_GET_FILE_RW_HINT	(F_LINUX_SPECIFIC_BASE + 13)
 #define F_SET_FILE_RW_HINT	(F_LINUX_SPECIFIC_BASE + 14)
 
+/* Create a file descriptor with FD_CLOFORK set. */
+#define F_DUPFD_CLOFORK	(F_LINUX_SPECIFIC_BASE + 15)
+
 /*
  * Valid hint values for F_{GET,SET}_RW_HINT. 0 is "not set", or can be
  * used to clear any hints previously set.
diff --git a/tools/include/uapi/linux/fcntl.h b/tools/include/uapi/linux/fcntl.h
index ca88b7bce553..9e1069ff3a22 100644
--- a/tools/include/uapi/linux/fcntl.h
+++ b/tools/include/uapi/linux/fcntl.h
@@ -55,6 +55,9 @@
 #define F_GET_FILE_RW_HINT	(F_LINUX_SPECIFIC_BASE + 13)
 #define F_SET_FILE_RW_HINT	(F_LINUX_SPECIFIC_BASE + 14)
 
+/* Create a file descriptor with FD_CLOFORK set. */
+#define F_DUPFD_CLOFORK	(F_LINUX_SPECIFIC_BASE + 15)
+
 /*
  * Valid hint values for F_{GET,SET}_RW_HINT. 0 is "not set", or can be
  * used to clear any hints previously set.
-- 
2.26.1


^ permalink raw reply related	[flat|nested] 69+ messages in thread

* [PATCH 3/4] fs: Add F_DUPFD_CLOFORK to fcntl(2)
@ 2020-04-20  7:15   ` Nate Karstens
  0 siblings, 0 replies; 69+ messages in thread
From: Nate Karstens @ 2020-04-20  7:15 UTC (permalink / raw)
  To: Alexander Viro, Jeff Layton, J. Bruce Fields, Arnd Bergmann,
	Richard Henderson, Ivan Kokshaysky, Matt Turner,
	James E.J. Bottomley, Helge Deller, David S. Miller,
	Jakub Kicinski, linux-fsdevel, linux-arch, linux-alpha,
	linux-parisc, sparclinux, netdev, linux-kernel
  Cc: Changli Gao, Nate Karstens

Implement functionality for duplicating a file descriptor
and having the close-on-fork flag automatically set in the
new file descriptor.

Signed-off-by: Nate Karstens <nate.karstens@garmin.com>
---
 fs/fcntl.c                       | 4 ++++
 include/uapi/linux/fcntl.h       | 3 +++
 tools/include/uapi/linux/fcntl.h | 3 +++
 3 files changed, 10 insertions(+)

diff --git a/fs/fcntl.c b/fs/fcntl.c
index b59b27c3a338..43ca3e3dacc5 100644
--- a/fs/fcntl.c
+++ b/fs/fcntl.c
@@ -333,6 +333,9 @@ static long do_fcntl(int fd, unsigned int cmd, unsigned long arg,
 	case F_DUPFD_CLOEXEC:
 		err = f_dupfd(arg, filp, O_CLOEXEC);
 		break;
+	case F_DUPFD_CLOFORK:
+		err = f_dupfd(arg, filp, O_CLOFORK);
+		break;
 	case F_GETFD:
 		err = get_close_on_exec(fd) ? FD_CLOEXEC : 0;
 		err |= get_close_on_fork(fd) ? FD_CLOFORK : 0;
@@ -439,6 +442,7 @@ static int check_fcntl_cmd(unsigned cmd)
 	switch (cmd) {
 	case F_DUPFD:
 	case F_DUPFD_CLOEXEC:
+	case F_DUPFD_CLOFORK:
 	case F_GETFD:
 	case F_SETFD:
 	case F_GETFL:
diff --git a/include/uapi/linux/fcntl.h b/include/uapi/linux/fcntl.h
index ca88b7bce553..9e1069ff3a22 100644
--- a/include/uapi/linux/fcntl.h
+++ b/include/uapi/linux/fcntl.h
@@ -55,6 +55,9 @@
 #define F_GET_FILE_RW_HINT	(F_LINUX_SPECIFIC_BASE + 13)
 #define F_SET_FILE_RW_HINT	(F_LINUX_SPECIFIC_BASE + 14)
 
+/* Create a file descriptor with FD_CLOFORK set. */
+#define F_DUPFD_CLOFORK	(F_LINUX_SPECIFIC_BASE + 15)
+
 /*
  * Valid hint values for F_{GET,SET}_RW_HINT. 0 is "not set", or can be
  * used to clear any hints previously set.
diff --git a/tools/include/uapi/linux/fcntl.h b/tools/include/uapi/linux/fcntl.h
index ca88b7bce553..9e1069ff3a22 100644
--- a/tools/include/uapi/linux/fcntl.h
+++ b/tools/include/uapi/linux/fcntl.h
@@ -55,6 +55,9 @@
 #define F_GET_FILE_RW_HINT	(F_LINUX_SPECIFIC_BASE + 13)
 #define F_SET_FILE_RW_HINT	(F_LINUX_SPECIFIC_BASE + 14)
 
+/* Create a file descriptor with FD_CLOFORK set. */
+#define F_DUPFD_CLOFORK	(F_LINUX_SPECIFIC_BASE + 15)
+
 /*
  * Valid hint values for F_{GET,SET}_RW_HINT. 0 is "not set", or can be
  * used to clear any hints previously set.
-- 
2.26.1

^ permalink raw reply related	[flat|nested] 69+ messages in thread

* [PATCH 3/4] fs: Add F_DUPFD_CLOFORK to fcntl(2)
@ 2020-04-20  7:15   ` Nate Karstens
  0 siblings, 0 replies; 69+ messages in thread
From: Nate Karstens @ 2020-04-20  7:15 UTC (permalink / raw)
  To: Alexander Viro, Jeff Layton, J. Bruce Fields, Arnd Bergmann,
	Richard Henderson, Ivan Kokshaysky, Matt Turner,
	James E.J. Bottomley, Helge Deller, David S. Miller,
	Jakub Kicinski, linux-fsdevel, linux-arch, linux-alpha,
	linux-parisc, sparclinux, netdev, linux-kernel
  Cc: Changli Gao, Nate Karstens

Implement functionality for duplicating a file descriptor
and having the close-on-fork flag automatically set in the
new file descriptor.

Signed-off-by: Nate Karstens <nate.karstens@garmin.com>
---
 fs/fcntl.c                       | 4 ++++
 include/uapi/linux/fcntl.h       | 3 +++
 tools/include/uapi/linux/fcntl.h | 3 +++
 3 files changed, 10 insertions(+)

diff --git a/fs/fcntl.c b/fs/fcntl.c
index b59b27c3a338..43ca3e3dacc5 100644
--- a/fs/fcntl.c
+++ b/fs/fcntl.c
@@ -333,6 +333,9 @@ static long do_fcntl(int fd, unsigned int cmd, unsigned long arg,
 	case F_DUPFD_CLOEXEC:
 		err = f_dupfd(arg, filp, O_CLOEXEC);
 		break;
+	case F_DUPFD_CLOFORK:
+		err = f_dupfd(arg, filp, O_CLOFORK);
+		break;
 	case F_GETFD:
 		err = get_close_on_exec(fd) ? FD_CLOEXEC : 0;
 		err |= get_close_on_fork(fd) ? FD_CLOFORK : 0;
@@ -439,6 +442,7 @@ static int check_fcntl_cmd(unsigned cmd)
 	switch (cmd) {
 	case F_DUPFD:
 	case F_DUPFD_CLOEXEC:
+	case F_DUPFD_CLOFORK:
 	case F_GETFD:
 	case F_SETFD:
 	case F_GETFL:
diff --git a/include/uapi/linux/fcntl.h b/include/uapi/linux/fcntl.h
index ca88b7bce553..9e1069ff3a22 100644
--- a/include/uapi/linux/fcntl.h
+++ b/include/uapi/linux/fcntl.h
@@ -55,6 +55,9 @@
 #define F_GET_FILE_RW_HINT	(F_LINUX_SPECIFIC_BASE + 13)
 #define F_SET_FILE_RW_HINT	(F_LINUX_SPECIFIC_BASE + 14)
 
+/* Create a file descriptor with FD_CLOFORK set. */
+#define F_DUPFD_CLOFORK	(F_LINUX_SPECIFIC_BASE + 15)
+
 /*
  * Valid hint values for F_{GET,SET}_RW_HINT. 0 is "not set", or can be
  * used to clear any hints previously set.
diff --git a/tools/include/uapi/linux/fcntl.h b/tools/include/uapi/linux/fcntl.h
index ca88b7bce553..9e1069ff3a22 100644
--- a/tools/include/uapi/linux/fcntl.h
+++ b/tools/include/uapi/linux/fcntl.h
@@ -55,6 +55,9 @@
 #define F_GET_FILE_RW_HINT	(F_LINUX_SPECIFIC_BASE + 13)
 #define F_SET_FILE_RW_HINT	(F_LINUX_SPECIFIC_BASE + 14)
 
+/* Create a file descriptor with FD_CLOFORK set. */
+#define F_DUPFD_CLOFORK	(F_LINUX_SPECIFIC_BASE + 15)
+
 /*
  * Valid hint values for F_{GET,SET}_RW_HINT. 0 is "not set", or can be
  * used to clear any hints previously set.
-- 
2.26.1

^ permalink raw reply related	[flat|nested] 69+ messages in thread

* [PATCH 4/4] net: Add SOCK_CLOFORK
  2020-04-20  7:15 ` Nate Karstens
  (?)
@ 2020-04-20  7:15   ` Nate Karstens
  -1 siblings, 0 replies; 69+ messages in thread
From: Nate Karstens @ 2020-04-20  7:15 UTC (permalink / raw)
  To: Alexander Viro, Jeff Layton, J. Bruce Fields, Arnd Bergmann,
	Richard Henderson, Ivan Kokshaysky, Matt Turner,
	James E.J. Bottomley, Helge Deller, David S. Miller,
	Jakub Kicinski, linux-fsdevel, linux-arch, linux-alpha,
	linux-parisc, sparclinux, netdev, linux-kernel
  Cc: Changli Gao, Nate Karstens

Implements a new socket flag that automatically sets the
close-on-fork flag for sockets created using socket(2),
socketpair(2), and accept4(2).

Signed-off-by: Nate Karstens <nate.karstens@garmin.com>
---
 include/linux/net.h |  3 ++-
 net/socket.c        | 14 ++++++++------
 2 files changed, 10 insertions(+), 7 deletions(-)

diff --git a/include/linux/net.h b/include/linux/net.h
index 6451425e828f..57663c9dc8c4 100644
--- a/include/linux/net.h
+++ b/include/linux/net.h
@@ -17,7 +17,7 @@
 #include <linux/stringify.h>
 #include <linux/random.h>
 #include <linux/wait.h>
-#include <linux/fcntl.h>	/* For O_CLOEXEC and O_NONBLOCK */
+#include <linux/fcntl.h>	/* For O_CLOEXEC, O_CLOFORK, and O_NONBLOCK */
 #include <linux/rcupdate.h>
 #include <linux/once.h>
 #include <linux/fs.h>
@@ -73,6 +73,7 @@ enum sock_type {
 
 /* Flags for socket, socketpair, accept4 */
 #define SOCK_CLOEXEC	O_CLOEXEC
+#define SOCK_CLOFORK	O_CLOFORK
 #ifndef SOCK_NONBLOCK
 #define SOCK_NONBLOCK	O_NONBLOCK
 #endif
diff --git a/net/socket.c b/net/socket.c
index 2eecf1517f76..ba6e971c7e78 100644
--- a/net/socket.c
+++ b/net/socket.c
@@ -1511,12 +1511,14 @@ int __sys_socket(int family, int type, int protocol)
 
 	/* Check the SOCK_* constants for consistency.  */
 	BUILD_BUG_ON(SOCK_CLOEXEC != O_CLOEXEC);
+	BUILD_BUG_ON(SOCK_CLOFORK != O_CLOFORK);
 	BUILD_BUG_ON((SOCK_MAX | SOCK_TYPE_MASK) != SOCK_TYPE_MASK);
 	BUILD_BUG_ON(SOCK_CLOEXEC & SOCK_TYPE_MASK);
+	BUILD_BUG_ON(SOCK_CLOFORK & SOCK_TYPE_MASK);
 	BUILD_BUG_ON(SOCK_NONBLOCK & SOCK_TYPE_MASK);
 
 	flags = type & ~SOCK_TYPE_MASK;
-	if (flags & ~(SOCK_CLOEXEC | SOCK_NONBLOCK))
+	if (flags & ~(SOCK_CLOEXEC | SOCK_CLOFORK | SOCK_NONBLOCK))
 		return -EINVAL;
 	type &= SOCK_TYPE_MASK;
 
@@ -1527,7 +1529,7 @@ int __sys_socket(int family, int type, int protocol)
 	if (retval < 0)
 		return retval;
 
-	return sock_map_fd(sock, flags & (O_CLOEXEC | O_NONBLOCK));
+	return sock_map_fd(sock, flags & (O_CLOEXEC | O_CLOFORK | O_NONBLOCK));
 }
 
 SYSCALL_DEFINE3(socket, int, family, int, type, int, protocol)
@@ -1547,7 +1549,7 @@ int __sys_socketpair(int family, int type, int protocol, int __user *usockvec)
 	int flags;
 
 	flags = type & ~SOCK_TYPE_MASK;
-	if (flags & ~(SOCK_CLOEXEC | SOCK_NONBLOCK))
+	if (flags & ~(SOCK_CLOEXEC | SOCK_CLOFORK | SOCK_NONBLOCK))
 		return -EINVAL;
 	type &= SOCK_TYPE_MASK;
 
@@ -1715,7 +1717,7 @@ int __sys_accept4_file(struct file *file, unsigned file_flags,
 	int err, len, newfd;
 	struct sockaddr_storage address;
 
-	if (flags & ~(SOCK_CLOEXEC | SOCK_NONBLOCK))
+	if (flags & ~(SOCK_CLOEXEC | SOCK_CLOFORK | SOCK_NONBLOCK))
 		return -EINVAL;
 
 	if (SOCK_NONBLOCK != O_NONBLOCK && (flags & SOCK_NONBLOCK))
@@ -3628,8 +3630,8 @@ EXPORT_SYMBOL(kernel_listen);
  *	@newsock: new connected socket
  *	@flags: flags
  *
- *	@flags must be SOCK_CLOEXEC, SOCK_NONBLOCK or 0.
- *	If it fails, @newsock is guaranteed to be %NULL.
+ *	@flags must be SOCK_CLOEXEC, SOCK_CLOFORK, SOCK_NONBLOCK,
+ *	or 0. If it fails, @newsock is guaranteed to be %NULL.
  *	Returns 0 or an error.
  */
 
-- 
2.26.1


^ permalink raw reply related	[flat|nested] 69+ messages in thread

* [PATCH 4/4] net: Add SOCK_CLOFORK
@ 2020-04-20  7:15   ` Nate Karstens
  0 siblings, 0 replies; 69+ messages in thread
From: Nate Karstens @ 2020-04-20  7:15 UTC (permalink / raw)
  To: Alexander Viro, Jeff Layton, J. Bruce Fields, Arnd Bergmann,
	Richard Henderson, Ivan Kokshaysky, Matt Turner,
	James E.J. Bottomley, Helge Deller, David S. Miller,
	Jakub Kicinski, linux-fsdevel, linux-arch, linux-alpha,
	linux-parisc, sparclinux, netdev, linux-kernel
  Cc: Changli Gao, Nate Karstens

Implements a new socket flag that automatically sets the
close-on-fork flag for sockets created using socket(2),
socketpair(2), and accept4(2).

Signed-off-by: Nate Karstens <nate.karstens@garmin.com>
---
 include/linux/net.h |  3 ++-
 net/socket.c        | 14 ++++++++------
 2 files changed, 10 insertions(+), 7 deletions(-)

diff --git a/include/linux/net.h b/include/linux/net.h
index 6451425e828f..57663c9dc8c4 100644
--- a/include/linux/net.h
+++ b/include/linux/net.h
@@ -17,7 +17,7 @@
 #include <linux/stringify.h>
 #include <linux/random.h>
 #include <linux/wait.h>
-#include <linux/fcntl.h>	/* For O_CLOEXEC and O_NONBLOCK */
+#include <linux/fcntl.h>	/* For O_CLOEXEC, O_CLOFORK, and O_NONBLOCK */
 #include <linux/rcupdate.h>
 #include <linux/once.h>
 #include <linux/fs.h>
@@ -73,6 +73,7 @@ enum sock_type {
 
 /* Flags for socket, socketpair, accept4 */
 #define SOCK_CLOEXEC	O_CLOEXEC
+#define SOCK_CLOFORK	O_CLOFORK
 #ifndef SOCK_NONBLOCK
 #define SOCK_NONBLOCK	O_NONBLOCK
 #endif
diff --git a/net/socket.c b/net/socket.c
index 2eecf1517f76..ba6e971c7e78 100644
--- a/net/socket.c
+++ b/net/socket.c
@@ -1511,12 +1511,14 @@ int __sys_socket(int family, int type, int protocol)
 
 	/* Check the SOCK_* constants for consistency.  */
 	BUILD_BUG_ON(SOCK_CLOEXEC != O_CLOEXEC);
+	BUILD_BUG_ON(SOCK_CLOFORK != O_CLOFORK);
 	BUILD_BUG_ON((SOCK_MAX | SOCK_TYPE_MASK) != SOCK_TYPE_MASK);
 	BUILD_BUG_ON(SOCK_CLOEXEC & SOCK_TYPE_MASK);
+	BUILD_BUG_ON(SOCK_CLOFORK & SOCK_TYPE_MASK);
 	BUILD_BUG_ON(SOCK_NONBLOCK & SOCK_TYPE_MASK);
 
 	flags = type & ~SOCK_TYPE_MASK;
-	if (flags & ~(SOCK_CLOEXEC | SOCK_NONBLOCK))
+	if (flags & ~(SOCK_CLOEXEC | SOCK_CLOFORK | SOCK_NONBLOCK))
 		return -EINVAL;
 	type &= SOCK_TYPE_MASK;
 
@@ -1527,7 +1529,7 @@ int __sys_socket(int family, int type, int protocol)
 	if (retval < 0)
 		return retval;
 
-	return sock_map_fd(sock, flags & (O_CLOEXEC | O_NONBLOCK));
+	return sock_map_fd(sock, flags & (O_CLOEXEC | O_CLOFORK | O_NONBLOCK));
 }
 
 SYSCALL_DEFINE3(socket, int, family, int, type, int, protocol)
@@ -1547,7 +1549,7 @@ int __sys_socketpair(int family, int type, int protocol, int __user *usockvec)
 	int flags;
 
 	flags = type & ~SOCK_TYPE_MASK;
-	if (flags & ~(SOCK_CLOEXEC | SOCK_NONBLOCK))
+	if (flags & ~(SOCK_CLOEXEC | SOCK_CLOFORK | SOCK_NONBLOCK))
 		return -EINVAL;
 	type &= SOCK_TYPE_MASK;
 
@@ -1715,7 +1717,7 @@ int __sys_accept4_file(struct file *file, unsigned file_flags,
 	int err, len, newfd;
 	struct sockaddr_storage address;
 
-	if (flags & ~(SOCK_CLOEXEC | SOCK_NONBLOCK))
+	if (flags & ~(SOCK_CLOEXEC | SOCK_CLOFORK | SOCK_NONBLOCK))
 		return -EINVAL;
 
 	if (SOCK_NONBLOCK != O_NONBLOCK && (flags & SOCK_NONBLOCK))
@@ -3628,8 +3630,8 @@ EXPORT_SYMBOL(kernel_listen);
  *	@newsock: new connected socket
  *	@flags: flags
  *
- *	@flags must be SOCK_CLOEXEC, SOCK_NONBLOCK or 0.
- *	If it fails, @newsock is guaranteed to be %NULL.
+ *	@flags must be SOCK_CLOEXEC, SOCK_CLOFORK, SOCK_NONBLOCK,
+ *	or 0. If it fails, @newsock is guaranteed to be %NULL.
  *	Returns 0 or an error.
  */
 
-- 
2.26.1

^ permalink raw reply related	[flat|nested] 69+ messages in thread

* [PATCH 4/4] net: Add SOCK_CLOFORK
@ 2020-04-20  7:15   ` Nate Karstens
  0 siblings, 0 replies; 69+ messages in thread
From: Nate Karstens @ 2020-04-20  7:15 UTC (permalink / raw)
  To: Alexander Viro, Jeff Layton, J. Bruce Fields, Arnd Bergmann,
	Richard Henderson, Ivan Kokshaysky, Matt Turner,
	James E.J. Bottomley, Helge Deller, David S. Miller,
	Jakub Kicinski, linux-fsdevel, linux-arch, linux-alpha,
	linux-parisc, sparclinux, netdev, linux-kernel
  Cc: Changli Gao, Nate Karstens

Implements a new socket flag that automatically sets the
close-on-fork flag for sockets created using socket(2),
socketpair(2), and accept4(2).

Signed-off-by: Nate Karstens <nate.karstens@garmin.com>
---
 include/linux/net.h |  3 ++-
 net/socket.c        | 14 ++++++++------
 2 files changed, 10 insertions(+), 7 deletions(-)

diff --git a/include/linux/net.h b/include/linux/net.h
index 6451425e828f..57663c9dc8c4 100644
--- a/include/linux/net.h
+++ b/include/linux/net.h
@@ -17,7 +17,7 @@
 #include <linux/stringify.h>
 #include <linux/random.h>
 #include <linux/wait.h>
-#include <linux/fcntl.h>	/* For O_CLOEXEC and O_NONBLOCK */
+#include <linux/fcntl.h>	/* For O_CLOEXEC, O_CLOFORK, and O_NONBLOCK */
 #include <linux/rcupdate.h>
 #include <linux/once.h>
 #include <linux/fs.h>
@@ -73,6 +73,7 @@ enum sock_type {
 
 /* Flags for socket, socketpair, accept4 */
 #define SOCK_CLOEXEC	O_CLOEXEC
+#define SOCK_CLOFORK	O_CLOFORK
 #ifndef SOCK_NONBLOCK
 #define SOCK_NONBLOCK	O_NONBLOCK
 #endif
diff --git a/net/socket.c b/net/socket.c
index 2eecf1517f76..ba6e971c7e78 100644
--- a/net/socket.c
+++ b/net/socket.c
@@ -1511,12 +1511,14 @@ int __sys_socket(int family, int type, int protocol)
 
 	/* Check the SOCK_* constants for consistency.  */
 	BUILD_BUG_ON(SOCK_CLOEXEC != O_CLOEXEC);
+	BUILD_BUG_ON(SOCK_CLOFORK != O_CLOFORK);
 	BUILD_BUG_ON((SOCK_MAX | SOCK_TYPE_MASK) != SOCK_TYPE_MASK);
 	BUILD_BUG_ON(SOCK_CLOEXEC & SOCK_TYPE_MASK);
+	BUILD_BUG_ON(SOCK_CLOFORK & SOCK_TYPE_MASK);
 	BUILD_BUG_ON(SOCK_NONBLOCK & SOCK_TYPE_MASK);
 
 	flags = type & ~SOCK_TYPE_MASK;
-	if (flags & ~(SOCK_CLOEXEC | SOCK_NONBLOCK))
+	if (flags & ~(SOCK_CLOEXEC | SOCK_CLOFORK | SOCK_NONBLOCK))
 		return -EINVAL;
 	type &= SOCK_TYPE_MASK;
 
@@ -1527,7 +1529,7 @@ int __sys_socket(int family, int type, int protocol)
 	if (retval < 0)
 		return retval;
 
-	return sock_map_fd(sock, flags & (O_CLOEXEC | O_NONBLOCK));
+	return sock_map_fd(sock, flags & (O_CLOEXEC | O_CLOFORK | O_NONBLOCK));
 }
 
 SYSCALL_DEFINE3(socket, int, family, int, type, int, protocol)
@@ -1547,7 +1549,7 @@ int __sys_socketpair(int family, int type, int protocol, int __user *usockvec)
 	int flags;
 
 	flags = type & ~SOCK_TYPE_MASK;
-	if (flags & ~(SOCK_CLOEXEC | SOCK_NONBLOCK))
+	if (flags & ~(SOCK_CLOEXEC | SOCK_CLOFORK | SOCK_NONBLOCK))
 		return -EINVAL;
 	type &= SOCK_TYPE_MASK;
 
@@ -1715,7 +1717,7 @@ int __sys_accept4_file(struct file *file, unsigned file_flags,
 	int err, len, newfd;
 	struct sockaddr_storage address;
 
-	if (flags & ~(SOCK_CLOEXEC | SOCK_NONBLOCK))
+	if (flags & ~(SOCK_CLOEXEC | SOCK_CLOFORK | SOCK_NONBLOCK))
 		return -EINVAL;
 
 	if (SOCK_NONBLOCK != O_NONBLOCK && (flags & SOCK_NONBLOCK))
@@ -3628,8 +3630,8 @@ EXPORT_SYMBOL(kernel_listen);
  *	@newsock: new connected socket
  *	@flags: flags
  *
- *	@flags must be SOCK_CLOEXEC, SOCK_NONBLOCK or 0.
- *	If it fails, @newsock is guaranteed to be %NULL.
+ *	@flags must be SOCK_CLOEXEC, SOCK_CLOFORK, SOCK_NONBLOCK,
+ *	or 0. If it fails, @newsock is guaranteed to be %NULL.
  *	Returns 0 or an error.
  */
 
-- 
2.26.1

^ permalink raw reply related	[flat|nested] 69+ messages in thread

* Re: [PATCH 1/4] fs: Implement close-on-fork
  2020-04-20  7:15   ` Nate Karstens
@ 2020-04-20 10:25     ` Eric Dumazet
  -1 siblings, 0 replies; 69+ messages in thread
From: Eric Dumazet @ 2020-04-20 10:25 UTC (permalink / raw)
  To: Nate Karstens, Alexander Viro, Jeff Layton, J. Bruce Fields,
	Arnd Bergmann, Richard Henderson, Ivan Kokshaysky, Matt Turner,
	James E.J. Bottomley, Helge Deller, David S. Miller,
	Jakub Kicinski, linux-fsdevel, linux-arch, linux-alpha,
	linux-parisc, sparclinux, netdev, linux-kernel
  Cc: Changli Gao



On 4/20/20 12:15 AM, Nate Karstens wrote:
> The close-on-fork flag causes the file descriptor to be closed
> atomically in the child process before the child process returns
> from fork(). Implement this feature and provide a method to
> get/set the close-on-fork flag using fcntl(2).
> 
> This functionality was approved by the Austin Common Standards
> Revision Group for inclusion in the next revision of the POSIX
> standard (see issue 1318 in the Austin Group Defect Tracker).

Oh well... yet another feature slowing down a critical path.

> 
> Co-developed-by: Changli Gao <xiaosuo@gmail.com>
> Signed-off-by: Changli Gao <xiaosuo@gmail.com>
> Signed-off-by: Nate Karstens <nate.karstens@garmin.com>
> ---
>  fs/fcntl.c                             |  2 ++
>  fs/file.c                              | 50 +++++++++++++++++++++++++-
>  include/linux/fdtable.h                |  7 ++++
>  include/linux/file.h                   |  2 ++
>  include/uapi/asm-generic/fcntl.h       |  5 +--
>  tools/include/uapi/asm-generic/fcntl.h |  5 +--
>  6 files changed, 66 insertions(+), 5 deletions(-)
> 
> diff --git a/fs/fcntl.c b/fs/fcntl.c
> index 2e4c0fa2074b..23964abf4a1a 100644
> --- a/fs/fcntl.c
> +++ b/fs/fcntl.c
> @@ -335,10 +335,12 @@ static long do_fcntl(int fd, unsigned int cmd, unsigned long arg,
>  		break;
>  	case F_GETFD:
>  		err = get_close_on_exec(fd) ? FD_CLOEXEC : 0;
> +		err |= get_close_on_fork(fd) ? FD_CLOFORK : 0;
>  		break;
>  	case F_SETFD:
>  		err = 0;
>  		set_close_on_exec(fd, arg & FD_CLOEXEC);
> +		set_close_on_fork(fd, arg & FD_CLOFORK);
>  		break;
>  	case F_GETFL:
>  		err = filp->f_flags;
> diff --git a/fs/file.c b/fs/file.c
> index c8a4e4c86e55..de7260ba718d 100644
> --- a/fs/file.c
> +++ b/fs/file.c
> @@ -57,6 +57,8 @@ static void copy_fd_bitmaps(struct fdtable *nfdt, struct fdtable *ofdt,
>  	memset((char *)nfdt->open_fds + cpy, 0, set);
>  	memcpy(nfdt->close_on_exec, ofdt->close_on_exec, cpy);
>  	memset((char *)nfdt->close_on_exec + cpy, 0, set);
> +	memcpy(nfdt->close_on_fork, ofdt->close_on_fork, cpy);
> +	memset((char *)nfdt->close_on_fork + cpy, 0, set);
>  

I suggest we group the two bits of a file (close_on_exec, close_on_fork) together,
so that we do not have to dirty two separate cache lines.

Otherwise we will add yet another cache line miss at every file opening/closing for processes
with big file tables.

Ie having a _single_ bitmap array, even bit for close_on_exec, odd bit for close_on_fork

static inline void __set_close_on_exec(unsigned int fd, struct fdtable *fdt)
{
	__set_bit(fd * 2, fdt->close_on_fork_exec);
}

static inline void __set_close_on_fork(unsigned int fd, struct fdtable *fdt)
{
	__set_bit(fd * 2 + 1, fdt->close_on_fork_exec);
}

Also the F_GETFD/F_SETFD implementation must use a single function call,
to not acquire the spinlock twice.



^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 1/4] fs: Implement close-on-fork
@ 2020-04-20 10:25     ` Eric Dumazet
  0 siblings, 0 replies; 69+ messages in thread
From: Eric Dumazet @ 2020-04-20 10:25 UTC (permalink / raw)
  To: Nate Karstens, Alexander Viro, Jeff Layton, J. Bruce Fields,
	Arnd Bergmann, Richard Henderson, Ivan Kokshaysky, Matt Turner,
	James E.J. Bottomley, Helge Deller, David S. Miller,
	Jakub Kicinski, linux-fsdevel, linux-arch, linux-alpha,
	linux-parisc, sparclinux, netdev, linux-kernel
  Cc: Changli Gao



On 4/20/20 12:15 AM, Nate Karstens wrote:
> The close-on-fork flag causes the file descriptor to be closed
> atomically in the child process before the child process returns
> from fork(). Implement this feature and provide a method to
> get/set the close-on-fork flag using fcntl(2).
> 
> This functionality was approved by the Austin Common Standards
> Revision Group for inclusion in the next revision of the POSIX
> standard (see issue 1318 in the Austin Group Defect Tracker).

Oh well... yet another feature slowing down a critical path.

> 
> Co-developed-by: Changli Gao <xiaosuo@gmail.com>
> Signed-off-by: Changli Gao <xiaosuo@gmail.com>
> Signed-off-by: Nate Karstens <nate.karstens@garmin.com>
> ---
>  fs/fcntl.c                             |  2 ++
>  fs/file.c                              | 50 +++++++++++++++++++++++++-
>  include/linux/fdtable.h                |  7 ++++
>  include/linux/file.h                   |  2 ++
>  include/uapi/asm-generic/fcntl.h       |  5 +--
>  tools/include/uapi/asm-generic/fcntl.h |  5 +--
>  6 files changed, 66 insertions(+), 5 deletions(-)
> 
> diff --git a/fs/fcntl.c b/fs/fcntl.c
> index 2e4c0fa2074b..23964abf4a1a 100644
> --- a/fs/fcntl.c
> +++ b/fs/fcntl.c
> @@ -335,10 +335,12 @@ static long do_fcntl(int fd, unsigned int cmd, unsigned long arg,
>  		break;
>  	case F_GETFD:
>  		err = get_close_on_exec(fd) ? FD_CLOEXEC : 0;
> +		err |= get_close_on_fork(fd) ? FD_CLOFORK : 0;
>  		break;
>  	case F_SETFD:
>  		err = 0;
>  		set_close_on_exec(fd, arg & FD_CLOEXEC);
> +		set_close_on_fork(fd, arg & FD_CLOFORK);
>  		break;
>  	case F_GETFL:
>  		err = filp->f_flags;
> diff --git a/fs/file.c b/fs/file.c
> index c8a4e4c86e55..de7260ba718d 100644
> --- a/fs/file.c
> +++ b/fs/file.c
> @@ -57,6 +57,8 @@ static void copy_fd_bitmaps(struct fdtable *nfdt, struct fdtable *ofdt,
>  	memset((char *)nfdt->open_fds + cpy, 0, set);
>  	memcpy(nfdt->close_on_exec, ofdt->close_on_exec, cpy);
>  	memset((char *)nfdt->close_on_exec + cpy, 0, set);
> +	memcpy(nfdt->close_on_fork, ofdt->close_on_fork, cpy);
> +	memset((char *)nfdt->close_on_fork + cpy, 0, set);
>  

I suggest we group the two bits of a file (close_on_exec, close_on_fork) together,
so that we do not have to dirty two separate cache lines.

Otherwise we will add yet another cache line miss at every file opening/closing for processes
with big file tables.

Ie having a _single_ bitmap array, even bit for close_on_exec, odd bit for close_on_fork

static inline void __set_close_on_exec(unsigned int fd, struct fdtable *fdt)
{
	__set_bit(fd * 2, fdt->close_on_fork_exec);
}

static inline void __set_close_on_fork(unsigned int fd, struct fdtable *fdt)
{
	__set_bit(fd * 2 + 1, fdt->close_on_fork_exec);
}

Also the F_GETFD/F_SETFD implementation must use a single function call,
to not acquire the spinlock twice.

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 1/4] fs: Implement close-on-fork
  2020-04-20 10:25     ` Eric Dumazet
@ 2020-04-22  3:38       ` Changli Gao
  -1 siblings, 0 replies; 69+ messages in thread
From: Changli Gao @ 2020-04-22  3:38 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Nate Karstens, Alexander Viro, Jeff Layton, J. Bruce Fields,
	Arnd Bergmann, Richard Henderson, Ivan Kokshaysky, Matt Turner,
	James E.J. Bottomley, Helge Deller, David S. Miller,
	Jakub Kicinski, linux-fsdevel, linux-arch, linux-alpha,
	linux-parisc, sparclinux, Linux Netdev List,
	Linux Kernel Mailing List

On Mon, Apr 20, 2020 at 6:25 PM Eric Dumazet <eric.dumazet@gmail.com> wrote:
>
>
>
> On 4/20/20 12:15 AM, Nate Karstens wrote:
> > The close-on-fork flag causes the file descriptor to be closed
> > atomically in the child process before the child process returns
> > from fork(). Implement this feature and provide a method to
> > get/set the close-on-fork flag using fcntl(2).
> >
> > This functionality was approved by the Austin Common Standards
> > Revision Group for inclusion in the next revision of the POSIX
> > standard (see issue 1318 in the Austin Group Defect Tracker).
>
> Oh well... yet another feature slowing down a critical path.
>
> >
> > Co-developed-by: Changli Gao <xiaosuo@gmail.com>
> > Signed-off-by: Changli Gao <xiaosuo@gmail.com>
> > Signed-off-by: Nate Karstens <nate.karstens@garmin.com>
> > ---
> >  fs/fcntl.c                             |  2 ++
> >  fs/file.c                              | 50 +++++++++++++++++++++++++-
> >  include/linux/fdtable.h                |  7 ++++
> >  include/linux/file.h                   |  2 ++
> >  include/uapi/asm-generic/fcntl.h       |  5 +--
> >  tools/include/uapi/asm-generic/fcntl.h |  5 +--
> >  6 files changed, 66 insertions(+), 5 deletions(-)
> >
> > diff --git a/fs/fcntl.c b/fs/fcntl.c
> > index 2e4c0fa2074b..23964abf4a1a 100644
> > --- a/fs/fcntl.c
> > +++ b/fs/fcntl.c
> > @@ -335,10 +335,12 @@ static long do_fcntl(int fd, unsigned int cmd, unsigned long arg,
> >               break;
> >       case F_GETFD:
> >               err = get_close_on_exec(fd) ? FD_CLOEXEC : 0;
> > +             err |= get_close_on_fork(fd) ? FD_CLOFORK : 0;
> >               break;
> >       case F_SETFD:
> >               err = 0;
> >               set_close_on_exec(fd, arg & FD_CLOEXEC);
> > +             set_close_on_fork(fd, arg & FD_CLOFORK);
> >               break;
> >       case F_GETFL:
> >               err = filp->f_flags;
> > diff --git a/fs/file.c b/fs/file.c
> > index c8a4e4c86e55..de7260ba718d 100644
> > --- a/fs/file.c
> > +++ b/fs/file.c
> > @@ -57,6 +57,8 @@ static void copy_fd_bitmaps(struct fdtable *nfdt, struct fdtable *ofdt,
> >       memset((char *)nfdt->open_fds + cpy, 0, set);
> >       memcpy(nfdt->close_on_exec, ofdt->close_on_exec, cpy);
> >       memset((char *)nfdt->close_on_exec + cpy, 0, set);
> > +     memcpy(nfdt->close_on_fork, ofdt->close_on_fork, cpy);
> > +     memset((char *)nfdt->close_on_fork + cpy, 0, set);
> >
>
> I suggest we group the two bits of a file (close_on_exec, close_on_fork) together,
> so that we do not have to dirty two separate cache lines.
>
> Otherwise we will add yet another cache line miss at every file opening/closing for processes
> with big file tables.
>
> Ie having a _single_ bitmap array, even bit for close_on_exec, odd bit for close_on_fork
>
> static inline void __set_close_on_exec(unsigned int fd, struct fdtable *fdt)
> {
>         __set_bit(fd * 2, fdt->close_on_fork_exec);
> }
>
> static inline void __set_close_on_fork(unsigned int fd, struct fdtable *fdt)
> {
>         __set_bit(fd * 2 + 1, fdt->close_on_fork_exec);
> }
>
> Also the F_GETFD/F_SETFD implementation must use a single function call,
> to not acquire the spinlock twice.
>

Good suggestions.

At the same time, we'd better extend other syscalls, which set the
FD_CLOEXEC when  creating FDs. i.e. open, pipe3...


-- 
Regards,
Changli Gao(xiaosuo@gmail.com)

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 1/4] fs: Implement close-on-fork
@ 2020-04-22  3:38       ` Changli Gao
  0 siblings, 0 replies; 69+ messages in thread
From: Changli Gao @ 2020-04-22  3:38 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Nate Karstens, Alexander Viro, Jeff Layton, J. Bruce Fields,
	Arnd Bergmann, Richard Henderson, Ivan Kokshaysky, Matt Turner,
	James E.J. Bottomley, Helge Deller, David S. Miller,
	Jakub Kicinski, linux-fsdevel, linux-arch, linux-alpha,
	linux-parisc, sparclinux, Linux Netdev List,
	Linux Kernel Mailing List

On Mon, Apr 20, 2020 at 6:25 PM Eric Dumazet <eric.dumazet@gmail.com> wrote:
>
>
>
> On 4/20/20 12:15 AM, Nate Karstens wrote:
> > The close-on-fork flag causes the file descriptor to be closed
> > atomically in the child process before the child process returns
> > from fork(). Implement this feature and provide a method to
> > get/set the close-on-fork flag using fcntl(2).
> >
> > This functionality was approved by the Austin Common Standards
> > Revision Group for inclusion in the next revision of the POSIX
> > standard (see issue 1318 in the Austin Group Defect Tracker).
>
> Oh well... yet another feature slowing down a critical path.
>
> >
> > Co-developed-by: Changli Gao <xiaosuo@gmail.com>
> > Signed-off-by: Changli Gao <xiaosuo@gmail.com>
> > Signed-off-by: Nate Karstens <nate.karstens@garmin.com>
> > ---
> >  fs/fcntl.c                             |  2 ++
> >  fs/file.c                              | 50 +++++++++++++++++++++++++-
> >  include/linux/fdtable.h                |  7 ++++
> >  include/linux/file.h                   |  2 ++
> >  include/uapi/asm-generic/fcntl.h       |  5 +--
> >  tools/include/uapi/asm-generic/fcntl.h |  5 +--
> >  6 files changed, 66 insertions(+), 5 deletions(-)
> >
> > diff --git a/fs/fcntl.c b/fs/fcntl.c
> > index 2e4c0fa2074b..23964abf4a1a 100644
> > --- a/fs/fcntl.c
> > +++ b/fs/fcntl.c
> > @@ -335,10 +335,12 @@ static long do_fcntl(int fd, unsigned int cmd, unsigned long arg,
> >               break;
> >       case F_GETFD:
> >               err = get_close_on_exec(fd) ? FD_CLOEXEC : 0;
> > +             err |= get_close_on_fork(fd) ? FD_CLOFORK : 0;
> >               break;
> >       case F_SETFD:
> >               err = 0;
> >               set_close_on_exec(fd, arg & FD_CLOEXEC);
> > +             set_close_on_fork(fd, arg & FD_CLOFORK);
> >               break;
> >       case F_GETFL:
> >               err = filp->f_flags;
> > diff --git a/fs/file.c b/fs/file.c
> > index c8a4e4c86e55..de7260ba718d 100644
> > --- a/fs/file.c
> > +++ b/fs/file.c
> > @@ -57,6 +57,8 @@ static void copy_fd_bitmaps(struct fdtable *nfdt, struct fdtable *ofdt,
> >       memset((char *)nfdt->open_fds + cpy, 0, set);
> >       memcpy(nfdt->close_on_exec, ofdt->close_on_exec, cpy);
> >       memset((char *)nfdt->close_on_exec + cpy, 0, set);
> > +     memcpy(nfdt->close_on_fork, ofdt->close_on_fork, cpy);
> > +     memset((char *)nfdt->close_on_fork + cpy, 0, set);
> >
>
> I suggest we group the two bits of a file (close_on_exec, close_on_fork) together,
> so that we do not have to dirty two separate cache lines.
>
> Otherwise we will add yet another cache line miss at every file opening/closing for processes
> with big file tables.
>
> Ie having a _single_ bitmap array, even bit for close_on_exec, odd bit for close_on_fork
>
> static inline void __set_close_on_exec(unsigned int fd, struct fdtable *fdt)
> {
>         __set_bit(fd * 2, fdt->close_on_fork_exec);
> }
>
> static inline void __set_close_on_fork(unsigned int fd, struct fdtable *fdt)
> {
>         __set_bit(fd * 2 + 1, fdt->close_on_fork_exec);
> }
>
> Also the F_GETFD/F_SETFD implementation must use a single function call,
> to not acquire the spinlock twice.
>

Good suggestions.

At the same time, we'd better extend other syscalls, which set the
FD_CLOEXEC when  creating FDs. i.e. open, pipe3...


-- 
Regards,
Changli Gao(xiaosuo@gmail.com)

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 1/4] fs: Implement close-on-fork
  2020-04-22  3:38       ` Changli Gao
@ 2020-04-22  3:41         ` Changli Gao
  -1 siblings, 0 replies; 69+ messages in thread
From: Changli Gao @ 2020-04-22  3:41 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Nate Karstens, Alexander Viro, Jeff Layton, J. Bruce Fields,
	Arnd Bergmann, Richard Henderson, Ivan Kokshaysky, Matt Turner,
	James E.J. Bottomley, Helge Deller, David S. Miller,
	Jakub Kicinski, linux-fsdevel, linux-arch, linux-alpha,
	linux-parisc, sparclinux, Linux Netdev List,
	Linux Kernel Mailing List

On Wed, Apr 22, 2020 at 11:38 AM Changli Gao <xiaosuo@gmail.com> wrote:
> At the same time, we'd better extend other syscalls, which set the
> FD_CLOEXEC when  creating FDs. i.e. open, pipe3...
>

Ignore me, I missed the latter patches.



-- 
Regards,
Changli Gao(xiaosuo@gmail.com)

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 1/4] fs: Implement close-on-fork
@ 2020-04-22  3:41         ` Changli Gao
  0 siblings, 0 replies; 69+ messages in thread
From: Changli Gao @ 2020-04-22  3:41 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Nate Karstens, Alexander Viro, Jeff Layton, J. Bruce Fields,
	Arnd Bergmann, Richard Henderson, Ivan Kokshaysky, Matt Turner,
	James E.J. Bottomley, Helge Deller, David S. Miller,
	Jakub Kicinski, linux-fsdevel, linux-arch, linux-alpha,
	linux-parisc, sparclinux, Linux Netdev List,
	Linux Kernel Mailing List

On Wed, Apr 22, 2020 at 11:38 AM Changli Gao <xiaosuo@gmail.com> wrote:
> At the same time, we'd better extend other syscalls, which set the
> FD_CLOEXEC when  creating FDs. i.e. open, pipe3...
>

Ignore me, I missed the latter patches.



-- 
Regards,
Changli Gao(xiaosuo@gmail.com)

^ permalink raw reply	[flat|nested] 69+ messages in thread

* RE: [PATCH 1/4] fs: Implement close-on-fork
  2020-04-20 10:25     ` Eric Dumazet
  (?)
@ 2020-04-22  8:35       ` David Laight
  -1 siblings, 0 replies; 69+ messages in thread
From: David Laight @ 2020-04-22  8:35 UTC (permalink / raw)
  To: 'Eric Dumazet',
	Nate Karstens, Alexander Viro, Jeff Layton, J. Bruce Fields,
	Arnd Bergmann, Richard Henderson, Ivan Kokshaysky, Matt Turner,
	James E.J. Bottomley, Helge Deller, David S. Miller,
	Jakub Kicinski, linux-fsdevel, linux-arch, linux-alpha,
	linux-parisc, sparclinux, netdev, linux-kernel
  Cc: Changli Gao

From: Eric Dumazet
> Sent: 20 April 2020 11:26
> On 4/20/20 12:15 AM, Nate Karstens wrote:
> > The close-on-fork flag causes the file descriptor to be closed
> > atomically in the child process before the child process returns
> > from fork(). Implement this feature and provide a method to
> > get/set the close-on-fork flag using fcntl(2).
> >
> > This functionality was approved by the Austin Common Standards
> > Revision Group for inclusion in the next revision of the POSIX
> > standard (see issue 1318 in the Austin Group Defect Tracker).
> 
> Oh well... yet another feature slowing down a critical path.
...
> I suggest we group the two bits of a file (close_on_exec, close_on_fork) together,
> so that we do not have to dirty two separate cache lines.
> 
> Otherwise we will add yet another cache line miss at every file opening/closing for processes
> with big file tables.

How about only allocating the 'close on fork' bitmap the first time
a process sets a bit in it?

Off hand I can't imagine the use case.
I thought posix always shared fd tables across fork().

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

^ permalink raw reply	[flat|nested] 69+ messages in thread

* RE: [PATCH 1/4] fs: Implement close-on-fork
@ 2020-04-22  8:35       ` David Laight
  0 siblings, 0 replies; 69+ messages in thread
From: David Laight @ 2020-04-22  8:35 UTC (permalink / raw)
  To: 'Eric Dumazet',
	Nate Karstens, Alexander Viro, Jeff Layton, J. Bruce Fields,
	Arnd Bergmann, Richard Henderson, Ivan Kokshaysky, Matt Turner,
	James E.J. Bottomley, Helge Deller, David S. Miller,
	Jakub Kicinski, linux-fsdevel, linux-arch, linux-alpha,
	linux-parisc
  Cc: Changli Gao

From: Eric Dumazet
> Sent: 20 April 2020 11:26
> On 4/20/20 12:15 AM, Nate Karstens wrote:
> > The close-on-fork flag causes the file descriptor to be closed
> > atomically in the child process before the child process returns
> > from fork(). Implement this feature and provide a method to
> > get/set the close-on-fork flag using fcntl(2).
> >
> > This functionality was approved by the Austin Common Standards
> > Revision Group for inclusion in the next revision of the POSIX
> > standard (see issue 1318 in the Austin Group Defect Tracker).
> 
> Oh well... yet another feature slowing down a critical path.
...
> I suggest we group the two bits of a file (close_on_exec, close_on_fork) together,
> so that we do not have to dirty two separate cache lines.
> 
> Otherwise we will add yet another cache line miss at every file opening/closing for processes
> with big file tables.

How about only allocating the 'close on fork' bitmap the first time
a process sets a bit in it?

Off hand I can't imagine the use case.
I thought posix always shared fd tables across fork().

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

^ permalink raw reply	[flat|nested] 69+ messages in thread

* RE: [PATCH 1/4] fs: Implement close-on-fork
@ 2020-04-22  8:35       ` David Laight
  0 siblings, 0 replies; 69+ messages in thread
From: David Laight @ 2020-04-22  8:35 UTC (permalink / raw)
  To: 'Eric Dumazet',
	Nate Karstens, Alexander Viro, Jeff Layton, J. Bruce Fields,
	Arnd Bergmann, Richard Henderson, Ivan Kokshaysky, Matt Turner,
	James E.J. Bottomley, Helge Deller, David S. Miller,
	Jakub Kicinski, linux-fsdevel, linux-arch, linux-alpha,
	linux-parisc
  Cc: Changli Gao

RnJvbTogRXJpYyBEdW1hemV0DQo+IFNlbnQ6IDIwIEFwcmlsIDIwMjAgMTE6MjYNCj4gT24gNC8y
MC8yMCAxMjoxNSBBTSwgTmF0ZSBLYXJzdGVucyB3cm90ZToNCj4gPiBUaGUgY2xvc2Utb24tZm9y
ayBmbGFnIGNhdXNlcyB0aGUgZmlsZSBkZXNjcmlwdG9yIHRvIGJlIGNsb3NlZA0KPiA+IGF0b21p
Y2FsbHkgaW4gdGhlIGNoaWxkIHByb2Nlc3MgYmVmb3JlIHRoZSBjaGlsZCBwcm9jZXNzIHJldHVy
bnMNCj4gPiBmcm9tIGZvcmsoKS4gSW1wbGVtZW50IHRoaXMgZmVhdHVyZSBhbmQgcHJvdmlkZSBh
IG1ldGhvZCB0bw0KPiA+IGdldC9zZXQgdGhlIGNsb3NlLW9uLWZvcmsgZmxhZyB1c2luZyBmY250
bCgyKS4NCj4gPg0KPiA+IFRoaXMgZnVuY3Rpb25hbGl0eSB3YXMgYXBwcm92ZWQgYnkgdGhlIEF1
c3RpbiBDb21tb24gU3RhbmRhcmRzDQo+ID4gUmV2aXNpb24gR3JvdXAgZm9yIGluY2x1c2lvbiBp
biB0aGUgbmV4dCByZXZpc2lvbiBvZiB0aGUgUE9TSVgNCj4gPiBzdGFuZGFyZCAoc2VlIGlzc3Vl
IDEzMTggaW4gdGhlIEF1c3RpbiBHcm91cCBEZWZlY3QgVHJhY2tlcikuDQo+IA0KPiBPaCB3ZWxs
Li4uIHlldCBhbm90aGVyIGZlYXR1cmUgc2xvd2luZyBkb3duIGEgY3JpdGljYWwgcGF0aC4NCi4u
Lg0KPiBJIHN1Z2dlc3Qgd2UgZ3JvdXAgdGhlIHR3byBiaXRzIG9mIGEgZmlsZSAoY2xvc2Vfb25f
ZXhlYywgY2xvc2Vfb25fZm9yaykgdG9nZXRoZXIsDQo+IHNvIHRoYXQgd2UgZG8gbm90IGhhdmUg
dG8gZGlydHkgdHdvIHNlcGFyYXRlIGNhY2hlIGxpbmVzLg0KPiANCj4gT3RoZXJ3aXNlIHdlIHdp
bGwgYWRkIHlldCBhbm90aGVyIGNhY2hlIGxpbmUgbWlzcyBhdCBldmVyeSBmaWxlIG9wZW5pbmcv
Y2xvc2luZyBmb3IgcHJvY2Vzc2VzDQo+IHdpdGggYmlnIGZpbGUgdGFibGVzLg0KDQpIb3cgYWJv
dXQgb25seSBhbGxvY2F0aW5nIHRoZSAnY2xvc2Ugb24gZm9yaycgYml0bWFwIHRoZSBmaXJzdCB0
aW1lDQphIHByb2Nlc3Mgc2V0cyBhIGJpdCBpbiBpdD8NCg0KT2ZmIGhhbmQgSSBjYW4ndCBpbWFn
aW5lIHRoZSB1c2UgY2FzZS4NCkkgdGhvdWdodCBwb3NpeCBhbHdheXMgc2hhcmVkIGZkIHRhYmxl
cyBhY3Jvc3MgZm9yaygpLg0KDQoJRGF2aWQNCg0KLQ0KUmVnaXN0ZXJlZCBBZGRyZXNzIExha2Vz
aWRlLCBCcmFtbGV5IFJvYWQsIE1vdW50IEZhcm0sIE1pbHRvbiBLZXluZXMsIE1LMSAxUFQsIFVL
DQpSZWdpc3RyYXRpb24gTm86IDEzOTczODYgKFdhbGVzKQ0K

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: Implement close-on-fork
  2020-04-20  7:15 ` Nate Karstens
@ 2020-04-22 14:32   ` James Bottomley
  -1 siblings, 0 replies; 69+ messages in thread
From: James Bottomley @ 2020-04-22 14:32 UTC (permalink / raw)
  To: Nate Karstens, Alexander Viro, Jeff Layton, J. Bruce Fields,
	Arnd Bergmann, Richard Henderson, Ivan Kokshaysky, Matt Turner,
	Helge Deller, David S. Miller, Jakub Kicinski, linux-fsdevel,
	linux-arch, linux-alpha, linux-parisc, sparclinux, netdev,
	linux-kernel
  Cc: Changli Gao

On Mon, 2020-04-20 at 02:15 -0500, Nate Karstens wrote:
> Series of 4 patches to implement close-on-fork. Tests have been
> published to https://github.com/nkarstens/ltp/tree/close-on-fork.
> 
> close-on-fork addresses race conditions in system(), which
> (depending on the implementation) is non-atomic in that it
> first calls a fork() and then an exec().

Why is this a problem?  I get that there's a time between fork and exec
when you have open file descriptors, but they should still be running
in the binary context of the programme that called fork, i.e. under
your control.  The security problems don't seem to occur until you exec
some random binary, which close on exec covers.  So what problem would
close on fork fix?

> This functionality was approved by the Austin Common Standards
> Revision Group for inclusion in the next revision of the POSIX
> standard (see issue 1318 in the Austin Group Defect Tracker).

URL?  Does this standard give a reason why the functionality might be
useful.

James


^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: Implement close-on-fork
@ 2020-04-22 14:32   ` James Bottomley
  0 siblings, 0 replies; 69+ messages in thread
From: James Bottomley @ 2020-04-22 14:32 UTC (permalink / raw)
  To: Nate Karstens, Alexander Viro, Jeff Layton, J. Bruce Fields,
	Arnd Bergmann, Richard Henderson, Ivan Kokshaysky, Matt Turner,
	Helge Deller, David S. Miller, Jakub Kicinski, linux-fsdevel,
	linux-arch, linux-alpha, linux-parisc, sparclinux, netdev,
	linux-kernel
  Cc: Changli Gao

On Mon, 2020-04-20 at 02:15 -0500, Nate Karstens wrote:
> Series of 4 patches to implement close-on-fork. Tests have been
> published to https://github.com/nkarstens/ltp/tree/close-on-fork.
> 
> close-on-fork addresses race conditions in system(), which
> (depending on the implementation) is non-atomic in that it
> first calls a fork() and then an exec().

Why is this a problem?  I get that there's a time between fork and exec
when you have open file descriptors, but they should still be running
in the binary context of the programme that called fork, i.e. under
your control.  The security problems don't seem to occur until you exec
some random binary, which close on exec covers.  So what problem would
close on fork fix?

> This functionality was approved by the Austin Common Standards
> Revision Group for inclusion in the next revision of the POSIX
> standard (see issue 1318 in the Austin Group Defect Tracker).

URL?  Does this standard give a reason why the functionality might be
useful.

James

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: Implement close-on-fork
  2020-04-20  7:15 ` Nate Karstens
@ 2020-04-22 15:01   ` Al Viro
  -1 siblings, 0 replies; 69+ messages in thread
From: Al Viro @ 2020-04-22 15:01 UTC (permalink / raw)
  To: Nate Karstens
  Cc: Jeff Layton, J. Bruce Fields, Arnd Bergmann, Richard Henderson,
	Ivan Kokshaysky, Matt Turner, James E.J. Bottomley, Helge Deller,
	David S. Miller, Jakub Kicinski, linux-fsdevel, linux-arch,
	linux-alpha, linux-parisc, sparclinux, netdev, linux-kernel,
	Changli Gao

On Mon, Apr 20, 2020 at 02:15:44AM -0500, Nate Karstens wrote:
> Series of 4 patches to implement close-on-fork. Tests have been
> published to https://github.com/nkarstens/ltp/tree/close-on-fork.
> 
> close-on-fork addresses race conditions in system(), which
> (depending on the implementation) is non-atomic in that it
> first calls a fork() and then an exec().
> 
> This functionality was approved by the Austin Common Standards
> Revision Group for inclusion in the next revision of the POSIX
> standard (see issue 1318 in the Austin Group Defect Tracker).

What exactly the reasons are and why would we want to implement that?

Pardon me, but going by the previous history, "The Austin Group Says It's
Good" is more of a source of concern regarding the merits, general sanity
and, most of all, good taste of a proposal.

I'm not saying that it's automatically bad, but you'll have to go much
deeper into the rationale of that change before your proposal is taken
seriously.

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: Implement close-on-fork
@ 2020-04-22 15:01   ` Al Viro
  0 siblings, 0 replies; 69+ messages in thread
From: Al Viro @ 2020-04-22 15:01 UTC (permalink / raw)
  To: Nate Karstens
  Cc: Jeff Layton, J. Bruce Fields, Arnd Bergmann, Richard Henderson,
	Ivan Kokshaysky, Matt Turner, James E.J. Bottomley, Helge Deller,
	David S. Miller, Jakub Kicinski, linux-fsdevel, linux-arch,
	linux-alpha, linux-parisc, sparclinux, netdev, linux-kernel,
	Changli Gao

On Mon, Apr 20, 2020 at 02:15:44AM -0500, Nate Karstens wrote:
> Series of 4 patches to implement close-on-fork. Tests have been
> published to https://github.com/nkarstens/ltp/tree/close-on-fork.
> 
> close-on-fork addresses race conditions in system(), which
> (depending on the implementation) is non-atomic in that it
> first calls a fork() and then an exec().
> 
> This functionality was approved by the Austin Common Standards
> Revision Group for inclusion in the next revision of the POSIX
> standard (see issue 1318 in the Austin Group Defect Tracker).

What exactly the reasons are and why would we want to implement that?

Pardon me, but going by the previous history, "The Austin Group Says It's
Good" is more of a source of concern regarding the merits, general sanity
and, most of all, good taste of a proposal.

I'm not saying that it's automatically bad, but you'll have to go much
deeper into the rationale of that change before your proposal is taken
seriously.

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: Implement close-on-fork
  2020-04-22 15:01   ` Al Viro
@ 2020-04-22 15:18     ` Matthew Wilcox
  -1 siblings, 0 replies; 69+ messages in thread
From: Matthew Wilcox @ 2020-04-22 15:18 UTC (permalink / raw)
  To: Al Viro
  Cc: Nate Karstens, Jeff Layton, J. Bruce Fields, Arnd Bergmann,
	Richard Henderson, Ivan Kokshaysky, Matt Turner,
	James E.J. Bottomley, Helge Deller, David S. Miller,
	Jakub Kicinski, linux-fsdevel, linux-arch, linux-alpha,
	linux-parisc, sparclinux, netdev, linux-kernel, Changli Gao

On Wed, Apr 22, 2020 at 04:01:07PM +0100, Al Viro wrote:
> On Mon, Apr 20, 2020 at 02:15:44AM -0500, Nate Karstens wrote:
> > Series of 4 patches to implement close-on-fork. Tests have been
> > published to https://github.com/nkarstens/ltp/tree/close-on-fork.
> > 
> > close-on-fork addresses race conditions in system(), which
> > (depending on the implementation) is non-atomic in that it
> > first calls a fork() and then an exec().
> > 
> > This functionality was approved by the Austin Common Standards
> > Revision Group for inclusion in the next revision of the POSIX
> > standard (see issue 1318 in the Austin Group Defect Tracker).
> 
> What exactly the reasons are and why would we want to implement that?
> 
> Pardon me, but going by the previous history, "The Austin Group Says It's
> Good" is more of a source of concern regarding the merits, general sanity
> and, most of all, good taste of a proposal.
> 
> I'm not saying that it's automatically bad, but you'll have to go much
> deeper into the rationale of that change before your proposal is taken
> seriously.

https://www.mail-archive.com/austin-group-l@opengroup.org/msg05324.html
might be useful

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: Implement close-on-fork
@ 2020-04-22 15:18     ` Matthew Wilcox
  0 siblings, 0 replies; 69+ messages in thread
From: Matthew Wilcox @ 2020-04-22 15:18 UTC (permalink / raw)
  To: Al Viro
  Cc: Nate Karstens, Jeff Layton, J. Bruce Fields, Arnd Bergmann,
	Richard Henderson, Ivan Kokshaysky, Matt Turner,
	James E.J. Bottomley, Helge Deller, David S. Miller,
	Jakub Kicinski, linux-fsdevel, linux-arch, linux-alpha,
	linux-parisc, sparclinux, netdev, linux-kernel, Changli Gao

On Wed, Apr 22, 2020 at 04:01:07PM +0100, Al Viro wrote:
> On Mon, Apr 20, 2020 at 02:15:44AM -0500, Nate Karstens wrote:
> > Series of 4 patches to implement close-on-fork. Tests have been
> > published to https://github.com/nkarstens/ltp/tree/close-on-fork.
> > 
> > close-on-fork addresses race conditions in system(), which
> > (depending on the implementation) is non-atomic in that it
> > first calls a fork() and then an exec().
> > 
> > This functionality was approved by the Austin Common Standards
> > Revision Group for inclusion in the next revision of the POSIX
> > standard (see issue 1318 in the Austin Group Defect Tracker).
> 
> What exactly the reasons are and why would we want to implement that?
> 
> Pardon me, but going by the previous history, "The Austin Group Says It's
> Good" is more of a source of concern regarding the merits, general sanity
> and, most of all, good taste of a proposal.
> 
> I'm not saying that it's automatically bad, but you'll have to go much
> deeper into the rationale of that change before your proposal is taken
> seriously.

https://www.mail-archive.com/austin-group-l@opengroup.org/msg05324.html
might be useful

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: Implement close-on-fork
  2020-04-22 15:18     ` Matthew Wilcox
@ 2020-04-22 15:34       ` James Bottomley
  -1 siblings, 0 replies; 69+ messages in thread
From: James Bottomley @ 2020-04-22 15:34 UTC (permalink / raw)
  To: Matthew Wilcox, Al Viro
  Cc: Nate Karstens, Jeff Layton, J. Bruce Fields, Arnd Bergmann,
	Richard Henderson, Ivan Kokshaysky, Matt Turner, Helge Deller,
	David S. Miller, Jakub Kicinski, linux-fsdevel, linux-arch,
	linux-alpha, linux-parisc, sparclinux, netdev, linux-kernel,
	Changli Gao

On Wed, 2020-04-22 at 08:18 -0700, Matthew Wilcox wrote:
> On Wed, Apr 22, 2020 at 04:01:07PM +0100, Al Viro wrote:
> > On Mon, Apr 20, 2020 at 02:15:44AM -0500, Nate Karstens wrote:
> > > Series of 4 patches to implement close-on-fork. Tests have been
> > > published to https://github.com/nkarstens/ltp/tree/close-on-fork.
> > > 
> > > close-on-fork addresses race conditions in system(), which
> > > (depending on the implementation) is non-atomic in that it
> > > first calls a fork() and then an exec().
> > > 
> > > This functionality was approved by the Austin Common Standards
> > > Revision Group for inclusion in the next revision of the POSIX
> > > standard (see issue 1318 in the Austin Group Defect Tracker).
> > 
> > What exactly the reasons are and why would we want to implement
> > that?
> > 
> > Pardon me, but going by the previous history, "The Austin Group
> > Says It's Good" is more of a source of concern regarding the
> > merits, general sanity and, most of all, good taste of a proposal.
> > 
> > I'm not saying that it's automatically bad, but you'll have to go
> > much deeper into the rationale of that change before your proposal
> > is taken seriously.
> 
> https://www.mail-archive.com/austin-group-l@opengroup.org/msg05324.ht
> ml
> might be useful

So the problem is an application is written in such a way that the time
window after it forks and before it execs can cause a file descriptor
based resource to be held when the application state thinks it should
have been released because of a mismatch in the expected use count?

Might it not be easier to rewrite the application for this problem
rather than the kernel?  Especially as the best justification in the
entire thread seems to be "because solaris had it".

James


^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: Implement close-on-fork
@ 2020-04-22 15:34       ` James Bottomley
  0 siblings, 0 replies; 69+ messages in thread
From: James Bottomley @ 2020-04-22 15:34 UTC (permalink / raw)
  To: Matthew Wilcox, Al Viro
  Cc: Nate Karstens, Jeff Layton, J. Bruce Fields, Arnd Bergmann,
	Richard Henderson, Ivan Kokshaysky, Matt Turner, Helge Deller,
	David S. Miller, Jakub Kicinski, linux-fsdevel, linux-arch,
	linux-alpha, linux-parisc, sparclinux, netdev, linux-kernel,
	Changli Gao

On Wed, 2020-04-22 at 08:18 -0700, Matthew Wilcox wrote:
> On Wed, Apr 22, 2020 at 04:01:07PM +0100, Al Viro wrote:
> > On Mon, Apr 20, 2020 at 02:15:44AM -0500, Nate Karstens wrote:
> > > Series of 4 patches to implement close-on-fork. Tests have been
> > > published to https://github.com/nkarstens/ltp/tree/close-on-fork.
> > > 
> > > close-on-fork addresses race conditions in system(), which
> > > (depending on the implementation) is non-atomic in that it
> > > first calls a fork() and then an exec().
> > > 
> > > This functionality was approved by the Austin Common Standards
> > > Revision Group for inclusion in the next revision of the POSIX
> > > standard (see issue 1318 in the Austin Group Defect Tracker).
> > 
> > What exactly the reasons are and why would we want to implement
> > that?
> > 
> > Pardon me, but going by the previous history, "The Austin Group
> > Says It's Good" is more of a source of concern regarding the
> > merits, general sanity and, most of all, good taste of a proposal.
> > 
> > I'm not saying that it's automatically bad, but you'll have to go
> > much deeper into the rationale of that change before your proposal
> > is taken seriously.
> 
> https://www.mail-archive.com/austin-group-l@opengroup.org/msg05324.ht
> ml
> might be useful

So the problem is an application is written in such a way that the time
window after it forks and before it execs can cause a file descriptor
based resource to be held when the application state thinks it should
have been released because of a mismatch in the expected use count?

Might it not be easier to rewrite the application for this problem
rather than the kernel?  Especially as the best justification in the
entire thread seems to be "because solaris had it".

James

^ permalink raw reply	[flat|nested] 69+ messages in thread

* RE: [PATCH 1/4] fs: Implement close-on-fork
  2020-04-20  7:15   ` Nate Karstens
  (?)
  (?)
@ 2020-04-22 15:36     ` Karstens, Nate
  -1 siblings, 0 replies; 69+ messages in thread
From: Karstens, Nate @ 2020-04-22 15:36 UTC (permalink / raw)
  To: Alexander Viro, Jeff Layton, J. Bruce Fields, Arnd Bergmann,
	Richard Henderson, Ivan Kokshaysky, Matt Turner,
	James E.J. Bottomley, Helge Deller, David S. Miller,
	Jakub Kicinski, linux-fsdevel, linux-arch, linux-alpha,
	linux-parisc, sparclinux, netdev, linux-kernel, David Laight,
	Matthew Wilcox
  Cc: Changli Gao

Thanks for everyone's feedback so far. There have been a few questions on why this feature is necessary/desired, so I'll describe that here.

We are running Linux on an embedded system. The platform can change the IP address either according to a proprietary negotiation scheme or a manual setting. The application uses netlink to listen for IP address changes; when this occurs the application closes all of its sockets and re-opens them using the new address. A problem can occur if the application is simultaneously fork/exec-ing a new process. The parent process attempts to bind a new socket to a port that it had previously bound to (before the IP address change), only to fail because the child process continues to hold a socket bound to that port.

Our initial solution was to use pthread_atfork() handlers to lock a mutex and wait for the child process to close all of its sockets (as signaled through an eventfd) before the parent attempts to create them again. This doesn't work if the application uses system() anywhere. glibc does not invoke pthread_atfork() handlers; older versions did but this was removed, and the Linux manpage for system(2) notes that "According  to  POSIX.1,  it is unspecified whether handlers registered using pthread_atfork(3) are called during the execution of system().  In the glibc implementation, such handlers are not called. "

This issue was discussed in the Austin Group mailing list; the root message is here:

https://www.mail-archive.com/austin-group-l@opengroup.org/msg05324.html

There was some skepticism about whether our practice of closing/reopening sockets was advisable. Regardless, it does expose what I believe to be something that was overlooked in the forking process model. We posted two solutions to the Austin Group defect tracker:

http://austingroupbugs.net/view.php?id=1317
http://austingroupbugs.net/view.php?id=1318

Ultimately the Austin Group felt that close-on-fork was the preferred approach. I think it's also worth pointing that out Solaris reportedly has this feature (https://www.mail-archive.com/austin-group-l@opengroup.org/msg05359.html).

Cheers,

Nate

-----Original Message-----
From: Karstens, Nate <Nate.Karstens@garmin.com> 
Sent: Monday, April 20, 2020 02:16
To: Alexander Viro <viro@zeniv.linux.org.uk>; Jeff Layton <jlayton@kernel.org>; J. Bruce Fields <bfields@fieldses.org>; Arnd Bergmann <arnd@arndb.de>; Richard Henderson <rth@twiddle.net>; Ivan Kokshaysky <ink@jurassic.park.msu.ru>; Matt Turner <mattst88@gmail.com>; James E.J. Bottomley <James.Bottomley@HansenPartnership.com>; Helge Deller <deller@gmx.de>; David S. Miller <davem@davemloft.net>; Jakub Kicinski <kuba@kernel.org>; linux-fsdevel@vger.kernel.org; linux-arch@vger.kernel.org; linux-alpha@vger.kernel.org; linux-parisc@vger.kernel.org; sparclinux@vger.kernel.org; netdev@vger.kernel.org; linux-kernel@vger.kernel.org
Cc: Changli Gao <xiaosuo@gmail.com>; Karstens, Nate <Nate.Karstens@garmin.com>
Subject: [PATCH 1/4] fs: Implement close-on-fork

The close-on-fork flag causes the file descriptor to be closed atomically in the child process before the child process returns from fork(). Implement this feature and provide a method to get/set the close-on-fork flag using fcntl(2).

This functionality was approved by the Austin Common Standards Revision Group for inclusion in the next revision of the POSIX standard (see issue 1318 in the Austin Group Defect Tracker).

Co-developed-by: Changli Gao <xiaosuo@gmail.com>
Signed-off-by: Changli Gao <xiaosuo@gmail.com>
Signed-off-by: Nate Karstens <nate.karstens@garmin.com>
---
 fs/fcntl.c                             |  2 ++
 fs/file.c                              | 50 +++++++++++++++++++++++++-
 include/linux/fdtable.h                |  7 ++++
 include/linux/file.h                   |  2 ++
 include/uapi/asm-generic/fcntl.h       |  5 +--
 tools/include/uapi/asm-generic/fcntl.h |  5 +--
 6 files changed, 66 insertions(+), 5 deletions(-)

diff --git a/fs/fcntl.c b/fs/fcntl.c
index 2e4c0fa2074b..23964abf4a1a 100644
--- a/fs/fcntl.c
+++ b/fs/fcntl.c
@@ -335,10 +335,12 @@ static long do_fcntl(int fd, unsigned int cmd, unsigned long arg,
 		break;
 	case F_GETFD:
 		err = get_close_on_exec(fd) ? FD_CLOEXEC : 0;
+		err |= get_close_on_fork(fd) ? FD_CLOFORK : 0;
 		break;
 	case F_SETFD:
 		err = 0;
 		set_close_on_exec(fd, arg & FD_CLOEXEC);
+		set_close_on_fork(fd, arg & FD_CLOFORK);
 		break;
 	case F_GETFL:
 		err = filp->f_flags;
diff --git a/fs/file.c b/fs/file.c
index c8a4e4c86e55..de7260ba718d 100644
--- a/fs/file.c
+++ b/fs/file.c
@@ -57,6 +57,8 @@ static void copy_fd_bitmaps(struct fdtable *nfdt, struct fdtable *ofdt,
 	memset((char *)nfdt->open_fds + cpy, 0, set);
 	memcpy(nfdt->close_on_exec, ofdt->close_on_exec, cpy);
 	memset((char *)nfdt->close_on_exec + cpy, 0, set);
+	memcpy(nfdt->close_on_fork, ofdt->close_on_fork, cpy);
+	memset((char *)nfdt->close_on_fork + cpy, 0, set);
 
 	cpy = BITBIT_SIZE(count);
 	set = BITBIT_SIZE(nfdt->max_fds) - cpy; @@ -118,7 +120,7 @@ static struct fdtable * alloc_fdtable(unsigned int nr)
 	fdt->fd = data;
 
 	data = kvmalloc(max_t(size_t,
-				 2 * nr / BITS_PER_BYTE + BITBIT_SIZE(nr), L1_CACHE_BYTES),
+				 3 * nr / BITS_PER_BYTE + BITBIT_SIZE(nr), L1_CACHE_BYTES),
 				 GFP_KERNEL_ACCOUNT);
 	if (!data)
 		goto out_arr;
@@ -126,6 +128,8 @@ static struct fdtable * alloc_fdtable(unsigned int nr)
 	data += nr / BITS_PER_BYTE;
 	fdt->close_on_exec = data;
 	data += nr / BITS_PER_BYTE;
+	fdt->close_on_fork = data;
+	data += nr / BITS_PER_BYTE;
 	fdt->full_fds_bits = data;
 
 	return fdt;
@@ -236,6 +240,17 @@ static inline void __clear_close_on_exec(unsigned int fd, struct fdtable *fdt)
 		__clear_bit(fd, fdt->close_on_exec);
 }
 
+static inline void __set_close_on_fork(unsigned int fd, struct fdtable 
+*fdt) {
+	__set_bit(fd, fdt->close_on_fork);
+}
+
+static inline void __clear_close_on_fork(unsigned int fd, struct 
+fdtable *fdt) {
+	if (test_bit(fd, fdt->close_on_fork))
+		__clear_bit(fd, fdt->close_on_fork);
+}
+
 static inline void __set_open_fd(unsigned int fd, struct fdtable *fdt)  {
 	__set_bit(fd, fdt->open_fds);
@@ -290,6 +305,7 @@ struct files_struct *dup_fd(struct files_struct *oldf, int *errorp)
 	new_fdt = &newf->fdtab;
 	new_fdt->max_fds = NR_OPEN_DEFAULT;
 	new_fdt->close_on_exec = newf->close_on_exec_init;
+	new_fdt->close_on_fork = newf->close_on_fork_init;
 	new_fdt->open_fds = newf->open_fds_init;
 	new_fdt->full_fds_bits = newf->full_fds_bits_init;
 	new_fdt->fd = &newf->fd_array[0];
@@ -337,6 +353,12 @@ struct files_struct *dup_fd(struct files_struct *oldf, int *errorp)
 
 	for (i = open_files; i != 0; i--) {
 		struct file *f = *old_fds++;
+
+		if (test_bit(open_files - i, new_fdt->close_on_fork)) {
+			__clear_bit(open_files - i, new_fdt->open_fds);
+			f = NULL;
+		}
+
 		if (f) {
 			get_file(f);
 		} else {
@@ -453,6 +475,7 @@ struct files_struct init_files = {
 		.max_fds	= NR_OPEN_DEFAULT,
 		.fd		= &init_files.fd_array[0],
 		.close_on_exec	= init_files.close_on_exec_init,
+		.close_on_fork	= init_files.close_on_fork_init,
 		.open_fds	= init_files.open_fds_init,
 		.full_fds_bits	= init_files.full_fds_bits_init,
 	},
@@ -865,6 +888,31 @@ bool get_close_on_exec(unsigned int fd)
 	return res;
 }
 
+void set_close_on_fork(unsigned int fd, int flag) {
+	struct files_struct *files = current->files;
+	struct fdtable *fdt;
+	spin_lock(&files->file_lock);
+	fdt = files_fdtable(files);
+	if (flag)
+		__set_close_on_fork(fd, fdt);
+	else
+		__clear_close_on_fork(fd, fdt);
+	spin_unlock(&files->file_lock);
+}
+
+bool get_close_on_fork(unsigned int fd) {
+	struct files_struct *files = current->files;
+	struct fdtable *fdt;
+	bool res;
+	rcu_read_lock();
+	fdt = files_fdtable(files);
+	res = close_on_fork(fd, fdt);
+	rcu_read_unlock();
+	return res;
+}
+
 static int do_dup2(struct files_struct *files,
 	struct file *file, unsigned fd, unsigned flags)
 __releases(&files->file_lock)
diff --git a/include/linux/fdtable.h b/include/linux/fdtable.h index f07c55ea0c22..61c551947fa3 100644
--- a/include/linux/fdtable.h
+++ b/include/linux/fdtable.h
@@ -27,6 +27,7 @@ struct fdtable {
 	unsigned int max_fds;
 	struct file __rcu **fd;      /* current fd array */
 	unsigned long *close_on_exec;
+	unsigned long *close_on_fork;
 	unsigned long *open_fds;
 	unsigned long *full_fds_bits;
 	struct rcu_head rcu;
@@ -37,6 +38,11 @@ static inline bool close_on_exec(unsigned int fd, const struct fdtable *fdt)
 	return test_bit(fd, fdt->close_on_exec);  }
 
+static inline bool close_on_fork(unsigned int fd, const struct fdtable 
+*fdt) {
+	return test_bit(fd, fdt->close_on_fork); }
+
 static inline bool fd_is_open(unsigned int fd, const struct fdtable *fdt)  {
 	return test_bit(fd, fdt->open_fds);
@@ -61,6 +67,7 @@ struct files_struct {
 	spinlock_t file_lock ____cacheline_aligned_in_smp;
 	unsigned int next_fd;
 	unsigned long close_on_exec_init[1];
+	unsigned long close_on_fork_init[1];
 	unsigned long open_fds_init[1];
 	unsigned long full_fds_bits_init[1];
 	struct file __rcu * fd_array[NR_OPEN_DEFAULT]; diff --git a/include/linux/file.h b/include/linux/file.h index 142d102f285e..86fbb36b438b 100644
--- a/include/linux/file.h
+++ b/include/linux/file.h
@@ -85,6 +85,8 @@ extern int f_dupfd(unsigned int from, struct file *file, unsigned flags);  extern int replace_fd(unsigned fd, struct file *file, unsigned flags);  extern void set_close_on_exec(unsigned int fd, int flag);  extern bool get_close_on_exec(unsigned int fd);
+extern void set_close_on_fork(unsigned int fd, int flag); extern bool 
+get_close_on_fork(unsigned int fd);
 extern int __get_unused_fd_flags(unsigned flags, unsigned long nofile);  extern int get_unused_fd_flags(unsigned flags);  extern void put_unused_fd(unsigned int fd); diff --git a/include/uapi/asm-generic/fcntl.h b/include/uapi/asm-generic/fcntl.h
index 9dc0bf0c5a6e..0cb7199a7743 100644
--- a/include/uapi/asm-generic/fcntl.h
+++ b/include/uapi/asm-generic/fcntl.h
@@ -98,8 +98,8 @@
 #endif
 
 #define F_DUPFD		0	/* dup */
-#define F_GETFD		1	/* get close_on_exec */
-#define F_SETFD		2	/* set/clear close_on_exec */
+#define F_GETFD		1	/* get close_on_exec & close_on_fork */
+#define F_SETFD		2	/* set/clear close_on_exec & close_on_fork */
 #define F_GETFL		3	/* get file->f_flags */
 #define F_SETFL		4	/* set file->f_flags */
 #ifndef F_GETLK
@@ -160,6 +160,7 @@ struct f_owner_ex {
 
 /* for F_[GET|SET]FL */
 #define FD_CLOEXEC	1	/* actually anything with low bit set goes */
+#define FD_CLOFORK	2
 
 /* for posix fcntl() and lockf() */
 #ifndef F_RDLCK
diff --git a/tools/include/uapi/asm-generic/fcntl.h b/tools/include/uapi/asm-generic/fcntl.h
index ac190958c981..e04a00fecb4a 100644
--- a/tools/include/uapi/asm-generic/fcntl.h
+++ b/tools/include/uapi/asm-generic/fcntl.h
@@ -97,8 +97,8 @@
 #endif
 
 #define F_DUPFD		0	/* dup */
-#define F_GETFD		1	/* get close_on_exec */
-#define F_SETFD		2	/* set/clear close_on_exec */
+#define F_GETFD		1	/* get close_on_exec & close_on_fork */
+#define F_SETFD		2	/* set/clear close_on_exec & close_on_fork */
 #define F_GETFL		3	/* get file->f_flags */
 #define F_SETFL		4	/* set file->f_flags */
 #ifndef F_GETLK
@@ -159,6 +159,7 @@ struct f_owner_ex {
 
 /* for F_[GET|SET]FL */
 #define FD_CLOEXEC	1	/* actually anything with low bit set goes */
+#define FD_CLOFORK	2
 
 /* for posix fcntl() and lockf() */
 #ifndef F_RDLCK
--
2.26.1


^ permalink raw reply related	[flat|nested] 69+ messages in thread

* RE: [PATCH 1/4] fs: Implement close-on-fork
@ 2020-04-22 15:36     ` Karstens, Nate
  0 siblings, 0 replies; 69+ messages in thread
From: Karstens, Nate @ 2020-04-22 15:36 UTC (permalink / raw)
  To: Alexander Viro, Jeff Layton, J. Bruce Fields, Arnd Bergmann,
	Richard Henderson, Ivan Kokshaysky, Matt Turner,
	James E.J. Bottomley, Helge Deller, David S. Miller,
	Jakub Kicinski, linux-fsdevel, linux-arch, linux-alpha,
	linux-parisc, sparclinux
  Cc: Changli Gao

Thanks for everyone's feedback so far. There have been a few questions on why this feature is necessary/desired, so I'll describe that here.

We are running Linux on an embedded system. The platform can change the IP address either according to a proprietary negotiation scheme or a manual setting. The application uses netlink to listen for IP address changes; when this occurs the application closes all of its sockets and re-opens them using the new address. A problem can occur if the application is simultaneously fork/exec-ing a new process. The parent process attempts to bind a new socket to a port that it had previously bound to (before the IP address change), only to fail because the child process continues to hold a socket bound to that port.

Our initial solution was to use pthread_atfork() handlers to lock a mutex and wait for the child process to close all of its sockets (as signaled through an eventfd) before the parent attempts to create them again. This doesn't work if the application uses system() anywhere. glibc does not invoke pthread_atfork() handlers; older versions did but this was removed, and the Linux manpage for system(2) notes that "According  to  POSIX.1,  it is unspecified whether handlers registered using pthread_atfork(3) are called during the execution of system().  In the glibc implementation, such handlers are not called. "

This issue was discussed in the Austin Group mailing list; the root message is here:

https://www.mail-archive.com/austin-group-l@opengroup.org/msg05324.html

There was some skepticism about whether our practice of closing/reopening sockets was advisable. Regardless, it does expose what I believe to be something that was overlooked in the forking process model. We posted two solutions to the Austin Group defect tracker:

http://austingroupbugs.net/view.php?id=1317
http://austingroupbugs.net/view.php?id=1318

Ultimately the Austin Group felt that close-on-fork was the preferred approach. I think it's also worth pointing that out Solaris reportedly has this feature (https://www.mail-archive.com/austin-group-l@opengroup.org/msg05359.html).

Cheers,

Nate

-----Original Message-----
From: Karstens, Nate <Nate.Karstens@garmin.com> 
Sent: Monday, April 20, 2020 02:16
To: Alexander Viro <viro@zeniv.linux.org.uk>; Jeff Layton <jlayton@kernel.org>; J. Bruce Fields <bfields@fieldses.org>; Arnd Bergmann <arnd@arndb.de>; Richard Henderson <rth@twiddle.net>; Ivan Kokshaysky <ink@jurassic.park.msu.ru>; Matt Turner <mattst88@gmail.com>; James E.J. Bottomley <James.Bottomley@HansenPartnership.com>; Helge Deller <deller@gmx.de>; David S. Miller <davem@davemloft.net>; Jakub Kicinski <kuba@kernel.org>; linux-fsdevel@vger.kernel.org; linux-arch@vger.kernel.org; linux-alpha@vger.kernel.org; linux-parisc@vger.kernel.org; sparclinux@vger.kernel.org; netdev@vger.kernel.org; linux-kernel@vger.kernel.org
Cc: Changli Gao <xiaosuo@gmail.com>; Karstens, Nate <Nate.Karstens@garmin.com>
Subject: [PATCH 1/4] fs: Implement close-on-fork

The close-on-fork flag causes the file descriptor to be closed atomically in the child process before the child process returns from fork(). Implement this feature and provide a method to get/set the close-on-fork flag using fcntl(2).

This functionality was approved by the Austin Common Standards Revision Group for inclusion in the next revision of the POSIX standard (see issue 1318 in the Austin Group Defect Tracker).

Co-developed-by: Changli Gao <xiaosuo@gmail.com>
Signed-off-by: Changli Gao <xiaosuo@gmail.com>
Signed-off-by: Nate Karstens <nate.karstens@garmin.com>
---
 fs/fcntl.c                             |  2 ++
 fs/file.c                              | 50 +++++++++++++++++++++++++-
 include/linux/fdtable.h                |  7 ++++
 include/linux/file.h                   |  2 ++
 include/uapi/asm-generic/fcntl.h       |  5 +--
 tools/include/uapi/asm-generic/fcntl.h |  5 +--
 6 files changed, 66 insertions(+), 5 deletions(-)

diff --git a/fs/fcntl.c b/fs/fcntl.c
index 2e4c0fa2074b..23964abf4a1a 100644
--- a/fs/fcntl.c
+++ b/fs/fcntl.c
@@ -335,10 +335,12 @@ static long do_fcntl(int fd, unsigned int cmd, unsigned long arg,
 		break;
 	case F_GETFD:
 		err = get_close_on_exec(fd) ? FD_CLOEXEC : 0;
+		err |= get_close_on_fork(fd) ? FD_CLOFORK : 0;
 		break;
 	case F_SETFD:
 		err = 0;
 		set_close_on_exec(fd, arg & FD_CLOEXEC);
+		set_close_on_fork(fd, arg & FD_CLOFORK);
 		break;
 	case F_GETFL:
 		err = filp->f_flags;
diff --git a/fs/file.c b/fs/file.c
index c8a4e4c86e55..de7260ba718d 100644
--- a/fs/file.c
+++ b/fs/file.c
@@ -57,6 +57,8 @@ static void copy_fd_bitmaps(struct fdtable *nfdt, struct fdtable *ofdt,
 	memset((char *)nfdt->open_fds + cpy, 0, set);
 	memcpy(nfdt->close_on_exec, ofdt->close_on_exec, cpy);
 	memset((char *)nfdt->close_on_exec + cpy, 0, set);
+	memcpy(nfdt->close_on_fork, ofdt->close_on_fork, cpy);
+	memset((char *)nfdt->close_on_fork + cpy, 0, set);
 
 	cpy = BITBIT_SIZE(count);
 	set = BITBIT_SIZE(nfdt->max_fds) - cpy; @@ -118,7 +120,7 @@ static struct fdtable * alloc_fdtable(unsigned int nr)
 	fdt->fd = data;
 
 	data = kvmalloc(max_t(size_t,
-				 2 * nr / BITS_PER_BYTE + BITBIT_SIZE(nr), L1_CACHE_BYTES),
+				 3 * nr / BITS_PER_BYTE + BITBIT_SIZE(nr), L1_CACHE_BYTES),
 				 GFP_KERNEL_ACCOUNT);
 	if (!data)
 		goto out_arr;
@@ -126,6 +128,8 @@ static struct fdtable * alloc_fdtable(unsigned int nr)
 	data += nr / BITS_PER_BYTE;
 	fdt->close_on_exec = data;
 	data += nr / BITS_PER_BYTE;
+	fdt->close_on_fork = data;
+	data += nr / BITS_PER_BYTE;
 	fdt->full_fds_bits = data;
 
 	return fdt;
@@ -236,6 +240,17 @@ static inline void __clear_close_on_exec(unsigned int fd, struct fdtable *fdt)
 		__clear_bit(fd, fdt->close_on_exec);
 }
 
+static inline void __set_close_on_fork(unsigned int fd, struct fdtable 
+*fdt) {
+	__set_bit(fd, fdt->close_on_fork);
+}
+
+static inline void __clear_close_on_fork(unsigned int fd, struct 
+fdtable *fdt) {
+	if (test_bit(fd, fdt->close_on_fork))
+		__clear_bit(fd, fdt->close_on_fork);
+}
+
 static inline void __set_open_fd(unsigned int fd, struct fdtable *fdt)  {
 	__set_bit(fd, fdt->open_fds);
@@ -290,6 +305,7 @@ struct files_struct *dup_fd(struct files_struct *oldf, int *errorp)
 	new_fdt = &newf->fdtab;
 	new_fdt->max_fds = NR_OPEN_DEFAULT;
 	new_fdt->close_on_exec = newf->close_on_exec_init;
+	new_fdt->close_on_fork = newf->close_on_fork_init;
 	new_fdt->open_fds = newf->open_fds_init;
 	new_fdt->full_fds_bits = newf->full_fds_bits_init;
 	new_fdt->fd = &newf->fd_array[0];
@@ -337,6 +353,12 @@ struct files_struct *dup_fd(struct files_struct *oldf, int *errorp)
 
 	for (i = open_files; i != 0; i--) {
 		struct file *f = *old_fds++;
+
+		if (test_bit(open_files - i, new_fdt->close_on_fork)) {
+			__clear_bit(open_files - i, new_fdt->open_fds);
+			f = NULL;
+		}
+
 		if (f) {
 			get_file(f);
 		} else {
@@ -453,6 +475,7 @@ struct files_struct init_files = {
 		.max_fds	= NR_OPEN_DEFAULT,
 		.fd		= &init_files.fd_array[0],
 		.close_on_exec	= init_files.close_on_exec_init,
+		.close_on_fork	= init_files.close_on_fork_init,
 		.open_fds	= init_files.open_fds_init,
 		.full_fds_bits	= init_files.full_fds_bits_init,
 	},
@@ -865,6 +888,31 @@ bool get_close_on_exec(unsigned int fd)
 	return res;
 }
 
+void set_close_on_fork(unsigned int fd, int flag) {
+	struct files_struct *files = current->files;
+	struct fdtable *fdt;
+	spin_lock(&files->file_lock);
+	fdt = files_fdtable(files);
+	if (flag)
+		__set_close_on_fork(fd, fdt);
+	else
+		__clear_close_on_fork(fd, fdt);
+	spin_unlock(&files->file_lock);
+}
+
+bool get_close_on_fork(unsigned int fd) {
+	struct files_struct *files = current->files;
+	struct fdtable *fdt;
+	bool res;
+	rcu_read_lock();
+	fdt = files_fdtable(files);
+	res = close_on_fork(fd, fdt);
+	rcu_read_unlock();
+	return res;
+}
+
 static int do_dup2(struct files_struct *files,
 	struct file *file, unsigned fd, unsigned flags)
 __releases(&files->file_lock)
diff --git a/include/linux/fdtable.h b/include/linux/fdtable.h index f07c55ea0c22..61c551947fa3 100644
--- a/include/linux/fdtable.h
+++ b/include/linux/fdtable.h
@@ -27,6 +27,7 @@ struct fdtable {
 	unsigned int max_fds;
 	struct file __rcu **fd;      /* current fd array */
 	unsigned long *close_on_exec;
+	unsigned long *close_on_fork;
 	unsigned long *open_fds;
 	unsigned long *full_fds_bits;
 	struct rcu_head rcu;
@@ -37,6 +38,11 @@ static inline bool close_on_exec(unsigned int fd, const struct fdtable *fdt)
 	return test_bit(fd, fdt->close_on_exec);  }
 
+static inline bool close_on_fork(unsigned int fd, const struct fdtable 
+*fdt) {
+	return test_bit(fd, fdt->close_on_fork); }
+
 static inline bool fd_is_open(unsigned int fd, const struct fdtable *fdt)  {
 	return test_bit(fd, fdt->open_fds);
@@ -61,6 +67,7 @@ struct files_struct {
 	spinlock_t file_lock ____cacheline_aligned_in_smp;
 	unsigned int next_fd;
 	unsigned long close_on_exec_init[1];
+	unsigned long close_on_fork_init[1];
 	unsigned long open_fds_init[1];
 	unsigned long full_fds_bits_init[1];
 	struct file __rcu * fd_array[NR_OPEN_DEFAULT]; diff --git a/include/linux/file.h b/include/linux/file.h index 142d102f285e..86fbb36b438b 100644
--- a/include/linux/file.h
+++ b/include/linux/file.h
@@ -85,6 +85,8 @@ extern int f_dupfd(unsigned int from, struct file *file, unsigned flags);  extern int replace_fd(unsigned fd, struct file *file, unsigned flags);  extern void set_close_on_exec(unsigned int fd, int flag);  extern bool get_close_on_exec(unsigned int fd);
+extern void set_close_on_fork(unsigned int fd, int flag); extern bool 
+get_close_on_fork(unsigned int fd);
 extern int __get_unused_fd_flags(unsigned flags, unsigned long nofile);  extern int get_unused_fd_flags(unsigned flags);  extern void put_unused_fd(unsigned int fd); diff --git a/include/uapi/asm-generic/fcntl.h b/include/uapi/asm-generic/fcntl.h
index 9dc0bf0c5a6e..0cb7199a7743 100644
--- a/include/uapi/asm-generic/fcntl.h
+++ b/include/uapi/asm-generic/fcntl.h
@@ -98,8 +98,8 @@
 #endif
 
 #define F_DUPFD		0	/* dup */
-#define F_GETFD		1	/* get close_on_exec */
-#define F_SETFD		2	/* set/clear close_on_exec */
+#define F_GETFD		1	/* get close_on_exec & close_on_fork */
+#define F_SETFD		2	/* set/clear close_on_exec & close_on_fork */
 #define F_GETFL		3	/* get file->f_flags */
 #define F_SETFL		4	/* set file->f_flags */
 #ifndef F_GETLK
@@ -160,6 +160,7 @@ struct f_owner_ex {
 
 /* for F_[GET|SET]FL */
 #define FD_CLOEXEC	1	/* actually anything with low bit set goes */
+#define FD_CLOFORK	2
 
 /* for posix fcntl() and lockf() */
 #ifndef F_RDLCK
diff --git a/tools/include/uapi/asm-generic/fcntl.h b/tools/include/uapi/asm-generic/fcntl.h
index ac190958c981..e04a00fecb4a 100644
--- a/tools/include/uapi/asm-generic/fcntl.h
+++ b/tools/include/uapi/asm-generic/fcntl.h
@@ -97,8 +97,8 @@
 #endif
 
 #define F_DUPFD		0	/* dup */
-#define F_GETFD		1	/* get close_on_exec */
-#define F_SETFD		2	/* set/clear close_on_exec */
+#define F_GETFD		1	/* get close_on_exec & close_on_fork */
+#define F_SETFD		2	/* set/clear close_on_exec & close_on_fork */
 #define F_GETFL		3	/* get file->f_flags */
 #define F_SETFL		4	/* set file->f_flags */
 #ifndef F_GETLK
@@ -159,6 +159,7 @@ struct f_owner_ex {
 
 /* for F_[GET|SET]FL */
 #define FD_CLOEXEC	1	/* actually anything with low bit set goes */
+#define FD_CLOFORK	2
 
 /* for posix fcntl() and lockf() */
 #ifndef F_RDLCK
--
2.26.1


^ permalink raw reply related	[flat|nested] 69+ messages in thread

* RE: [PATCH 1/4] fs: Implement close-on-fork
@ 2020-04-22 15:36     ` Karstens, Nate
  0 siblings, 0 replies; 69+ messages in thread
From: Karstens, Nate @ 2020-04-22 15:36 UTC (permalink / raw)
  To: Alexander Viro, Jeff Layton, J. Bruce Fields, Arnd Bergmann,
	Richard Henderson, Ivan Kokshaysky, Matt Turner,
	James E.J. Bottomley, Helge Deller, David S. Miller,
	Jakub Kicinski, linux-fsdevel, linux-arch, linux-alpha,
	linux-parisc, sparclinux
  Cc: Changli Gao

VGhhbmtzIGZvciBldmVyeW9uZSdzIGZlZWRiYWNrIHNvIGZhci4gVGhlcmUgaGF2ZSBiZWVuIGEg
ZmV3IHF1ZXN0aW9ucyBvbiB3aHkgdGhpcyBmZWF0dXJlIGlzIG5lY2Vzc2FyeS9kZXNpcmVkLCBz
byBJJ2xsIGRlc2NyaWJlIHRoYXQgaGVyZS4NCg0KV2UgYXJlIHJ1bm5pbmcgTGludXggb24gYW4g
ZW1iZWRkZWQgc3lzdGVtLiBUaGUgcGxhdGZvcm0gY2FuIGNoYW5nZSB0aGUgSVAgYWRkcmVzcyBl
aXRoZXIgYWNjb3JkaW5nIHRvIGEgcHJvcHJpZXRhcnkgbmVnb3RpYXRpb24gc2NoZW1lIG9yIGEg
bWFudWFsIHNldHRpbmcuIFRoZSBhcHBsaWNhdGlvbiB1c2VzIG5ldGxpbmsgdG8gbGlzdGVuIGZv
ciBJUCBhZGRyZXNzIGNoYW5nZXM7IHdoZW4gdGhpcyBvY2N1cnMgdGhlIGFwcGxpY2F0aW9uIGNs
b3NlcyBhbGwgb2YgaXRzIHNvY2tldHMgYW5kIHJlLW9wZW5zIHRoZW0gdXNpbmcgdGhlIG5ldyBh
ZGRyZXNzLiBBIHByb2JsZW0gY2FuIG9jY3VyIGlmIHRoZSBhcHBsaWNhdGlvbiBpcyBzaW11bHRh
bmVvdXNseSBmb3JrL2V4ZWMtaW5nIGEgbmV3IHByb2Nlc3MuIFRoZSBwYXJlbnQgcHJvY2VzcyBh
dHRlbXB0cyB0byBiaW5kIGEgbmV3IHNvY2tldCB0byBhIHBvcnQgdGhhdCBpdCBoYWQgcHJldmlv
dXNseSBib3VuZCB0byAoYmVmb3JlIHRoZSBJUCBhZGRyZXNzIGNoYW5nZSksIG9ubHkgdG8gZmFp
bCBiZWNhdXNlIHRoZSBjaGlsZCBwcm9jZXNzIGNvbnRpbnVlcyB0byBob2xkIGEgc29ja2V0IGJv
dW5kIHRvIHRoYXQgcG9ydC4NCg0KT3VyIGluaXRpYWwgc29sdXRpb24gd2FzIHRvIHVzZSBwdGhy
ZWFkX2F0Zm9yaygpIGhhbmRsZXJzIHRvIGxvY2sgYSBtdXRleCBhbmQgd2FpdCBmb3IgdGhlIGNo
aWxkIHByb2Nlc3MgdG8gY2xvc2UgYWxsIG9mIGl0cyBzb2NrZXRzIChhcyBzaWduYWxlZCB0aHJv
dWdoIGFuIGV2ZW50ZmQpIGJlZm9yZSB0aGUgcGFyZW50IGF0dGVtcHRzIHRvIGNyZWF0ZSB0aGVt
IGFnYWluLiBUaGlzIGRvZXNuJ3Qgd29yayBpZiB0aGUgYXBwbGljYXRpb24gdXNlcyBzeXN0ZW0o
KSBhbnl3aGVyZS4gZ2xpYmMgZG9lcyBub3QgaW52b2tlIHB0aHJlYWRfYXRmb3JrKCkgaGFuZGxl
cnM7IG9sZGVyIHZlcnNpb25zIGRpZCBidXQgdGhpcyB3YXMgcmVtb3ZlZCwgYW5kIHRoZSBMaW51
eCBtYW5wYWdlIGZvciBzeXN0ZW0oMikgbm90ZXMgdGhhdCAiQWNjb3JkaW5nICB0byAgUE9TSVgu
MSwgIGl0IGlzIHVuc3BlY2lmaWVkIHdoZXRoZXIgaGFuZGxlcnMgcmVnaXN0ZXJlZCB1c2luZyBw
dGhyZWFkX2F0Zm9yaygzKSBhcmUgY2FsbGVkIGR1cmluZyB0aGUgZXhlY3V0aW9uIG9mIHN5c3Rl
bSgpLiAgSW4gdGhlIGdsaWJjIGltcGxlbWVudGF0aW9uLCBzdWNoIGhhbmRsZXJzIGFyZSBub3Qg
Y2FsbGVkLiAiDQoNClRoaXMgaXNzdWUgd2FzIGRpc2N1c3NlZCBpbiB0aGUgQXVzdGluIEdyb3Vw
IG1haWxpbmcgbGlzdDsgdGhlIHJvb3QgbWVzc2FnZSBpcyBoZXJlOg0KDQpodHRwczovL3d3dy5t
YWlsLWFyY2hpdmUuY29tL2F1c3Rpbi1ncm91cC1sQG9wZW5ncm91cC5vcmcvbXNnMDUzMjQuaHRt
bA0KDQpUaGVyZSB3YXMgc29tZSBza2VwdGljaXNtIGFib3V0IHdoZXRoZXIgb3VyIHByYWN0aWNl
IG9mIGNsb3NpbmcvcmVvcGVuaW5nIHNvY2tldHMgd2FzIGFkdmlzYWJsZS4gUmVnYXJkbGVzcywg
aXQgZG9lcyBleHBvc2Ugd2hhdCBJIGJlbGlldmUgdG8gYmUgc29tZXRoaW5nIHRoYXQgd2FzIG92
ZXJsb29rZWQgaW4gdGhlIGZvcmtpbmcgcHJvY2VzcyBtb2RlbC4gV2UgcG9zdGVkIHR3byBzb2x1
dGlvbnMgdG8gdGhlIEF1c3RpbiBHcm91cCBkZWZlY3QgdHJhY2tlcjoNCg0KaHR0cDovL2F1c3Rp
bmdyb3VwYnVncy5uZXQvdmlldy5waHA/aWQ9MTMxNw0KaHR0cDovL2F1c3Rpbmdyb3VwYnVncy5u
ZXQvdmlldy5waHA/aWQ9MTMxOA0KDQpVbHRpbWF0ZWx5IHRoZSBBdXN0aW4gR3JvdXAgZmVsdCB0
aGF0IGNsb3NlLW9uLWZvcmsgd2FzIHRoZSBwcmVmZXJyZWQgYXBwcm9hY2guIEkgdGhpbmsgaXQn
cyBhbHNvIHdvcnRoIHBvaW50aW5nIHRoYXQgb3V0IFNvbGFyaXMgcmVwb3J0ZWRseSBoYXMgdGhp
cyBmZWF0dXJlIChodHRwczovL3d3dy5tYWlsLWFyY2hpdmUuY29tL2F1c3Rpbi1ncm91cC1sQG9w
ZW5ncm91cC5vcmcvbXNnMDUzNTkuaHRtbCkuDQoNCkNoZWVycywNCg0KTmF0ZQ0KDQotLS0tLU9y
aWdpbmFsIE1lc3NhZ2UtLS0tLQ0KRnJvbTogS2Fyc3RlbnMsIE5hdGUgPE5hdGUuS2Fyc3RlbnNA
Z2FybWluLmNvbT4gDQpTZW50OiBNb25kYXksIEFwcmlsIDIwLCAyMDIwIDAyOjE2DQpUbzogQWxl
eGFuZGVyIFZpcm8gPHZpcm9AemVuaXYubGludXgub3JnLnVrPjsgSmVmZiBMYXl0b24gPGpsYXl0
b25Aa2VybmVsLm9yZz47IEouIEJydWNlIEZpZWxkcyA8YmZpZWxkc0BmaWVsZHNlcy5vcmc+OyBB
cm5kIEJlcmdtYW5uIDxhcm5kQGFybmRiLmRlPjsgUmljaGFyZCBIZW5kZXJzb24gPHJ0aEB0d2lk
ZGxlLm5ldD47IEl2YW4gS29rc2hheXNreSA8aW5rQGp1cmFzc2ljLnBhcmsubXN1LnJ1PjsgTWF0
dCBUdXJuZXIgPG1hdHRzdDg4QGdtYWlsLmNvbT47IEphbWVzIEUuSi4gQm90dG9tbGV5IDxKYW1l
cy5Cb3R0b21sZXlASGFuc2VuUGFydG5lcnNoaXAuY29tPjsgSGVsZ2UgRGVsbGVyIDxkZWxsZXJA
Z214LmRlPjsgRGF2aWQgUy4gTWlsbGVyIDxkYXZlbUBkYXZlbWxvZnQubmV0PjsgSmFrdWIgS2lj
aW5za2kgPGt1YmFAa2VybmVsLm9yZz47IGxpbnV4LWZzZGV2ZWxAdmdlci5rZXJuZWwub3JnOyBs
aW51eC1hcmNoQHZnZXIua2VybmVsLm9yZzsgbGludXgtYWxwaGFAdmdlci5rZXJuZWwub3JnOyBs
aW51eC1wYXJpc2NAdmdlci5rZXJuZWwub3JnOyBzcGFyY2xpbnV4QHZnZXIua2VybmVsLm9yZzsg
bmV0ZGV2QHZnZXIua2VybmVsLm9yZzsgbGludXgta2VybmVsQHZnZXIua2VybmVsLm9yZw0KQ2M6
IENoYW5nbGkgR2FvIDx4aWFvc3VvQGdtYWlsLmNvbT47IEthcnN0ZW5zLCBOYXRlIDxOYXRlLkth
cnN0ZW5zQGdhcm1pbi5jb20+DQpTdWJqZWN0OiBbUEFUQ0ggMS80XSBmczogSW1wbGVtZW50IGNs
b3NlLW9uLWZvcmsNCg0KVGhlIGNsb3NlLW9uLWZvcmsgZmxhZyBjYXVzZXMgdGhlIGZpbGUgZGVz
Y3JpcHRvciB0byBiZSBjbG9zZWQgYXRvbWljYWxseSBpbiB0aGUgY2hpbGQgcHJvY2VzcyBiZWZv
cmUgdGhlIGNoaWxkIHByb2Nlc3MgcmV0dXJucyBmcm9tIGZvcmsoKS4gSW1wbGVtZW50IHRoaXMg
ZmVhdHVyZSBhbmQgcHJvdmlkZSBhIG1ldGhvZCB0byBnZXQvc2V0IHRoZSBjbG9zZS1vbi1mb3Jr
IGZsYWcgdXNpbmcgZmNudGwoMikuDQoNClRoaXMgZnVuY3Rpb25hbGl0eSB3YXMgYXBwcm92ZWQg
YnkgdGhlIEF1c3RpbiBDb21tb24gU3RhbmRhcmRzIFJldmlzaW9uIEdyb3VwIGZvciBpbmNsdXNp
b24gaW4gdGhlIG5leHQgcmV2aXNpb24gb2YgdGhlIFBPU0lYIHN0YW5kYXJkIChzZWUgaXNzdWUg
MTMxOCBpbiB0aGUgQXVzdGluIEdyb3VwIERlZmVjdCBUcmFja2VyKS4NCg0KQ28tZGV2ZWxvcGVk
LWJ5OiBDaGFuZ2xpIEdhbyA8eGlhb3N1b0BnbWFpbC5jb20+DQpTaWduZWQtb2ZmLWJ5OiBDaGFu
Z2xpIEdhbyA8eGlhb3N1b0BnbWFpbC5jb20+DQpTaWduZWQtb2ZmLWJ5OiBOYXRlIEthcnN0ZW5z
IDxuYXRlLmthcnN0ZW5zQGdhcm1pbi5jb20+DQotLS0NCiBmcy9mY250bC5jICAgICAgICAgICAg
ICAgICAgICAgICAgICAgICB8ICAyICsrDQogZnMvZmlsZS5jICAgICAgICAgICAgICAgICAgICAg
ICAgICAgICAgfCA1MCArKysrKysrKysrKysrKysrKysrKysrKysrLQ0KIGluY2x1ZGUvbGludXgv
ZmR0YWJsZS5oICAgICAgICAgICAgICAgIHwgIDcgKysrKw0KIGluY2x1ZGUvbGludXgvZmlsZS5o
ICAgICAgICAgICAgICAgICAgIHwgIDIgKysNCiBpbmNsdWRlL3VhcGkvYXNtLWdlbmVyaWMvZmNu
dGwuaCAgICAgICB8ICA1ICstLQ0KIHRvb2xzL2luY2x1ZGUvdWFwaS9hc20tZ2VuZXJpYy9mY250
bC5oIHwgIDUgKy0tDQogNiBmaWxlcyBjaGFuZ2VkLCA2NiBpbnNlcnRpb25zKCspLCA1IGRlbGV0
aW9ucygtKQ0KDQpkaWZmIC0tZ2l0IGEvZnMvZmNudGwuYyBiL2ZzL2ZjbnRsLmMNCmluZGV4IDJl
NGMwZmEyMDc0Yi4uMjM5NjRhYmY0YTFhIDEwMDY0NA0KLS0tIGEvZnMvZmNudGwuYw0KKysrIGIv
ZnMvZmNudGwuYw0KQEAgLTMzNSwxMCArMzM1LDEyIEBAIHN0YXRpYyBsb25nIGRvX2ZjbnRsKGlu
dCBmZCwgdW5zaWduZWQgaW50IGNtZCwgdW5zaWduZWQgbG9uZyBhcmcsDQogCQlicmVhazsNCiAJ
Y2FzZSBGX0dFVEZEOg0KIAkJZXJyID0gZ2V0X2Nsb3NlX29uX2V4ZWMoZmQpID8gRkRfQ0xPRVhF
QyA6IDA7DQorCQllcnIgfD0gZ2V0X2Nsb3NlX29uX2ZvcmsoZmQpID8gRkRfQ0xPRk9SSyA6IDA7
DQogCQlicmVhazsNCiAJY2FzZSBGX1NFVEZEOg0KIAkJZXJyID0gMDsNCiAJCXNldF9jbG9zZV9v
bl9leGVjKGZkLCBhcmcgJiBGRF9DTE9FWEVDKTsNCisJCXNldF9jbG9zZV9vbl9mb3JrKGZkLCBh
cmcgJiBGRF9DTE9GT1JLKTsNCiAJCWJyZWFrOw0KIAljYXNlIEZfR0VURkw6DQogCQllcnIgPSBm
aWxwLT5mX2ZsYWdzOw0KZGlmZiAtLWdpdCBhL2ZzL2ZpbGUuYyBiL2ZzL2ZpbGUuYw0KaW5kZXgg
YzhhNGU0Yzg2ZTU1Li5kZTcyNjBiYTcxOGQgMTAwNjQ0DQotLS0gYS9mcy9maWxlLmMNCisrKyBi
L2ZzL2ZpbGUuYw0KQEAgLTU3LDYgKzU3LDggQEAgc3RhdGljIHZvaWQgY29weV9mZF9iaXRtYXBz
KHN0cnVjdCBmZHRhYmxlICpuZmR0LCBzdHJ1Y3QgZmR0YWJsZSAqb2ZkdCwNCiAJbWVtc2V0KChj
aGFyICopbmZkdC0+b3Blbl9mZHMgKyBjcHksIDAsIHNldCk7DQogCW1lbWNweShuZmR0LT5jbG9z
ZV9vbl9leGVjLCBvZmR0LT5jbG9zZV9vbl9leGVjLCBjcHkpOw0KIAltZW1zZXQoKGNoYXIgKilu
ZmR0LT5jbG9zZV9vbl9leGVjICsgY3B5LCAwLCBzZXQpOw0KKwltZW1jcHkobmZkdC0+Y2xvc2Vf
b25fZm9yaywgb2ZkdC0+Y2xvc2Vfb25fZm9yaywgY3B5KTsNCisJbWVtc2V0KChjaGFyICopbmZk
dC0+Y2xvc2Vfb25fZm9yayArIGNweSwgMCwgc2V0KTsNCiANCiAJY3B5ID0gQklUQklUX1NJWkUo
Y291bnQpOw0KIAlzZXQgPSBCSVRCSVRfU0laRShuZmR0LT5tYXhfZmRzKSAtIGNweTsgQEAgLTEx
OCw3ICsxMjAsNyBAQCBzdGF0aWMgc3RydWN0IGZkdGFibGUgKiBhbGxvY19mZHRhYmxlKHVuc2ln
bmVkIGludCBucikNCiAJZmR0LT5mZCA9IGRhdGE7DQogDQogCWRhdGEgPSBrdm1hbGxvYyhtYXhf
dChzaXplX3QsDQotCQkJCSAyICogbnIgLyBCSVRTX1BFUl9CWVRFICsgQklUQklUX1NJWkUobnIp
LCBMMV9DQUNIRV9CWVRFUyksDQorCQkJCSAzICogbnIgLyBCSVRTX1BFUl9CWVRFICsgQklUQklU
X1NJWkUobnIpLCBMMV9DQUNIRV9CWVRFUyksDQogCQkJCSBHRlBfS0VSTkVMX0FDQ09VTlQpOw0K
IAlpZiAoIWRhdGEpDQogCQlnb3RvIG91dF9hcnI7DQpAQCAtMTI2LDYgKzEyOCw4IEBAIHN0YXRp
YyBzdHJ1Y3QgZmR0YWJsZSAqIGFsbG9jX2ZkdGFibGUodW5zaWduZWQgaW50IG5yKQ0KIAlkYXRh
ICs9IG5yIC8gQklUU19QRVJfQllURTsNCiAJZmR0LT5jbG9zZV9vbl9leGVjID0gZGF0YTsNCiAJ
ZGF0YSArPSBuciAvIEJJVFNfUEVSX0JZVEU7DQorCWZkdC0+Y2xvc2Vfb25fZm9yayA9IGRhdGE7
DQorCWRhdGEgKz0gbnIgLyBCSVRTX1BFUl9CWVRFOw0KIAlmZHQtPmZ1bGxfZmRzX2JpdHMgPSBk
YXRhOw0KIA0KIAlyZXR1cm4gZmR0Ow0KQEAgLTIzNiw2ICsyNDAsMTcgQEAgc3RhdGljIGlubGlu
ZSB2b2lkIF9fY2xlYXJfY2xvc2Vfb25fZXhlYyh1bnNpZ25lZCBpbnQgZmQsIHN0cnVjdCBmZHRh
YmxlICpmZHQpDQogCQlfX2NsZWFyX2JpdChmZCwgZmR0LT5jbG9zZV9vbl9leGVjKTsNCiB9DQog
DQorc3RhdGljIGlubGluZSB2b2lkIF9fc2V0X2Nsb3NlX29uX2ZvcmsodW5zaWduZWQgaW50IGZk
LCBzdHJ1Y3QgZmR0YWJsZSANCisqZmR0KSB7DQorCV9fc2V0X2JpdChmZCwgZmR0LT5jbG9zZV9v
bl9mb3JrKTsNCit9DQorDQorc3RhdGljIGlubGluZSB2b2lkIF9fY2xlYXJfY2xvc2Vfb25fZm9y
ayh1bnNpZ25lZCBpbnQgZmQsIHN0cnVjdCANCitmZHRhYmxlICpmZHQpIHsNCisJaWYgKHRlc3Rf
Yml0KGZkLCBmZHQtPmNsb3NlX29uX2ZvcmspKQ0KKwkJX19jbGVhcl9iaXQoZmQsIGZkdC0+Y2xv
c2Vfb25fZm9yayk7DQorfQ0KKw0KIHN0YXRpYyBpbmxpbmUgdm9pZCBfX3NldF9vcGVuX2ZkKHVu
c2lnbmVkIGludCBmZCwgc3RydWN0IGZkdGFibGUgKmZkdCkgIHsNCiAJX19zZXRfYml0KGZkLCBm
ZHQtPm9wZW5fZmRzKTsNCkBAIC0yOTAsNiArMzA1LDcgQEAgc3RydWN0IGZpbGVzX3N0cnVjdCAq
ZHVwX2ZkKHN0cnVjdCBmaWxlc19zdHJ1Y3QgKm9sZGYsIGludCAqZXJyb3JwKQ0KIAluZXdfZmR0
ID0gJm5ld2YtPmZkdGFiOw0KIAluZXdfZmR0LT5tYXhfZmRzID0gTlJfT1BFTl9ERUZBVUxUOw0K
IAluZXdfZmR0LT5jbG9zZV9vbl9leGVjID0gbmV3Zi0+Y2xvc2Vfb25fZXhlY19pbml0Ow0KKwlu
ZXdfZmR0LT5jbG9zZV9vbl9mb3JrID0gbmV3Zi0+Y2xvc2Vfb25fZm9ya19pbml0Ow0KIAluZXdf
ZmR0LT5vcGVuX2ZkcyA9IG5ld2YtPm9wZW5fZmRzX2luaXQ7DQogCW5ld19mZHQtPmZ1bGxfZmRz
X2JpdHMgPSBuZXdmLT5mdWxsX2Zkc19iaXRzX2luaXQ7DQogCW5ld19mZHQtPmZkID0gJm5ld2Yt
PmZkX2FycmF5WzBdOw0KQEAgLTMzNyw2ICszNTMsMTIgQEAgc3RydWN0IGZpbGVzX3N0cnVjdCAq
ZHVwX2ZkKHN0cnVjdCBmaWxlc19zdHJ1Y3QgKm9sZGYsIGludCAqZXJyb3JwKQ0KIA0KIAlmb3Ig
KGkgPSBvcGVuX2ZpbGVzOyBpICE9IDA7IGktLSkgew0KIAkJc3RydWN0IGZpbGUgKmYgPSAqb2xk
X2ZkcysrOw0KKw0KKwkJaWYgKHRlc3RfYml0KG9wZW5fZmlsZXMgLSBpLCBuZXdfZmR0LT5jbG9z
ZV9vbl9mb3JrKSkgew0KKwkJCV9fY2xlYXJfYml0KG9wZW5fZmlsZXMgLSBpLCBuZXdfZmR0LT5v
cGVuX2Zkcyk7DQorCQkJZiA9IE5VTEw7DQorCQl9DQorDQogCQlpZiAoZikgew0KIAkJCWdldF9m
aWxlKGYpOw0KIAkJfSBlbHNlIHsNCkBAIC00NTMsNiArNDc1LDcgQEAgc3RydWN0IGZpbGVzX3N0
cnVjdCBpbml0X2ZpbGVzID0gew0KIAkJLm1heF9mZHMJPSBOUl9PUEVOX0RFRkFVTFQsDQogCQku
ZmQJCT0gJmluaXRfZmlsZXMuZmRfYXJyYXlbMF0sDQogCQkuY2xvc2Vfb25fZXhlYwk9IGluaXRf
ZmlsZXMuY2xvc2Vfb25fZXhlY19pbml0LA0KKwkJLmNsb3NlX29uX2ZvcmsJPSBpbml0X2ZpbGVz
LmNsb3NlX29uX2ZvcmtfaW5pdCwNCiAJCS5vcGVuX2Zkcwk9IGluaXRfZmlsZXMub3Blbl9mZHNf
aW5pdCwNCiAJCS5mdWxsX2Zkc19iaXRzCT0gaW5pdF9maWxlcy5mdWxsX2Zkc19iaXRzX2luaXQs
DQogCX0sDQpAQCAtODY1LDYgKzg4OCwzMSBAQCBib29sIGdldF9jbG9zZV9vbl9leGVjKHVuc2ln
bmVkIGludCBmZCkNCiAJcmV0dXJuIHJlczsNCiB9DQogDQordm9pZCBzZXRfY2xvc2Vfb25fZm9y
ayh1bnNpZ25lZCBpbnQgZmQsIGludCBmbGFnKSB7DQorCXN0cnVjdCBmaWxlc19zdHJ1Y3QgKmZp
bGVzID0gY3VycmVudC0+ZmlsZXM7DQorCXN0cnVjdCBmZHRhYmxlICpmZHQ7DQorCXNwaW5fbG9j
aygmZmlsZXMtPmZpbGVfbG9jayk7DQorCWZkdCA9IGZpbGVzX2ZkdGFibGUoZmlsZXMpOw0KKwlp
ZiAoZmxhZykNCisJCV9fc2V0X2Nsb3NlX29uX2ZvcmsoZmQsIGZkdCk7DQorCWVsc2UNCisJCV9f
Y2xlYXJfY2xvc2Vfb25fZm9yayhmZCwgZmR0KTsNCisJc3Bpbl91bmxvY2soJmZpbGVzLT5maWxl
X2xvY2spOw0KK30NCisNCitib29sIGdldF9jbG9zZV9vbl9mb3JrKHVuc2lnbmVkIGludCBmZCkg
ew0KKwlzdHJ1Y3QgZmlsZXNfc3RydWN0ICpmaWxlcyA9IGN1cnJlbnQtPmZpbGVzOw0KKwlzdHJ1
Y3QgZmR0YWJsZSAqZmR0Ow0KKwlib29sIHJlczsNCisJcmN1X3JlYWRfbG9jaygpOw0KKwlmZHQg
PSBmaWxlc19mZHRhYmxlKGZpbGVzKTsNCisJcmVzID0gY2xvc2Vfb25fZm9yayhmZCwgZmR0KTsN
CisJcmN1X3JlYWRfdW5sb2NrKCk7DQorCXJldHVybiByZXM7DQorfQ0KKw0KIHN0YXRpYyBpbnQg
ZG9fZHVwMihzdHJ1Y3QgZmlsZXNfc3RydWN0ICpmaWxlcywNCiAJc3RydWN0IGZpbGUgKmZpbGUs
IHVuc2lnbmVkIGZkLCB1bnNpZ25lZCBmbGFncykNCiBfX3JlbGVhc2VzKCZmaWxlcy0+ZmlsZV9s
b2NrKQ0KZGlmZiAtLWdpdCBhL2luY2x1ZGUvbGludXgvZmR0YWJsZS5oIGIvaW5jbHVkZS9saW51
eC9mZHRhYmxlLmggaW5kZXggZjA3YzU1ZWEwYzIyLi42MWM1NTE5NDdmYTMgMTAwNjQ0DQotLS0g
YS9pbmNsdWRlL2xpbnV4L2ZkdGFibGUuaA0KKysrIGIvaW5jbHVkZS9saW51eC9mZHRhYmxlLmgN
CkBAIC0yNyw2ICsyNyw3IEBAIHN0cnVjdCBmZHRhYmxlIHsNCiAJdW5zaWduZWQgaW50IG1heF9m
ZHM7DQogCXN0cnVjdCBmaWxlIF9fcmN1ICoqZmQ7ICAgICAgLyogY3VycmVudCBmZCBhcnJheSAq
Lw0KIAl1bnNpZ25lZCBsb25nICpjbG9zZV9vbl9leGVjOw0KKwl1bnNpZ25lZCBsb25nICpjbG9z
ZV9vbl9mb3JrOw0KIAl1bnNpZ25lZCBsb25nICpvcGVuX2ZkczsNCiAJdW5zaWduZWQgbG9uZyAq
ZnVsbF9mZHNfYml0czsNCiAJc3RydWN0IHJjdV9oZWFkIHJjdTsNCkBAIC0zNyw2ICszOCwxMSBA
QCBzdGF0aWMgaW5saW5lIGJvb2wgY2xvc2Vfb25fZXhlYyh1bnNpZ25lZCBpbnQgZmQsIGNvbnN0
IHN0cnVjdCBmZHRhYmxlICpmZHQpDQogCXJldHVybiB0ZXN0X2JpdChmZCwgZmR0LT5jbG9zZV9v
bl9leGVjKTsgIH0NCiANCitzdGF0aWMgaW5saW5lIGJvb2wgY2xvc2Vfb25fZm9yayh1bnNpZ25l
ZCBpbnQgZmQsIGNvbnN0IHN0cnVjdCBmZHRhYmxlIA0KKypmZHQpIHsNCisJcmV0dXJuIHRlc3Rf
Yml0KGZkLCBmZHQtPmNsb3NlX29uX2ZvcmspOyB9DQorDQogc3RhdGljIGlubGluZSBib29sIGZk
X2lzX29wZW4odW5zaWduZWQgaW50IGZkLCBjb25zdCBzdHJ1Y3QgZmR0YWJsZSAqZmR0KSAgew0K
IAlyZXR1cm4gdGVzdF9iaXQoZmQsIGZkdC0+b3Blbl9mZHMpOw0KQEAgLTYxLDYgKzY3LDcgQEAg
c3RydWN0IGZpbGVzX3N0cnVjdCB7DQogCXNwaW5sb2NrX3QgZmlsZV9sb2NrIF9fX19jYWNoZWxp
bmVfYWxpZ25lZF9pbl9zbXA7DQogCXVuc2lnbmVkIGludCBuZXh0X2ZkOw0KIAl1bnNpZ25lZCBs
b25nIGNsb3NlX29uX2V4ZWNfaW5pdFsxXTsNCisJdW5zaWduZWQgbG9uZyBjbG9zZV9vbl9mb3Jr
X2luaXRbMV07DQogCXVuc2lnbmVkIGxvbmcgb3Blbl9mZHNfaW5pdFsxXTsNCiAJdW5zaWduZWQg
bG9uZyBmdWxsX2Zkc19iaXRzX2luaXRbMV07DQogCXN0cnVjdCBmaWxlIF9fcmN1ICogZmRfYXJy
YXlbTlJfT1BFTl9ERUZBVUxUXTsgZGlmZiAtLWdpdCBhL2luY2x1ZGUvbGludXgvZmlsZS5oIGIv
aW5jbHVkZS9saW51eC9maWxlLmggaW5kZXggMTQyZDEwMmYyODVlLi44NmZiYjM2YjQzOGIgMTAw
NjQ0DQotLS0gYS9pbmNsdWRlL2xpbnV4L2ZpbGUuaA0KKysrIGIvaW5jbHVkZS9saW51eC9maWxl
LmgNCkBAIC04NSw2ICs4NSw4IEBAIGV4dGVybiBpbnQgZl9kdXBmZCh1bnNpZ25lZCBpbnQgZnJv
bSwgc3RydWN0IGZpbGUgKmZpbGUsIHVuc2lnbmVkIGZsYWdzKTsgIGV4dGVybiBpbnQgcmVwbGFj
ZV9mZCh1bnNpZ25lZCBmZCwgc3RydWN0IGZpbGUgKmZpbGUsIHVuc2lnbmVkIGZsYWdzKTsgIGV4
dGVybiB2b2lkIHNldF9jbG9zZV9vbl9leGVjKHVuc2lnbmVkIGludCBmZCwgaW50IGZsYWcpOyAg
ZXh0ZXJuIGJvb2wgZ2V0X2Nsb3NlX29uX2V4ZWModW5zaWduZWQgaW50IGZkKTsNCitleHRlcm4g
dm9pZCBzZXRfY2xvc2Vfb25fZm9yayh1bnNpZ25lZCBpbnQgZmQsIGludCBmbGFnKTsgZXh0ZXJu
IGJvb2wgDQorZ2V0X2Nsb3NlX29uX2ZvcmsodW5zaWduZWQgaW50IGZkKTsNCiBleHRlcm4gaW50
IF9fZ2V0X3VudXNlZF9mZF9mbGFncyh1bnNpZ25lZCBmbGFncywgdW5zaWduZWQgbG9uZyBub2Zp
bGUpOyAgZXh0ZXJuIGludCBnZXRfdW51c2VkX2ZkX2ZsYWdzKHVuc2lnbmVkIGZsYWdzKTsgIGV4
dGVybiB2b2lkIHB1dF91bnVzZWRfZmQodW5zaWduZWQgaW50IGZkKTsgZGlmZiAtLWdpdCBhL2lu
Y2x1ZGUvdWFwaS9hc20tZ2VuZXJpYy9mY250bC5oIGIvaW5jbHVkZS91YXBpL2FzbS1nZW5lcmlj
L2ZjbnRsLmgNCmluZGV4IDlkYzBiZjBjNWE2ZS4uMGNiNzE5OWE3NzQzIDEwMDY0NA0KLS0tIGEv
aW5jbHVkZS91YXBpL2FzbS1nZW5lcmljL2ZjbnRsLmgNCisrKyBiL2luY2x1ZGUvdWFwaS9hc20t
Z2VuZXJpYy9mY250bC5oDQpAQCAtOTgsOCArOTgsOCBAQA0KICNlbmRpZg0KIA0KICNkZWZpbmUg
Rl9EVVBGRAkJMAkvKiBkdXAgKi8NCi0jZGVmaW5lIEZfR0VURkQJCTEJLyogZ2V0IGNsb3NlX29u
X2V4ZWMgKi8NCi0jZGVmaW5lIEZfU0VURkQJCTIJLyogc2V0L2NsZWFyIGNsb3NlX29uX2V4ZWMg
Ki8NCisjZGVmaW5lIEZfR0VURkQJCTEJLyogZ2V0IGNsb3NlX29uX2V4ZWMgJiBjbG9zZV9vbl9m
b3JrICovDQorI2RlZmluZSBGX1NFVEZECQkyCS8qIHNldC9jbGVhciBjbG9zZV9vbl9leGVjICYg
Y2xvc2Vfb25fZm9yayAqLw0KICNkZWZpbmUgRl9HRVRGTAkJMwkvKiBnZXQgZmlsZS0+Zl9mbGFn
cyAqLw0KICNkZWZpbmUgRl9TRVRGTAkJNAkvKiBzZXQgZmlsZS0+Zl9mbGFncyAqLw0KICNpZm5k
ZWYgRl9HRVRMSw0KQEAgLTE2MCw2ICsxNjAsNyBAQCBzdHJ1Y3QgZl9vd25lcl9leCB7DQogDQog
LyogZm9yIEZfW0dFVHxTRVRdRkwgKi8NCiAjZGVmaW5lIEZEX0NMT0VYRUMJMQkvKiBhY3R1YWxs
eSBhbnl0aGluZyB3aXRoIGxvdyBiaXQgc2V0IGdvZXMgKi8NCisjZGVmaW5lIEZEX0NMT0ZPUksJ
Mg0KIA0KIC8qIGZvciBwb3NpeCBmY250bCgpIGFuZCBsb2NrZigpICovDQogI2lmbmRlZiBGX1JE
TENLDQpkaWZmIC0tZ2l0IGEvdG9vbHMvaW5jbHVkZS91YXBpL2FzbS1nZW5lcmljL2ZjbnRsLmgg
Yi90b29scy9pbmNsdWRlL3VhcGkvYXNtLWdlbmVyaWMvZmNudGwuaA0KaW5kZXggYWMxOTA5NThj
OTgxLi5lMDRhMDBmZWNiNGEgMTAwNjQ0DQotLS0gYS90b29scy9pbmNsdWRlL3VhcGkvYXNtLWdl
bmVyaWMvZmNudGwuaA0KKysrIGIvdG9vbHMvaW5jbHVkZS91YXBpL2FzbS1nZW5lcmljL2ZjbnRs
LmgNCkBAIC05Nyw4ICs5Nyw4IEBADQogI2VuZGlmDQogDQogI2RlZmluZSBGX0RVUEZECQkwCS8q
IGR1cCAqLw0KLSNkZWZpbmUgRl9HRVRGRAkJMQkvKiBnZXQgY2xvc2Vfb25fZXhlYyAqLw0KLSNk
ZWZpbmUgRl9TRVRGRAkJMgkvKiBzZXQvY2xlYXIgY2xvc2Vfb25fZXhlYyAqLw0KKyNkZWZpbmUg
Rl9HRVRGRAkJMQkvKiBnZXQgY2xvc2Vfb25fZXhlYyAmIGNsb3NlX29uX2ZvcmsgKi8NCisjZGVm
aW5lIEZfU0VURkQJCTIJLyogc2V0L2NsZWFyIGNsb3NlX29uX2V4ZWMgJiBjbG9zZV9vbl9mb3Jr
ICovDQogI2RlZmluZSBGX0dFVEZMCQkzCS8qIGdldCBmaWxlLT5mX2ZsYWdzICovDQogI2RlZmlu
ZSBGX1NFVEZMCQk0CS8qIHNldCBmaWxlLT5mX2ZsYWdzICovDQogI2lmbmRlZiBGX0dFVExLDQpA
QCAtMTU5LDYgKzE1OSw3IEBAIHN0cnVjdCBmX293bmVyX2V4IHsNCiANCiAvKiBmb3IgRl9bR0VU
fFNFVF1GTCAqLw0KICNkZWZpbmUgRkRfQ0xPRVhFQwkxCS8qIGFjdHVhbGx5IGFueXRoaW5nIHdp
dGggbG93IGJpdCBzZXQgZ29lcyAqLw0KKyNkZWZpbmUgRkRfQ0xPRk9SSwkyDQogDQogLyogZm9y
IHBvc2l4IGZjbnRsKCkgYW5kIGxvY2tmKCkgKi8NCiAjaWZuZGVmIEZfUkRMQ0sNCi0tDQoyLjI2
LjENCg0K

^ permalink raw reply	[flat|nested] 69+ messages in thread

* RE: [PATCH 1/4] fs: Implement close-on-fork
@ 2020-04-22 15:36     ` Karstens, Nate
  0 siblings, 0 replies; 69+ messages in thread
From: Karstens, Nate @ 2020-04-22 15:36 UTC (permalink / raw)
  To: Alexander Viro, Jeff Layton, J. Bruce Fields, Arnd Bergmann,
	Richard Henderson, Ivan Kokshaysky, Matt Turner,
	James E.J. Bottomley, Helge Deller, David S. Miller,
	Jakub Kicinski, linux-fsdevel, linux-arch, linux-alpha,
	linux-parisc, sparclinux, netdev@vger.kernel.org
  Cc: Changli Gao

Thanks for everyone's feedback so far. There have been a few questions on why this feature is necessary/desired, so I'll describe that here.

We are running Linux on an embedded system. The platform can change the IP address either according to a proprietary negotiation scheme or a manual setting. The application uses netlink to listen for IP address changes; when this occurs the application closes all of its sockets and re-opens them using the new address. A problem can occur if the application is simultaneously fork/exec-ing a new process. The parent process attempts to bind a new socket to a port that it had previously bound to (before the IP address change), only to fail because the child process continues to hold a socket bound to that port.

Our initial solution was to use pthread_atfork() handlers to lock a mutex and wait for the child process to close all of its sockets (as signaled through an eventfd) before the parent attempts to create them again. This doesn't work if the application uses system() anywhere. glibc does not invoke pthread_atfork() handlers; older versions did but this was removed, and the Linux manpage for system(2) notes that "According  to  POSIX.1,  it is unspecified whether handlers registered using pthread_atfork(3) are called during the execution of system().  In the glibc implementation, such handlers are not called. "

This issue was discussed in the Austin Group mailing list; the root message is here:

https://www.mail-archive.com/austin-group-l@opengroup.org/msg05324.html

There was some skepticism about whether our practice of closing/reopening sockets was advisable. Regardless, it does expose what I believe to be something that was overlooked in the forking process model. We posted two solutions to the Austin Group defect tracker:

http://austingroupbugs.net/view.php?id=1317
http://austingroupbugs.net/view.php?id=1318

Ultimately the Austin Group felt that close-on-fork was the preferred approach. I think it's also worth pointing that out Solaris reportedly has this feature (https://www.mail-archive.com/austin-group-l@opengroup.org/msg05359.html).

Cheers,

Nate

-----Original Message-----
From: Karstens, Nate <Nate.Karstens@garmin.com> 
Sent: Monday, April 20, 2020 02:16
To: Alexander Viro <viro@zeniv.linux.org.uk>; Jeff Layton <jlayton@kernel.org>; J. Bruce Fields <bfields@fieldses.org>; Arnd Bergmann <arnd@arndb.de>; Richard Henderson <rth@twiddle.net>; Ivan Kokshaysky <ink@jurassic.park.msu.ru>; Matt Turner <mattst88@gmail.com>; James E.J. Bottomley <James.Bottomley@HansenPartnership.com>; Helge Deller <deller@gmx.de>; David S. Miller <davem@davemloft.net>; Jakub Kicinski <kuba@kernel.org>; linux-fsdevel@vger.kernel.org; linux-arch@vger.kernel.org; linux-alpha@vger.kernel.org; linux-parisc@vger.kernel.org; sparclinux@vger.kernel.org; netdev@vger.kernel.org; linux-kernel@vger.kernel.org
Cc: Changli Gao <xiaosuo@gmail.com>; Karstens, Nate <Nate.Karstens@garmin.com>
Subject: [PATCH 1/4] fs: Implement close-on-fork

The close-on-fork flag causes the file descriptor to be closed atomically in the child process before the child process returns from fork(). Implement this feature and provide a method to get/set the close-on-fork flag using fcntl(2).

This functionality was approved by the Austin Common Standards Revision Group for inclusion in the next revision of the POSIX standard (see issue 1318 in the Austin Group Defect Tracker).

Co-developed-by: Changli Gao <xiaosuo@gmail.com>
Signed-off-by: Changli Gao <xiaosuo@gmail.com>
Signed-off-by: Nate Karstens <nate.karstens@garmin.com>
---
 fs/fcntl.c                             |  2 ++
 fs/file.c                              | 50 +++++++++++++++++++++++++-
 include/linux/fdtable.h                |  7 ++++
 include/linux/file.h                   |  2 ++
 include/uapi/asm-generic/fcntl.h       |  5 +--
 tools/include/uapi/asm-generic/fcntl.h |  5 +--
 6 files changed, 66 insertions(+), 5 deletions(-)

diff --git a/fs/fcntl.c b/fs/fcntl.c
index 2e4c0fa2074b..23964abf4a1a 100644
--- a/fs/fcntl.c
+++ b/fs/fcntl.c
@@ -335,10 +335,12 @@ static long do_fcntl(int fd, unsigned int cmd, unsigned long arg,
 		break;
 	case F_GETFD:
 		err = get_close_on_exec(fd) ? FD_CLOEXEC : 0;
+		err |= get_close_on_fork(fd) ? FD_CLOFORK : 0;
 		break;
 	case F_SETFD:
 		err = 0;
 		set_close_on_exec(fd, arg & FD_CLOEXEC);
+		set_close_on_fork(fd, arg & FD_CLOFORK);
 		break;
 	case F_GETFL:
 		err = filp->f_flags;
diff --git a/fs/file.c b/fs/file.c
index c8a4e4c86e55..de7260ba718d 100644
--- a/fs/file.c
+++ b/fs/file.c
@@ -57,6 +57,8 @@ static void copy_fd_bitmaps(struct fdtable *nfdt, struct fdtable *ofdt,
 	memset((char *)nfdt->open_fds + cpy, 0, set);
 	memcpy(nfdt->close_on_exec, ofdt->close_on_exec, cpy);
 	memset((char *)nfdt->close_on_exec + cpy, 0, set);
+	memcpy(nfdt->close_on_fork, ofdt->close_on_fork, cpy);
+	memset((char *)nfdt->close_on_fork + cpy, 0, set);
 
 	cpy = BITBIT_SIZE(count);
 	set = BITBIT_SIZE(nfdt->max_fds) - cpy; @@ -118,7 +120,7 @@ static struct fdtable * alloc_fdtable(unsigned int nr)
 	fdt->fd = data;
 
 	data = kvmalloc(max_t(size_t,
-				 2 * nr / BITS_PER_BYTE + BITBIT_SIZE(nr), L1_CACHE_BYTES),
+				 3 * nr / BITS_PER_BYTE + BITBIT_SIZE(nr), L1_CACHE_BYTES),
 				 GFP_KERNEL_ACCOUNT);
 	if (!data)
 		goto out_arr;
@@ -126,6 +128,8 @@ static struct fdtable * alloc_fdtable(unsigned int nr)
 	data += nr / BITS_PER_BYTE;
 	fdt->close_on_exec = data;
 	data += nr / BITS_PER_BYTE;
+	fdt->close_on_fork = data;
+	data += nr / BITS_PER_BYTE;
 	fdt->full_fds_bits = data;
 
 	return fdt;
@@ -236,6 +240,17 @@ static inline void __clear_close_on_exec(unsigned int fd, struct fdtable *fdt)
 		__clear_bit(fd, fdt->close_on_exec);
 }
 
+static inline void __set_close_on_fork(unsigned int fd, struct fdtable 
+*fdt) {
+	__set_bit(fd, fdt->close_on_fork);
+}
+
+static inline void __clear_close_on_fork(unsigned int fd, struct 
+fdtable *fdt) {
+	if (test_bit(fd, fdt->close_on_fork))
+		__clear_bit(fd, fdt->close_on_fork);
+}
+
 static inline void __set_open_fd(unsigned int fd, struct fdtable *fdt)  {
 	__set_bit(fd, fdt->open_fds);
@@ -290,6 +305,7 @@ struct files_struct *dup_fd(struct files_struct *oldf, int *errorp)
 	new_fdt = &newf->fdtab;
 	new_fdt->max_fds = NR_OPEN_DEFAULT;
 	new_fdt->close_on_exec = newf->close_on_exec_init;
+	new_fdt->close_on_fork = newf->close_on_fork_init;
 	new_fdt->open_fds = newf->open_fds_init;
 	new_fdt->full_fds_bits = newf->full_fds_bits_init;
 	new_fdt->fd = &newf->fd_array[0];
@@ -337,6 +353,12 @@ struct files_struct *dup_fd(struct files_struct *oldf, int *errorp)
 
 	for (i = open_files; i != 0; i--) {
 		struct file *f = *old_fds++;
+
+		if (test_bit(open_files - i, new_fdt->close_on_fork)) {
+			__clear_bit(open_files - i, new_fdt->open_fds);
+			f = NULL;
+		}
+
 		if (f) {
 			get_file(f);
 		} else {
@@ -453,6 +475,7 @@ struct files_struct init_files = {
 		.max_fds	= NR_OPEN_DEFAULT,
 		.fd		= &init_files.fd_array[0],
 		.close_on_exec	= init_files.close_on_exec_init,
+		.close_on_fork	= init_files.close_on_fork_init,
 		.open_fds	= init_files.open_fds_init,
 		.full_fds_bits	= init_files.full_fds_bits_init,
 	},
@@ -865,6 +888,31 @@ bool get_close_on_exec(unsigned int fd)
 	return res;
 }
 
+void set_close_on_fork(unsigned int fd, int flag) {
+	struct files_struct *files = current->files;
+	struct fdtable *fdt;
+	spin_lock(&files->file_lock);
+	fdt = files_fdtable(files);
+	if (flag)
+		__set_close_on_fork(fd, fdt);
+	else
+		__clear_close_on_fork(fd, fdt);
+	spin_unlock(&files->file_lock);
+}
+
+bool get_close_on_fork(unsigned int fd) {
+	struct files_struct *files = current->files;
+	struct fdtable *fdt;
+	bool res;
+	rcu_read_lock();
+	fdt = files_fdtable(files);
+	res = close_on_fork(fd, fdt);
+	rcu_read_unlock();
+	return res;
+}
+
 static int do_dup2(struct files_struct *files,
 	struct file *file, unsigned fd, unsigned flags)
 __releases(&files->file_lock)
diff --git a/include/linux/fdtable.h b/include/linux/fdtable.h index f07c55ea0c22..61c551947fa3 100644
--- a/include/linux/fdtable.h
+++ b/include/linux/fdtable.h
@@ -27,6 +27,7 @@ struct fdtable {
 	unsigned int max_fds;
 	struct file __rcu **fd;      /* current fd array */
 	unsigned long *close_on_exec;
+	unsigned long *close_on_fork;
 	unsigned long *open_fds;
 	unsigned long *full_fds_bits;
 	struct rcu_head rcu;
@@ -37,6 +38,11 @@ static inline bool close_on_exec(unsigned int fd, const struct fdtable *fdt)
 	return test_bit(fd, fdt->close_on_exec);  }
 
+static inline bool close_on_fork(unsigned int fd, const struct fdtable 
+*fdt) {
+	return test_bit(fd, fdt->close_on_fork); }
+
 static inline bool fd_is_open(unsigned int fd, const struct fdtable *fdt)  {
 	return test_bit(fd, fdt->open_fds);
@@ -61,6 +67,7 @@ struct files_struct {
 	spinlock_t file_lock ____cacheline_aligned_in_smp;
 	unsigned int next_fd;
 	unsigned long close_on_exec_init[1];
+	unsigned long close_on_fork_init[1];
 	unsigned long open_fds_init[1];
 	unsigned long full_fds_bits_init[1];
 	struct file __rcu * fd_array[NR_OPEN_DEFAULT]; diff --git a/include/linux/file.h b/include/linux/file.h index 142d102f285e..86fbb36b438b 100644
--- a/include/linux/file.h
+++ b/include/linux/file.h
@@ -85,6 +85,8 @@ extern int f_dupfd(unsigned int from, struct file *file, unsigned flags);  extern int replace_fd(unsigned fd, struct file *file, unsigned flags);  extern void set_close_on_exec(unsigned int fd, int flag);  extern bool get_close_on_exec(unsigned int fd);
+extern void set_close_on_fork(unsigned int fd, int flag); extern bool 
+get_close_on_fork(unsigned int fd);
 extern int __get_unused_fd_flags(unsigned flags, unsigned long nofile);  extern int get_unused_fd_flags(unsigned flags);  extern void put_unused_fd(unsigned int fd); diff --git a/include/uapi/asm-generic/fcntl.h b/include/uapi/asm-generic/fcntl.h
index 9dc0bf0c5a6e..0cb7199a7743 100644
--- a/include/uapi/asm-generic/fcntl.h
+++ b/include/uapi/asm-generic/fcntl.h
@@ -98,8 +98,8 @@
 #endif
 
 #define F_DUPFD		0	/* dup */
-#define F_GETFD		1	/* get close_on_exec */
-#define F_SETFD		2	/* set/clear close_on_exec */
+#define F_GETFD		1	/* get close_on_exec & close_on_fork */
+#define F_SETFD		2	/* set/clear close_on_exec & close_on_fork */
 #define F_GETFL		3	/* get file->f_flags */
 #define F_SETFL		4	/* set file->f_flags */
 #ifndef F_GETLK
@@ -160,6 +160,7 @@ struct f_owner_ex {
 
 /* for F_[GET|SET]FL */
 #define FD_CLOEXEC	1	/* actually anything with low bit set goes */
+#define FD_CLOFORK	2
 
 /* for posix fcntl() and lockf() */
 #ifndef F_RDLCK
diff --git a/tools/include/uapi/asm-generic/fcntl.h b/tools/include/uapi/asm-generic/fcntl.h
index ac190958c981..e04a00fecb4a 100644
--- a/tools/include/uapi/asm-generic/fcntl.h
+++ b/tools/include/uapi/asm-generic/fcntl.h
@@ -97,8 +97,8 @@
 #endif
 
 #define F_DUPFD		0	/* dup */
-#define F_GETFD		1	/* get close_on_exec */
-#define F_SETFD		2	/* set/clear close_on_exec */
+#define F_GETFD		1	/* get close_on_exec & close_on_fork */
+#define F_SETFD		2	/* set/clear close_on_exec & close_on_fork */
 #define F_GETFL		3	/* get file->f_flags */
 #define F_SETFL		4	/* set file->f_flags */
 #ifndef F_GETLK
@@ -159,6 +159,7 @@ struct f_owner_ex {
 
 /* for F_[GET|SET]FL */
 #define FD_CLOEXEC	1	/* actually anything with low bit set goes */
+#define FD_CLOFORK	2
 
 /* for posix fcntl() and lockf() */
 #ifndef F_RDLCK
--
2.26.1


^ permalink raw reply related	[flat|nested] 69+ messages in thread

* Re: [PATCH 1/4] fs: Implement close-on-fork
  2020-04-22 15:36     ` Karstens, Nate
  (?)
  (?)
@ 2020-04-22 15:43       ` Matthew Wilcox
  -1 siblings, 0 replies; 69+ messages in thread
From: Matthew Wilcox @ 2020-04-22 15:43 UTC (permalink / raw)
  To: Karstens, Nate
  Cc: Alexander Viro, Jeff Layton, J. Bruce Fields, Arnd Bergmann,
	Richard Henderson, Ivan Kokshaysky, Matt Turner,
	James E.J. Bottomley, Helge Deller, David S. Miller,
	Jakub Kicinski, linux-fsdevel, linux-arch, linux-alpha,
	linux-parisc, sparclinux, netdev, linux-kernel, David Laight,
	Changli Gao

On Wed, Apr 22, 2020 at 03:36:09PM +0000, Karstens, Nate wrote:
> There was some skepticism about whether our practice of
> closing/reopening sockets was advisable. Regardless, it does expose what
> I believe to be something that was overlooked in the forking process
> model. We posted two solutions to the Austin Group defect tracker:

I don't think it was "overlooked" at all.  It's not safe to call system()
from a threaded app.  That's all.  It's right there in the DESCRIPTION:

   The system() function need not be thread-safe.
https://pubs.opengroup.org/onlinepubs/9699919799/functions/system.html

> Ultimately the Austin Group felt that close-on-fork
> was the preferred approach. I think it's also worth
> pointing that out Solaris reportedly has this feature
> (https://www.mail-archive.com/austin-group-l@opengroup.org/msg05359.html).

I am perplexed that the Austin Group thought this was a good idea.

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 1/4] fs: Implement close-on-fork
@ 2020-04-22 15:43       ` Matthew Wilcox
  0 siblings, 0 replies; 69+ messages in thread
From: Matthew Wilcox @ 2020-04-22 15:43 UTC (permalink / raw)
  To: Karstens, Nate
  Cc: Alexander Viro, Jeff Layton, J. Bruce Fields, Arnd Bergmann,
	Richard Henderson, Ivan Kokshaysky, Matt Turner,
	James E.J. Bottomley, Helge Deller, David S. Miller,
	Jakub Kicinski, linux-fsdevel, linux-arch, linux-alpha,
	linux-parisc, sparclinux

On Wed, Apr 22, 2020 at 03:36:09PM +0000, Karstens, Nate wrote:
> There was some skepticism about whether our practice of
> closing/reopening sockets was advisable. Regardless, it does expose what
> I believe to be something that was overlooked in the forking process
> model. We posted two solutions to the Austin Group defect tracker:

I don't think it was "overlooked" at all.  It's not safe to call system()
from a threaded app.  That's all.  It's right there in the DESCRIPTION:

   The system() function need not be thread-safe.
https://pubs.opengroup.org/onlinepubs/9699919799/functions/system.html

> Ultimately the Austin Group felt that close-on-fork
> was the preferred approach. I think it's also worth
> pointing that out Solaris reportedly has this feature
> (https://www.mail-archive.com/austin-group-l@opengroup.org/msg05359.html).

I am perplexed that the Austin Group thought this was a good idea.

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 1/4] fs: Implement close-on-fork
@ 2020-04-22 15:43       ` Matthew Wilcox
  0 siblings, 0 replies; 69+ messages in thread
From: Matthew Wilcox @ 2020-04-22 15:43 UTC (permalink / raw)
  To: Karstens, Nate
  Cc: Alexander Viro, Jeff Layton, J. Bruce Fields, Arnd Bergmann,
	Richard Henderson, Ivan Kokshaysky, Matt Turner,
	James E.J. Bottomley, Helge Deller, David S. Miller,
	Jakub Kicinski, linux-fsdevel, linux-arch, linux-alpha,
	linux-parisc, sparclinux

On Wed, Apr 22, 2020 at 03:36:09PM +0000, Karstens, Nate wrote:
> There was some skepticism about whether our practice of
> closing/reopening sockets was advisable. Regardless, it does expose what
> I believe to be something that was overlooked in the forking process
> model. We posted two solutions to the Austin Group defect tracker:

I don't think it was "overlooked" at all.  It's not safe to call system()
from a threaded app.  That's all.  It's right there in the DESCRIPTION:

   The system() function need not be thread-safe.
https://pubs.opengroup.org/onlinepubs/9699919799/functions/system.html

> Ultimately the Austin Group felt that close-on-fork
> was the preferred approach. I think it's also worth
> pointing that out Solaris reportedly has this feature
> (https://www.mail-archive.com/austin-group-l@opengroup.org/msg05359.html).

I am perplexed that the Austin Group thought this was a good idea.

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 1/4] fs: Implement close-on-fork
@ 2020-04-22 15:43       ` Matthew Wilcox
  0 siblings, 0 replies; 69+ messages in thread
From: Matthew Wilcox @ 2020-04-22 15:43 UTC (permalink / raw)
  To: Karstens, Nate
  Cc: Alexander Viro, Jeff Layton, J. Bruce Fields, Arnd Bergmann,
	Richard Henderson, Ivan Kokshaysky, Matt Turner,
	James E.J. Bottomley, Helge Deller, David S. Miller,
	Jakub Kicinski, linux-fsdevel, linux-arch, linux-alpha,
	linux-parisc, sparclinux, netdev@vger.kernel.org

On Wed, Apr 22, 2020 at 03:36:09PM +0000, Karstens, Nate wrote:
> There was some skepticism about whether our practice of
> closing/reopening sockets was advisable. Regardless, it does expose what
> I believe to be something that was overlooked in the forking process
> model. We posted two solutions to the Austin Group defect tracker:

I don't think it was "overlooked" at all.  It's not safe to call system()
from a threaded app.  That's all.  It's right there in the DESCRIPTION:

   The system() function need not be thread-safe.
https://pubs.opengroup.org/onlinepubs/9699919799/functions/system.html

> Ultimately the Austin Group felt that close-on-fork
> was the preferred approach. I think it's also worth
> pointing that out Solaris reportedly has this feature
> (https://www.mail-archive.com/austin-group-l@opengroup.org/msg05359.html).

I am perplexed that the Austin Group thought this was a good idea.

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: Implement close-on-fork
  2020-04-22 15:18     ` Matthew Wilcox
@ 2020-04-22 16:00       ` Al Viro
  -1 siblings, 0 replies; 69+ messages in thread
From: Al Viro @ 2020-04-22 16:00 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Nate Karstens, Jeff Layton, J. Bruce Fields, Arnd Bergmann,
	Richard Henderson, Ivan Kokshaysky, Matt Turner,
	James E.J. Bottomley, Helge Deller, David S. Miller,
	Jakub Kicinski, linux-fsdevel, linux-arch, linux-alpha,
	linux-parisc, sparclinux, netdev, linux-kernel, Changli Gao

On Wed, Apr 22, 2020 at 08:18:15AM -0700, Matthew Wilcox wrote:
> On Wed, Apr 22, 2020 at 04:01:07PM +0100, Al Viro wrote:
> > On Mon, Apr 20, 2020 at 02:15:44AM -0500, Nate Karstens wrote:
> > > Series of 4 patches to implement close-on-fork. Tests have been
> > > published to https://github.com/nkarstens/ltp/tree/close-on-fork.
> > > 
> > > close-on-fork addresses race conditions in system(), which
> > > (depending on the implementation) is non-atomic in that it
> > > first calls a fork() and then an exec().
> > > 
> > > This functionality was approved by the Austin Common Standards
> > > Revision Group for inclusion in the next revision of the POSIX
> > > standard (see issue 1318 in the Austin Group Defect Tracker).
> > 
> > What exactly the reasons are and why would we want to implement that?
> > 
> > Pardon me, but going by the previous history, "The Austin Group Says It's
> > Good" is more of a source of concern regarding the merits, general sanity
> > and, most of all, good taste of a proposal.
> > 
> > I'm not saying that it's automatically bad, but you'll have to go much
> > deeper into the rationale of that change before your proposal is taken
> > seriously.
> 
> https://www.mail-archive.com/austin-group-l@opengroup.org/msg05324.html
> might be useful

*snort*

Alan Coopersmith in that thread:
|| https://lwn.net/Articles/785430/ suggests AIX, BSD, & MacOS have also defined
|| it, and though it's been proposed multiple times for Linux, never adopted there.

Now, look at the article in question.  You'll see that it should've been
"someone's posting in the end of comments thread under LWN article says that
apparently it exists on AIX, BSD, ..."

The strength of evidence aside, that got me curious; I have checked the
source of FreeBSD, NetBSD and OpenBSD.  No such thing exists in either of
their kernels, so at least that part can be considered an urban legend.

As for the original problem...  what kind of exclusion is used between
the reaction to netlink notifications (including closing every socket,
etc.) and actual IO done on those sockets?


^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: Implement close-on-fork
@ 2020-04-22 16:00       ` Al Viro
  0 siblings, 0 replies; 69+ messages in thread
From: Al Viro @ 2020-04-22 16:00 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Nate Karstens, Jeff Layton, J. Bruce Fields, Arnd Bergmann,
	Richard Henderson, Ivan Kokshaysky, Matt Turner,
	James E.J. Bottomley, Helge Deller, David S. Miller,
	Jakub Kicinski, linux-fsdevel, linux-arch, linux-alpha,
	linux-parisc, sparclinux, netdev, linux-kernel, Changli Gao

On Wed, Apr 22, 2020 at 08:18:15AM -0700, Matthew Wilcox wrote:
> On Wed, Apr 22, 2020 at 04:01:07PM +0100, Al Viro wrote:
> > On Mon, Apr 20, 2020 at 02:15:44AM -0500, Nate Karstens wrote:
> > > Series of 4 patches to implement close-on-fork. Tests have been
> > > published to https://github.com/nkarstens/ltp/tree/close-on-fork.
> > > 
> > > close-on-fork addresses race conditions in system(), which
> > > (depending on the implementation) is non-atomic in that it
> > > first calls a fork() and then an exec().
> > > 
> > > This functionality was approved by the Austin Common Standards
> > > Revision Group for inclusion in the next revision of the POSIX
> > > standard (see issue 1318 in the Austin Group Defect Tracker).
> > 
> > What exactly the reasons are and why would we want to implement that?
> > 
> > Pardon me, but going by the previous history, "The Austin Group Says It's
> > Good" is more of a source of concern regarding the merits, general sanity
> > and, most of all, good taste of a proposal.
> > 
> > I'm not saying that it's automatically bad, but you'll have to go much
> > deeper into the rationale of that change before your proposal is taken
> > seriously.
> 
> https://www.mail-archive.com/austin-group-l@opengroup.org/msg05324.html
> might be useful

*snort*

Alan Coopersmith in that thread:
|| https://lwn.net/Articles/785430/ suggests AIX, BSD, & MacOS have also defined
|| it, and though it's been proposed multiple times for Linux, never adopted there.

Now, look at the article in question.  You'll see that it should've been
"someone's posting in the end of comments thread under LWN article says that
apparently it exists on AIX, BSD, ..."

The strength of evidence aside, that got me curious; I have checked the
source of FreeBSD, NetBSD and OpenBSD.  No such thing exists in either of
their kernels, so at least that part can be considered an urban legend.

As for the original problem...  what kind of exclusion is used between
the reaction to netlink notifications (including closing every socket,
etc.) and actual IO done on those sockets?

^ permalink raw reply	[flat|nested] 69+ messages in thread

* RE: [PATCH 1/4] fs: Implement close-on-fork
  2020-04-22 15:43       ` Matthew Wilcox
  (?)
@ 2020-04-22 16:02         ` Karstens, Nate
  -1 siblings, 0 replies; 69+ messages in thread
From: Karstens, Nate @ 2020-04-22 16:02 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Alexander Viro, Jeff Layton, J. Bruce Fields, Arnd Bergmann,
	Richard Henderson, Ivan Kokshaysky, Matt Turner,
	James E.J. Bottomley, Helge Deller, David S. Miller,
	Jakub Kicinski, linux-fsdevel, linux-arch, linux-alpha,
	linux-parisc, sparclinux, netdev, linux-kernel, David Laight,
	Changli Gao

> It's not safe to call system() from a threaded app.  That's all.  It's right there in the DESCRIPTION:

That is true, but that description is missing from both the Linux man page and the glibc documentation (https://www.gnu.org/software/libc/manual/html_mono/libc.html#Running-a-Command). It seems like a minor point that won't be noticed until it causes a problem, and problems are rare enough they might go unnoticed for a while. We have removed system() from our application, but we're also concerned that libraries we integrate will use system() without our knowledge.

-----Original Message-----
From: Matthew Wilcox <willy@infradead.org>
Sent: Wednesday, April 22, 2020 10:44
To: Karstens, Nate <Nate.Karstens@garmin.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>; Jeff Layton <jlayton@kernel.org>; J. Bruce Fields <bfields@fieldses.org>; Arnd Bergmann <arnd@arndb.de>; Richard Henderson <rth@twiddle.net>; Ivan Kokshaysky <ink@jurassic.park.msu.ru>; Matt Turner <mattst88@gmail.com>; James E.J. Bottomley <James.Bottomley@hansenpartnership.com>; Helge Deller <deller@gmx.de>; David S. Miller <davem@davemloft.net>; Jakub Kicinski <kuba@kernel.org>; linux-fsdevel@vger.kernel.org; linux-arch@vger.kernel.org; linux-alpha@vger.kernel.org; linux-parisc@vger.kernel.org; sparclinux@vger.kernel.org; netdev@vger.kernel.org; linux-kernel@vger.kernel.org; David Laight <David.Laight@aculab.com>; Changli Gao <xiaosuo@gmail.com>
Subject: Re: [PATCH 1/4] fs: Implement close-on-fork

CAUTION - EXTERNAL EMAIL: Do not click any links or open any attachments unless you trust the sender and know the content is safe.


On Wed, Apr 22, 2020 at 03:36:09PM +0000, Karstens, Nate wrote:
> There was some skepticism about whether our practice of
> closing/reopening sockets was advisable. Regardless, it does expose
> what I believe to be something that was overlooked in the forking
> process model. We posted two solutions to the Austin Group defect tracker:

I don't think it was "overlooked" at all.  It's not safe to call system() from a threaded app.  That's all.  It's right there in the DESCRIPTION:

   The system() function need not be thread-safe.
https://pubs.opengroup.org/onlinepubs/9699919799/functions/system.html

> Ultimately the Austin Group felt that close-on-fork was the preferred
> approach. I think it's also worth pointing that out Solaris reportedly
> has this feature
> (https://www.mail-archive.com/austin-group-l@opengroup.org/msg05359.html).

I am perplexed that the Austin Group thought this was a good idea.

________________________________

CONFIDENTIALITY NOTICE: This email and any attachments are for the sole use of the intended recipient(s) and contain information that may be Garmin confidential and/or Garmin legally privileged. If you have received this email in error, please notify the sender by reply email and delete the message. Any disclosure, copying, distribution or use of this communication (including attachments) by someone other than the intended recipient is prohibited. Thank you.

^ permalink raw reply	[flat|nested] 69+ messages in thread

* RE: [PATCH 1/4] fs: Implement close-on-fork
@ 2020-04-22 16:02         ` Karstens, Nate
  0 siblings, 0 replies; 69+ messages in thread
From: Karstens, Nate @ 2020-04-22 16:02 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Alexander Viro, Jeff Layton, J. Bruce Fields, Arnd Bergmann,
	Richard Henderson, Ivan Kokshaysky, Matt Turner,
	James E.J. Bottomley, Helge Deller, David S. Miller,
	Jakub Kicinski, linux-fsdevel, linux-arch, linux-alpha,
	linux-parisc, sparclinux

> It's not safe to call system() from a threaded app.  That's all.  It's right there in the DESCRIPTION:

That is true, but that description is missing from both the Linux man page and the glibc documentation (https://www.gnu.org/software/libc/manual/html_mono/libc.html#Running-a-Command). It seems like a minor point that won't be noticed until it causes a problem, and problems are rare enough they might go unnoticed for a while. We have removed system() from our application, but we're also concerned that libraries we integrate will use system() without our knowledge.

-----Original Message-----
From: Matthew Wilcox <willy@infradead.org>
Sent: Wednesday, April 22, 2020 10:44
To: Karstens, Nate <Nate.Karstens@garmin.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>; Jeff Layton <jlayton@kernel.org>; J. Bruce Fields <bfields@fieldses.org>; Arnd Bergmann <arnd@arndb.de>; Richard Henderson <rth@twiddle.net>; Ivan Kokshaysky <ink@jurassic.park.msu.ru>; Matt Turner <mattst88@gmail.com>; James E.J. Bottomley <James.Bottomley@hansenpartnership.com>; Helge Deller <deller@gmx.de>; David S. Miller <davem@davemloft.net>; Jakub Kicinski <kuba@kernel.org>; linux-fsdevel@vger.kernel.org; linux-arch@vger.kernel.org; linux-alpha@vger.kernel.org; linux-parisc@vger.kernel.org; sparclinux@vger.kernel.org; netdev@vger.kernel.org; linux-kernel@vger.kernel.org; David Laight <David.Laight@aculab.com>; Changli Gao <xiaosuo@gmail.com>
Subject: Re: [PATCH 1/4] fs: Implement close-on-fork

CAUTION - EXTERNAL EMAIL: Do not click any links or open any attachments unless you trust the sender and know the content is safe.


On Wed, Apr 22, 2020 at 03:36:09PM +0000, Karstens, Nate wrote:
> There was some skepticism about whether our practice of
> closing/reopening sockets was advisable. Regardless, it does expose
> what I believe to be something that was overlooked in the forking
> process model. We posted two solutions to the Austin Group defect tracker:

I don't think it was "overlooked" at all.  It's not safe to call system() from a threaded app.  That's all.  It's right there in the DESCRIPTION:

   The system() function need not be thread-safe.
https://pubs.opengroup.org/onlinepubs/9699919799/functions/system.html

> Ultimately the Austin Group felt that close-on-fork was the preferred
> approach. I think it's also worth pointing that out Solaris reportedly
> has this feature
> (https://www.mail-archive.com/austin-group-l@opengroup.org/msg05359.html).

I am perplexed that the Austin Group thought this was a good idea.

________________________________

CONFIDENTIALITY NOTICE: This email and any attachments are for the sole use of the intended recipient(s) and contain information that may be Garmin confidential and/or Garmin legally privileged. If you have received this email in error, please notify the sender by reply email and delete the message. Any disclosure, copying, distribution or use of this communication (including attachments) by someone other than the intended recipient is prohibited. Thank you.

^ permalink raw reply	[flat|nested] 69+ messages in thread

* RE: [PATCH 1/4] fs: Implement close-on-fork
@ 2020-04-22 16:02         ` Karstens, Nate
  0 siblings, 0 replies; 69+ messages in thread
From: Karstens, Nate @ 2020-04-22 16:02 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Alexander Viro, Jeff Layton, J. Bruce Fields, Arnd Bergmann,
	Richard Henderson, Ivan Kokshaysky, Matt Turner,
	James E.J. Bottomley, Helge Deller, David S. Miller,
	Jakub Kicinski, linux-fsdevel, linux-arch, linux-alpha,
	linux-parisc, sparclinux, netdev@vger.kernel.org

> It's not safe to call system() from a threaded app.  That's all.  It's right there in the DESCRIPTION:

That is true, but that description is missing from both the Linux man page and the glibc documentation (https://www.gnu.org/software/libc/manual/html_mono/libc.html#Running-a-Command). It seems like a minor point that won't be noticed until it causes a problem, and problems are rare enough they might go unnoticed for a while. We have removed system() from our application, but we're also concerned that libraries we integrate will use system() without our knowledge.

-----Original Message-----
From: Matthew Wilcox <willy@infradead.org>
Sent: Wednesday, April 22, 2020 10:44
To: Karstens, Nate <Nate.Karstens@garmin.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>; Jeff Layton <jlayton@kernel.org>; J. Bruce Fields <bfields@fieldses.org>; Arnd Bergmann <arnd@arndb.de>; Richard Henderson <rth@twiddle.net>; Ivan Kokshaysky <ink@jurassic.park.msu.ru>; Matt Turner <mattst88@gmail.com>; James E.J. Bottomley <James.Bottomley@hansenpartnership.com>; Helge Deller <deller@gmx.de>; David S. Miller <davem@davemloft.net>; Jakub Kicinski <kuba@kernel.org>; linux-fsdevel@vger.kernel.org; linux-arch@vger.kernel.org; linux-alpha@vger.kernel.org; linux-parisc@vger.kernel.org; sparclinux@vger.kernel.org; netdev@vger.kernel.org; linux-kernel@vger.kernel.org; David Laight <David.Laight@aculab.com>; Changli Gao <xiaosuo@gmail.com>
Subject: Re: [PATCH 1/4] fs: Implement close-on-fork

CAUTION - EXTERNAL EMAIL: Do not click any links or open any attachments unless you trust the sender and know the content is safe.


On Wed, Apr 22, 2020 at 03:36:09PM +0000, Karstens, Nate wrote:
> There was some skepticism about whether our practice of
> closing/reopening sockets was advisable. Regardless, it does expose
> what I believe to be something that was overlooked in the forking
> process model. We posted two solutions to the Austin Group defect tracker:

I don't think it was "overlooked" at all.  It's not safe to call system() from a threaded app.  That's all.  It's right there in the DESCRIPTION:

   The system() function need not be thread-safe.
https://pubs.opengroup.org/onlinepubs/9699919799/functions/system.html

> Ultimately the Austin Group felt that close-on-fork was the preferred
> approach. I think it's also worth pointing that out Solaris reportedly
> has this feature
> (https://www.mail-archive.com/austin-group-l@opengroup.org/msg05359.html).

I am perplexed that the Austin Group thought this was a good idea.

________________________________

CONFIDENTIALITY NOTICE: This email and any attachments are for the sole use of the intended recipient(s) and contain information that may be Garmin confidential and/or Garmin legally privileged. If you have received this email in error, please notify the sender by reply email and delete the message. Any disclosure, copying, distribution or use of this communication (including attachments) by someone other than the intended recipient is prohibited. Thank you.

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: Implement close-on-fork
  2020-04-22 16:00       ` Al Viro
@ 2020-04-22 16:13         ` Al Viro
  -1 siblings, 0 replies; 69+ messages in thread
From: Al Viro @ 2020-04-22 16:13 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Nate Karstens, Jeff Layton, J. Bruce Fields, Arnd Bergmann,
	Richard Henderson, Ivan Kokshaysky, Matt Turner,
	James E.J. Bottomley, Helge Deller, David S. Miller,
	Jakub Kicinski, linux-fsdevel, linux-arch, linux-alpha,
	linux-parisc, sparclinux, netdev, linux-kernel, Changli Gao

On Wed, Apr 22, 2020 at 05:00:32PM +0100, Al Viro wrote:

> *snort*
> 
> Alan Coopersmith in that thread:
> || https://lwn.net/Articles/785430/ suggests AIX, BSD, & MacOS have also defined
> || it, and though it's been proposed multiple times for Linux, never adopted there.
> 
> Now, look at the article in question.  You'll see that it should've been
> "someone's posting in the end of comments thread under LWN article says that
> apparently it exists on AIX, BSD, ..."
> 
> The strength of evidence aside, that got me curious; I have checked the
> source of FreeBSD, NetBSD and OpenBSD.  No such thing exists in either of
> their kernels, so at least that part can be considered an urban legend.
> 
> As for the original problem...  what kind of exclusion is used between
> the reaction to netlink notifications (including closing every socket,
> etc.) and actual IO done on those sockets?

Not an idle question, BTW - unlike Solaris we do NOT (and will not) have
close(2) abort IO on the same descriptor from another thread.  So if one
thread sits in recvmsg(2) while another does close(2), the socket will
*NOT* actually shut down until recvmsg(2) returns.

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: Implement close-on-fork
@ 2020-04-22 16:13         ` Al Viro
  0 siblings, 0 replies; 69+ messages in thread
From: Al Viro @ 2020-04-22 16:13 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Nate Karstens, Jeff Layton, J. Bruce Fields, Arnd Bergmann,
	Richard Henderson, Ivan Kokshaysky, Matt Turner,
	James E.J. Bottomley, Helge Deller, David S. Miller,
	Jakub Kicinski, linux-fsdevel, linux-arch, linux-alpha,
	linux-parisc, sparclinux, netdev, linux-kernel, Changli Gao

On Wed, Apr 22, 2020 at 05:00:32PM +0100, Al Viro wrote:

> *snort*
> 
> Alan Coopersmith in that thread:
> || https://lwn.net/Articles/785430/ suggests AIX, BSD, & MacOS have also defined
> || it, and though it's been proposed multiple times for Linux, never adopted there.
> 
> Now, look at the article in question.  You'll see that it should've been
> "someone's posting in the end of comments thread under LWN article says that
> apparently it exists on AIX, BSD, ..."
> 
> The strength of evidence aside, that got me curious; I have checked the
> source of FreeBSD, NetBSD and OpenBSD.  No such thing exists in either of
> their kernels, so at least that part can be considered an urban legend.
> 
> As for the original problem...  what kind of exclusion is used between
> the reaction to netlink notifications (including closing every socket,
> etc.) and actual IO done on those sockets?

Not an idle question, BTW - unlike Solaris we do NOT (and will not) have
close(2) abort IO on the same descriptor from another thread.  So if one
thread sits in recvmsg(2) while another does close(2), the socket will
*NOT* actually shut down until recvmsg(2) returns.

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 1/4] fs: Implement close-on-fork
  2020-04-22 16:02         ` Karstens, Nate
  (?)
@ 2020-04-22 16:31           ` Bernd Petrovitsch
  -1 siblings, 0 replies; 69+ messages in thread
From: Bernd Petrovitsch @ 2020-04-22 16:31 UTC (permalink / raw)
  To: Karstens, Nate, Matthew Wilcox
  Cc: Alexander Viro, Jeff Layton, J. Bruce Fields, Arnd Bergmann,
	Richard Henderson, Ivan Kokshaysky, Matt Turner,
	James E.J. Bottomley, Helge Deller, David S. Miller,
	Jakub Kicinski, linux-fsdevel, linux-arch, linux-alpha,
	linux-parisc, sparclinux, netdev, linux-kernel, David Laight,
	Changli Gao

[-- Attachment #1: Type: text/plain, Size: 1056 bytes --]

On 22/04/2020 16:02, Karstens, Nate wrote:
>> It's not safe to call system() from a threaded app.  That's all.  It's right there in the DESCRIPTION:
> 
> That is true, but that description is missing from both the Linux man page and the glibc documentation (https://www.gnu.org/software/libc/manual/html_mono/libc.html#Running-a-Command). It seems like a minor point that won't be noticed until it causes a problem, and problems are rare enough they might go unnoticed for a while. We have removed system() from our application, but we're also concerned that libraries we integrate will use system() without our knowledge.

Reimplementing system() is trivial.
LD_LIBRARY_PRELOAD should take care of all system(3) calls.

I wonder it it has some value to add runtime checking for
"multi-threaded" to such lib functions and error out if
yes.

Apart from that, system() is a PITA even on
single/non-threaded apps.

MfG,
	Bernd
-- 
There is no cloud, just other people computers.
-- https://static.fsf.org/nosvn/stickers/thereisnocloud.svg

[-- Attachment #2: pEpkey.asc --]
[-- Type: application/pgp-keys, Size: 2513 bytes --]

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 1/4] fs: Implement close-on-fork
@ 2020-04-22 16:31           ` Bernd Petrovitsch
  0 siblings, 0 replies; 69+ messages in thread
From: Bernd Petrovitsch @ 2020-04-22 16:31 UTC (permalink / raw)
  To: Karstens, Nate, Matthew Wilcox
  Cc: Alexander Viro, Jeff Layton, J. Bruce Fields, Arnd Bergmann,
	Richard Henderson, Ivan Kokshaysky, Matt Turner,
	James E.J. Bottomley, Helge Deller, David S. Miller,
	Jakub Kicinski, linux-fsdevel, linux-arch, linux-alpha,
	linux-parisc, sparclinux

[-- Attachment #1: Type: text/plain, Size: 1056 bytes --]

On 22/04/2020 16:02, Karstens, Nate wrote:
>> It's not safe to call system() from a threaded app.  That's all.  It's right there in the DESCRIPTION:
> 
> That is true, but that description is missing from both the Linux man page and the glibc documentation (https://www.gnu.org/software/libc/manual/html_mono/libc.html#Running-a-Command). It seems like a minor point that won't be noticed until it causes a problem, and problems are rare enough they might go unnoticed for a while. We have removed system() from our application, but we're also concerned that libraries we integrate will use system() without our knowledge.

Reimplementing system() is trivial.
LD_LIBRARY_PRELOAD should take care of all system(3) calls.

I wonder it it has some value to add runtime checking for
"multi-threaded" to such lib functions and error out if
yes.

Apart from that, system() is a PITA even on
single/non-threaded apps.

MfG,
	Bernd
-- 
There is no cloud, just other people computers.
-- https://static.fsf.org/nosvn/stickers/thereisnocloud.svg

[-- Attachment #2: pEpkey.asc --]
[-- Type: application/pgp-keys, Size: 2513 bytes --]

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 1/4] fs: Implement close-on-fork
@ 2020-04-22 16:31           ` Bernd Petrovitsch
  0 siblings, 0 replies; 69+ messages in thread
From: Bernd Petrovitsch @ 2020-04-22 16:31 UTC (permalink / raw)
  To: Karstens, Nate, Matthew Wilcox
  Cc: Alexander Viro, Jeff Layton, J. Bruce Fields, Arnd Bergmann,
	Richard Henderson, Ivan Kokshaysky, Matt Turner,
	James E.J. Bottomley, Helge Deller, David S. Miller,
	Jakub Kicinski, linux-fsdevel, linux-arch, linux-alpha,
	linux-parisc, sparclinux, netdev@vger.kernel.org

[-- Attachment #1: Type: text/plain, Size: 1056 bytes --]

On 22/04/2020 16:02, Karstens, Nate wrote:
>> It's not safe to call system() from a threaded app.  That's all.  It's right there in the DESCRIPTION:
> 
> That is true, but that description is missing from both the Linux man page and the glibc documentation (https://www.gnu.org/software/libc/manual/html_mono/libc.html#Running-a-Command). It seems like a minor point that won't be noticed until it causes a problem, and problems are rare enough they might go unnoticed for a while. We have removed system() from our application, but we're also concerned that libraries we integrate will use system() without our knowledge.

Reimplementing system() is trivial.
LD_LIBRARY_PRELOAD should take care of all system(3) calls.

I wonder it it has some value to add runtime checking for
"multi-threaded" to such lib functions and error out if
yes.

Apart from that, system() is a PITA even on
single/non-threaded apps.

MfG,
	Bernd
-- 
There is no cloud, just other people computers.
-- https://static.fsf.org/nosvn/stickers/thereisnocloud.svg

[-- Attachment #2: pEpkey.asc --]
[-- Type: application/pgp-keys, Size: 2513 bytes --]

^ permalink raw reply	[flat|nested] 69+ messages in thread

* RE: [PATCH 1/4] fs: Implement close-on-fork
  2020-04-22 16:31           ` Bernd Petrovitsch
  (?)
  (?)
@ 2020-04-22 16:55             ` David Laight
  -1 siblings, 0 replies; 69+ messages in thread
From: David Laight @ 2020-04-22 16:55 UTC (permalink / raw)
  To: 'Bernd Petrovitsch', Karstens, Nate, Matthew Wilcox
  Cc: Alexander Viro, Jeff Layton, J. Bruce Fields, Arnd Bergmann,
	Richard Henderson, Ivan Kokshaysky, Matt Turner,
	James E.J. Bottomley, Helge Deller, David S. Miller,
	Jakub Kicinski, linux-fsdevel, linux-arch, linux-alpha,
	linux-parisc, sparclinux, netdev, linux-kernel, Changli Gao

From: Bernd Petrovitsch
> Sent: 22 April 2020 17:32
...
> Apart from that, system() is a PITA even on
> single/non-threaded apps.

Not only that, it is bloody dangerous because (typically)
shell is doing post substitution syntax analysis.

If you need to run an external process you need to generate
an arv[] array containing the parameters.

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

^ permalink raw reply	[flat|nested] 69+ messages in thread

* RE: [PATCH 1/4] fs: Implement close-on-fork
@ 2020-04-22 16:55             ` David Laight
  0 siblings, 0 replies; 69+ messages in thread
From: David Laight @ 2020-04-22 16:55 UTC (permalink / raw)
  To: 'Bernd Petrovitsch', Karstens, Nate, Matthew Wilcox
  Cc: Alexander Viro, Jeff Layton, J. Bruce Fields, Arnd Bergmann,
	Richard Henderson, Ivan Kokshaysky, Matt Turner,
	James E.J. Bottomley, Helge Deller, David S. Miller,
	Jakub Kicinski, linux-fsdevel, linux-arch, linux-alpha,
	linux-parisc, sparclinux

From: Bernd Petrovitsch
> Sent: 22 April 2020 17:32
...
> Apart from that, system() is a PITA even on
> single/non-threaded apps.

Not only that, it is bloody dangerous because (typically)
shell is doing post substitution syntax analysis.

If you need to run an external process you need to generate
an arv[] array containing the parameters.

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

^ permalink raw reply	[flat|nested] 69+ messages in thread

* RE: [PATCH 1/4] fs: Implement close-on-fork
@ 2020-04-22 16:55             ` David Laight
  0 siblings, 0 replies; 69+ messages in thread
From: David Laight @ 2020-04-22 16:55 UTC (permalink / raw)
  To: 'Bernd Petrovitsch', Karstens, Nate, Matthew Wilcox
  Cc: Alexander Viro, Jeff Layton, J. Bruce Fields, Arnd Bergmann,
	Richard Henderson, Ivan Kokshaysky, Matt Turner,
	James E.J. Bottomley, Helge Deller, David S. Miller,
	Jakub Kicinski, linux-fsdevel, linux-arch, linux-alpha,
	linux-parisc, sparclinux

RnJvbTogQmVybmQgUGV0cm92aXRzY2gNCj4gU2VudDogMjIgQXByaWwgMjAyMCAxNzozMg0KLi4u
DQo+IEFwYXJ0IGZyb20gdGhhdCwgc3lzdGVtKCkgaXMgYSBQSVRBIGV2ZW4gb24NCj4gc2luZ2xl
L25vbi10aHJlYWRlZCBhcHBzLg0KDQpOb3Qgb25seSB0aGF0LCBpdCBpcyBibG9vZHkgZGFuZ2Vy
b3VzIGJlY2F1c2UgKHR5cGljYWxseSkNCnNoZWxsIGlzIGRvaW5nIHBvc3Qgc3Vic3RpdHV0aW9u
IHN5bnRheCBhbmFseXNpcy4NCg0KSWYgeW91IG5lZWQgdG8gcnVuIGFuIGV4dGVybmFsIHByb2Nl
c3MgeW91IG5lZWQgdG8gZ2VuZXJhdGUNCmFuIGFydltdIGFycmF5IGNvbnRhaW5pbmcgdGhlIHBh
cmFtZXRlcnMuDQoNCglEYXZpZA0KDQotDQpSZWdpc3RlcmVkIEFkZHJlc3MgTGFrZXNpZGUsIEJy
YW1sZXkgUm9hZCwgTW91bnQgRmFybSwgTWlsdG9uIEtleW5lcywgTUsxIDFQVCwgVUsNClJlZ2lz
dHJhdGlvbiBObzogMTM5NzM4NiAoV2FsZXMpDQo

^ permalink raw reply	[flat|nested] 69+ messages in thread

* RE: [PATCH 1/4] fs: Implement close-on-fork
@ 2020-04-22 16:55             ` David Laight
  0 siblings, 0 replies; 69+ messages in thread
From: David Laight @ 2020-04-22 16:55 UTC (permalink / raw)
  To: 'Bernd Petrovitsch', Karstens, Nate, Matthew Wilcox
  Cc: Alexander Viro, Jeff Layton, J. Bruce Fields, Arnd Bergmann,
	Richard Henderson, Ivan Kokshaysky, Matt Turner,
	James E.J. Bottomley, Helge Deller, David S. Miller,
	Jakub Kicinski, linux-fsdevel, linux-arch, linux-alpha,
	linux-parisc, sparclinux, netdev@vger.kernel.org

From: Bernd Petrovitsch
> Sent: 22 April 2020 17:32
...
> Apart from that, system() is a PITA even on
> single/non-threaded apps.

Not only that, it is bloody dangerous because (typically)
shell is doing post substitution syntax analysis.

If you need to run an external process you need to generate
an arv[] array containing the parameters.

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 1/4] fs: Implement close-on-fork
  2020-04-22 16:55             ` David Laight
  (?)
@ 2020-04-23 12:34               ` Bernd Petrovitsch
  -1 siblings, 0 replies; 69+ messages in thread
From: Bernd Petrovitsch @ 2020-04-23 12:34 UTC (permalink / raw)
  To: David Laight, Karstens, Nate, Matthew Wilcox
  Cc: Alexander Viro, Jeff Layton, J. Bruce Fields, Arnd Bergmann,
	Richard Henderson, Ivan Kokshaysky, Matt Turner,
	James E.J. Bottomley, Helge Deller, David S. Miller,
	Jakub Kicinski, linux-fsdevel, linux-arch, linux-alpha,
	linux-parisc, sparclinux, netdev, linux-kernel, Changli Gao

[-- Attachment #1: Type: text/plain, Size: 648 bytes --]

Hi all!

On 22/04/2020 16:55, David Laight wrote:
> From: Bernd Petrovitsch
>> Sent: 22 April 2020 17:32
> ...
>> Apart from that, system() is a PITA even on
>> single/non-threaded apps.
> 
> Not only that, it is bloody dangerous because (typically)
> shell is doing post substitution syntax analysis.

I actually meant exactly that with PITA;-)

> If you need to run an external process you need to generate
> an arv[] array containing the parameters.

FullACK. That is usually similar trivial ...

MfG,
	Bernd
-- 
There is no cloud, just other people computers.
-- https://static.fsf.org/nosvn/stickers/thereisnocloud.svg

[-- Attachment #2: pEpkey.asc --]
[-- Type: application/pgp-keys, Size: 2513 bytes --]

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 1/4] fs: Implement close-on-fork
@ 2020-04-23 12:34               ` Bernd Petrovitsch
  0 siblings, 0 replies; 69+ messages in thread
From: Bernd Petrovitsch @ 2020-04-23 12:34 UTC (permalink / raw)
  To: David Laight, Karstens, Nate, Matthew Wilcox
  Cc: Alexander Viro, Jeff Layton, J. Bruce Fields, Arnd Bergmann,
	Richard Henderson, Ivan Kokshaysky, Matt Turner,
	James E.J. Bottomley, Helge Deller, David S. Miller,
	Jakub Kicinski, linux-fsdevel, linux-arch, linux-alpha,
	linux-parisc, sparclinux

[-- Attachment #1: Type: text/plain, Size: 648 bytes --]

Hi all!

On 22/04/2020 16:55, David Laight wrote:
> From: Bernd Petrovitsch
>> Sent: 22 April 2020 17:32
> ...
>> Apart from that, system() is a PITA even on
>> single/non-threaded apps.
> 
> Not only that, it is bloody dangerous because (typically)
> shell is doing post substitution syntax analysis.

I actually meant exactly that with PITA;-)

> If you need to run an external process you need to generate
> an arv[] array containing the parameters.

FullACK. That is usually similar trivial ...

MfG,
	Bernd
-- 
There is no cloud, just other people computers.
-- https://static.fsf.org/nosvn/stickers/thereisnocloud.svg

[-- Attachment #2: pEpkey.asc --]
[-- Type: application/pgp-keys, Size: 2513 bytes --]

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 1/4] fs: Implement close-on-fork
@ 2020-04-23 12:34               ` Bernd Petrovitsch
  0 siblings, 0 replies; 69+ messages in thread
From: Bernd Petrovitsch @ 2020-04-23 12:34 UTC (permalink / raw)
  To: David Laight, Karstens, Nate, Matthew Wilcox
  Cc: Alexander Viro, Jeff Layton, J. Bruce Fields, Arnd Bergmann,
	Richard Henderson, Ivan Kokshaysky, Matt Turner,
	James E.J. Bottomley, Helge Deller, David S. Miller,
	Jakub Kicinski, linux-fsdevel, linux-arch, linux-alpha,
	linux-parisc, sparclinux, netdev@vger.kernel.org

[-- Attachment #1: Type: text/plain, Size: 648 bytes --]

Hi all!

On 22/04/2020 16:55, David Laight wrote:
> From: Bernd Petrovitsch
>> Sent: 22 April 2020 17:32
> ...
>> Apart from that, system() is a PITA even on
>> single/non-threaded apps.
> 
> Not only that, it is bloody dangerous because (typically)
> shell is doing post substitution syntax analysis.

I actually meant exactly that with PITA;-)

> If you need to run an external process you need to generate
> an arv[] array containing the parameters.

FullACK. That is usually similar trivial ...

MfG,
	Bernd
-- 
There is no cloud, just other people computers.
-- https://static.fsf.org/nosvn/stickers/thereisnocloud.svg

[-- Attachment #2: pEpkey.asc --]
[-- Type: application/pgp-keys, Size: 2513 bytes --]

^ permalink raw reply	[flat|nested] 69+ messages in thread

* RE: [PATCH 1/4] fs: Implement close-on-fork
  2020-04-20 10:25     ` Eric Dumazet
  (?)
@ 2020-05-01 14:45       ` Karstens, Nate
  -1 siblings, 0 replies; 69+ messages in thread
From: Karstens, Nate @ 2020-05-01 14:45 UTC (permalink / raw)
  To: Eric Dumazet, Alexander Viro, Jeff Layton, J. Bruce Fields,
	Arnd Bergmann, Richard Henderson, Ivan Kokshaysky, Matt Turner,
	James E.J. Bottomley, Helge Deller, David S. Miller,
	Jakub Kicinski, linux-fsdevel, linux-arch, linux-alpha,
	linux-parisc, sparclinux, netdev, linux-kernel
  Cc: Changli Gao

Eric,

Thanks for the suggestion. I looked into it and noticed that do_close_on_exec() appears to have some optimizations as well:

> set = fdt->close_on_exec[i];
> if (!set)
> 	continue; 

If we interleave the close-on-exec and close-on-fork flags then this optimization will have to be removed. Do you have a sense of which optimization provides the most benefit?

I noticed a couple of other issues with the original patch that I will need to investigate or rework:

1) I'm not sure dup_fd() is the best place to check the close-on-fork flag. For example, the ksys_unshare() > unshare_fd() > dup_fd() execution path seems suspect. I will either add a parameter to the function indicating if the flag should be checked or do a separate function, like do_close_on_fork().
2) If the close-on-fork flag is set, then __clear_open_fd() should be called instead of just __clear_bit(). This will ensure that fdt->full_fds_bits() is updated.
3) Need to investigate if the close-on-fork (or close-on-exec) flags need to be cleared when the file is closed as part of the close-on-fork execution path.

Others -- I will respond to feedback outside of implementation details in a separate message.

Thanks,

Nate

-----Original Message-----
From: Eric Dumazet <eric.dumazet@gmail.com> 
Sent: Monday, April 20, 2020 05:26
To: Karstens, Nate <Nate.Karstens@garmin.com>; Alexander Viro <viro@zeniv.linux.org.uk>; Jeff Layton <jlayton@kernel.org>; J. Bruce Fields <bfields@fieldses.org>; Arnd Bergmann <arnd@arndb.de>; Richard Henderson <rth@twiddle.net>; Ivan Kokshaysky <ink@jurassic.park.msu.ru>; Matt Turner <mattst88@gmail.com>; James E.J. Bottomley <James.Bottomley@HansenPartnership.com>; Helge Deller <deller@gmx.de>; David S. Miller <davem@davemloft.net>; Jakub Kicinski <kuba@kernel.org>; linux-fsdevel@vger.kernel.org; linux-arch@vger.kernel.org; linux-alpha@vger.kernel.org; linux-parisc@vger.kernel.org; sparclinux@vger.kernel.org; netdev@vger.kernel.org; linux-kernel@vger.kernel.org
Cc: Changli Gao <xiaosuo@gmail.com>
Subject: Re: [PATCH 1/4] fs: Implement close-on-fork

CAUTION - EXTERNAL EMAIL: Do not click any links or open any attachments unless you trust the sender and know the content is safe.


On 4/20/20 12:15 AM, Nate Karstens wrote:
> The close-on-fork flag causes the file descriptor to be closed 
> atomically in the child process before the child process returns from 
> fork(). Implement this feature and provide a method to get/set the 
> close-on-fork flag using fcntl(2).
>
> This functionality was approved by the Austin Common Standards 
> Revision Group for inclusion in the next revision of the POSIX 
> standard (see issue 1318 in the Austin Group Defect Tracker).

Oh well... yet another feature slowing down a critical path.

>
> Co-developed-by: Changli Gao <xiaosuo@gmail.com>
> Signed-off-by: Changli Gao <xiaosuo@gmail.com>
> Signed-off-by: Nate Karstens <nate.karstens@garmin.com>
> ---
>  fs/fcntl.c                             |  2 ++
>  fs/file.c                              | 50 +++++++++++++++++++++++++-
>  include/linux/fdtable.h                |  7 ++++
>  include/linux/file.h                   |  2 ++
>  include/uapi/asm-generic/fcntl.h       |  5 +--
>  tools/include/uapi/asm-generic/fcntl.h |  5 +--
>  6 files changed, 66 insertions(+), 5 deletions(-)
>
> diff --git a/fs/fcntl.c b/fs/fcntl.c
> index 2e4c0fa2074b..23964abf4a1a 100644
> --- a/fs/fcntl.c
> +++ b/fs/fcntl.c
> @@ -335,10 +335,12 @@ static long do_fcntl(int fd, unsigned int cmd, unsigned long arg,
>               break;
>       case F_GETFD:
>               err = get_close_on_exec(fd) ? FD_CLOEXEC : 0;
> +             err |= get_close_on_fork(fd) ? FD_CLOFORK : 0;
>               break;
>       case F_SETFD:
>               err = 0;
>               set_close_on_exec(fd, arg & FD_CLOEXEC);
> +             set_close_on_fork(fd, arg & FD_CLOFORK);
>               break;
>       case F_GETFL:
>               err = filp->f_flags;
> diff --git a/fs/file.c b/fs/file.c
> index c8a4e4c86e55..de7260ba718d 100644
> --- a/fs/file.c
> +++ b/fs/file.c
> @@ -57,6 +57,8 @@ static void copy_fd_bitmaps(struct fdtable *nfdt, struct fdtable *ofdt,
>       memset((char *)nfdt->open_fds + cpy, 0, set);
>       memcpy(nfdt->close_on_exec, ofdt->close_on_exec, cpy);
>       memset((char *)nfdt->close_on_exec + cpy, 0, set);
> +     memcpy(nfdt->close_on_fork, ofdt->close_on_fork, cpy);
> +     memset((char *)nfdt->close_on_fork + cpy, 0, set);
>

I suggest we group the two bits of a file (close_on_exec, close_on_fork) together, so that we do not have to dirty two separate cache lines.

Otherwise we will add yet another cache line miss at every file opening/closing for processes with big file tables.

Ie having a _single_ bitmap array, even bit for close_on_exec, odd bit for close_on_fork

static inline void __set_close_on_exec(unsigned int fd, struct fdtable *fdt) {
        __set_bit(fd * 2, fdt->close_on_fork_exec); }

static inline void __set_close_on_fork(unsigned int fd, struct fdtable *fdt) {
        __set_bit(fd * 2 + 1, fdt->close_on_fork_exec); }

Also the F_GETFD/F_SETFD implementation must use a single function call, to not acquire the spinlock twice.



^ permalink raw reply	[flat|nested] 69+ messages in thread

* RE: [PATCH 1/4] fs: Implement close-on-fork
@ 2020-05-01 14:45       ` Karstens, Nate
  0 siblings, 0 replies; 69+ messages in thread
From: Karstens, Nate @ 2020-05-01 14:45 UTC (permalink / raw)
  To: Eric Dumazet, Alexander Viro, Jeff Layton, J. Bruce Fields,
	Arnd Bergmann, Richard Henderson, Ivan Kokshaysky, Matt Turner,
	James E.J. Bottomley, Helge Deller, David S. Miller,
	Jakub Kicinski, linux-fsdevel, linux-arch, linux-alpha,
	linux-parisc, sparclinux@vger.kernel.org
  Cc: Changli Gao

Eric,

Thanks for the suggestion. I looked into it and noticed that do_close_on_exec() appears to have some optimizations as well:

> set = fdt->close_on_exec[i];
> if (!set)
> 	continue; 

If we interleave the close-on-exec and close-on-fork flags then this optimization will have to be removed. Do you have a sense of which optimization provides the most benefit?

I noticed a couple of other issues with the original patch that I will need to investigate or rework:

1) I'm not sure dup_fd() is the best place to check the close-on-fork flag. For example, the ksys_unshare() > unshare_fd() > dup_fd() execution path seems suspect. I will either add a parameter to the function indicating if the flag should be checked or do a separate function, like do_close_on_fork().
2) If the close-on-fork flag is set, then __clear_open_fd() should be called instead of just __clear_bit(). This will ensure that fdt->full_fds_bits() is updated.
3) Need to investigate if the close-on-fork (or close-on-exec) flags need to be cleared when the file is closed as part of the close-on-fork execution path.

Others -- I will respond to feedback outside of implementation details in a separate message.

Thanks,

Nate

-----Original Message-----
From: Eric Dumazet <eric.dumazet@gmail.com> 
Sent: Monday, April 20, 2020 05:26
To: Karstens, Nate <Nate.Karstens@garmin.com>; Alexander Viro <viro@zeniv.linux.org.uk>; Jeff Layton <jlayton@kernel.org>; J. Bruce Fields <bfields@fieldses.org>; Arnd Bergmann <arnd@arndb.de>; Richard Henderson <rth@twiddle.net>; Ivan Kokshaysky <ink@jurassic.park.msu.ru>; Matt Turner <mattst88@gmail.com>; James E.J. Bottomley <James.Bottomley@HansenPartnership.com>; Helge Deller <deller@gmx.de>; David S. Miller <davem@davemloft.net>; Jakub Kicinski <kuba@kernel.org>; linux-fsdevel@vger.kernel.org; linux-arch@vger.kernel.org; linux-alpha@vger.kernel.org; linux-parisc@vger.kernel.org; sparclinux@vger.kernel.org; netdev@vger.kernel.org; linux-kernel@vger.kernel.org
Cc: Changli Gao <xiaosuo@gmail.com>
Subject: Re: [PATCH 1/4] fs: Implement close-on-fork

CAUTION - EXTERNAL EMAIL: Do not click any links or open any attachments unless you trust the sender and know the content is safe.


On 4/20/20 12:15 AM, Nate Karstens wrote:
> The close-on-fork flag causes the file descriptor to be closed 
> atomically in the child process before the child process returns from 
> fork(). Implement this feature and provide a method to get/set the 
> close-on-fork flag using fcntl(2).
>
> This functionality was approved by the Austin Common Standards 
> Revision Group for inclusion in the next revision of the POSIX 
> standard (see issue 1318 in the Austin Group Defect Tracker).

Oh well... yet another feature slowing down a critical path.

>
> Co-developed-by: Changli Gao <xiaosuo@gmail.com>
> Signed-off-by: Changli Gao <xiaosuo@gmail.com>
> Signed-off-by: Nate Karstens <nate.karstens@garmin.com>
> ---
>  fs/fcntl.c                             |  2 ++
>  fs/file.c                              | 50 +++++++++++++++++++++++++-
>  include/linux/fdtable.h                |  7 ++++
>  include/linux/file.h                   |  2 ++
>  include/uapi/asm-generic/fcntl.h       |  5 +--
>  tools/include/uapi/asm-generic/fcntl.h |  5 +--
>  6 files changed, 66 insertions(+), 5 deletions(-)
>
> diff --git a/fs/fcntl.c b/fs/fcntl.c
> index 2e4c0fa2074b..23964abf4a1a 100644
> --- a/fs/fcntl.c
> +++ b/fs/fcntl.c
> @@ -335,10 +335,12 @@ static long do_fcntl(int fd, unsigned int cmd, unsigned long arg,
>               break;
>       case F_GETFD:
>               err = get_close_on_exec(fd) ? FD_CLOEXEC : 0;
> +             err |= get_close_on_fork(fd) ? FD_CLOFORK : 0;
>               break;
>       case F_SETFD:
>               err = 0;
>               set_close_on_exec(fd, arg & FD_CLOEXEC);
> +             set_close_on_fork(fd, arg & FD_CLOFORK);
>               break;
>       case F_GETFL:
>               err = filp->f_flags;
> diff --git a/fs/file.c b/fs/file.c
> index c8a4e4c86e55..de7260ba718d 100644
> --- a/fs/file.c
> +++ b/fs/file.c
> @@ -57,6 +57,8 @@ static void copy_fd_bitmaps(struct fdtable *nfdt, struct fdtable *ofdt,
>       memset((char *)nfdt->open_fds + cpy, 0, set);
>       memcpy(nfdt->close_on_exec, ofdt->close_on_exec, cpy);
>       memset((char *)nfdt->close_on_exec + cpy, 0, set);
> +     memcpy(nfdt->close_on_fork, ofdt->close_on_fork, cpy);
> +     memset((char *)nfdt->close_on_fork + cpy, 0, set);
>

I suggest we group the two bits of a file (close_on_exec, close_on_fork) together, so that we do not have to dirty two separate cache lines.

Otherwise we will add yet another cache line miss at every file opening/closing for processes with big file tables.

Ie having a _single_ bitmap array, even bit for close_on_exec, odd bit for close_on_fork

static inline void __set_close_on_exec(unsigned int fd, struct fdtable *fdt) {
        __set_bit(fd * 2, fdt->close_on_fork_exec); }

static inline void __set_close_on_fork(unsigned int fd, struct fdtable *fdt) {
        __set_bit(fd * 2 + 1, fdt->close_on_fork_exec); }

Also the F_GETFD/F_SETFD implementation must use a single function call, to not acquire the spinlock twice.



^ permalink raw reply	[flat|nested] 69+ messages in thread

* RE: [PATCH 1/4] fs: Implement close-on-fork
@ 2020-05-01 14:45       ` Karstens, Nate
  0 siblings, 0 replies; 69+ messages in thread
From: Karstens, Nate @ 2020-05-01 14:45 UTC (permalink / raw)
  To: Eric Dumazet, Alexander Viro, Jeff Layton, J. Bruce Fields,
	Arnd Bergmann, Richard Henderson, Ivan Kokshaysky, Matt Turner,
	James E.J. Bottomley, Helge Deller, David S. Miller,
	Jakub Kicinski, linux-fsdevel, linux-arch, linux-alpha,
	linux-parisc, sparclinux@vger.kernel.org
  Cc: Changli Gao

RXJpYywNCg0KVGhhbmtzIGZvciB0aGUgc3VnZ2VzdGlvbi4gSSBsb29rZWQgaW50byBpdCBhbmQg
bm90aWNlZCB0aGF0IGRvX2Nsb3NlX29uX2V4ZWMoKSBhcHBlYXJzIHRvIGhhdmUgc29tZSBvcHRp
bWl6YXRpb25zIGFzIHdlbGw6DQoNCj4gc2V0ID0gZmR0LT5jbG9zZV9vbl9leGVjW2ldOw0KPiBp
ZiAoIXNldCkNCj4gCWNvbnRpbnVlOyANCg0KSWYgd2UgaW50ZXJsZWF2ZSB0aGUgY2xvc2Utb24t
ZXhlYyBhbmQgY2xvc2Utb24tZm9yayBmbGFncyB0aGVuIHRoaXMgb3B0aW1pemF0aW9uIHdpbGwg
aGF2ZSB0byBiZSByZW1vdmVkLiBEbyB5b3UgaGF2ZSBhIHNlbnNlIG9mIHdoaWNoIG9wdGltaXph
dGlvbiBwcm92aWRlcyB0aGUgbW9zdCBiZW5lZml0Pw0KDQpJIG5vdGljZWQgYSBjb3VwbGUgb2Yg
b3RoZXIgaXNzdWVzIHdpdGggdGhlIG9yaWdpbmFsIHBhdGNoIHRoYXQgSSB3aWxsIG5lZWQgdG8g
aW52ZXN0aWdhdGUgb3IgcmV3b3JrOg0KDQoxKSBJJ20gbm90IHN1cmUgZHVwX2ZkKCkgaXMgdGhl
IGJlc3QgcGxhY2UgdG8gY2hlY2sgdGhlIGNsb3NlLW9uLWZvcmsgZmxhZy4gRm9yIGV4YW1wbGUs
IHRoZSBrc3lzX3Vuc2hhcmUoKSA+IHVuc2hhcmVfZmQoKSA+IGR1cF9mZCgpIGV4ZWN1dGlvbiBw
YXRoIHNlZW1zIHN1c3BlY3QuIEkgd2lsbCBlaXRoZXIgYWRkIGEgcGFyYW1ldGVyIHRvIHRoZSBm
dW5jdGlvbiBpbmRpY2F0aW5nIGlmIHRoZSBmbGFnIHNob3VsZCBiZSBjaGVja2VkIG9yIGRvIGEg
c2VwYXJhdGUgZnVuY3Rpb24sIGxpa2UgZG9fY2xvc2Vfb25fZm9yaygpLg0KMikgSWYgdGhlIGNs
b3NlLW9uLWZvcmsgZmxhZyBpcyBzZXQsIHRoZW4gX19jbGVhcl9vcGVuX2ZkKCkgc2hvdWxkIGJl
IGNhbGxlZCBpbnN0ZWFkIG9mIGp1c3QgX19jbGVhcl9iaXQoKS4gVGhpcyB3aWxsIGVuc3VyZSB0
aGF0IGZkdC0+ZnVsbF9mZHNfYml0cygpIGlzIHVwZGF0ZWQuDQozKSBOZWVkIHRvIGludmVzdGln
YXRlIGlmIHRoZSBjbG9zZS1vbi1mb3JrIChvciBjbG9zZS1vbi1leGVjKSBmbGFncyBuZWVkIHRv
IGJlIGNsZWFyZWQgd2hlbiB0aGUgZmlsZSBpcyBjbG9zZWQgYXMgcGFydCBvZiB0aGUgY2xvc2Ut
b24tZm9yayBleGVjdXRpb24gcGF0aC4NCg0KT3RoZXJzIC0tIEkgd2lsbCByZXNwb25kIHRvIGZl
ZWRiYWNrIG91dHNpZGUgb2YgaW1wbGVtZW50YXRpb24gZGV0YWlscyBpbiBhIHNlcGFyYXRlIG1l
c3NhZ2UuDQoNClRoYW5rcywNCg0KTmF0ZQ0KDQotLS0tLU9yaWdpbmFsIE1lc3NhZ2UtLS0tLQ0K
RnJvbTogRXJpYyBEdW1hemV0IDxlcmljLmR1bWF6ZXRAZ21haWwuY29tPiANClNlbnQ6IE1vbmRh
eSwgQXByaWwgMjAsIDIwMjAgMDU6MjYNClRvOiBLYXJzdGVucywgTmF0ZSA8TmF0ZS5LYXJzdGVu
c0BnYXJtaW4uY29tPjsgQWxleGFuZGVyIFZpcm8gPHZpcm9AemVuaXYubGludXgub3JnLnVrPjsg
SmVmZiBMYXl0b24gPGpsYXl0b25Aa2VybmVsLm9yZz47IEouIEJydWNlIEZpZWxkcyA8YmZpZWxk
c0BmaWVsZHNlcy5vcmc+OyBBcm5kIEJlcmdtYW5uIDxhcm5kQGFybmRiLmRlPjsgUmljaGFyZCBI
ZW5kZXJzb24gPHJ0aEB0d2lkZGxlLm5ldD47IEl2YW4gS29rc2hheXNreSA8aW5rQGp1cmFzc2lj
LnBhcmsubXN1LnJ1PjsgTWF0dCBUdXJuZXIgPG1hdHRzdDg4QGdtYWlsLmNvbT47IEphbWVzIEUu
Si4gQm90dG9tbGV5IDxKYW1lcy5Cb3R0b21sZXlASGFuc2VuUGFydG5lcnNoaXAuY29tPjsgSGVs
Z2UgRGVsbGVyIDxkZWxsZXJAZ214LmRlPjsgRGF2aWQgUy4gTWlsbGVyIDxkYXZlbUBkYXZlbWxv
ZnQubmV0PjsgSmFrdWIgS2ljaW5za2kgPGt1YmFAa2VybmVsLm9yZz47IGxpbnV4LWZzZGV2ZWxA
dmdlci5rZXJuZWwub3JnOyBsaW51eC1hcmNoQHZnZXIua2VybmVsLm9yZzsgbGludXgtYWxwaGFA
dmdlci5rZXJuZWwub3JnOyBsaW51eC1wYXJpc2NAdmdlci5rZXJuZWwub3JnOyBzcGFyY2xpbnV4
QHZnZXIua2VybmVsLm9yZzsgbmV0ZGV2QHZnZXIua2VybmVsLm9yZzsgbGludXgta2VybmVsQHZn
ZXIua2VybmVsLm9yZw0KQ2M6IENoYW5nbGkgR2FvIDx4aWFvc3VvQGdtYWlsLmNvbT4NClN1Ympl
Y3Q6IFJlOiBbUEFUQ0ggMS80XSBmczogSW1wbGVtZW50IGNsb3NlLW9uLWZvcmsNCg0KQ0FVVElP
TiAtIEVYVEVSTkFMIEVNQUlMOiBEbyBub3QgY2xpY2sgYW55IGxpbmtzIG9yIG9wZW4gYW55IGF0
dGFjaG1lbnRzIHVubGVzcyB5b3UgdHJ1c3QgdGhlIHNlbmRlciBhbmQga25vdyB0aGUgY29udGVu
dCBpcyBzYWZlLg0KDQoNCk9uIDQvMjAvMjAgMTI6MTUgQU0sIE5hdGUgS2Fyc3RlbnMgd3JvdGU6
DQo+IFRoZSBjbG9zZS1vbi1mb3JrIGZsYWcgY2F1c2VzIHRoZSBmaWxlIGRlc2NyaXB0b3IgdG8g
YmUgY2xvc2VkIA0KPiBhdG9taWNhbGx5IGluIHRoZSBjaGlsZCBwcm9jZXNzIGJlZm9yZSB0aGUg
Y2hpbGQgcHJvY2VzcyByZXR1cm5zIGZyb20gDQo+IGZvcmsoKS4gSW1wbGVtZW50IHRoaXMgZmVh
dHVyZSBhbmQgcHJvdmlkZSBhIG1ldGhvZCB0byBnZXQvc2V0IHRoZSANCj4gY2xvc2Utb24tZm9y
ayBmbGFnIHVzaW5nIGZjbnRsKDIpLg0KPg0KPiBUaGlzIGZ1bmN0aW9uYWxpdHkgd2FzIGFwcHJv
dmVkIGJ5IHRoZSBBdXN0aW4gQ29tbW9uIFN0YW5kYXJkcyANCj4gUmV2aXNpb24gR3JvdXAgZm9y
IGluY2x1c2lvbiBpbiB0aGUgbmV4dCByZXZpc2lvbiBvZiB0aGUgUE9TSVggDQo+IHN0YW5kYXJk
IChzZWUgaXNzdWUgMTMxOCBpbiB0aGUgQXVzdGluIEdyb3VwIERlZmVjdCBUcmFja2VyKS4NCg0K
T2ggd2VsbC4uLiB5ZXQgYW5vdGhlciBmZWF0dXJlIHNsb3dpbmcgZG93biBhIGNyaXRpY2FsIHBh
dGguDQoNCj4NCj4gQ28tZGV2ZWxvcGVkLWJ5OiBDaGFuZ2xpIEdhbyA8eGlhb3N1b0BnbWFpbC5j
b20+DQo+IFNpZ25lZC1vZmYtYnk6IENoYW5nbGkgR2FvIDx4aWFvc3VvQGdtYWlsLmNvbT4NCj4g
U2lnbmVkLW9mZi1ieTogTmF0ZSBLYXJzdGVucyA8bmF0ZS5rYXJzdGVuc0BnYXJtaW4uY29tPg0K
PiAtLS0NCj4gIGZzL2ZjbnRsLmMgICAgICAgICAgICAgICAgICAgICAgICAgICAgIHwgIDIgKysN
Cj4gIGZzL2ZpbGUuYyAgICAgICAgICAgICAgICAgICAgICAgICAgICAgIHwgNTAgKysrKysrKysr
KysrKysrKysrKysrKysrKy0NCj4gIGluY2x1ZGUvbGludXgvZmR0YWJsZS5oICAgICAgICAgICAg
ICAgIHwgIDcgKysrKw0KPiAgaW5jbHVkZS9saW51eC9maWxlLmggICAgICAgICAgICAgICAgICAg
fCAgMiArKw0KPiAgaW5jbHVkZS91YXBpL2FzbS1nZW5lcmljL2ZjbnRsLmggICAgICAgfCAgNSAr
LS0NCj4gIHRvb2xzL2luY2x1ZGUvdWFwaS9hc20tZ2VuZXJpYy9mY250bC5oIHwgIDUgKy0tDQo+
ICA2IGZpbGVzIGNoYW5nZWQsIDY2IGluc2VydGlvbnMoKyksIDUgZGVsZXRpb25zKC0pDQo+DQo+
IGRpZmYgLS1naXQgYS9mcy9mY250bC5jIGIvZnMvZmNudGwuYw0KPiBpbmRleCAyZTRjMGZhMjA3
NGIuLjIzOTY0YWJmNGExYSAxMDA2NDQNCj4gLS0tIGEvZnMvZmNudGwuYw0KPiArKysgYi9mcy9m
Y250bC5jDQo+IEBAIC0zMzUsMTAgKzMzNSwxMiBAQCBzdGF0aWMgbG9uZyBkb19mY250bChpbnQg
ZmQsIHVuc2lnbmVkIGludCBjbWQsIHVuc2lnbmVkIGxvbmcgYXJnLA0KPiAgICAgICAgICAgICAg
IGJyZWFrOw0KPiAgICAgICBjYXNlIEZfR0VURkQ6DQo+ICAgICAgICAgICAgICAgZXJyID0gZ2V0
X2Nsb3NlX29uX2V4ZWMoZmQpID8gRkRfQ0xPRVhFQyA6IDA7DQo+ICsgICAgICAgICAgICAgZXJy
IHw9IGdldF9jbG9zZV9vbl9mb3JrKGZkKSA/IEZEX0NMT0ZPUksgOiAwOw0KPiAgICAgICAgICAg
ICAgIGJyZWFrOw0KPiAgICAgICBjYXNlIEZfU0VURkQ6DQo+ICAgICAgICAgICAgICAgZXJyID0g
MDsNCj4gICAgICAgICAgICAgICBzZXRfY2xvc2Vfb25fZXhlYyhmZCwgYXJnICYgRkRfQ0xPRVhF
Qyk7DQo+ICsgICAgICAgICAgICAgc2V0X2Nsb3NlX29uX2ZvcmsoZmQsIGFyZyAmIEZEX0NMT0ZP
UkspOw0KPiAgICAgICAgICAgICAgIGJyZWFrOw0KPiAgICAgICBjYXNlIEZfR0VURkw6DQo+ICAg
ICAgICAgICAgICAgZXJyID0gZmlscC0+Zl9mbGFnczsNCj4gZGlmZiAtLWdpdCBhL2ZzL2ZpbGUu
YyBiL2ZzL2ZpbGUuYw0KPiBpbmRleCBjOGE0ZTRjODZlNTUuLmRlNzI2MGJhNzE4ZCAxMDA2NDQN
Cj4gLS0tIGEvZnMvZmlsZS5jDQo+ICsrKyBiL2ZzL2ZpbGUuYw0KPiBAQCAtNTcsNiArNTcsOCBA
QCBzdGF0aWMgdm9pZCBjb3B5X2ZkX2JpdG1hcHMoc3RydWN0IGZkdGFibGUgKm5mZHQsIHN0cnVj
dCBmZHRhYmxlICpvZmR0LA0KPiAgICAgICBtZW1zZXQoKGNoYXIgKiluZmR0LT5vcGVuX2ZkcyAr
IGNweSwgMCwgc2V0KTsNCj4gICAgICAgbWVtY3B5KG5mZHQtPmNsb3NlX29uX2V4ZWMsIG9mZHQt
PmNsb3NlX29uX2V4ZWMsIGNweSk7DQo+ICAgICAgIG1lbXNldCgoY2hhciAqKW5mZHQtPmNsb3Nl
X29uX2V4ZWMgKyBjcHksIDAsIHNldCk7DQo+ICsgICAgIG1lbWNweShuZmR0LT5jbG9zZV9vbl9m
b3JrLCBvZmR0LT5jbG9zZV9vbl9mb3JrLCBjcHkpOw0KPiArICAgICBtZW1zZXQoKGNoYXIgKilu
ZmR0LT5jbG9zZV9vbl9mb3JrICsgY3B5LCAwLCBzZXQpOw0KPg0KDQpJIHN1Z2dlc3Qgd2UgZ3Jv
dXAgdGhlIHR3byBiaXRzIG9mIGEgZmlsZSAoY2xvc2Vfb25fZXhlYywgY2xvc2Vfb25fZm9yaykg
dG9nZXRoZXIsIHNvIHRoYXQgd2UgZG8gbm90IGhhdmUgdG8gZGlydHkgdHdvIHNlcGFyYXRlIGNh
Y2hlIGxpbmVzLg0KDQpPdGhlcndpc2Ugd2Ugd2lsbCBhZGQgeWV0IGFub3RoZXIgY2FjaGUgbGlu
ZSBtaXNzIGF0IGV2ZXJ5IGZpbGUgb3BlbmluZy9jbG9zaW5nIGZvciBwcm9jZXNzZXMgd2l0aCBi
aWcgZmlsZSB0YWJsZXMuDQoNCkllIGhhdmluZyBhIF9zaW5nbGVfIGJpdG1hcCBhcnJheSwgZXZl
biBiaXQgZm9yIGNsb3NlX29uX2V4ZWMsIG9kZCBiaXQgZm9yIGNsb3NlX29uX2ZvcmsNCg0Kc3Rh
dGljIGlubGluZSB2b2lkIF9fc2V0X2Nsb3NlX29uX2V4ZWModW5zaWduZWQgaW50IGZkLCBzdHJ1
Y3QgZmR0YWJsZSAqZmR0KSB7DQogICAgICAgIF9fc2V0X2JpdChmZCAqIDIsIGZkdC0+Y2xvc2Vf
b25fZm9ya19leGVjKTsgfQ0KDQpzdGF0aWMgaW5saW5lIHZvaWQgX19zZXRfY2xvc2Vfb25fZm9y
ayh1bnNpZ25lZCBpbnQgZmQsIHN0cnVjdCBmZHRhYmxlICpmZHQpIHsNCiAgICAgICAgX19zZXRf
Yml0KGZkICogMiArIDEsIGZkdC0+Y2xvc2Vfb25fZm9ya19leGVjKTsgfQ0KDQpBbHNvIHRoZSBG
X0dFVEZEL0ZfU0VURkQgaW1wbGVtZW50YXRpb24gbXVzdCB1c2UgYSBzaW5nbGUgZnVuY3Rpb24g
Y2FsbCwgdG8gbm90IGFjcXVpcmUgdGhlIHNwaW5sb2NrIHR3aWNlLg0KDQoNCg=

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 1/4] fs: Implement close-on-fork
  2020-05-01 14:45       ` Karstens, Nate
  (?)
@ 2020-05-01 15:23         ` Matthew Wilcox
  -1 siblings, 0 replies; 69+ messages in thread
From: Matthew Wilcox @ 2020-05-01 15:23 UTC (permalink / raw)
  To: Karstens, Nate
  Cc: Eric Dumazet, Alexander Viro, Jeff Layton, J. Bruce Fields,
	Arnd Bergmann, Richard Henderson, Ivan Kokshaysky, Matt Turner,
	James E.J. Bottomley, Helge Deller, David S. Miller,
	Jakub Kicinski, linux-fsdevel, linux-arch, linux-alpha,
	linux-parisc, sparclinux, netdev, linux-kernel, Changli Gao

On Fri, May 01, 2020 at 02:45:16PM +0000, Karstens, Nate wrote:
> Others -- I will respond to feedback outside of implementation details in a separate message.

FWIW, I'm opposed to the entire feature.  Improving the implementation
will not change that.

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 1/4] fs: Implement close-on-fork
@ 2020-05-01 15:23         ` Matthew Wilcox
  0 siblings, 0 replies; 69+ messages in thread
From: Matthew Wilcox @ 2020-05-01 15:23 UTC (permalink / raw)
  To: Karstens, Nate
  Cc: Eric Dumazet, Alexander Viro, Jeff Layton, J. Bruce Fields,
	Arnd Bergmann, Richard Henderson, Ivan Kokshaysky, Matt Turner,
	James E.J. Bottomley, Helge Deller, David S. Miller,
	Jakub Kicinski, linux-fsdevel, linux-arch, linux-alpha,
	linux-parisc, sparclinux@vger.kernel.org

On Fri, May 01, 2020 at 02:45:16PM +0000, Karstens, Nate wrote:
> Others -- I will respond to feedback outside of implementation details in a separate message.

FWIW, I'm opposed to the entire feature.  Improving the implementation
will not change that.

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 1/4] fs: Implement close-on-fork
@ 2020-05-01 15:23         ` Matthew Wilcox
  0 siblings, 0 replies; 69+ messages in thread
From: Matthew Wilcox @ 2020-05-01 15:23 UTC (permalink / raw)
  To: Karstens, Nate
  Cc: Eric Dumazet, Alexander Viro, Jeff Layton, J. Bruce Fields,
	Arnd Bergmann, Richard Henderson, Ivan Kokshaysky, Matt Turner,
	James E.J. Bottomley, Helge Deller, David S. Miller,
	Jakub Kicinski, linux-fsdevel, linux-arch, linux-alpha,
	linux-parisc, sparclinux@vger.kernel.org

On Fri, May 01, 2020 at 02:45:16PM +0000, Karstens, Nate wrote:
> Others -- I will respond to feedback outside of implementation details in a separate message.

FWIW, I'm opposed to the entire feature.  Improving the implementation
will not change that.

^ permalink raw reply	[flat|nested] 69+ messages in thread

* RE: [PATCH 1/4] fs: Implement close-on-fork
  2020-05-01 14:45       ` Karstens, Nate
  (?)
@ 2020-05-03 13:52         ` David Laight
  -1 siblings, 0 replies; 69+ messages in thread
From: David Laight @ 2020-05-03 13:52 UTC (permalink / raw)
  To: 'Karstens, Nate',
	Eric Dumazet, Alexander Viro, Jeff Layton, J. Bruce Fields,
	Arnd Bergmann, Richard Henderson, Ivan Kokshaysky, Matt Turner,
	James E.J. Bottomley, Helge Deller, David S. Miller,
	Jakub Kicinski, linux-fsdevel, linux-arch, linux-alpha,
	linux-parisc, sparclinux, netdev, linux-kernel
  Cc: Changli Gao

From: Karstens, Nate
> Sent: 01 May 2020 15:45
> Thanks for the suggestion. I looked into it and noticed that do_close_on_exec() appears to have some
> optimizations as well:
> 
> > set = fdt->close_on_exec[i];
> > if (!set)
> > 	continue;
> 
> If we interleave the close-on-exec and close-on-fork flags then this optimization will have to be
> removed. Do you have a sense of which optimization provides the most benefit?

Thinks....
A moderate proportion of exec() will have at least one fd with 'close on exec' set.
Very few fork() will have any fd with 'close on fork' set.
The 'close on fork' table shouldn't be copied to the forked process.
The 'close on exec' table is deleted by exec().

So...
On fork() take a copy and clear the 'close_on_fork' bitmap.
For every bit set lookup the fd and close if the live bit is set.
Similarly exec() clears and acts on the 'close on exec' map.

You should be able to use the same 'close the fds in this bitmap'
function for both cases.

So I think you need two bitmaps.
But the code needs to differentiate between requests to set bits
(which need to allocate/extend the bitmap) and ones to clear/read
bits (which do not).

You might even consider putting the 'live' flag into the fd structure
and using the bitmap value as a 'hint' - which might be hashed.

After all, it is likely that the 'close on exec' processing
will be faster overall if it just loops through the open fd and
checks each in turn!
I doubt many processes actually exec with more than an handful
of open files.

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

^ permalink raw reply	[flat|nested] 69+ messages in thread

* RE: [PATCH 1/4] fs: Implement close-on-fork
@ 2020-05-03 13:52         ` David Laight
  0 siblings, 0 replies; 69+ messages in thread
From: David Laight @ 2020-05-03 13:52 UTC (permalink / raw)
  To: 'Karstens, Nate',
	Eric Dumazet, Alexander Viro, Jeff Layton, J. Bruce Fields,
	Arnd Bergmann, Richard Henderson, Ivan Kokshaysky, Matt Turner,
	James E.J. Bottomley, Helge Deller, David S. Miller,
	Jakub Kicinski, linux-fsdevel, linux-arch, linux-alpha,
	linux-parisc@vger.kernel.org
  Cc: Changli Gao

From: Karstens, Nate
> Sent: 01 May 2020 15:45
> Thanks for the suggestion. I looked into it and noticed that do_close_on_exec() appears to have some
> optimizations as well:
> 
> > set = fdt->close_on_exec[i];
> > if (!set)
> > 	continue;
> 
> If we interleave the close-on-exec and close-on-fork flags then this optimization will have to be
> removed. Do you have a sense of which optimization provides the most benefit?

Thinks....
A moderate proportion of exec() will have at least one fd with 'close on exec' set.
Very few fork() will have any fd with 'close on fork' set.
The 'close on fork' table shouldn't be copied to the forked process.
The 'close on exec' table is deleted by exec().

So...
On fork() take a copy and clear the 'close_on_fork' bitmap.
For every bit set lookup the fd and close if the live bit is set.
Similarly exec() clears and acts on the 'close on exec' map.

You should be able to use the same 'close the fds in this bitmap'
function for both cases.

So I think you need two bitmaps.
But the code needs to differentiate between requests to set bits
(which need to allocate/extend the bitmap) and ones to clear/read
bits (which do not).

You might even consider putting the 'live' flag into the fd structure
and using the bitmap value as a 'hint' - which might be hashed.

After all, it is likely that the 'close on exec' processing
will be faster overall if it just loops through the open fd and
checks each in turn!
I doubt many processes actually exec with more than an handful
of open files.

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

^ permalink raw reply	[flat|nested] 69+ messages in thread

* RE: [PATCH 1/4] fs: Implement close-on-fork
@ 2020-05-03 13:52         ` David Laight
  0 siblings, 0 replies; 69+ messages in thread
From: David Laight @ 2020-05-03 13:52 UTC (permalink / raw)
  To: 'Karstens, Nate',
	Eric Dumazet, Alexander Viro, Jeff Layton, J. Bruce Fields,
	Arnd Bergmann, Richard Henderson, Ivan Kokshaysky, Matt Turner,
	James E.J. Bottomley, Helge Deller, David S. Miller,
	Jakub Kicinski, linux-fsdevel, linux-arch, linux-alpha,
	linux-parisc@vger.kernel.org
  Cc: Changli Gao

RnJvbTogS2Fyc3RlbnMsIE5hdGUNCj4gU2VudDogMDEgTWF5IDIwMjAgMTU6NDUNCj4gVGhhbmtz
IGZvciB0aGUgc3VnZ2VzdGlvbi4gSSBsb29rZWQgaW50byBpdCBhbmQgbm90aWNlZCB0aGF0IGRv
X2Nsb3NlX29uX2V4ZWMoKSBhcHBlYXJzIHRvIGhhdmUgc29tZQ0KPiBvcHRpbWl6YXRpb25zIGFz
IHdlbGw6DQo+IA0KPiA+IHNldCA9IGZkdC0+Y2xvc2Vfb25fZXhlY1tpXTsNCj4gPiBpZiAoIXNl
dCkNCj4gPiAJY29udGludWU7DQo+IA0KPiBJZiB3ZSBpbnRlcmxlYXZlIHRoZSBjbG9zZS1vbi1l
eGVjIGFuZCBjbG9zZS1vbi1mb3JrIGZsYWdzIHRoZW4gdGhpcyBvcHRpbWl6YXRpb24gd2lsbCBo
YXZlIHRvIGJlDQo+IHJlbW92ZWQuIERvIHlvdSBoYXZlIGEgc2Vuc2Ugb2Ygd2hpY2ggb3B0aW1p
emF0aW9uIHByb3ZpZGVzIHRoZSBtb3N0IGJlbmVmaXQ/DQoNClRoaW5rcy4uLi4NCkEgbW9kZXJh
dGUgcHJvcG9ydGlvbiBvZiBleGVjKCkgd2lsbCBoYXZlIGF0IGxlYXN0IG9uZSBmZCB3aXRoICdj
bG9zZSBvbiBleGVjJyBzZXQuDQpWZXJ5IGZldyBmb3JrKCkgd2lsbCBoYXZlIGFueSBmZCB3aXRo
ICdjbG9zZSBvbiBmb3JrJyBzZXQuDQpUaGUgJ2Nsb3NlIG9uIGZvcmsnIHRhYmxlIHNob3VsZG4n
dCBiZSBjb3BpZWQgdG8gdGhlIGZvcmtlZCBwcm9jZXNzLg0KVGhlICdjbG9zZSBvbiBleGVjJyB0
YWJsZSBpcyBkZWxldGVkIGJ5IGV4ZWMoKS4NCg0KU28uLi4NCk9uIGZvcmsoKSB0YWtlIGEgY29w
eSBhbmQgY2xlYXIgdGhlICdjbG9zZV9vbl9mb3JrJyBiaXRtYXAuDQpGb3IgZXZlcnkgYml0IHNl
dCBsb29rdXAgdGhlIGZkIGFuZCBjbG9zZSBpZiB0aGUgbGl2ZSBiaXQgaXMgc2V0Lg0KU2ltaWxh
cmx5IGV4ZWMoKSBjbGVhcnMgYW5kIGFjdHMgb24gdGhlICdjbG9zZSBvbiBleGVjJyBtYXAuDQoN
CllvdSBzaG91bGQgYmUgYWJsZSB0byB1c2UgdGhlIHNhbWUgJ2Nsb3NlIHRoZSBmZHMgaW4gdGhp
cyBiaXRtYXAnDQpmdW5jdGlvbiBmb3IgYm90aCBjYXNlcy4NCg0KU28gSSB0aGluayB5b3UgbmVl
ZCB0d28gYml0bWFwcy4NCkJ1dCB0aGUgY29kZSBuZWVkcyB0byBkaWZmZXJlbnRpYXRlIGJldHdl
ZW4gcmVxdWVzdHMgdG8gc2V0IGJpdHMNCih3aGljaCBuZWVkIHRvIGFsbG9jYXRlL2V4dGVuZCB0
aGUgYml0bWFwKSBhbmQgb25lcyB0byBjbGVhci9yZWFkDQpiaXRzICh3aGljaCBkbyBub3QpLg0K
DQpZb3UgbWlnaHQgZXZlbiBjb25zaWRlciBwdXR0aW5nIHRoZSAnbGl2ZScgZmxhZyBpbnRvIHRo
ZSBmZCBzdHJ1Y3R1cmUNCmFuZCB1c2luZyB0aGUgYml0bWFwIHZhbHVlIGFzIGEgJ2hpbnQnIC0g
d2hpY2ggbWlnaHQgYmUgaGFzaGVkLg0KDQpBZnRlciBhbGwsIGl0IGlzIGxpa2VseSB0aGF0IHRo
ZSAnY2xvc2Ugb24gZXhlYycgcHJvY2Vzc2luZw0Kd2lsbCBiZSBmYXN0ZXIgb3ZlcmFsbCBpZiBp
dCBqdXN0IGxvb3BzIHRocm91Z2ggdGhlIG9wZW4gZmQgYW5kDQpjaGVja3MgZWFjaCBpbiB0dXJu
IQ0KSSBkb3VidCBtYW55IHByb2Nlc3NlcyBhY3R1YWxseSBleGVjIHdpdGggbW9yZSB0aGFuIGFu
IGhhbmRmdWwNCm9mIG9wZW4gZmlsZXMuDQoNCglEYXZpZA0KDQotDQpSZWdpc3RlcmVkIEFkZHJl
c3MgTGFrZXNpZGUsIEJyYW1sZXkgUm9hZCwgTW91bnQgRmFybSwgTWlsdG9uIEtleW5lcywgTUsx
IDFQVCwgVUsNClJlZ2lzdHJhdGlvbiBObzogMTM5NzM4NiAoV2FsZXMpDQo

^ permalink raw reply	[flat|nested] 69+ messages in thread

* RE: Implement close-on-fork
  2020-04-22 16:00       ` Al Viro
  (?)
@ 2020-05-04 13:46         ` Karstens, Nate
  -1 siblings, 0 replies; 69+ messages in thread
From: Karstens, Nate @ 2020-05-04 13:46 UTC (permalink / raw)
  To: Al Viro, Matthew Wilcox
  Cc: Jeff Layton, J. Bruce Fields, Arnd Bergmann, Richard Henderson,
	Ivan Kokshaysky, Matt Turner, James E.J. Bottomley, Helge Deller,
	David S. Miller, Jakub Kicinski, linux-fsdevel, linux-arch,
	linux-alpha, linux-parisc, sparclinux, netdev, linux-kernel,
	Changli Gao

Thanks everyone for their comments, sorry for the delay in my reply.

> As for the original problem...  what kind of exclusion is used between the reaction to netlink notifications (including closing every socket,
> etc.) and actual IO done on those sockets?
> Not an idle question, BTW - unlike Solaris we do NOT (and will not) have
> close(2) abort IO on the same descriptor from another thread.  So if one thread sits in recvmsg(2) while another does close(2), the socket will
> *NOT* actually shut down until recvmsg(2) returns.

The netlink notification is received on a separate thread, but handling of that notification (closing and re-opening sockets) and the socket I/O is all done on the same thread. The call to system() happens sometime between when this thread decides to close all of its sockets and when the sockets have been closed. The child process is left with a reference to one or more sockets. The close-on-exec flag is set on the socket, so the period of time is brief, but because system() is not atomic this still leaves a window of opportunity for the failure to occur. The parent process tries to open the socket again but fails because the child process still has an open socket that controls the port.

This phenomenon can really be generalized to any resource that 1) a process needs exclusive access to and 2) the operating system automatically creates a new reference in the child when the process forks.

> Reimplementing system() is trivial.
> LD_LIBRARY_PRELOAD should take care of all system(3) calls.

Yes, that would solve the problem for our system. We identified what we believe to be a problem with the POSIX threading model and wanted to work with the community to improve this for others as well. The Austin Group agreed with the premise enough that they were willing to update the POSIX standard.

> I wonder it it has some value to add runtime checking for "multi-threaded" to such lib functions and error out if yes.
> Apart from that, system() is a PITA even on single/non-threaded apps.

That may be, but system() is convenient and there isn't much in the documentation that warns the average developer away from its use. The manpage indicates system() is thread-safe. The manpage is also somewhat contradictory in that it describes the operation as being equivalent to a fork() and an execl(), though it later points out that pthread_atfork() handlers may not be executed.

> FWIW, I'm opposed to the entire feature.  Improving the implementation will not change that.

I get it. From our perspective, changing the OS to resolve an issue seems like a drastic step. We tried hard to come up with an alternative (see https://www.mail-archive.com/austin-group-l@opengroup.org/msg05324.html and https://austingroupbugs.net/view.php?id=1317), but nothing else addresses the underlying issue: there is no way to prevent a fork() from duplicating the resource. The close-on-exec flag partially-addresses this by allowing the parent process to mark a file descriptor as exclusive to itself, but there is still a period of time the failure can occur because the auto-close only occurs during the exec(). Perhaps this would not be an issue with a different process/threading model, but that is another discussion entirely.

Best Regards,

Nate

-----Original Message-----
From: Al Viro <viro@ftp.linux.org.uk> On Behalf Of Al Viro
Sent: Wednesday, April 22, 2020 11:01
To: Matthew Wilcox <willy@infradead.org>
Cc: Karstens, Nate <Nate.Karstens@garmin.com>; Jeff Layton <jlayton@kernel.org>; J. Bruce Fields <bfields@fieldses.org>; Arnd Bergmann <arnd@arndb.de>; Richard Henderson <rth@twiddle.net>; Ivan Kokshaysky <ink@jurassic.park.msu.ru>; Matt Turner <mattst88@gmail.com>; James E.J. Bottomley <James.Bottomley@hansenpartnership.com>; Helge Deller <deller@gmx.de>; David S. Miller <davem@davemloft.net>; Jakub Kicinski <kuba@kernel.org>; linux-fsdevel@vger.kernel.org; linux-arch@vger.kernel.org; linux-alpha@vger.kernel.org; linux-parisc@vger.kernel.org; sparclinux@vger.kernel.org; netdev@vger.kernel.org; linux-kernel@vger.kernel.org; Changli Gao <xiaosuo@gmail.com>
Subject: Re: Implement close-on-fork

CAUTION - EXTERNAL EMAIL: Do not click any links or open any attachments unless you trust the sender and know the content is safe.


On Wed, Apr 22, 2020 at 08:18:15AM -0700, Matthew Wilcox wrote:
> On Wed, Apr 22, 2020 at 04:01:07PM +0100, Al Viro wrote:
> > On Mon, Apr 20, 2020 at 02:15:44AM -0500, Nate Karstens wrote:
> > > Series of 4 patches to implement close-on-fork. Tests have been
> > > published to https://github.com/nkarstens/ltp/tree/close-on-fork.
> > >
> > > close-on-fork addresses race conditions in system(), which
> > > (depending on the implementation) is non-atomic in that it first
> > > calls a fork() and then an exec().
> > >
> > > This functionality was approved by the Austin Common Standards
> > > Revision Group for inclusion in the next revision of the POSIX
> > > standard (see issue 1318 in the Austin Group Defect Tracker).
> >
> > What exactly the reasons are and why would we want to implement that?
> >
> > Pardon me, but going by the previous history, "The Austin Group Says
> > It's Good" is more of a source of concern regarding the merits,
> > general sanity and, most of all, good taste of a proposal.
> >
> > I'm not saying that it's automatically bad, but you'll have to go
> > much deeper into the rationale of that change before your proposal
> > is taken seriously.
>
> https://www.mail-archive.com/austin-group-l@opengroup.org/msg05324.htm
> l
> might be useful

*snort*

Alan Coopersmith in that thread:
|| https://lwn.net/Articles/785430/ suggests AIX, BSD, & MacOS have also
|| defined it, and though it's been proposed multiple times for Linux, never adopted there.

Now, look at the article in question.  You'll see that it should've been "someone's posting in the end of comments thread under LWN article says that apparently it exists on AIX, BSD, ..."

The strength of evidence aside, that got me curious; I have checked the source of FreeBSD, NetBSD and OpenBSD.  No such thing exists in either of their kernels, so at least that part can be considered an urban legend.

As for the original problem...  what kind of exclusion is used between the reaction to netlink notifications (including closing every socket,
etc.) and actual IO done on those sockets?


________________________________

CONFIDENTIALITY NOTICE: This email and any attachments are for the sole use of the intended recipient(s) and contain information that may be Garmin confidential and/or Garmin legally privileged. If you have received this email in error, please notify the sender by reply email and delete the message. Any disclosure, copying, distribution or use of this communication (including attachments) by someone other than the intended recipient is prohibited. Thank you.

^ permalink raw reply	[flat|nested] 69+ messages in thread

* RE: Implement close-on-fork
@ 2020-05-04 13:46         ` Karstens, Nate
  0 siblings, 0 replies; 69+ messages in thread
From: Karstens, Nate @ 2020-05-04 13:46 UTC (permalink / raw)
  To: Al Viro, Matthew Wilcox
  Cc: Jeff Layton, J. Bruce Fields, Arnd Bergmann, Richard Henderson,
	Ivan Kokshaysky, Matt Turner, James E.J. Bottomley, Helge Deller,
	David S. Miller, Jakub Kicinski, linux-fsdevel, linux-arch,
	linux-alpha, linux-parisc, sparclinux, netdev, linux-kernel

Thanks everyone for their comments, sorry for the delay in my reply.

> As for the original problem...  what kind of exclusion is used between the reaction to netlink notifications (including closing every socket,
> etc.) and actual IO done on those sockets?
> Not an idle question, BTW - unlike Solaris we do NOT (and will not) have
> close(2) abort IO on the same descriptor from another thread.  So if one thread sits in recvmsg(2) while another does close(2), the socket will
> *NOT* actually shut down until recvmsg(2) returns.

The netlink notification is received on a separate thread, but handling of that notification (closing and re-opening sockets) and the socket I/O is all done on the same thread. The call to system() happens sometime between when this thread decides to close all of its sockets and when the sockets have been closed. The child process is left with a reference to one or more sockets. The close-on-exec flag is set on the socket, so the period of time is brief, but because system() is not atomic this still leaves a window of opportunity for the failure to occur. The parent process tries to open the socket again but fails because the child process still has an open socket that controls the port.

This phenomenon can really be generalized to any resource that 1) a process needs exclusive access to and 2) the operating system automatically creates a new reference in the child when the process forks.

> Reimplementing system() is trivial.
> LD_LIBRARY_PRELOAD should take care of all system(3) calls.

Yes, that would solve the problem for our system. We identified what we believe to be a problem with the POSIX threading model and wanted to work with the community to improve this for others as well. The Austin Group agreed with the premise enough that they were willing to update the POSIX standard.

> I wonder it it has some value to add runtime checking for "multi-threaded" to such lib functions and error out if yes.
> Apart from that, system() is a PITA even on single/non-threaded apps.

That may be, but system() is convenient and there isn't much in the documentation that warns the average developer away from its use. The manpage indicates system() is thread-safe. The manpage is also somewhat contradictory in that it describes the operation as being equivalent to a fork() and an execl(), though it later points out that pthread_atfork() handlers may not be executed.

> FWIW, I'm opposed to the entire feature.  Improving the implementation will not change that.

I get it. From our perspective, changing the OS to resolve an issue seems like a drastic step. We tried hard to come up with an alternative (see https://www.mail-archive.com/austin-group-l@opengroup.org/msg05324.html and https://austingroupbugs.net/view.php?id=1317), but nothing else addresses the underlying issue: there is no way to prevent a fork() from duplicating the resource. The close-on-exec flag partially-addresses this by allowing the parent process to mark a file descriptor as exclusive to itself, but there is still a period of time the failure can occur because the auto-close only occurs during the exec(). Perhaps this would not be an issue with a different process/threading model, but that is another discussion entirely.

Best Regards,

Nate

-----Original Message-----
From: Al Viro <viro@ftp.linux.org.uk> On Behalf Of Al Viro
Sent: Wednesday, April 22, 2020 11:01
To: Matthew Wilcox <willy@infradead.org>
Cc: Karstens, Nate <Nate.Karstens@garmin.com>; Jeff Layton <jlayton@kernel.org>; J. Bruce Fields <bfields@fieldses.org>; Arnd Bergmann <arnd@arndb.de>; Richard Henderson <rth@twiddle.net>; Ivan Kokshaysky <ink@jurassic.park.msu.ru>; Matt Turner <mattst88@gmail.com>; James E.J. Bottomley <James.Bottomley@hansenpartnership.com>; Helge Deller <deller@gmx.de>; David S. Miller <davem@davemloft.net>; Jakub Kicinski <kuba@kernel.org>; linux-fsdevel@vger.kernel.org; linux-arch@vger.kernel.org; linux-alpha@vger.kernel.org; linux-parisc@vger.kernel.org; sparclinux@vger.kernel.org; netdev@vger.kernel.org; linux-kernel@vger.kernel.org; Changli Gao <xiaosuo@gmail.com>
Subject: Re: Implement close-on-fork

CAUTION - EXTERNAL EMAIL: Do not click any links or open any attachments unless you trust the sender and know the content is safe.


On Wed, Apr 22, 2020 at 08:18:15AM -0700, Matthew Wilcox wrote:
> On Wed, Apr 22, 2020 at 04:01:07PM +0100, Al Viro wrote:
> > On Mon, Apr 20, 2020 at 02:15:44AM -0500, Nate Karstens wrote:
> > > Series of 4 patches to implement close-on-fork. Tests have been
> > > published to https://github.com/nkarstens/ltp/tree/close-on-fork.
> > >
> > > close-on-fork addresses race conditions in system(), which
> > > (depending on the implementation) is non-atomic in that it first
> > > calls a fork() and then an exec().
> > >
> > > This functionality was approved by the Austin Common Standards
> > > Revision Group for inclusion in the next revision of the POSIX
> > > standard (see issue 1318 in the Austin Group Defect Tracker).
> >
> > What exactly the reasons are and why would we want to implement that?
> >
> > Pardon me, but going by the previous history, "The Austin Group Says
> > It's Good" is more of a source of concern regarding the merits,
> > general sanity and, most of all, good taste of a proposal.
> >
> > I'm not saying that it's automatically bad, but you'll have to go
> > much deeper into the rationale of that change before your proposal
> > is taken seriously.
>
> https://www.mail-archive.com/austin-group-l@opengroup.org/msg05324.htm
> l
> might be useful

*snort*

Alan Coopersmith in that thread:
|| https://lwn.net/Articles/785430/ suggests AIX, BSD, & MacOS have also
|| defined it, and though it's been proposed multiple times for Linux, never adopted there.

Now, look at the article in question.  You'll see that it should've been "someone's posting in the end of comments thread under LWN article says that apparently it exists on AIX, BSD, ..."

The strength of evidence aside, that got me curious; I have checked the source of FreeBSD, NetBSD and OpenBSD.  No such thing exists in either of their kernels, so at least that part can be considered an urban legend.

As for the original problem...  what kind of exclusion is used between the reaction to netlink notifications (including closing every socket,
etc.) and actual IO done on those sockets?


________________________________

CONFIDENTIALITY NOTICE: This email and any attachments are for the sole use of the intended recipient(s) and contain information that may be Garmin confidential and/or Garmin legally privileged. If you have received this email in error, please notify the sender by reply email and delete the message. Any disclosure, copying, distribution or use of this communication (including attachments) by someone other than the intended recipient is prohibited. Thank you.

^ permalink raw reply	[flat|nested] 69+ messages in thread

* RE: Implement close-on-fork
@ 2020-05-04 13:46         ` Karstens, Nate
  0 siblings, 0 replies; 69+ messages in thread
From: Karstens, Nate @ 2020-05-04 13:46 UTC (permalink / raw)
  To: Al Viro, Matthew Wilcox
  Cc: Jeff Layton, J. Bruce Fields, Arnd Bergmann, Richard Henderson,
	Ivan Kokshaysky, Matt Turner, James E.J. Bottomley, Helge Deller,
	David S. Miller, Jakub Kicinski, linux-fsdevel, linux-arch,
	linux-alpha, linux-parisc, sparclinux, netdev, linux-kernel

Thanks everyone for their comments, sorry for the delay in my reply.

> As for the original problem...  what kind of exclusion is used between the reaction to netlink notifications (including closing every socket,
> etc.) and actual IO done on those sockets?
> Not an idle question, BTW - unlike Solaris we do NOT (and will not) have
> close(2) abort IO on the same descriptor from another thread.  So if one thread sits in recvmsg(2) while another does close(2), the socket will
> *NOT* actually shut down until recvmsg(2) returns.

The netlink notification is received on a separate thread, but handling of that notification (closing and re-opening sockets) and the socket I/O is all done on the same thread. The call to system() happens sometime between when this thread decides to close all of its sockets and when the sockets have been closed. The child process is left with a reference to one or more sockets. The close-on-exec flag is set on the socket, so the period of time is brief, but because system() is not atomic this still leaves a window of opportunity for the failure to occur. The parent process tries to open the socket again but fails because the child process still has an open socket that controls the port.

This phenomenon can really be generalized to any resource that 1) a process needs exclusive access to and 2) the operating system automatically creates a new reference in the child when the process forks.

> Reimplementing system() is trivial.
> LD_LIBRARY_PRELOAD should take care of all system(3) calls.

Yes, that would solve the problem for our system. We identified what we believe to be a problem with the POSIX threading model and wanted to work with the community to improve this for others as well. The Austin Group agreed with the premise enough that they were willing to update the POSIX standard.

> I wonder it it has some value to add runtime checking for "multi-threaded" to such lib functions and error out if yes.
> Apart from that, system() is a PITA even on single/non-threaded apps.

That may be, but system() is convenient and there isn't much in the documentation that warns the average developer away from its use. The manpage indicates system() is thread-safe. The manpage is also somewhat contradictory in that it describes the operation as being equivalent to a fork() and an execl(), though it later points out that pthread_atfork() handlers may not be executed.

> FWIW, I'm opposed to the entire feature.  Improving the implementation will not change that.

I get it. From our perspective, changing the OS to resolve an issue seems like a drastic step. We tried hard to come up with an alternative (see https://www.mail-archive.com/austin-group-l@opengroup.org/msg05324.html and https://austingroupbugs.net/view.php?id=1317), but nothing else addresses the underlying issue: there is no way to prevent a fork() from duplicating the resource. The close-on-exec flag partially-addresses this by allowing the parent process to mark a file descriptor as exclusive to itself, but there is still a period of time the failure can occur because the auto-close only occurs during the exec(). Perhaps this would not be an issue with a different process/threading model, but that is another discussion entirely.

Best Regards,

Nate

-----Original Message-----
From: Al Viro <viro@ftp.linux.org.uk> On Behalf Of Al Viro
Sent: Wednesday, April 22, 2020 11:01
To: Matthew Wilcox <willy@infradead.org>
Cc: Karstens, Nate <Nate.Karstens@garmin.com>; Jeff Layton <jlayton@kernel.org>; J. Bruce Fields <bfields@fieldses.org>; Arnd Bergmann <arnd@arndb.de>; Richard Henderson <rth@twiddle.net>; Ivan Kokshaysky <ink@jurassic.park.msu.ru>; Matt Turner <mattst88@gmail.com>; James E.J. Bottomley <James.Bottomley@hansenpartnership.com>; Helge Deller <deller@gmx.de>; David S. Miller <davem@davemloft.net>; Jakub Kicinski <kuba@kernel.org>; linux-fsdevel@vger.kernel.org; linux-arch@vger.kernel.org; linux-alpha@vger.kernel.org; linux-parisc@vger.kernel.org; sparclinux@vger.kernel.org; netdev@vger.kernel.org; linux-kernel@vger.kernel.org; Changli Gao <xiaosuo@gmail.com>
Subject: Re: Implement close-on-fork

CAUTION - EXTERNAL EMAIL: Do not click any links or open any attachments unless you trust the sender and know the content is safe.


On Wed, Apr 22, 2020 at 08:18:15AM -0700, Matthew Wilcox wrote:
> On Wed, Apr 22, 2020 at 04:01:07PM +0100, Al Viro wrote:
> > On Mon, Apr 20, 2020 at 02:15:44AM -0500, Nate Karstens wrote:
> > > Series of 4 patches to implement close-on-fork. Tests have been
> > > published to https://github.com/nkarstens/ltp/tree/close-on-fork.
> > >
> > > close-on-fork addresses race conditions in system(), which
> > > (depending on the implementation) is non-atomic in that it first
> > > calls a fork() and then an exec().
> > >
> > > This functionality was approved by the Austin Common Standards
> > > Revision Group for inclusion in the next revision of the POSIX
> > > standard (see issue 1318 in the Austin Group Defect Tracker).
> >
> > What exactly the reasons are and why would we want to implement that?
> >
> > Pardon me, but going by the previous history, "The Austin Group Says
> > It's Good" is more of a source of concern regarding the merits,
> > general sanity and, most of all, good taste of a proposal.
> >
> > I'm not saying that it's automatically bad, but you'll have to go
> > much deeper into the rationale of that change before your proposal
> > is taken seriously.
>
> https://www.mail-archive.com/austin-group-l@opengroup.org/msg05324.htm
> l
> might be useful

*snort*

Alan Coopersmith in that thread:
|| https://lwn.net/Articles/785430/ suggests AIX, BSD, & MacOS have also
|| defined it, and though it's been proposed multiple times for Linux, never adopted there.

Now, look at the article in question.  You'll see that it should've been "someone's posting in the end of comments thread under LWN article says that apparently it exists on AIX, BSD, ..."

The strength of evidence aside, that got me curious; I have checked the source of FreeBSD, NetBSD and OpenBSD.  No such thing exists in either of their kernels, so at least that part can be considered an urban legend.

As for the original problem...  what kind of exclusion is used between the reaction to netlink notifications (including closing every socket,
etc.) and actual IO done on those sockets?


________________________________

CONFIDENTIALITY NOTICE: This email and any attachments are for the sole use of the intended recipient(s) and contain information that may be Garmin confidential and/or Garmin legally privileged. If you have received this email in error, please notify the sender by reply email and delete the message. Any disclosure, copying, distribution or use of this communication (including attachments) by someone other than the intended recipient is prohibited. Thank you.

^ permalink raw reply	[flat|nested] 69+ messages in thread

end of thread, other threads:[~2020-05-04 13:46 UTC | newest]

Thread overview: 69+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-04-20  7:15 Implement close-on-fork Nate Karstens
2020-04-20  7:15 ` Nate Karstens
2020-04-20  7:15 ` Nate Karstens
2020-04-20  7:15 ` [PATCH 1/4] fs: " Nate Karstens
2020-04-20  7:15   ` Nate Karstens
2020-04-20  7:15   ` Nate Karstens
2020-04-20 10:25   ` Eric Dumazet
2020-04-20 10:25     ` Eric Dumazet
2020-04-22  3:38     ` Changli Gao
2020-04-22  3:38       ` Changli Gao
2020-04-22  3:41       ` Changli Gao
2020-04-22  3:41         ` Changli Gao
2020-04-22  8:35     ` David Laight
2020-04-22  8:35       ` David Laight
2020-04-22  8:35       ` David Laight
2020-05-01 14:45     ` Karstens, Nate
2020-05-01 14:45       ` Karstens, Nate
2020-05-01 14:45       ` Karstens, Nate
2020-05-01 15:23       ` Matthew Wilcox
2020-05-01 15:23         ` Matthew Wilcox
2020-05-01 15:23         ` Matthew Wilcox
2020-05-03 13:52       ` David Laight
2020-05-03 13:52         ` David Laight
2020-05-03 13:52         ` David Laight
2020-04-22 15:36   ` Karstens, Nate
2020-04-22 15:36     ` Karstens, Nate
2020-04-22 15:36     ` Karstens, Nate
2020-04-22 15:36     ` Karstens, Nate
2020-04-22 15:43     ` Matthew Wilcox
2020-04-22 15:43       ` Matthew Wilcox
2020-04-22 15:43       ` Matthew Wilcox
2020-04-22 15:43       ` Matthew Wilcox
2020-04-22 16:02       ` Karstens, Nate
2020-04-22 16:02         ` Karstens, Nate
2020-04-22 16:02         ` Karstens, Nate
2020-04-22 16:31         ` Bernd Petrovitsch
2020-04-22 16:31           ` Bernd Petrovitsch
2020-04-22 16:31           ` Bernd Petrovitsch
2020-04-22 16:55           ` David Laight
2020-04-22 16:55             ` David Laight
2020-04-22 16:55             ` David Laight
2020-04-22 16:55             ` David Laight
2020-04-23 12:34             ` Bernd Petrovitsch
2020-04-23 12:34               ` Bernd Petrovitsch
2020-04-23 12:34               ` Bernd Petrovitsch
2020-04-20  7:15 ` [PATCH 2/4] fs: Add O_CLOFORK flag for open(2) and dup3(2) Nate Karstens
2020-04-20  7:15   ` Nate Karstens
2020-04-20  7:15   ` Nate Karstens
2020-04-20  7:15 ` [PATCH 3/4] fs: Add F_DUPFD_CLOFORK to fcntl(2) Nate Karstens
2020-04-20  7:15   ` Nate Karstens
2020-04-20  7:15   ` Nate Karstens
2020-04-20  7:15 ` [PATCH 4/4] net: Add SOCK_CLOFORK Nate Karstens
2020-04-20  7:15   ` Nate Karstens
2020-04-20  7:15   ` Nate Karstens
2020-04-22 14:32 ` Implement close-on-fork James Bottomley
2020-04-22 14:32   ` James Bottomley
2020-04-22 15:01 ` Al Viro
2020-04-22 15:01   ` Al Viro
2020-04-22 15:18   ` Matthew Wilcox
2020-04-22 15:18     ` Matthew Wilcox
2020-04-22 15:34     ` James Bottomley
2020-04-22 15:34       ` James Bottomley
2020-04-22 16:00     ` Al Viro
2020-04-22 16:00       ` Al Viro
2020-04-22 16:13       ` Al Viro
2020-04-22 16:13         ` Al Viro
2020-05-04 13:46       ` Karstens, Nate
2020-05-04 13:46         ` Karstens, Nate
2020-05-04 13:46         ` Karstens, Nate

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.